Preference Learning
Learning from human comparative judgments (e.g., 'trajectory A is better than trajectory B') rather than explicit reward signals or demonstrations. A reward model is trained to be consistent with human preferences, then used to optimize the policy via RL. This approach (RLHF applied to robotics) avoids the need for precise scalar reward engineering and can capture nuanced human intent.