Language-conditioned Policy
A language-conditioned policy takes a natural language instruction (e.g., "pick up the red cup and place it on the tray") as an additional input alongside visual observations, enabling a single policy network to perform multiple tasks selected at runtime without retraining. Language conditioning is typically implemented by encoding instructions with a pretrained language model (CLIP, T5, PaLM) and fusing the resulting embedding with image features. VLA models such as RT-2, OpenVLA, and pi0 are language-conditioned by design. This approach reduces the need to train separate policies per task and supports zero-shot generalization to novel instruction phrasings.