CLIP

Contrastive Language-Image Pre-training — a model trained by OpenAI on 400M image-text pairs to learn aligned visual and linguistic representations. CLIP embeddings are used in robotics for open-vocabulary object detection, language-conditioned manipulation, and reward specification. VLA models like RT-2 and SayCan leverage CLIP-style vision-language alignment to ground language commands in robotic actions.

Robot LearningVision-Language

Explore More Terms

Browse the full robotics glossary with 1,000+ terms.

Back to Glossary