DINOv2
A self-supervised vision transformer model trained by Meta on a curated dataset of 142M images using a self-distillation objective. DINOv2 learns powerful visual representations without any labels. Its features transfer well to robotics tasks — manipulation policies using frozen DINOv2 encoders achieve strong performance with minimal fine-tuning on robot data.