RT-2
Robotics Transformer 2 — a VLA model from Google DeepMind that fine-tunes a large vision-language model (PaLI-X or PaLM-E) to output robot actions as text tokens. RT-2 demonstrates that internet-scale pre-training enables robots to follow novel language instructions and generalize to unseen objects and scenarios. It represents the paradigm of treating robot action prediction as a vision-language modeling problem.