Vision Transformer
An image classification architecture that splits images into fixed-size patches, linearly embeds each patch, adds positional encoding, and processes the sequence with a standard transformer encoder. ViT has become the dominant visual backbone, surpassing CNNs when pre-trained on large datasets. ViT variants (DeiT, Swin, DINOv2) are the default visual encoders in modern VLA models.