Transformer Visual Backbone

Using a Vision Transformer (ViT) as the feature extraction backbone in a computer vision pipeline. Compared to CNN backbones, ViT backbones model long-range dependencies between image patches via self-attention, providing superior performance on large-scale datasets. They are the default backbone in modern detection (DINO-DETR), segmentation (Mask2Former), and VLA models.

VisionML

Explore More Terms

Browse 1,000+ robotics terms.

Back to Glossary