A Practical Guide to Vision Transformers: How ViT Changed Computer Vision
Written by
Jay Kim

From patch embeddings to attention maps, this hands-on guide covers everything you need to know about Vision Transformers — including ViT, DeiT, Swin, and DINOv2 — with full Python code to train, fine-tune, and interpret them on your own data.
Introduction
For nearly a decade, Convolutional Neural Networks (CNNs) were the undisputed kings of computer vision. From AlexNet's breakthrough in 2012 to the sophisticated architectures of EfficientNet, the formula was clear: stack convolutional layers, pool features hierarchically, and classify. It worked. It worked extremely well.
Then, in October 2020, a team at Google Brain published a paper with a provocative title: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (Dosovitskiy et al., 2020). The core claim was deceptively simple, take the transformer architecture that had dominated NLP, apply it directly to images with as few modifications as possible, and watch it match or beat the best CNNs.
The computer vision community was skeptical at first. Transformers have no built-in understanding of spatial locality. They don't inherently know that neighboring pixels are related. They lack the inductive biases that made CNNs so effective. And yet, Vision Transformers (ViTs) didn't just work, they opened an entirely new design space that has since produced models like Swin Transformer, DeiT, DINOv2, and the vision encoders powering multimodal systems like GPT-4V and LLaVA.
This article will take you through the research, explain why ViTs work, compare the most important variants, and give you hands-on code to train, evaluate, and interpret Vision Transformers on your own data.
Why Transformers for Vision?
To understand why ViTs matter, we need to understand what CNNs do well, and where they fall short.
CNNs process images through local receptive fields. A 3×3 convolutional kernel looks at a small patch of the image, detects a local feature (an edge, a texture, a corner), and passes that information upward. Deeper layers combine local features into increasingly global representations. This hierarchical, local-to-global processing is the fundamental inductive bias of CNNs, and it's extremely well-suited for images.
But this locality is also a limitation. A convolutional layer in the early stages of the network cannot relate a pixel in the top-left corner to a pixel in the bottom-right corner. Long-range dependencies require many stacked layers to propagate. This matters for tasks where global context is critical, understanding that a person is holding an object, recognizing that two distant regions of an image share a texture, or reasoning about the overall scene composition.
Transformers, by contrast, use self-attention. Every element in the input sequence can attend to every other element in a single layer. There's no locality constraint. If you can represent an image as a sequence of tokens, the transformer can learn arbitrary relationships between any parts of the image from the very first layer.
The question was never whether global attention would be theoretically useful for vision. The question was whether it could work in practice without the inductive biases that CNNs provided, and whether it could scale.
The Vision Transformer (ViT): Core Architecture

The ViT architecture, described in "An Image is Worth 16x16 Words", makes a deliberate choice to change as little as possible from the original NLP transformer. The authors' goal was to test whether a standard transformer, applied with minimal modification, could compete with CNNs. The architecture proceeds in four stages.
Stage 1: Patch Embedding. The input image (say, 224×224 pixels with 3 color channels) is divided into fixed-size patches. With a patch size of 16×16, a 224×224 image becomes a grid of 14×14 = 196 patches. Each patch is 16×16×3 = 768 values. These patches are flattened into vectors and linearly projected into the transformer's embedding dimension. Conceptually, each image patch is treated the same way a word token is treated in BERT — it becomes a vector in a shared embedding space.
Stage 2: Position Embeddings. Since transformers have no inherent notion of order or position, learnable position embeddings are added to each patch embedding. Unlike CNNs, which know about spatial adjacency through their kernel structure, ViT must learn spatial relationships entirely from data. The original ViT uses simple 1D position embeddings (just numbering the patches 1 through 196), and the authors found this works as well as more complex 2D position schemes. Interestingly, after training, the learned position embeddings show clear 2D spatial structure — nearby patches have similar embeddings — meaning the model discovers spatial relationships on its own.
Stage 3: Transformer Encoder. The sequence of patch embeddings (plus a special [CLS] token prepended to the sequence) is fed through a standard transformer encoder — the same multi-head self-attention and feed-forward network structure used in BERT. The ViT-Base model uses 12 transformer layers, 12 attention heads, and a hidden dimension of 768, totaling approximately 86 million parameters. ViT-Large scales to 24 layers and 307 million parameters, while ViT-Huge reaches 632 million parameters.
Stage 4: Classification Head. The output corresponding to the [CLS] token is passed through a small MLP head to produce the final classification prediction. During pre-training, this head is a larger MLP; during fine-tuning, it's replaced with a single linear layer.

The Data Hunger Problem
Here's the critical finding from the original ViT paper that shaped the entire field's trajectory: ViT doesn't outperform CNNs when trained on mid-sized datasets like ImageNet-1K (1.2 million images). In fact, it underperforms. The magic happens at scale. When pre-trained on JFT-300M (300 million images, Google's internal dataset) or even ImageNet-21K (14 million images), ViT matches or exceeds the best CNNs on downstream tasks.

This makes intuitive sense. CNNs have strong inductive biases (locality, translation equivariance) that act as a form of regularization — they constrain the model to learn features that are likely useful for images, even with limited data. Transformers lack these biases, so they need more data to learn the same spatial priors from scratch. But once they have enough data, the lack of restrictive biases becomes an advantage — the model is free to learn richer, more flexible representations.
This insight, that transformers need large-scale pre-training to shine in vision, motivated the next wave of research: how to make ViTs data-efficient.
DeiT: Making ViTs Data-Efficient

The Data-efficient Image Transformers (DeiT) paper (Touvron et al., 2021) from Facebook AI directly addressed ViT's data hunger. The authors showed that with the right training recipe, a ViT can be trained competitively on ImageNet-1K alone, no JFT-300M, no ImageNet-21K pre-training required.
The key contributions were less about architecture changes and more about training strategy. DeiT introduced a sophisticated set of data augmentations (RandAugment, random erasing, CutMix, Mixup), regularization techniques (stochastic depth, repeated augmentation), and a carefully tuned training schedule that collectively compensated for the missing inductive biases.
DeiT also introduced a novel distillation approach. A "distillation token" is added to the input sequence alongside the [CLS] token. This token is trained to reproduce the output of a strong CNN teacher (typically a RegNet). The intuition is elegant: instead of hardcoding convolutional biases into the architecture, you transfer them through knowledge distillation. The CNN teacher has learned what spatial features matter, and the distillation token channels that knowledge into the transformer student.
The result was striking. DeiT-Base achieved 83.4% top-1 accuracy on ImageNet using only ImageNet-1K for training, competitive with the best CNNs and with ViT models that required orders of magnitude more pre-training data.
This was a turning point. It showed that ViTs weren't inherently data-hungry, they just needed better training recipes. The community took notice, and the floodgates opened.
Swin Transformer: Bringing Back Hierarchy

While ViT and DeiT proved that transformers could classify images, they had a practical limitation: self-attention is quadratic in the sequence length. For a 224×224 image with 16×16 patches, you get 196 tokens, manageable. But for dense prediction tasks like object detection or semantic segmentation, you need higher-resolution feature maps. A 1024×1024 image with 16×16 patches gives 4,096 tokens, and self-attention over 4,096 tokens is expensive. With smaller patches or larger images, the cost becomes prohibitive.
The Swin Transformer (Liu et al., 2021) solved this with two key ideas.
First, Swin computes self-attention within local windows rather than globally. The image is divided into non-overlapping windows (e.g., 7×7 patches per window), and attention is computed only within each window. This makes computational cost linear with image size rather than quadratic.
Second, to allow information to flow between windows, Swin uses shifted windows in alternating layers. In one layer, the windows start at the top-left corner. In the next layer, the windows are shifted by half the window size. This means patches that were at the boundary of different windows in one layer are now in the same window in the next layer, enabling cross-window connections.
Swin also introduces a hierarchical structure reminiscent of CNNs. It starts with small patches (4×4) and progressively merges them through "patch merging" layers, creating feature maps at 1/4, 1/8, 1/16, and 1/32 of the original resolution. This produces multi-scale feature maps that can be plugged directly into existing detection and segmentation frameworks like Feature Pyramid Networks (FPN) or UPerNet.
The results were comprehensive. Swin Transformer achieved 87.3% top-1 on ImageNet (with ImageNet-22K pre-training), 58.7 box AP on COCO object detection, and 53.5 mIoU on ADE20K semantic segmentation, state-of-the-art across the board. It became the de facto backbone for dense prediction tasks and demonstrated that transformers could be a universal vision architecture, not just a classification model.
DINOv2: Self-Supervised Visual Features

The story of Vision Transformers would be incomplete without discussing self-supervised learning. DINOv2 (Oquab et al., 2023) from Meta AI trained ViT models on 142 million curated images without any labels, producing visual features that are remarkably versatile.
The approach builds on the original DINO (Self-Distillation with No Labels, Caron et al., 2021), which discovered something remarkable: when you train a ViT with a self-supervised objective (specifically, a self-distillation objective where a student network learns to match a momentum-updated teacher network), the self-attention maps in the last layer spontaneously learn to segment objects. Without ever seeing a segmentation mask, the model attends to semantically meaningful regions.
DINOv2 scaled this approach with a larger, curated dataset (LVD-142M), improved training stability, and produced models at multiple scales (ViT-S/B/L/g). The resulting features work as frozen, general-purpose visual representations. You can take DINOv2 features and train a simple linear classifier on top for classification, a linear layer for depth estimation, or a simple decoder for segmentation — and achieve strong results without fine-tuning the backbone at all.
This is significant because it mirrors what happened in NLP with BERT and GPT: a single pre-trained model produces representations useful across many downstream tasks. DINOv2 is arguably the closest thing computer vision has to a universal feature extractor.
Hands-On: Implementing Vision Transformers
Now let's move from theory to practice. We'll work through three progressively complex exercises: using a pretrained ViT for inference, fine-tuning on a custom dataset, and visualizing attention maps.
Setup
python"""Setup: verify all required packages import cleanly.""" import importlib import importlib.util # must be imported explicitly; bare 'import importlib' doesn't expose .util required = [ "torch", "torchvision", "transformers", "datasets", "evaluate", "accelerate", "timm", "PIL", "requests", "numpy", "matplotlib", "sklearn", ] missing = [pkg for pkg in required if importlib.util.find_spec(pkg) is None] if missing: raise ImportError(f"Missing packages: {missing}. Run: pip install {' '.join(missing)}") import torch import torchvision import transformers import datasets as ds import evaluate as ev print(f"torch {torch.__version__}") print(f"torchvision {torchvision.__version__}") print(f"transformers {transformers.__version__}") print(f"datasets {ds.__version__}") print(f"evaluate {ev.__version__}") device = ( "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu" ) print(f"\nDevice: {device}") print("Setup OK.")
Exercise 1: Pretrained ViT for Image Classification
Let's start with the simplest use case, using a ViT model pre-trained on ImageNet-21K and fine-tuned on ImageNet-1K for zero-shot classification.
python"""Exercise 1: Pretrained ViT for image classification.""" import certifi import requests import torch from PIL import Image from transformers import ViTImageProcessor, ViTForImageClassification model_name = "google/vit-base-patch16-224" processor = ViTImageProcessor.from_pretrained(model_name) model = ViTForImageClassification.from_pretrained(model_name) model.eval() url = "https://images.unsplash.com/photo-1574158622682-e40e69881006?w=400" response = requests.get(url, stream=True, verify=certifi.where()) response.raise_for_status() image = Image.open(response.raw).convert("RGB") print(f"Image size: {image.size}") inputs = processor(images=image, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits probs = torch.nn.functional.softmax(logits, dim=-1) top5_probs, top5_indices = torch.topk(probs, 5) print("\nTop-5 Predictions:") print("-" * 40) for i in range(5): label = model.config.id2label[top5_indices[0][i].item()] prob = top5_probs[0][i].item() print(f" {label}: {prob:.4f}") assert top5_probs[0].sum().item() > 0.5, "Top-5 probabilities too low" print("\nAssertion passed.")
This gives you a working classifier in about 10 lines of code. The model processes the image as 196 patches (14×14 grid of 16×16 patches), runs them through 12 transformer layers, and produces a probability distribution over 1,000 ImageNet classes.
Exercise 2: Fine-Tuning ViT on a Custom Dataset
This is where things get practical. We'll fine-tune ViT on the beans dataset from Hugging Face — a small agricultural image classification dataset with 3 classes (healthy, angular leaf spot, bean rust). This is a realistic scenario: you have a domain-specific dataset and want to leverage pretrained ViT representations.
python""" Exercise 2: Fine-tuning ViT on the Beans dataset. Changes from the original: - num_train_epochs 5 → 2 (CPU/MPS tractable: ~5-10 min) - report_to="none" (suppress wandb / tensorboard) - dataloader_num_workers=0 (required on macOS for multiprocessing safety) - save_total_limit=1 (avoid filling disk with checkpoints) - PYTORCH_ENABLE_MPS_FALLBACK=1 in runner handles any unsupported MPS ops """ import os import numpy as np import torch import evaluate as ev from datasets import load_dataset from transformers import ( ViTImageProcessor, ViTForImageClassification, TrainingArguments, Trainer, DefaultDataCollator, ) # 1. Load dataset dataset = load_dataset("beans") print(f"Train: {len(dataset['train'])} | Val: {len(dataset['validation'])} | Test: {len(dataset['test'])}") class_names = dataset["train"].features["labels"].names print(f"Classes: {class_names}") # 2. Processor + preprocessing model_name = "google/vit-base-patch16-224" processor = ViTImageProcessor.from_pretrained(model_name) def preprocess(examples): images = [img.convert("RGB") for img in examples["image"]] inputs = processor(images=images, return_tensors="pt") inputs["labels"] = examples["labels"] return inputs dataset = dataset.with_transform(preprocess) # 3. Model with new classification head model = ViTForImageClassification.from_pretrained( model_name, num_labels=len(class_names), id2label={i: l for i, l in enumerate(class_names)}, label2id={l: i for i, l in enumerate(class_names)}, ignore_mismatched_sizes=True, ) # 4. Metrics acc_metric = ev.load("accuracy") f1_metric = ev.load("f1") def compute_metrics(eval_pred): logits, labels = eval_pred preds = np.argmax(logits, axis=-1) acc = acc_metric.compute(predictions=preds, references=labels) f1 = f1_metric.compute(predictions=preds, references=labels, average="weighted") return {"accuracy": acc["accuracy"], "f1": f1["f1"]} # 5. Training arguments training_args = TrainingArguments( output_dir="./vit-beans", num_train_epochs=2, per_device_train_batch_size=16, per_device_eval_batch_size=16, learning_rate=2e-5, weight_decay=0.01, eval_strategy="epoch", save_strategy="epoch", save_total_limit=1, load_best_model_at_end=True, metric_for_best_model="accuracy", logging_steps=10, remove_unused_columns=False, fp16=torch.cuda.is_available(), dataloader_num_workers=0, report_to="none", ) # 6. Train trainer = Trainer( model=model, args=training_args, train_dataset=dataset["train"], eval_dataset=dataset["validation"], compute_metrics=compute_metrics, data_collator=DefaultDataCollator(), ) trainer.train() # 7. Evaluate on test set results = trainer.evaluate(dataset["test"]) print(f"\nTest Results:") print(f" Accuracy: {results['eval_accuracy']:.4f}") print(f" F1 Score: {results['eval_f1']:.4f}") assert results["eval_accuracy"] > 0.5, "Accuracy suspiciously low — check training" print("Assertion passed.")
A few things to note about this training setup. The learning rate of 2e-5 is deliberately low. When fine-tuning a pretrained model, you want to nudge the weights rather than overwrite them. The pretrained features already encode rich visual representations, aggressive updates would destroy them. This is the same principle behind fine-tuning BERT for NLP tasks.
The ignore_mismatched_sizes=True flag is important. The pretrained model has a 1,000-class ImageNet head, but we need a 3-class head. This flag tells Hugging Face to discard the old classification head and initialize a new one. The backbone weights are preserved; only the head is randomly initialized.
You can also experiment with linear probing by uncommenting the backbone freezing lines. This trains only the classification head while keeping all ViT parameters frozen. It's faster and requires less data, but typically achieves slightly lower accuracy than full fine-tuning. The gap between linear probing and full fine-tuning is itself an interesting metric, a small gap suggests the pretrained features are already highly relevant to your task.
Exercise 3: Visualizing Attention Maps
One of the most compelling aspects of Vision Transformers is the interpretability of their attention maps. Unlike CNNs, where understanding what the model "looks at" requires gradient-based methods like Grad-CAM, ViTs give you attention weights directly. Let's extract and visualize them.
python"""Exercise 3: Visualizing ViT attention maps.""" from pathlib import Path import certifi import requests import torch import numpy as np import matplotlib matplotlib.use("Agg") import matplotlib.pyplot as plt from PIL import Image from transformers import ViTModel, ViTImageProcessor def get_attention_maps(image_url, model_name="google/vit-base-patch16-224"): """Extract attention maps from all layers and heads.""" processor = ViTImageProcessor.from_pretrained(model_name) model = ViTModel.from_pretrained(model_name, output_attentions=True) model.eval() response = requests.get(image_url, stream=True, verify=certifi.where()) response.raise_for_status() image = Image.open(response.raw).convert("RGB") inputs = processor(images=image, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) # outputs.attentions: tuple of (num_layers,) tensors # Each shape: (batch, num_heads, seq_len, seq_len) return outputs.attentions, image def visualize_cls_attention(attentions, image, layer=-1, outfile="output/attention_map.png"): """Visualize what [CLS] attends to in a given layer (averaged across heads).""" attn = attentions[layer] cls_attention = attn[0, :, 0, 1:].mean(dim=0) # (196,) num_patches = int(cls_attention.shape[0] ** 0.5) # 14 cls_attn_2d = cls_attention.reshape(num_patches, num_patches).numpy() fig, axes = plt.subplots(1, 3, figsize=(15, 5)) axes[0].imshow(image.resize((224, 224))) axes[0].set_title("Original Image", fontsize=14) axes[0].axis("off") axes[1].imshow(cls_attn_2d, cmap="viridis") axes[1].set_title(f"[CLS] Attention (Layer {layer})", fontsize=14) axes[1].axis("off") img_arr = np.array(image.resize((224, 224))) / 255.0 attn_up = np.array( Image.fromarray((cls_attn_2d * 255).astype(np.uint8)).resize((224, 224), Image.BILINEAR) ) / 255.0 overlay = img_arr * 0.5 + plt.cm.viridis(attn_up)[:, :, :3] * 0.5 axes[2].imshow(overlay) axes[2].set_title("Attention Overlay", fontsize=14) axes[2].axis("off") plt.tight_layout() Path(outfile).parent.mkdir(parents=True, exist_ok=True) plt.savefig(outfile, dpi=150, bbox_inches="tight") plt.close(fig) print(f"Saved: {outfile}") def visualize_attention_across_layers(attentions, image, outfile="output/attention_layers.png"): """Visualize how [CLS] attention evolves across layers.""" layers_to_show = [0, 2, 5, 8, 11] fig, axes = plt.subplots(1, len(layers_to_show) + 1, figsize=(20, 4)) axes[0].imshow(image.resize((224, 224))) axes[0].set_title("Original", fontsize=12) axes[0].axis("off") for idx, layer_idx in enumerate(layers_to_show): attn = attentions[layer_idx] cls_attn = attn[0, :, 0, 1:].mean(dim=0) num_patches = int(cls_attn.shape[0] ** 0.5) cls_attn_2d = cls_attn.reshape(num_patches, num_patches).numpy() axes[idx + 1].imshow(cls_attn_2d, cmap="inferno") axes[idx + 1].set_title(f"Layer {layer_idx + 1}", fontsize=12) axes[idx + 1].axis("off") plt.suptitle("Evolution of [CLS] Attention Across Layers", fontsize=14, y=1.02) plt.tight_layout() Path(outfile).parent.mkdir(parents=True, exist_ok=True) plt.savefig(outfile, dpi=150, bbox_inches="tight") plt.close(fig) print(f"Saved: {outfile}") url = "https://images.unsplash.com/photo-1574158622682-e40e69881006?w=400" attentions, image = get_attention_maps(url) visualize_cls_attention(attentions, image, layer=-1) visualize_attention_across_layers(attentions, image) print(f"\nNumber of layers: {len(attentions)}") print(f"Attention shape (one layer): {tuple(attentions[0].shape)}") print(f" Batch size: {attentions[0].shape[0]}") print(f" Attention heads: {attentions[0].shape[1]}") print(f" Seq length: {attentions[0].shape[2]} (1 CLS + 196 patches)") assert len(attentions) == 12, f"Expected 12 layers, got {len(attentions)}" assert attentions[0].shape[2] == 197, "Unexpected sequence length" print("Assertions passed.")
What you'll observe in the attention visualizations is revealing. Early layers (1-3) tend to show diffuse, relatively uniform attention — the model is gathering broad contextual information. Middle layers (5-8) begin to show more structured patterns, often attending to edges and texture boundaries. The final layers (10-12) typically produce sharp, object-focused attention maps where the [CLS] token concentrates on the semantically important regions of the image.
This progressive refinement mirrors the hierarchical feature extraction of CNNs, but it emerges from training rather than being architecturally enforced. The model learns to process information from broad to specific, not because it has to, but because it's the most effective strategy.
Exercise 4: Comparing ViT vs CNN (ResNet) Performance and Speed
Let's run a practical head-to-head comparison to understand the real-world tradeoffs.
python""" Exercise 4: Comparing ViT-Base vs ResNet-50 (speed + params). Fix from original: resnet50(pretrained=True) is deprecated. Use weights=ResNet50_Weights.DEFAULT instead. """ import time import certifi import requests import torch import numpy as np from PIL import Image from transformers import ViTForImageClassification, ViTImageProcessor from torchvision import models, transforms from torchvision.models import ResNet50_Weights def benchmark_model(model, inputs, model_name, num_runs=50): """Benchmark inference speed (reduced from 100 → 50 runs for CPU speed).""" model.eval() # Use explicit callable to avoid isinstance(BatchEncoding, dict) ambiguity def forward(): if isinstance(inputs, torch.Tensor): return model(inputs) return model(**{k: v for k, v in inputs.items()}) # unpack safely with torch.no_grad(): for _ in range(5): forward() times = [] with torch.no_grad(): for _ in range(num_runs): t0 = time.perf_counter() forward() times.append((time.perf_counter() - t0) * 1000) mean, std = np.mean(times), np.std(times) print(f" {model_name}:") print(f" Mean: {mean:.1f} ± {std:.1f} ms " f"Throughput: {1000 / mean:.0f} img/s") return mean def count_parameters(model): total = sum(p.numel() for p in model.parameters()) trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) return total, trainable # --- Load sample image --- url = "https://images.unsplash.com/photo-1574158622682-e40e69881006?w=400" resp = requests.get(url, stream=True, verify=certifi.where()) resp.raise_for_status() image = Image.open(resp.raw).convert("RGB") # --- ViT-Base --- vit_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224") vit_model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224") vit_inputs = vit_processor(images=image, return_tensors="pt") # --- ResNet-50 (use new-style weights API) --- resnet_model = models.resnet50(weights=ResNet50_Weights.DEFAULT) resnet_transform = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ]) resnet_input = resnet_transform(image).unsqueeze(0) # --- Parameter counts --- print("=" * 52) print("MODEL COMPARISON") print("=" * 52) vit_total, _ = count_parameters(vit_model) res_total, _ = count_parameters(resnet_model) print(f"\nViT-Base/16: {vit_total / 1e6:.1f}M parameters") print(f"ResNet-50: {res_total / 1e6:.1f}M parameters") print(f"Ratio: ViT is {vit_total / res_total:.1f}x larger") # --- Speed benchmark --- print(f"\nInference Speed (CPU, 50 runs):") print("-" * 52) vit_time = benchmark_model(vit_model, vit_inputs, "ViT-Base/16") res_time = benchmark_model(resnet_model, resnet_input, "ResNet-50") print(f"\n ResNet-50 is {vit_time / res_time:.1f}x faster than ViT on CPU") # --- Published ImageNet accuracy --- print(f"\nImageNet Top-1 Accuracy (from papers):") print("-" * 52) rows = [ ("ResNet-50", "76.1%"), ("ResNet-152", "78.3%"), ("ViT-Base/16 (IN-21K → IN-1K)", "84.2%"), ("ViT-Large/16 (IN-21K → IN-1K)", "85.3%"), ("Swin-Base (IN-22K → IN-1K)", "86.4%"), ] for name, acc in rows: print(f" {name:<40} {acc}") assert vit_total > res_total, "ViT should have more parameters than ResNet-50" assert vit_time > 0 and res_time > 0 print("\nAssertions passed.")
The comparison reveals the fundamental tradeoff. ViT-Base has roughly 3.4x more parameters than ResNet-50 and is typically 2-3x slower on CPU. But it achieves significantly higher accuracy (84.2% vs 76.1% on ImageNet) when pretrained on larger data. On GPU with batched inference, the speed gap narrows considerably because the transformer's matrix multiplications parallelize more efficiently than the sequential convolutional operations.
This is the practical reality: if you're deploying on edge devices with limited compute, CNNs or efficient ViT variants (like MobileViT or EfficientViT) might be preferable. If you're running inference on a server with GPU access and want maximum accuracy, ViTs are likely the better choice.
Exercise 5: Feature Extraction with DINOv2
Finally, let's use DINOv2 as a frozen feature extractor, the self-supervised approach where you don't fine-tune the backbone at all, just train a simple classifier on top.
python"""Exercise 5: DINOv2 feature extraction + linear probing on CIFAR-10.""" import numpy as np import torch from torch.utils.data import DataLoader from torchvision import transforms, datasets from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, classification_report # Load DINOv2 ViT-Small (dinov2_vits14: 21M params, fast on CPU) print("Loading DINOv2 ViT-S/14 from torch.hub...") dinov2 = torch.hub.load("facebookresearch/dinov2", "dinov2_vits14", verbose=False) dinov2.eval() print("Model loaded.") # Preprocessing for DINOv2 (must use 224px, patch size 14 → 16×16 patches) transform = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ]) # CIFAR-10 — 5k train subset, 1k test subset (fast + sufficient for demo) train_dataset = datasets.CIFAR10(root="./data", train=True, download=True, transform=transform) test_dataset = datasets.CIFAR10(root="./data", train=False, download=True, transform=transform) train_subset = torch.utils.data.Subset(train_dataset, range(5000)) test_subset = torch.utils.data.Subset(test_dataset, range(1000)) train_loader = DataLoader(train_subset, batch_size=64, shuffle=False, num_workers=0) test_loader = DataLoader(test_subset, batch_size=64, shuffle=False, num_workers=0) def extract_features(model, dataloader): """Extract [CLS] token embeddings from DINOv2 for every batch.""" all_features, all_labels = [], [] device = next(model.parameters()).device with torch.no_grad(): for images, labels in dataloader: features = model(images.to(device)) # (batch, 384) for ViT-S all_features.append(features.cpu().numpy()) all_labels.append(labels.numpy()) return np.concatenate(all_features), np.concatenate(all_labels) print("\nExtracting training features...") train_features, train_labels = extract_features(dinov2, train_loader) print(f" Shape: {train_features.shape}") print("Extracting test features...") test_features, test_labels = extract_features(dinov2, test_loader) print(f" Shape: {test_features.shape}") print("\nFitting logistic regression on DINOv2 features...") clf = LogisticRegression(max_iter=1000, C=1.0, random_state=42) clf.fit(train_features, train_labels) train_acc = accuracy_score(train_labels, clf.predict(train_features)) test_acc = accuracy_score(test_labels, clf.predict(test_features)) print(f"\nResults (DINOv2-ViT-S + Logistic Regression):") print(f" Train Accuracy: {train_acc:.4f}") print(f" Test Accuracy: {test_acc:.4f}") class_names = ["airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"] print("\nClassification Report:") print(classification_report(test_labels, clf.predict(test_features), target_names=class_names, digits=3)) assert test_acc > 0.70, f"DINOv2 test accuracy {test_acc:.4f} unexpectedly low" print("Assertion passed.")
The remarkable thing about this approach is how simple it is. We're using a frozen model — no backpropagation through the transformer, no GPU training needed for the classifier. Just extract features once and fit a logistic regression. On CIFAR-10, DINOv2-ViT-S with a linear head achieves ~95% accuracy, which is competitive with fully supervised CNNs that were specifically trained on this dataset.
This is the power of strong pretrained representations. If DINOv2 features work this well on CIFAR-10 (which wasn't in its training data), imagine what they can do on your domain-specific dataset, medical images, satellite imagery, agricultural photos, manufacturing defect detection, with just a simple linear layer on top.
When to Use Which Model: A Practical Decision Framework
Having covered the major variants, here's how to think about choosing the right model for your specific situation.
If you have a large labeled dataset (>100K images) and plenty of compute, full fine-tuning of ViT-Large or Swin-Large will give you the best accuracy. Use Swin if your task involves dense prediction (detection, segmentation) since it produces multi-scale feature maps natively.
If you have a moderate dataset (1K-100K images), fine-tuning ViT-Base or DeiT-Base with appropriate regularization is a strong default. Use a low learning rate (1e-5 to 5e-5), enable extensive data augmentation, and consider starting from an ImageNet-21K pretrained checkpoint rather than ImageNet-1K.
If you have very limited labeled data (<1K images), frozen DINOv2 features with a linear classifier or lightweight MLP head is often your best bet. The self-supervised pretraining captures visual features that transfer well even to niche domains, and training only the head avoids overfitting.
If you need fast inference on edge devices, consider efficient variants like MobileViT (Mehta et al., 2021) or EfficientViT (Cai et al., 2022), which combine lightweight convolutions with attention mechanisms. Alternatively, a well-optimized ResNet or EfficientNet might still be more practical for latency-critical deployment.
If your task requires understanding both images and text, CLIP-style models (which use ViT as the vision encoder) or LLaVA-style multimodal models are the natural extension. The ViT backbone trained with contrastive image-text supervision produces features aligned to a shared vision-language embedding space.
Key Takeaways
The Vision Transformer story is, at its core, a story about the tradeoff between inductive bias and scalability. CNNs bake in strong assumptions about how images work, locality, translation equivariance, hierarchical composition. These assumptions are correct, and they make CNNs powerful learners from limited data. But they also impose a ceiling on what the model can represent.
Transformers make almost no assumptions about the input structure. This means they need more data to learn what CNNs get for free, but it also means they can learn representations that are richer, more flexible, and more transferable once they have sufficient data and compute. The progression from ViT (needs 300M images) to DeiT (works with 1.2M images and distillation) to DINOv2 (self-supervised, no labels needed) shows the field converging on how to efficiently unlock this potential.
The practical implication is clear: if you're starting a new computer vision project today, ViT-based models should be your default starting point. The ecosystem is mature (Hugging Face support, timm library, pretrained weights for every variant), the training recipes are well-understood, and the performance ceiling is higher than CNNs for most tasks. CNNs aren't obsolete, they remain excellent for edge deployment, small-data regimes, and specific architectural needs, but the center of gravity has shifted.
References
- Dosovitskiy, A., et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929
- Touvron, H., et al. (2021). Training data-efficient image transformers & distillation through attention. arXiv:2012.12877
- Liu, Z., et al. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv:2103.14030
- Caron, M., et al. (2021). Emerging Properties in Self-Supervised Vision Transformers. arXiv:2104.14294
- Oquab, M., et al. (2023). DINOv2: Learning Robust Visual Features without Supervision. arXiv:2304.07193
- Mehta, S., et al. (2021). MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. arXiv:2110.02178


