This post is mainly based on

CLIP

  • Representation Learning: predicting caption-image pair is a scalable way to learn SOTA image representations
    • 400 million (image, text) pairs dataset collected from the internet
  • Natural language based zero-shot transfer to downstream tasks
  • Results
    • Evaluation on over 30 benchmarks, including: OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification
    • Transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without fine-tuning
    • Match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training

What is CLIP?

  1. Contrastive pre-training: jointly trains an image encoder and a text encoder to predict the correct (image, text) pair within each batch
  2. Create dataset classifier from label text: convert classification labels to descriptive sentences
  3. Use for zero-shot prediction: select the class with highest logits

Background

  • Previous work focus on weak supervision
    • Learning from ImageNet-related hashtags
    • Learning from a noisy JFT-300M dataset
  • These methods navigates a trade-off between: learning image representation from
    • A limited amount of supervised “golden-labels”
    • Practically unlimited amounts of raw text
  • However, their prediction is limited to 1000 and 18291 classes

Overview

  • Collect a new dataset of 400 million (image, text) pairs
  • Design CLIP Architecture (simplified ConVIRT)
  • Study scalability of CLIP: observe that model transfer performance is a smoothly predictable function of compute
  • Study zero-shot performance on over 30 Benchmarks
  • Conduct linear-probe representation learning analysis: CLIP embedding outperforms SOTA while also being more computationally efficient

Approach

  • Natural language can be used as a training signal for image representation
    • Much easier to scale natural language supervision compared to crowd-sourced labeling
    • Apart from learning the representation, it also connects the representation to language which enables flexible zero-shot transfer
  • WebImageText dataset
    • Common words: all words occurring at least 100 times in the English Wikipedia
    • Construction 500,000 queries from common words
    • Class balance: including up to 20,000 (image, text) pairs per query
    • Total word count: ~ GPT-2 WebText dataset

Efficient training

  • Previous approach
    • ResNeXt101-32x48d: 19 GPU years
    • EfficientNet-L2: 33 TPUv3 years
    • both these systems were trained to predict only 1000 ImageNet classes
  • Our approach
    • Baseline: Bag of Words Prediction
      • CNN learns to predict a bag-of-words encoding of the text
    • Transformer Language Model
      • Training 63 million parameter text transformer + ResNet-50 from scratch to predict the caption of an image
      • Transformer uses twice the compute of CNN
      • Learns to recognize ImageNet classes three times slower than Baseline
    • Problems of the above approaches
      • Trying to predict the exact words of the text accompanying each image is unnecessarily hard and prone to overfit (similar to predict each pixels in image generation)
      • Contrastive Multiview Coding, 2019: contrastive objectives can learn better representations than their equivalent predictive objective
      • Generative pretraining from pixels, ICML 2020: generative model require over an order of magnitude more compute than contrastive models to learn similar quality of representation
    • Bag of Words Contrastive (CLIP)
      • Inspired by contrastive learning
      • Predict sentense as a whole, rather than exact individual words
      • 4x efficiency improvement compared to baseline

clip-eff

CLIP is much more efficient at zero-shot transfer than our image caption baseline. Blue: transformer-based language models learns 3x slower than the baseline (predicts a bag-of-words encoding of the text). Green: CLIP learns 3x faster than the baseline.

Architecture

clip-arch-1

  • Overview
    • Given a batch of N (image, text) pairs
    • Predict which of the $N \times N$ possible (image, text) pairings across a batch actually occurred
  • CLIP learns a multi-modal embedding space by
    • Jointly training an image encoder and text encoder
    • Maximize the cosine similarity of the image embeddings and text embeddings of the N real pairs
    • Minimizing the cosine similarity of the embeddings of the $N^2 − N$ incorrect pairings
  • Pseudocode
    • Paper Figure-3
    • Encode a batch of image and text into 2 dimension $[n, d_i], [n, d_t]$ matrices using image and text encoder
    • Map and normalize $[n, d_i], [n, d_t]$ into joint 2 multimodal embeddings of dim $[n, d_e]$ using a linear projection
    • Compute pairwise similarity matrix $[n,n]$
    • Compute loss in 2 axis direction
  • Image encoder
    • ResNet-50 + attention pooling
    • ViT, with additional layer normalization and different initialization scheme
  • Text encoder
    • Embedding layer: BPE, 49152 vocab size
    • Max sequence length: 76 (for efficiency)
    • Special tokens: [SOS], [EOS]
    • Base: 12-layer, 512-dim, 8-attention head, 63M-parameter
    • Masked self-attention (can be init from a decoder LLM or add LML as auxiliary loss)

Optimization

  • Train from scratch without initializing the image encoder or the text encoder
  • Loss
    • Cross entropy loss over these similarity scores
    • Weight decay regularization to all weights that are not gains or biases
    • Clip logits < 100 for stability
  • Optimizer
    • Adam
    • Cosine learning rate decay
  • Hyperparams
    • Data augmentation: random square crop
    • Softmax temperature parameter $\tau$ is directly optimized during training, init to 0.07
    • Init: grid searches, random search, and manual tuning
  • Mixed Precision Training
    • Gradient checkpointing
    • Half-precision Adam statistics
    • Half-precision stochastically rounded text encoder weights
  • Distributed Training
    • Batch size: 32,768
    • Embed similarity: only the subset of the pairwise similarities necessary for their local batch of embeddings

Ablations

  • Image encoder
    • 5 ResNets
      • ResNet-50, ResNet-101
      • RN50x4, RN50x16, RN50x64 (EfficientNet-style model scaling, using ~4x, 16x, and 64x the compute of a ResNet-50)
      • RN50x64: 592 V100 GPUs, 18 days
    • 3 Vision Transformers
      • ViT-B/32, a ViT-B/16, and a ViT-L/14
      • ViT-L/14: 256 V100 GPUs, 12 days
      • ViT-L/14@336px: pre-train at a higher 336 pixel resolution for one additional epoch to boost performance
  • Text encoder
    • Findings: CLIP’s performance is less sensitive to the capacity of the text encoder
  • Train all models for 32 epochs

Experiments: Zero-Shot Transfer

clip-arch-2

  • Problems of existing benchmarks
    • Research benchmarks focus on synthetic problems, rather measuring performance on a specific task
    • SVHN dataset: measures the task of street number transcription on the distribution of Google Street View photos
    • CIFAR-10: unclear what “real” task the dataset measures
    • When we evaluate zero-shot transfer of CLIP on CIFAR-10, it is more of an evaluation of CLIP’s robustness to distribution shift and domain generalization rather than task generalization
  • CLIP on zero-shot classification
    • Reuse ability learned in pre-training
    • A batch of $n$ (image, text) pairs, where text is names of classes in the dataset, predict highest logit score
    • Alternative view: comparing cosine-similarity of the image embedding vs $n$ text embeddings
    • Text embedding of the $n$ classes only need to be computed once
    • Image embedding need to be computed $1$ time for each image
  • Results
    • Improves accuracy on ImageNet from a proof of concept of 11.5% to 76.2%
    • Top-1 accuracy 76.2%: matches the performance of the original ResNet-50 (without using 1.28 million labels)
    • Top-5 accuracy 95%: matches the performance of Inception-V4
  • Prompt engineering and ensemble
    • Improve ImageNet accuracy by almost 5%, which is equivalent to 4x efficiency gain
    • For details, see Section 3.1.4
  • Performance Anlaysis on 27 datasets
    • CLIP outperforms ResNet-50 on 16/27
    • CLIP underperforms on specialized, complex, or abstract tasks
      • Satellite image classification (EuroSAT and RESISC45)
      • Lymph node tumor detection (PatchCamelyon)
      • Counting objects in synthetic scenes (CLEVRCounts)
      • Self-driving related tasks such as German traffic sign recognition (GTSRB)
      • Recognizing distance to the nearest car (KITTI Distance)
    • Zero-shot performance is correlated with linear probe performance but still mostly sub-optimal
      • See paper Figure-8

Comparing zero-shot transfer image classification results of CLIP vs Visual N-Grams (proof of concept on zero-shot transfer).

Few-shot Linear Probe

clip-linear-probe

Zero-shot CLIP matches the average performance of a 4-shot linear classifier trained on the same feature space and nearly matches the best results of a 16-shot linear classifier across publicly available models. The 20 datasets with at least 16 examples per class were used in this analysis.

Scaling Law

clip-scaling

Zero-shot CLIP performance scales smoothly as a function of model compute. Across 39 evals on 36 different datasets, average zero-shot error is well modeled by a log-log linear trend across a 44x range of compute spanning 5 different CLIP models.

Representation Learning

  • Scale of experiments: 66 different models on 27 different datasets requires tuning 1782 different evaluations
  • Conclusions
    • All CLIP models, regardless of scale, outperform all evaluated systems in terms of compute efficiency
    • CLIP-Vit are about 3x more compute efficient than CLIP-ResNets
    • CLIP’s features outperform the features of the best ImageNet model on a wide variety of datasets (paper Figure-11)

clip-embed-eff

Linear probe performance of CLIP models in comparison with state-of-the-art computer vision models.

Performance of linear probe on learned representation. CLIP scale very well on large models.

Robustness to Natural Distribution Shift

  • Natural distribution shifts datasets: ImageNet Sketch, ImageNet Adversarial, etc. (See example below)
  • ResNet-101 makes 5x more mistakes when evaluated on these natural distribution shifts compared to the ImageNet validation set
  • All zero-shot CLIP models improve effective robustness by a large amount and reduce the size of the gap between ImageNet accuracy and accuracy under distribution shift by up to 75%

clip-robust

Zero-shot CLIP is much more robust to distribution shift than standard ImageNet models. Left: quantifying CLIP’s robustness gap against ImageNet trained model. Right: Visualizing distribution shift for bananas.

clip-robust-2

CLIP’s features are more robust to task shift when compared to models pre-trained on ImageNet. This suggests that the representations of models trained on ImageNet are somewhat overfit to their task.

Data Overlap Analysis

  • See paper Section 5

Limitations

  • Performance
    • While CLIP is competitive against ResNet-50, it is well behind supervised SOTA
    • Still needed to improve the data efficiency of CLIP, for better the task learning and transfer capabilities
    • 1000x increase in compute is required for zero-shot CLIP to reach overall SOTA performance
  • Robustness
    • CLIP generalizes poorly to data that is truly out-of-distribution (e.g., OCR, MNIST)
    • See paper Appendix E and Table 14
    • This suggests CLIP does little to address the underlying problem of brittle generalization of deep learning models.
    • Instead CLIP tries to circumvent the problem and hopes that by training on such a large and varied dataset that all data will be effectively in-distribution.
  • Efficiency
    • CLIP also does not address the poor data efficiency of deep learning.
    • Instead CLIP compensates by using a source of supervision that can be scaled to hundreds of millions of training examples.
  • Methodology limitations
    • Repeatedly queried performance on full validation sets to guide the development of CLIP (against the goal of zero-shot transfer)
    • Selection of evaluation datasets: lack of a benchmark task set
    • Social biases: due to adopting training data from internet