Text Embeddings I

This post is mainly based on

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, 2019
SimCSE: Simple Contrastive Learning of Sentence Embeddings, 2021
Text and Code Embeddings by Contrastive Pre-Training, 2022
Embedding Model Implementations: SentenceTransformers

Text Embeddings can be used for

Sentence Similarity: e.g., Semantic Textual Similarity Dataset, STS-16
Information Extraction: e.g., MSMARCO

Research on Sentence Similarity and Search/Information Extraction used to be separated: previous embedding search methods do not report performance on sentence similarity tasks, and previous sentence embedding methods do not evaluate on search tasks. While the 2 task appears to have connections, they might have contradicting definitions: A sentence and its negation could be considered as relevant during search, but not “similar” in sentence similarity tasks.

Sentence-BERT

Goal: train compute efficient network to output sentence embedding
Dataset: NLI dataset (entailment, neutral, and contradiction sentence)
Supervise training from BERT
Loss: triplet loss / cross entropy loss / MSE loss (cosine)
Result: significantly outperform SOTA on 7 STS tasks
- +11.7 vs InferSent (siamese BiLSTM network with max-pooling over the output, trained on SNLI)
- +5.5 vs Universal Sentence Encoder (transformer + unsupervised learning, trained on SNLI)

Previous approach

Compute inefficient
- BERT and RoBERTa can perform sentence-pair regression tasks, e.g., semantic textual similarity (STS)
- However, each comparison requires feeding sentence pair into the network / computational expensive
Pooling
- Average the BERT output layer (BERT embeddings for each tokens) or by using the output of the first token (the [CLS] token)
- Yields rather bad sentence embeddings, often worse than averaging GloVe embeddings
Training: does not initialize from pre-trained network

Dataset

Stanford Natural Language Inference dataset (SNLI): 570,000 sentence pairs annotated with the labels contradiction, entailment, and neutral
Multi-Genre NLI (MultiNLI): 430,000 sentence pairs and covers a range of genres of spoken and written text

Architecture

Pooling layer over BERT embedding
Ablation on 3 designs
- CLS-token
- MEAN-strategy: mean over all embeddings [Default]
- MAX-strategy: max over all embeddings
Siamese Network vs Triplet Network: usage depending on training data

sbert-arch

Left: architecture and type I loss (linear probe loss) in training. Right: similarity computation in inference.

Optimization

Pre-trained BERT and RoBERTa network
Fine-tuned in <20 minutes
Optimizer: Adam
- Lr: 2e-5
- Warm-up: linear learning rate over 10% of the training data
Batch-size: 16

Loss Functions

Type I loss: Siamese Network + Classification Loss
- Trained on SNLI and MultiNLI
- Classification Head: $\text{softmax}(W_t(u, v, |u − v |))$
  - $u,v$: pooled embeddings
  - $W_t \in \mathbb{R}^{3n \times k}$
- Loss: Cross entropy loss
Type II loss: Siamese Network + Regression Loss
- Trained on training set of the STS benchmark dataset
- Regression Head: $\text{cosine}(u,v)$
- Loss: MSE loss
Type III loss: Triplet Network + Triplet Loss
- Trained on Wikipedia Sections Distinction dataset (Section 4.4)
- Head: pooled embeddings
  - $s_a$: anchor sentence
  - $s_p$: positive example
  - $s_n$: negative example
- Loss: $\text{max}( | s_a - s_p | - | s_a - s_n | + \epsilon, 0)$, Euclidean distance

Some Personal Thoughts on Loss Functions

Strictly speaking, only the Type II and Type III loss could be recognized as contrastive loss, since minimizing them

Pull embedding/representation of similar sentence together
Push embedding/representation of different sentence apart

The Type I loss relies on matrix multiplication between $W_t$ and embedding. I don’t think this loss can build good representations theoretically

In inference, we directly compute cosine similarity between 2 embeddings
In training, we have the linear probe matrix $W_t$ “helping” before softmax
This may result in mismatch between training objective and inference objective

Results

sbert-result

Spearman rank correlation $\rho$ between the cosine similarity of sentence representations and the gold labels for various Textual Similarity (STS) tasks.

SimCSE

Contrastive Loss / in-batch negatives
Unsupervised & Supervised training
Results
- Evaluation on STS tasks
- Unsupervised: +4.2% SOTA (CT-BERT / DeCLUTR-RoBERTa)
- Supervised: +2.2% SOTA (CT-SBERT / SRoBERTa-whitening)
Huggingface - sup-simcse-roberta-large

Dataset

Unsupervised: $10^6$ randomly sampled sentences from English Wikipedia
Supervised: 314k examples from MNLI and SNLI datasets

Architecture

Backbone: pre-trained BERT or RoBERTa

Training

Contrastive Loss

Unsupervised: Cross-entropy loss with in-batch negatives
Supervised: Cross-entropy loss with hard negatives + in-batch negatives

simcse-arch

Left: SimCSE with in-batch negatives (see dashed arrow). Right: SimCSE with hard negatives and in-batch negatives (see dashed arrow).

Notations

$\tau$: temperature hyperparameter
$x_i, x_i^+, x_i^-$: anchor / positive / negative example
$h_i$: representation of $x_i$ / $h = f_\theta(x)$
$sim(\cdot, \cdot)$: cosine similarity

Unsupervised loss:

\[l = -\log \frac{ e^{sim(h_i, h_i^+) /\tau} }{ \sum_{j=1}^N e^{sim(h_i, h_j^+) /\tau} }\]

Supervised loss:

\[l = -\log \frac{ e^{sim(h_i, h_i^+) /\tau} }{ \sum_{j=1}^N e^{sim(h_i, h_j^+) /\tau} + e^{sim(h_i, h_j^-) /\tau} }\]

Contrastive Sample

Unsupervised Training
- Anchor / Positive: same sentence pass through encoder twice with dropout (data augmentation)
- Negative: In-batch negatives
Supervised Training
- Anchor / Positive: entailment from dataset
- Negative: hard negatives from dataset + in-batch negatives

Results

Outperforms SOTA

simcse-result

Sentence embedding performance on STS tasks.

Ablation

Data Augmentation

For unsupervised training, how to construct positive pairs / similar sentence pairs
Previous approach: word deletion, reordering, and substitution
SimCSE: standard dropout on intermediate representations

simcse-data-aug

Comparison of data augmentations on STS-B development set (Spearman’s correlation).

There we could see data augmentation in NLP is very different from CV

Cropping an image is not likely to change its meaning
Word deletion, reordering, and substitution could change meaning of a sentence

Quality of Contrastive Embeddings

Measuring embedding quality paper: Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere
Quality of learned representations can be measured by alignment and uniformity
- Alignment: expected distance between positive embeddings pair
- Uniformity: distribution in embedding space

simcse-embed-quality

Embedding Alignment-Uniformity plot. Color of the points are based on STS performance (Spearman’s correlation).

OpenAI Text Embedding

Results: outperforms SOTA on linear-probe classification accuracy averaging over 7 tasks
Model-id
- This is NOT their text-embedding-ada-002 model
- Embeddings Documentation
- Blog: 1st gen embedding model
- Blog: 2nd gen embedding model

Relationship to OpenAI Model Name

cpt-text-S/M/L/XL corresponding to *-ada-*-001, *-babbage-*-001, *-curie-*-001, *-davinci-*-001
- Cross reference of embedding dimension between: paper’s Table 1 and Model Documentation
text-embedding-ada-002
- Apart from a OpenAI blog, there is no information on how
- According to their blog, text-embedding-ada-002 generally outperform all first generation cpt- models on various tasks
- Dimension=1536 (between ada-001’s dim=1024 and babbage-001’s dim=2048)

Architecture

backbone: Transformer encoder
Scale: 300M to 175B parameters

ada-arch

Training

Data
- Text search: (text, text) pairs, Internet data
- Code search: (text, code) pairs, extracted from open source code
Loss: contrastive loss with in-batch negatives
Initialization: pre-trained GPT (text-* or code-* model)
Large batch size is crucial to achieve good performance with our setup

Results

Tasks: linear-probe classification, sentence similarity, and semantic search
Text search
- MSMARCO passage ranking task
- 23.4% relative improvement over SOTA unsupervised
Code search
- CodeSearchNet: find most relevant code snippet given a natural language query
- 20.8% relative improvement over the SOTA
- No performance improvement on code search when increasing the number of parameters
Outperforms SOTA

ada-code

Comparison of cpt-code on code search across 6 programming languages with CodeBERT and GraphCodeBERT.

cpt-text underperform SimCSE on Sentence Similarity task. Refer to the discussion on difference between Sentence Similarity or Information Extraction at the beginning of the post.

ada-sts

Ablation

Effect of batch size
- Refer to Table 9
- cpt-text-S trained with small vs large batch size 1536
  - Batch size=1536, MRR@10 = 71.4
  - Batch size=12288, MRR@10 = 84.7
- A larger batch increases the chances of having hard negatives in a single batch
Training Behavior
- As training step increase
  - Performance on search and classification tasks increases
  - Performance on sentence similarity tasks decreases
- May related to “sentence similarity is not a well defined task” issue discussed at the beginning