This post is mainly based on

Overview

GPT is abbreviation for Generative Pre-training Transformer. While BERT is stacks of Transformer Encoders, GPT is stacks of Transformer Decoders.

The architectural difference between BERT and GPT:

  • BERT: Masked Language Model (MLM)
    • A Masked Language Model is trained by predicting masked tokens
    • Cannot natural perform next word prediction / text generation
  • GPT: Language Model
    • A Language Model is a conditional distribution over vocabulary: $P(w_{t} \mid w_{t-1}, \dots , w_{t-k})$
    • Can perform next word prediction & text generation

What makes GPT a game changer is its unique approach toward multi-task learning.

Prior to GPT, ML research focused on the old paradigm of “pre-training + fine tuning”. Example of those models include ResNet and BERT. The problem of this paradigm is it still requires collecting a task dataset and an ML engineer to fine tune the model. Removing the “fine-tuning” step would be desirable.

GPT took a different approach: few-shot learning (this term may be referred to meta-learning or in-context learning, see page 4 footnote of GPT-3 paper for details). Researchers believe that model can obtain certain degree of understanding of the world solely by language modeling / learning the conditional distribution of natural language. The model thus can achieve multi-tasking by leveraging the statistical dependency between the task and the desirable output.

To perform a specific task using GPT, you prompt the model (establish the statistical dependency) by a special sequence of tokens or purely by natural language. (e.g., “[French sentence], translate into English:” ). This natural language prompt + multi-tasking ability give GPTs the capacity to seamlessly interact with human. GPT-3’s zero-shot reason capacity (e.g., common sense reasoning, doing arithmetic) indicate that it may has some degree of understanding of the world.

ft-vs-fs

GPT Size and Capacity

  model params layers model dimension Few-shot capacity
GPT-1 117M 12 768 Limited
GPT-2 1.5B 48 1600 Lag behind supervised SOTA
GPT-3 175B 96 12888 Competitive to supervised SOTA

Few-shot Capacity

  • GTP-1
    • Requires prompt and some engineering on output
    • Sentiment analysis
    • Multiple choice question answering
  • GTP-2
    • Requires prompt in certain format
    • Generally lag behind supervised SOTA
      • Reading Comprehension: $F_1 = 0.55$ (SOTA supervised BERT $F_1 = 0.89$)
      • Summarization: barely outperform random selection 3 sentences
      • Translation: 11.5 BLEU on French to English (SOTA unsupervised model 33.5 BLEU)
      • Close Book Question Answering: 4.1% accuracy (Human: 17%, open book system: 30-50%)
  • GPT-3
    • Surpass or competitive to supervised SOTA on a wide range of tasks
    • Can perform tasks that requires reasoning
      • Answer questions that requires physical or scientific reasoning
      • Performing arithmetic
      • Using novel words in a sentence after seeing them defined only once

GPT-1

The first GPT paper still follow the pre-training + fine-tuning paradigm. Note that the Transformer paper publish in 2017 does not focus on multi-task learning and generalization. The original transformer is specifically trained for machine translation (although the paper briefly mentioned that the trained model can generalized to other tasks).

The authors of GPT-1 focus on multi-task learning / foundation model. Their goal is to demonstrate SOTA multi-task learning can be achieved by generative pre-training of a language model on a diverse corpus of unlabeled text. GPT-1 outperforms discriminatively trained models that use architectures specifically crafted for each task in 9 out of the 12 tasks studied.

Loss function

Language model loss:

\[L_u = \sum_i \log P(w_t | w_{t-1}, \dots , w_{t-k}; \theta)\]

Supervised loss:

\[L_s = \sum_i \log P(y | x_i; \theta)\]

where $x_i$ is the sequence of tokens, $x_i = w_n, … , w_1$.

GPT-1 is pre-trained with language model loss $L_u$ and fine tuned on combined loss $L = L_s + \lambda L_u$

The author conducted ablation studies to demonstrate that using language model loss as an auxiliary loss helped learning by

  • Improving generalization of the supervised model
  • Accelerating convergence

Task-Specific Input Transformations

Inj multi-task learning, some tasks does not fit into the above supervised loss framework (e.g. for similarity task, the input is a pair of sentences). Previous work introduced task-specific architectures to deal with different task, which requires a lot of engineer and is inefficient. Instead of changing the architecture, GPT-1 chances the input, i.e., “Task-Specific Input Transformations”:

  • All transformations requires adding special tokens: <s> for Start, $ for Delim, and <e> for Extract
  • Classification: No change to input
  • Textual entailment: premise $ hypothesis
  • Semantic similarity: average of Text 1 $ Text 2 in both order
  • Multiple choice: Softmax over Context $ Answer i

gpt-1-input-trans

The goal of Task-Specific Input Transformations is to make minimal changes to the architecture in fine tuning.

Model

  • Dataset: Book corpus (7000 unpublished books)
  • Vocabulary: Byte Pair Encoding (BPE) with 40,000 merges
  • Activation: GELU
  • Regularization
    • Dropout: 0.1 on residual, embedding, and attention layer
    • Modified L2 regularization: non bias or gain weights
  • Batch size: 64 for training and 32 for fine-tuning

We will skip the results and the 12 tasks studied in this paper since they are pretty outdated.

Zero-shot Behaviors

The authors tried to understand why the above pre-training schema improves the model’s generalization ability. They argue that the in order to become a better language model (i.e., minimize $L_u$ loss), GPT-1 inherently need to be skillful in performing some of the general NLP tasks (to put it in plain language, having a decent understanding of the world).

As explained above, GPT-1 itself is a language model: it can returns a conditional distribution of the next word given a sequence of tokens. To evaluate the model on general NLP tasks without fine tuning, the authors need to formulate those tasks into next word prediction problems. Here is a series of “hacks” they devised:

  • Sentiment analysis (SST-2 dataset)
    • Append the token “very” to each example
    • Restrict the language model’s output distribution to 2 tokens, “positive” and “negative”
    • GPT produce a binary output
    • Classification = output with higher probability
  • Linguistic acceptability (CoLA dataset)
    • Classification = threshold on target token’s log-probability assigned by GPT
  • Question answering (RACE dataset)
    • Append for all choice: Document, Question, Answer
    • Use GPT-1 to evaluate the log probability of the entire sequence
    • Choice = higher log probability
  • Question answering (DPRD dataset / winograd schemas)
    • Substitute the binary choice and get 2 sentences
    • Use GPT-1 to evaluate the log probability of the rest of sentence after the substitution token
    • Choice = higher log probability

Note that the zero-shot output requires heavy engineering (e.g., extract log-probability) in all tasks, which is not really useful in practice. Given the development, we expect a zero-shot model to understand natural language prompt and directly output human interpretable content.

The result demonstrated that as pre-training progress, the model’s ability to achieve zero-shot learning also increased.

gpt-1-zs

The First Prompt?

Prompt refers to the description of the task embedded in the input that the model is supposed to accomplish.

Below is just my personal guess: the invention of the “prompt learning” strategy may be related to this type of ablation studies. To evaluate the zero-shot sentiment analysis performance, the authors attached the token “very” to the end of the input sentence. This approach appear to be rudimentary and rather looks like some experimental “prompt”. Using today’s standard, we may write more directed prompt, such as: “The previous sentence contains what emotion: positive or negative? A:”, to encourage the model to probe the sentiment-dimension of the next word prediction task.

GPT-2

Unlike its predecessor GPT-1, GPT-2 ditch the “pre-training + fine-tuning” paradigm. The authors does not perform fine-tuning at all in their experiments and the model was evaluated on few-shot & zero-shot tasks (although they mention that they may try finetuning in Section 6. Discussion). There is little change to the model architectures and the paper mostly focused on scaling up the model and the dataset.

The authors discuss Task Conditioning in data and emphasize that the instruction of what task the model should perform can be embedded in the token sequence itself. Due to there is huge amount of “instruction-task” and “question-answer” format task online, the authors hypothesize that the model can achieve multi-task learning by only optimizing for language model loss on a wide range of texts. To test their hypothesis, the authors:

  • Scale up the model
  • Scale up the dataset

Results: increased performance in multi-task and zero-shot / few shot learning. Given GPT-2’s testing perplexity on WebText is steadily increasing as model size grow, the authors believe that GPT-2 is still underfitting the data.

OpenAI did not release the dataset or the model at the beginning, which stir up a big discussion back then (Yannic Kilcher’s review).

Scaling Up the Dataset

What is Task Conditioning?

Task Conditioning refer to the special clue provides to the model to inform its target task in multi-task learning.

  • In single-task learning, a model performs a task by estimating a conditional distribution: $P(output \mid input)$
  • In multi-task learning, a model performs a task by estimating a conditional distribution: $P(output \mid input, task)$

Previously, task conditioning is mostly implemented at architectural level (e.g., multiple decoder heads and each group of heads specialized in one task). However, GPT-2’s authors believe that task conditioning can be implemented at data level. For example

  • A translation task can be written as the sequence (translate to french, english text, french text)
  • A reading comprehension training example can be written as (answer the question, document, question, answer)

Previous Datasets

The authors pointed out that most of the previous work train language models on a single domain of text (e.g, news articles, Wikipedia, fictions). Given their “task conditioning” and “multi-task” learning hypothesis, they want to collect a large and diverse a dataset which contains natural language demonstrations of tasks in as varied of domains and contexts.

They presented several naturally occurring demonstrations of English to French and French to English translation found throughout the WebText training set:

  • I’m not the cleverest man in the world, but like they say in French: Je ne suis pas un imbecile [I’m not a fool].
  • “I hate the word ‘perfume,”’ Burr says. ‘It’s somewhat better in French: ‘parfum.’
  • If this sounds like a bit of a stretch, consider this question in French: As-tu aller au cinéma?, or Did you go to the movies?, which literally translates as Have-you to go to movies/theater?

The WebText Dataset

  • Web scrapes like Common Crawl has data quality issues and contains large amount of documents whose content are mostly unintelligible
  • The author created The WebText Dataset, which emphasizes document quality (web pages which have been curated/filtered by humans)
  • Approach
    • Scraped all outbound links from Reddit, a social media platform, which received at least 3 karma
    • Posts with high karma indicate other users found the link interesting, educational, or just funny
    • Documents are further processed by de-duplication and some heuristic based cleaning
    • Result: 8 million documents / 40 GB of text
    • All Wikipedia documents remove due to overlapping with test evaluation tasks

Scaling Up the Model

  • Vocabulary: 50,257
  • Context size: 512 -> 1024 tokens
  • Layers: 12 -> 48
  • Model dimension: 768 -> 1600
  • Num of parameters: 117M -> 1.5B
  • Batch size: 64 -> 512
  • Layer normalization was moved to input of each sub-block and an additional layer normalization was added after final self-attention block
  • At initialisation, the weight of residual layers was scaled by $1/\sqrt{N}$, where N was the number of residual layers

Multi-Task & Few-Shot Learning

  • Reading Comprehension
    • Dataset: CoQA dataset
    • Prompt by: [document], [associated conversations], “A:”
    • GPT-2: $F_1 = 0.55$
    • SOTA supervised BERT: $F_1 = 0.89$
  • Summarization
    • Dataset: CNN and Daily Mail dataset
    • Prompt by: “TL;DR: “
    • First 3 generated sentences in the generated 100 tokens as the summary
    • Barely outperform random selection 3 sentences on ROUGE 1,2,L
    • Possible to prompt the model by natural language, but performance reduced
  • Translation
    • Prompt by: [example english] = [example french], [target english] =
    • WMT-14 English-French: 5 BLEU
    • WMT-14 French-English: 11.5 BLEU on (SOTA unsupervised model 33.5 BLEU)
    • Note that non-English webpages is intentionally removed from pre-training dataset
    • Training data only contains 10MB of data in the French language (approximately 500x smaller than the monolingual French corpus common used in prior research)
  • Close Book Question Answering
    • Dataset: SQUAD (evaluation metric: exact match)
    • Goal: test what information is contained within model
    • Due to there is no access to internet, answer only relies on information stored in its parameters
    • GPT-2: 4.1% accuracy (Human: 17%, open book system: 30-50%)

Discussions

Text Generation

GPT-2 is able to

Generalization vs Memorization

  • WebText training set does not overlap with evaluation dataset
    • Target datasets: PTB, WikiText-2, enwik8, text8, Wikitext-103, 1BW
    • 8 grams overlapping
      • WebText train vs target dataset test: 0.88%-3.94% (average: 3.2%)
      • Target dataset train vs target dataset test: 0.66%-13.19% (average: 5.9%)
  • Increase model capacity => log-linear improvement in zero-shot performance
  • GPT-2 is still underfitting the data (see Figure below)

gpt-2-ppxt

GPT-3

The author hypothesize that higher few-shot performance can be achieved by further scaling up the model, due to:

  • Increased model size brought improvements in text synthesis and/or downstream NLP tasks
  • Evidence suggest that log loss, which correlates well with many downstream tasks, follows a smooth trend of improvement with scale
  • Notable models
    • GPT-1: 100M parameters
    • BERT: 300M parameters
    • GPT-2: 1.5B parameters
    • Megatron-LM: 8B parameters
    • T5: 11B parameters
    • T-NLG: 17B parameters

GPT-3 further scale up the GPT-2 model. Results are:

  • While GPT-2 significantly underperforms supervised models, GPT-3 sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches
  • GPT-3 can perform tasks that requires some degree of reasoning, for example: unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic
  • Qualitative evaluations find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans

However, there are still tasks where GPT-3 has poor few-shot performance:

  • Natural language inference: ANLI dataset
  • Reading comprehension: RACE dataset, QuAC dataset

Due to the length of the paper (the paper itself is 40 pages long, with a 23 page long appendix), we will only highlight important results in this post.

Model Architecture

  • Similar to GPT-2 (modified initialization, pre-normalization, and reversible tokenization)
  • Efficient attention implementation
  • Context size: 1024 -> 2048 tokens
  • Layers: 48 -> 96
  • Model dimension: 1600 -> 12288
  • Num of parameters: 1.5B -> 175B
  • Batch size: 512 -> 3.2M
  • Optimization
    • Larger models can typically use a larger batch size, but require a smaller learning rate
    • Measure the gradient noise scale during training for choice of batch size
    • Clip the global norm of the gradient at 1.0

Dataset

  • Common Crawl dataset: filtered + fuzzy deduplication (410B tokens)
  • Curated high-quality datasets
    • Expanded version of the WebText dataset (19B tokens)
    • Books1 and Books2 corpora (67B tokens)
    • English-language Wikipedia (3B tokens)
  • Overlapping testing sets & Memorization problems (paper Section 2.2 and Section 4)

Few-Shot Learning

Few-shot performance often grows with model capacity, perhaps suggesting that larger models are more proficient meta-learners

Larger models => efficient use of in-context information

gpt-3-model-size-ri

The steeper “in-context learning curves” for large models demonstrate improved ability to learn a task from contextual information

Task: removing random symbols from the word (paper Section 3.9.2)

  • A random punctuation or space character is inserted between each letter of a word
  • The model must output the original word
  • Example: s.u!c/c!e.s s i/o/n = succession
  • With and without a natural language task description

Language Modeling

  • PTB:
    • Previous SOTA perplexity: 35.8
    • GPT-3 perplexity: 20.5
  • LAMBADA (long range dependencies in text)
    • Previous SOTA accuracy: 68.0
    • GPT-3 zero-shot accuracy: 76.2
    • GPT-3 few-shot accuracy: 86.4
  • Below SOTA datasets
    • HellaSwag (picking the best ending to a story or set of instructions)
    • StoryCloze (electing the correct ending sentence for five-sentence long stories)

Closed Book Question Answering

  • TriviaQA: GPT-3 few-shots outperforms Fine-tuned SOTA
  • WebQuestions: GPT-3 is competitive to SOTA
  • Natural Questions: GPT-3 underperforms SOTA

gpt-3-model-size-triviaqa

GPT3’s performance grows smoothly with model size on TriviaQA

Translation

  • Measured by BLEU
  • Translation into English: GPT-3 outperforms/competitive to Fine-tuned supervised SOTA
  • Translation from English: GPT-3 significantly lag behind Fine-tuned supervised SOTA
  • Details: paper Section 3.3

gpt-3-model-size-translation

GPT3’s performance grows smoothly with model size on NMT tasks

Winograd-Style Tasks

  • Tasks in NLP that involves determining which word a pronoun refers to
  • The pronoun is grammatically ambiguous but semantically unambiguous to a human
  • Winograd: GPT-3 few-shot is competitive to Fine-tuned SOTA
  • Winograd XL: GPT-3 few-shot lag behind to Fine-tuned SOTA

Common Sense Reasoning

  • Tasks in physical or scientific reasoning
  • Distinct from sentence completion, reading comprehension, or broad knowledge question answering
  • PhysicalQA
    • Common sense questions about how the physical world works and is intended as a probe of grounded understanding of the world
    • GPT-3 outperforms Fine-tuned SOTA
  • ARC
    • Multiple-choice questions collected from 3rd to 9th grade science exams
    • GPT-3 lag behind Fine-tuned SOTA

Reading Comprehension

  • GPT-3 Generally underperforms Fine-tuned SOTA
  • GPT-3 performs best (within 3 points of the human baseline) on CoQA, a free-form conversational dataset
  • GPT-3 performs worst on QuAC, a dataset which requires modeling structured dialog acts

Natural Language Inference

  • Task for testing the ability to understand the relationship between two sentences
  • GPT-3 largely underperforms Supervised SOTA on RTE and ANLI

Synthetic and Qualitative Tasks

  • 3-digit additions: large model suddenly generalize
  • Word Scrambling: larger model lead to smooth performance increase
  • Selected SAT college entrance exam: large model improve performance (>60% accuracy)
  • News Article Generation: human unable to identify model generated articles (statistically cannot reject null hypothesis)
  • Learning and Using Novel Words: appears to be a correct or at least plausible use of the word
  • Correcting English Grammar
    • Few-shot prompt: “Poor English Input: [sentence]\n Good English Output: [sentence]”