This post is mainly based on

LLM Developer Scale Year Open Source
GPT-3 OpenAI 175B 2020 No
LaMDA Google 137B 2022 No
PaLM Google 540B 2022 No
LLaMA Meta 65B 2023 Non-commercial
Alpaca Stanford 7B 2023 Non-commercial

LaMDA

  • 137B parameters
  • Specialized for dialog: pre-trained on 1.56T words of public dialog data and web text
  • Goal: improve safety & factual grounding
    • Safety: fine-tuning with annotated data
    • Factual grounding: enabling the model to consult external knowledge sources
  • Dialog evaluation metrics: sensibleness, specificity and interestingness

Dataset

  • 2.97B documents, 1.12B dialogs, 13.39B dialog utterances. Total: 1.56T words

Architecture

  • Activation: gated-GELU
  • Attention: relative attention as described in T5
  • Vocabulary: 32K tokens, from SentencePiece

Evaluation

  • SSI
    • Sensibleness: measures whether a model’s responses make sense in context and do not contradict anything that was said earlier
    • Specificity: measure whether a response is specific to a given context
    • Interestingness: measure whether the response “catch someone’s attention” or “arouse their curiosity”
    • Rank by human
  • Safety: see Appendix A.1
  • Groundedness
    • Groundedness: percentage of responses containing claims about the external world that can be supported by authoritative external sources
    • Informativeness: percentage of responses that carry information about the external world that can be supported by known sources as a share of all response
    • Citation accuracy: percentage of model responses that cite the URLs of their sources as a share of all responses with explicit claims about the external world
  • Role-specific metrics
    • Helpfulness: helpful responses are a subset of informative ones, which are judged by the user to be both correct and useful
    • Role consistency: consistency with the definition of the agent’s role external to the conversation

PaLM

Dataset

  • Trained on a single pass over a 780B token corpus PaLM training data

Architecture

  • SwiGLU Activation: $(Swish(xW) \cdot xV )$ for the MLP intermediate activations
  • Parallel Layers
    • Standard: $y = x + MLP(LayerNorm(x + Attention(LayerNorm(x)))$
    • Parallel: $y = x + MLP(LayerNorm(x)) + Attention(LayerNorm(x))$
    • 15% faster training speed, small degradation on small model
  • Multi-Query Attention: [Fast transformer decoding: One write-head is all you need]
  • RoPE Embeddings: [Roformer: Enhanced transformer with rotary position embedding]
  • Shared Input-Output Embeddings (???)
  • No Biases: in any of the dense kernels or layer norms / increase training stability
  • Vocabulary: [SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing]

Optimization

  • Weight initialization
    • Kernel weights: $W \sim N(0, 1/\sqrt{n_in})$, $n_in =$ input dim
    • Embedding weights: $E \sim N(0,1)$
  • Optimizer: Adafactor
  • Hyperparameters
    • Lr: $10^{-2}$ for the first $k=10000$ steps, then decay at rate $1/\sqrt{k}$
    • Modified $\beta_1, \beta_2$ for rate embedding tokens
    • Gradient clipping: 1.0
  • Loss: LM loss + Aux loss for stability
  • Sequence length: 2048
  • Batch size: 512 -> 1024 -> 2048
  • Dropout: 0.1
  • Determinism: training is fully reproducible via JAX + deterministic dataloader
  • Instability
    • ~20 loss spikes, despite gradient clipping
    • Sometimes happening late into training
    • Not observed when training the smaller models
    • Mitigation: re-start training ~100 steps before spike + skip ~200-500 data batches

Memorization

  • Evaluation
    • Prompted the model with the first 50 tokens from the span, then greedy decoding
    • Measure how often the model produced a 50-token continuation that exactly matches the training example
  • Exactly reproduce: 2.4% of data
  • Examples seen more than 500 times have a memorization rate of over 40%
  • Breakdown by corpus: code has the highest memorization rate (15%)

LLaMA

  • Possible to train SOTA models using publicly available datasets exclusively
  • LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B
  • LLaMA-13B outperforms GPT-3 on most benchmarks, despite being 10× smaller / can be run on a single GPU

Dataset

  • Preprocessing
    • CommonCrawl 2017-2020: preprocess with CCNet pipeline, deduplicates the data at the line level
    • C4: deduplication and language identification
    • Github: public dataset from Google BigQuery
    • ArXiv: removed everything before the first section, bibliography, comments and macros
    • Stack Exchange: top 28 websites, removed the HTML tags + sorted answers by score
  • ~1.4T tokens after tokenization
  • For most training data, each token is used only once during training
  • For Wikipedia and Books domains, each token is used twice

LLaMA training data

Architecture

  • Pre-normalization
    • RMSNorm normalizing
    • Improve the training stability
    • Normalize the input of each transformer sub-layer, instead of normalizing the output
  • Acitvation: SwiGLU
  • Rotary Embeddings
  • Tokenizer: BPE implementation from SentencePiece

Optimization

  • Optimizer: AdamW
  • Lr: cosine learning rate schedule, final lr = 10% max lr
  • Gradient clipping: 1.0
  • Warmup: 2000 steps
  • Implementation: xformers
    • Not store/compute the masked part of the attention matrix
    • reimplement .backward(): store the activations that are expensive to compute
  • Compute time for 65B model
    • ~380 tokens/sec/GPU on 2048 A100-80GB
    • 1.4T tokens takes approximately 21 days

Evaluation

  • 6 types of common evaluation benchmarks
  • Instruction Finetuning
    • Small improvement on MMLU (5-shots)
    • LLaMA-I 68.9% vs code-davinci-002 77.4%
    • Warning: paper’s MMLU (5-shots) result is different from Open LLM Leaderboard result

Alpaca

  • Fine-tuned from LLaMA-7B
  • Supervised learning on 52K instruction-following demonstrations generated from text-davinci-003

Data

  • Generated instruction-following demonstrations based on self-instruct method
  • Process
    • Step 1: 175 human-written instruction-output pairs from the self-instruct seed set
    • Step 2: text-davinci-003 generate more instructions using the seed set as in-context examples
  • Simplified the generation pipeline: Github
  • Results: 52K unique instructions and the corresponding outputs, (costed <$500)

Optimization

  • Techiniques
    • Fully Sharded Data Parallel
    • Mixed precision training
  • Fine-tuning
    • Time: 3 hours on 8 A100-80GB
    • Cost: <$100
  • Hyperparameter
    • Batch size: 128
    • Lr: 2e-5
    • Epochs: 3
    • Max length: 512
    • Weight decay: 0
  • Addressing OOM
    • Naively, fine-tuning a 7B model requires about 7 x 4 x 4 = 112 GB of VRAM
    • Parameter sharding: no redundant model copy is stored on any GPU
    • Turn on CPU offload for FSDP
    • LoRA: reduce the total memory footprint from 112GB to about 7 x 4 = 28GB

Evaluation

  • Human evaluation on: Self-instruct evaluation set
  • Bind pairwise comparison: Alpaca wins 90 vs 89 comparisons against text-davinci-003
  • Alpaca’s answers are typically shorter than ChatGPT, reflecting text-davinci-003’s shorter outputs
  • Deficiencies: hallucination, toxicity, and stereotypes
  • Can be used to generate well-written outputs that spread misinformation