This post is mainly based on

Model scale:

  • GPT-3 (OpenAI): 175B
  • GPT-Neo (EleutherAI): 2.7B
  • GPT-NeoX (EleutherAI): 20B
  • OPT (Meta): 175B (research-only) / 66B (open source)
  • BLOOM: 176B
  • StableLM: 3-7B (open source) / 15-66B (pending)

GPT-NeoX

  • GPT-Neo Repo
  • GPT-NeoX Repo
  • Dense transformer
  • Hardware
    • Training: 12x clusters, each with 8x A100-40G
    • Fine-tuning: 4x A100-40G
    • Inference: 2x 3090 Ti or 1x A6000-48G
  • Evaluation
    • Competitive to similar size GPT-3 on language modeling benchmarks
    • Better arithmetic performance

OPT

OPT (Open Pre-trained Transformer) is an open-source GPT-3 approximate. The paper and repo provide a detailed account of the training process, including: compute power, human overhead and ad-hoc optimization designs.

  • “Open source”
    • Small models (125M - 66B) are released
    • 175B model is for non-commercial / research only
  • Detailed description over model training
    • Dataset / data cleaning
    • Optimization challenges (e.g., hardware failures, loss divergence)
  • Hardware intensive
    • Pre-training: 992 80GB A100 GPUs
    • Fine-tuning / deploy: 16 V100 GPUs
  • Repo and training log

Dataset

  • RoBERTa dataset
    • BookCorpus
    • Stories
    • CCNews v2
  • The Pile dataset
    • CommonCrawl
    • DM Mathematics: repo
    • Project Gutenberg: ebooks
    • HackerNews
    • OpenSubtitles: database of movie and TV subtitles
    • OpenWebText2: enhanced version of the original OpenWebTextCorpus scraped from Reddit
    • USPTO: US patent and trademark research datasets
    • Wikipedia
  • PushShift.io Reddit dataset
    • Pushshift.io corpus: historical Reddit data
  • Processing
    • Deduplication: Min-hashLSH with a Jaccard similarity > 0.95
    • Ad-hoc whitespace normalization
    • Reddit: convert the conversational trees into documents / keep only the longest chain of comments in each thread

Model

Largely based on GPT-3 architecture:

  • Tokenizer: BPE
  • Dropout: 0.1 (no dropout to embedding)
  • Activation: ReLU
  • Token length: 2048

opt-arch

#L: number of layers, #H: number of attention heads

Optimization

  • Weight initialization
    • Follow Megatron-LM
    • $\sim N(0, 0.006)$
    • Output layers Standard deviation: scale by $1.0/\sqrt{2L}$, where $L$ is the # of layers
  • Bias initialization: 0
  • Optimizer
    • AdamW(0.9, 0.95)
    • Weight decay of 0.1
  • Learning rate
    • OPT-175B warm up: 0 to the maximum LR over the first 2000 step
    • Smaller model warm up: 0 to the maximum LR over 375M tokens
    • Decaying down to 10% of the maximum LR over 300B tokens
  • Clip gradient norms at 1.0
  • Gradient pre-divide factor
    • Reduce the risk of over/underflows
    • Splitting one division by $N$ into two divisions by $\sqrt{N}$
  • Hardware
    • OPT-175B is trained on 992 80GB A100 GPUs
    • Adam state: FP32 (due to sharding across all hosts)
    • Model parameters: FP16
    • Dynamic loss scaling from Mixed Precision Training

Training Processes

  • Training takes over 2 month
  • Hardware Failures
    • At least 35 manual restarts and 70+ automatic restarts
    • During manual restarts, a series of diagnostics tests were conducted to detect problematic node
    • Flagged nodes were taken offline and training was resumed from the last saved checkpoint
    • Over 100 hosts were cycled out
  • Loss Divergences
    • Lowering the learning rate and restarting from an earlier checkpoint
    • Correlation between
      • Loss divergence
      • Dynamic loss scalar crashing to 0
      • L2-norm of the activations of the final layer spiking
    • Hence, earlier checkpoint is picked s.t., dynamic loss scalar was still in a “healthy” state (>1.0)
  • Reducing Loss Divergences
    • Lowering gradient clipping from 1.0 to 0.3
    • Switching to vanilla SGD (after optimization plateaued reverted back to AdamW)
    • Resetting the dynamic loss scalar
    • Switching to a newer version of Megatron-LM (repo for distributed transformer)

Evaluations

  • Multi-tasking: by prompt, following GPT-3
  • Aimed to re-implement GPT-3 evaluation setting
  • Model scale vs Average Zero-shot and Multi-shot performance is similar to GPT-3
  • Dialogue dataset: perplexity is competitive to open-source supervised SOTA
  • Hate Speech Detection: outperforms GPT-3-Davinci
  • Intrasentence level biases: competitive to GPT-3-Davinci
  • Stereotypical bias: competitive to GPT-3-Davinci
  • Respond to toxic language prompt
    • Higher toxicity rate than PaLM or Davinci
    • All 3 models have increased likelihood of generating toxic continuations as the toxicity of the prompt increases
  • Dialogue Safety
    • SaferDialogues: Ability to recover from explicit safety failures (graceful vs defensive response)
    • Safety Bench Unit Tests: Generating offensive content, Responding inappropriately to offensive conten, Responding inappropriately in safety-critical situation

Limitations

  • Following instructions
    • OPT-175B does not work well with declarative instructions
    • Model tends to output a dialogue beginning with such an instruction, rather than an execution of the instruction
    • Possible solution: fine-tuning similar to InstructGPT
  • Repetitive behavior
    • OPT-175B also tends to be repetitive and can easily get stuck in a loop
    • Possible solution: unlikelihood training, or best-first decoding
  • Factually incorrect statement
    • Possible solution: retrieval-augmented models
  • Summary: premature for commercial deployment