Early Open Source LLM

This post is mainly based on

Model scale:

GPT-3 (OpenAI): 175B
GPT-Neo (EleutherAI): 2.7B
GPT-NeoX (EleutherAI): 20B
OPT (Meta): 175B (research-only) / 66B (open source)
BLOOM: 176B
StableLM: 3-7B (open source) / 15-66B (pending)

GPT-NeoX

GPT-Neo Repo
GPT-NeoX Repo
Dense transformer
Hardware
- Training: 12x clusters, each with 8x A100-40G
- Fine-tuning: 4x A100-40G
- Inference: 2x 3090 Ti or 1x A6000-48G
Evaluation
- Competitive to similar size GPT-3 on language modeling benchmarks
- Better arithmetic performance

OPT

OPT (Open Pre-trained Transformer) is an open-source GPT-3 approximate. The paper and repo provide a detailed account of the training process, including: compute power, human overhead and ad-hoc optimization designs.

“Open source”
- Small models (125M - 66B) are released
- 175B model is for non-commercial / research only
Detailed description over model training
- Dataset / data cleaning
- Optimization challenges (e.g., hardware failures, loss divergence)
Hardware intensive
- Pre-training: 992 80GB A100 GPUs
- Fine-tuning / deploy: 16 V100 GPUs
Repo and training log

Dataset

RoBERTa dataset
- BookCorpus
- Stories
- CCNews v2
The Pile dataset
- CommonCrawl
- DM Mathematics: repo
- Project Gutenberg: ebooks
- HackerNews
- OpenSubtitles: database of movie and TV subtitles
- OpenWebText2: enhanced version of the original OpenWebTextCorpus scraped from Reddit
- USPTO: US patent and trademark research datasets
- Wikipedia
PushShift.io Reddit dataset
- Pushshift.io corpus: historical Reddit data
Processing
- Deduplication: Min-hashLSH with a Jaccard similarity > 0.95
- Ad-hoc whitespace normalization
- Reddit: convert the conversational trees into documents / keep only the longest chain of comments in each thread

Model

Largely based on GPT-3 architecture:

Tokenizer: BPE
Dropout: 0.1 (no dropout to embedding)
Activation: ReLU
Token length: 2048

opt-arch

#L: number of layers, #H: number of attention heads

Optimization

Weight initialization
- Follow Megatron-LM
- $\sim N(0, 0.006)$
- Output layers Standard deviation: scale by $1.0/\sqrt{2L}$, where $L$ is the # of layers
Bias initialization: 0
Optimizer
- AdamW(0.9, 0.95)
- Weight decay of 0.1
Learning rate
- OPT-175B warm up: 0 to the maximum LR over the first 2000 step
- Smaller model warm up: 0 to the maximum LR over 375M tokens
- Decaying down to 10% of the maximum LR over 300B tokens
Clip gradient norms at 1.0
Gradient pre-divide factor
- Reduce the risk of over/underflows
- Splitting one division by $N$ into two divisions by $\sqrt{N}$
Hardware
- OPT-175B is trained on 992 80GB A100 GPUs
- Adam state: FP32 (due to sharding across all hosts)
- Model parameters: FP16
- Dynamic loss scaling from Mixed Precision Training

Training Processes

Training takes over 2 month
Hardware Failures
- At least 35 manual restarts and 70+ automatic restarts
- During manual restarts, a series of diagnostics tests were conducted to detect problematic node
- Flagged nodes were taken offline and training was resumed from the last saved checkpoint
- Over 100 hosts were cycled out
Loss Divergences
- Lowering the learning rate and restarting from an earlier checkpoint
- Correlation between
  - Loss divergence
  - Dynamic loss scalar crashing to 0
  - L2-norm of the activations of the final layer spiking
- Hence, earlier checkpoint is picked s.t., dynamic loss scalar was still in a “healthy” state (>1.0)
Reducing Loss Divergences
- Lowering gradient clipping from 1.0 to 0.3
- Switching to vanilla SGD (after optimization plateaued reverted back to AdamW)
- Resetting the dynamic loss scalar
- Switching to a newer version of Megatron-LM (repo for distributed transformer)

Evaluations

Multi-tasking: by prompt, following GPT-3
Aimed to re-implement GPT-3 evaluation setting
Model scale vs Average Zero-shot and Multi-shot performance is similar to GPT-3
Dialogue dataset: perplexity is competitive to open-source supervised SOTA
Hate Speech Detection: outperforms GPT-3-Davinci
Intrasentence level biases: competitive to GPT-3-Davinci
Stereotypical bias: competitive to GPT-3-Davinci
Respond to toxic language prompt
- Higher toxicity rate than PaLM or Davinci
- All 3 models have increased likelihood of generating toxic continuations as the toxicity of the prompt increases
Dialogue Safety
- SaferDialogues: Ability to recover from explicit safety failures (graceful vs defensive response)
- Safety Bench Unit Tests: Generating offensive content, Responding inappropriately to offensive conten, Responding inappropriately in safety-critical situation

Limitations

Following instructions
- OPT-175B does not work well with declarative instructions
- Model tends to output a dialogue beginning with such an instruction, rather than an execution of the instruction
- Possible solution: fine-tuning similar to InstructGPT
Repetitive behavior
- OPT-175B also tends to be repetitive and can easily get stuck in a loop
- Possible solution: unlikelihood training, or best-first decoding
Factually incorrect statement
- Possible solution: retrieval-augmented models
Summary: premature for commercial deployment