Instruction Tuning

This post is mainly based on

SUPERNI

Super-NaturalInstructions Dataset
- 1616 diverse NLP tasks + expert-written instructions
- 76 distinct task types: classification, extraction, infilling, sequence tagging, text rewriting, text composition, etc.
- Contributed by 88 NLP practitioners, in response to public call
Results
- Trained Tk-INSTRUCT (backbone: T5 model)
- Tk-INSTRUCT Outperforms existing instruction-following models (InstructGPT) by >9% on our benchmark despite being an order of magnitude smaller

superni-stats

Statistics of SUP-NATINST.

SUP-NATINST

A meta-dataset
- A dataset of datasets
- Consists of a wide variety of NLP tasks with their instructions
Extends NATINST dataset
- 26x more tasks
- Greater variety of task types
Instruction schema
- Definition: defines a given task in natural language / how to map input text to output text
- Positive Examples: (inputs, output), with a short explanation
- Negative Examples: (inputs, output), with a short explanation
Task instances
- A unified format to organize the instances of all tasks
- Each instance consists of a textual input and a list of acceptable textual outputs
- Limit the number of instances in each task to 6.5K to avoid an imbalance dataset
Data collection
- Source
  - Existing public NLP datasets
  - Available intermediate annotations in crowdsourcing experiments
  - Synthetic tasks that can be communicated to an average human in a few sentences
- JSON files via GitHub pull requests
- Reviewed by automated checks and peers
Evaluation Setup
- Evaluation tasks
  - Fix a manually selected collection of 12 categories, representing 154 tasks
  - Sample a maximum of 100 instances for each task, which results in 15,310 testing instances in total
  - English cross-task generalization: 119 tasks
  - Cross-lingual cross-task generalization: 35 tasks
- Evaluation Metrics
  - ROUGE-L
  - Human

superni-corr

Human evaluation vs. ROUGE-L for several methods. Human evaluation align quite well with automatic metrics. For details, check paper Section 6.2 and Appendix B.

Results

Model: Tk-INSTRUCT
Goal: Generalization to Unseen Tasks at Scale
- Given: instruction $I_t$, dataset $(X_t, Y_t)$
- Learn: $y = M(I_t, x)$ for $(x,y) \in (X_t, Y_t)$
Models that leverage instructions show stronger generalization to unseen tasks.
Model fine-tuned on a diverse set of tasks outperforms InstructGPT and T0 by a large margin

superni-eval

The overall performance of different methods on unseen tasks in the test set of SUP-NATINST. We report ROUGE-L here as our aggregated metric.

Scaling Trends of Generalization

More observed tasks: better generalization
- Generalization performance grows log-linearly w.r.t. # of observed tasks
More instances per task: not much impact
- Generalization performance saturates when only 64 instances per task are used for training
Larger model: better generalization
- Generalization performance grows log-linearly w.r.t. model parameter size
Trade-off
- Increasing the diversity of training tasks is an alternative to scaling model sizes
- T5-large model trained with 757 tasks $\approx$ T5-3B model trained with 128 tasks

superni-scaling

Scaling trends of models performance as a function of (a) the number of training tasks; (b) the number of instances per training task; (c) model sizes. x-axes are in log scale. Model performance increases with increase in observed tasks and model size. The performance gain from more instances per task is limited.

Self-Instruct

A framework for generating instruction dataset
- Generate: (instructions, input, output)
- Filter: invalid or similar data
Finetune the original model with generated instruction dataset
Results on GPT-3
- 33% absolute improvement over fine-tuning on SUPER-NATURALINSTRUCTIONS
- Final model’s performance is comparable to InstructGPT001 / text-davinci-001

selfins-arch

Procedure: bootstrap

Starts with 175 manually-written seed tasks in the task pool

Step 1: Instruction Generation

In-context examples: 8 task instructions from the task pool
6 are human-written, 2 are model-generated
Prompt: Table 5

Step 2: Classification Task Identification

In-context examples: 31 classification examples from task pool
12 classification instructions, 19 non-classification instructions
Prompt: Table 6

Step 3: Instance Generation

Input-first prompt: Table 7
Output-first prompt: Table 8

Step 4: Filtering and Postprocessing

New instruction is added to the task pool only when all pairwise ROUGE-L similarity < 0.7
Exclude
- Contains certain keywords (e.g., image, picture, graph)
- Instances with same input but different output
- Invalid generations (e.g., instruction is too long or too short)

SELF-INSTRUCT Dataset

Statistics: see paper Table 1
Diversity: see paper Figure 4
Quality: see paper Table 2

Results

SUPERNI Evaluation Set

selfins-eval

Evaluation results on unseen tasks from SUPERNI. 1: SELF-INSTRUCT can boost GPT3 performance by a large margin (+33.1%) 2: SELF-INSTRUCT nearly matches the performance of InstructGPT001. 3: SELF-INSTRUCT can further improve the performance even when a large amount of labeled instruction data is present.

Human Evaluation

Grading
- RATING-A: The response is valid and satisfying.
- RATING-B: The response is acceptable but has minor errors or imperfections.
- RATING-C: The response is relevant and responds to the instruction, but it has significant errors in the content. For example, GPT3 might generate a valid output first, but continue to generate other irrelevant things.
- RATING-D: The response is irrelevant or completely invalid.

selfins-human

Performance of GPT3 model and its instruction-tuned variants, evaluated by human experts on our 252 user-oriented instructions. Human evaluators are instructed to rate the models’ responses into four levels (A-D). The results indicate that $GPT3_{SELF-Inst}$ outperforms all the other GPT3 variants trained on publicly available instruction datasets. Additionally, $GPT3_{SELF-Inst}$ scores nearly as good as $InstructGPT_{001}$

Scaling with Data Size

selfins-scaling

Human evaluation performance of GPT3SELF-INST models tuned with different sizes of instructions. x-axis is in log scale.

Data Quality

Fix instruction and input
Regenerate the output field of all our instances using InstructGPT003

Alpaca

Base model: LLaMA-7B
Fine-tuning Data
- 52K instruction-following demonstrations
- Following self-instruct framework generated by text-davinci-003
Results
- Alpaca shows many behaviors similar to OpenAI’s text-davinci-003
- Surprisingly small and easy/cheap to reproduce (<$500 / 3 hours on 8 80GB A100s)
- Blind pairwise comparison: Alpaca wins 90 versus 89 comparisons against text-davinci-003
Limitations: hallucination, toxicity, and stereotypes

alpaca-arch

Alpaca training pipeline.

ShareGPT

An open-source Chrome Extension for crowdsourcing ChatGPT data
Dataset: search huggingface/datasets

Vicuna

Base model: LLaMA-13B
Data
- ~70K conversations from ShareGPT.com
- Enhanced Alpaca’s training scripts to better handle multi-turn conversations and long sequences (512 -> 2048)
Cost: ~$1000 / 8 A100 GPUs in one day
Results
- Eval: GPT-4
- Quality of answers are based on helpfulness, relevance, accuracy, and detail
- Findings - GPT-4 can produce highly consistent ranks and detailed explanations on why such scores are given - GPT-4 is not very good at judging coding/math tasks
- Outperforms LLaMA and Alpaca in more than 90% of cases

GPT-4-LLM

Data: GPT-4 generated instruction
Model: Instruction-tuned LLaMA models and reward models
Evaluation
- Human evaluation on three alignment criteria
- Auto evaluation using GPT-4 feedback
- ROUGE-L on un-natural instructions

Data

English Instruction-Following Data: reuse 52K unique instructions
Chinese Instruction-Following Data: 52K translate into Chinese
Comparison Data

Instruction Tuning

Supervised finetuning
- Follow Alpaca’s training schedule
RLHF
- Large-scale comparison data created by GPT-4
- Train a reward model based on OPT 1.3B

Benchmarks

User-Oriented-Instructions-252
- 252 instructions, motivated by 71 user-oriented applications such as Grammarly, StackOverflow, Overleaf, rather than well-studied NLP tasks
Vicuna-Instructions-80
- A dataset synthesized by gpt-4 with 80 challenging questions that baseline models find challenging
Unnatural Instructions
- 68,478 samples synthesized by text-davinci-002 using 3-shot in-context-learning from 15 manually-constructed examples

Alignment Criteria: HHH

Helpfulness: whether it helps humans achieve their goals. A model that can answer questions accurately is helpful
Honesty: whether it provides true information, and expresses its uncertainty to avoid misleading human users when necessary. A model that provides false information is not honest
Harmlessness: whether it does not cause harm to humans. A model that generates hate speech or promotes violence is not harmless