This post is mainly based on

BIG-bench (2022)

  • Beyond the Imitation Game benchmark (BIG-bench)
  • Task
    • Tasks that are believed to be beyond the capabilities of current language model
    • Large-scale, extremely difficult and diverse tasks
  • Dataset
    • 204 tasks, contributed by 450 authors across 132 institutions
    • Diverse tasks: linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development
    • Human expert and human average raters performed all tasks to provide a baseline
  • Findings
    • Analyze dense and sparse transformer models, from Google and OpenAI, across six orders of magnitude of model scale
    • Performance is similar across model classes
    • Tasks where performance linearly scales with size are mainly knowledge based tasks that involves memorization
    • Tasks where performance exhibit “breakthrough” behavior at a critical scale often involve multiple steps / skills
    • Social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting

BIG-bench API

  • JSON Task (80% of all)
    • Contains a list of examples made up of inputs and targets
    • Evaluation
      • Comparing generated model outputs to targets using standard metrics (e.g., ROUGE)
      • Compute on model-assigned probabilities (e.g., in answering multiple-choice questions)
  • Programmatic Task (20% of all)
    • Written in Python and have the ability to interact directly with the model over multiple query rounds
    • The code that defines the task can query the model repeatedly (e.g., multiple rounds dialog)
    • Evaluation: custom metrics
  • Benchmark Average Score
    • Specify a unique preferred metric for each task
    • Specify high and low scores for each task, the score is then normalized in this range

BIG-bench Lite

JSON tasks only, for less computationally intensive evaluation.

24 tasks included in BIG-bench Lite:

auto_debugging
bbq_lite_json
code_line_description
conceptual_combinations
conlang_translation
emoji_movie
formal_fallacies_...
hindu_knowledge
known_unknowns
language_identification
linguistics_puzzles
logic_grid_puzzle
logical_deduction
misconceptions_russian
novel_concepts
operators
parsinlu_reading_comprehension
play_dialog_same_or_different
repeat_copy_logic
strange_stories
strategyqa
symbol_interpretation
vitaminc_fact_verification
winowhy

“Breakthrough” Behavior

  • Tasks that see strong breakthrough behavior include those that are
    • Composite in nature
    • Require a model to apply several distinct skills or perform multiple steps to reach the correct answer
  • Examples
    • modified_arithmetic: applying a mathematical operator, defined in-context to certain inputs
    • repeat_copy_logic
    • figure_of_speech_detection
    • codenames

bbh-linear-breakthrough

Left: Tasks that achieve the highest linearity scores see model performance improve predictably with scale.

Right: Tasks with high breakthroughness do not see model performance improve until the model reaches a critical scale. Around 5% of BIG-bench tasks see models achieve sudden score breakthroughs with increasing scale.

Social Bias

  • Social Bias: given a fixed context involving people, with potential completions, does a model show a systematic preference for members of one category over another or for associating particular attributes with particular categories?
  • Bias typically increases with scale in settings with broad or ambiguous contexts
  • Bias can decrease with scale in settings with narrow, unambiguous contexts
  • Bias can potentially be steered through appropriately chosen prompting

AGIEval (2023)

  • Task
    • Human-centric standardized exams (college entrance exams, law school admission tests, math competition)
  • Dataset
    • 8062 questions from 10 exams
    • Human-level reasoning and real-world relevance
    • Evaluation Metrics: multiple-choice and fill-in-the-blank (exact match & F1)
  • Findings
    • GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test
    • LLM weaknesses

AGIEval vs MMLU

  • AGIEval: data sourced from high-standard and official human-centric exams
  • AGIEval is bilingual

agieval-models

Compared to the averaged human performance, GPT-4 achieves better scores on the SAT, LSAT, and math competition. Model: zero-shot learning with a Chain-of-Thought (CoT) prompting setting. Human performance (top) refers to the performance of the top 1% of test takers, except for the lawyer qualification test which uses the top 10%.

Overall Trend of LLM Capabilities

Capabilities

  • Understanding: accurately interpret the meaning of questions
  • Knowledge: proficiency in identifying correct knowledge
  • Reasoning: accurately execute multi-step reasoning processe
  • Calculation: basic calculation, combinatorial abstraction and variable substitutions

Strength

  • Good Understanding
  • Proficiency in Simple Reasoning and Deduction
  • Grasping General Reasoning Process

Weaknesses

  • Understanding
    • Difficulty with Variable Substitution
    • Difficulty with Complex Math Concepts and Symbols
    • Confusion with Similar Concepts
    • Difficulty with Handling Long Contexts
  • Knowledge
    • Occasionally lack of commonsense or domain-specific knowledge
    • Difficulty with Identifying Correct Formulas
  • Reasoning
    • Difficulty with Strict Logical Deduction
      • Ignoring premise conditions
      • Misconstruing sufficient and necessary conditions
      • Making errors in logical chaining
    • Difficulty with Counterfactual Reasoning
      • Difficulty with generating alternative scenarios
      • Difficulty with evaluating hypothetical outcomes
      • Difficulty with exploring potential consequences based on varying assumptions
    • Struggles in Multi-hop Complex Reasoning
    • Establishing Incorrect Conclusions and Contradictory Reasoning
      • Set an incorrect conclusion first and then generate contradictory reasoning based on that faulty foundation
    • Concealed Substitution of Concepts
    • Vulnerability to Contextual Disturbance
      • When the context is modified, the models may produce different deductions for the same condition
  • Calculation
    • Prone to Making Calculation Errors
    • Difficulty with Complex Variable Substitutions

MT-Bench (2023)

  • Task: systematically evaluate LLM alignment
    • Idea: using strong LLMs as judges to evaluate answers on open-ended questions
    • Goal: develop a more scalable and automated LLM evaluation approach
  • Dataset
    • MT-bench: 80 high-quality multi-turn questions, 3K expert votes
    • Chatbot Arena: crowdsourcing open-end interaction with two anonymous models, 30K votes
  • Findings
    • GPT-4 judge match human evaluations at an agreement rate exceeding 80%

For more details, please refer to this post