LLM As A Judge

This post is mainly based on

Traditional NLP Evaluation Methods

BLEU
- Machine translation benchmark
- N-grams precision: measure % of n-grams also occured in the reference translations
- Brevity Penalty: penalize for translation length < reference length
- For more details, check this post
Rouge: Recall-Oriented Understudy for Gisting Evaluation
- ROUGE = $\sum$ Recall of n-grams
- ROUGE-N: overlap of n-grams
- ROUGE-L: longest common subsequence
- ROUGE-S: overlap of skip-bigram
Multiple-choice questions
- Standardized and consistent way to evaluate the language understanding capabilities
- Sample benchmarks: MMLU

Challenges of traditional LLM benchmarks: cannot effectively tell the difference between these aligned models and the base models.

LLM-as-a-Judge

Idea: using strong LLMs as judges to evaluate answers on open-ended questions
Goal: develop a more scalable and automated LLM evaluation approach
Agreement rate
- GPT-4 judge match human evaluations at an agreement rate exceeding 80%
- Same level as human-human agreement rate
Human preference datasets
- MT-bench: 80 high-quality multi-turn questions, 3K expert votes
- Chatbot Arena: crowdsourcing open-end interaction with two anonymous models, 30K votes

example

Multi-turn dialogues between a user and two AI assistants—LLaMA-13B (Assistant A) and Vicuna-13B (Assistant B)—initiated by a question from the MMLU benchmark and a follow-up instruction. GPT-4 is then presented with the context to determine which assistant answers better.

LLM-as-a-Judge

Types
- Pairwise comparison: determine which one is better or declare a tie (See paper’s Figure-5)
- Single answer grading: directly assign a score (1 to 10) to a single answer (See paper’s Figure-6)
- Reference-guided grading: grading with reference solutions (See paper’s Figure-8)
Advantages
- Scalability
- Explainability
Limitations
- Position, verbosity, and self-enhancement biases
- Limited reasoning ability

mt-bench-example

Sample multi-turn questions in MT-bench

Limitations: Position Bias

Approach: LLM is asked to evaluate two responses in different posistions
“Rename” renames the assistants in our default prompt to see whether the bias is on positions or names
Findings
- All LLMs exhibit strong position bias, most favor the first position
- Claude-v1 favors “Assistant A”
- GPT-4 has highest consistent: 60%

position-bias

Position bias of different LLM judges. Consistency is the percentage of cases where a judge gives consistent results when swapping the order of two assistants. “Biased toward first” is the percentage of cases when a judge favors the first answer. “Error” indicates wrong output formats.

Limitations: Verbosity Bias

Approach: “repetitive list” attack
- Select answers that contain a numbered list
- Use GPT-4 to rephrase the list without adding any new information
- 5 items => 5 items + 5 items rephrased from the original 5 items
Findings
- All LLMs may be prone to verbosity bias
- GPT-4 defends significantly better than others

verbosity-bias

Failure rate under “repetitive list” attack for different LLM judges on 23 answers.

Limitations: Self-Enhancement Bias

LLM judges may favor the answers generated by themselves
Approach: win rate under different LLM judges and humans
Findings
- GPT-4 favors itself with a 10% higher win rate
- Claude-v1 favors itself with a 25% higher win rate

Limitations: Lack of Capability in Grading Math and Reasoning Questions

LLMs are known to have limited math and reasoning capability
TBD: Training verifiers to solve math word problems

Addressing Limitations

Swapping positions
Few-shot judge
CoT and reference-guided judge

Multi-turn Judge

When comparing two assistants, need to handle 2 questions and 4 responses
2 designs
- Design 1: Breaking the two turns into two prompts
- Design 2: Displaying complete conversations in a single prompt
Findings
- Design 1 cause LLM judge struggling to locate the assistant’s previous response precisely
- Necessary to display a complete conversation to enable the LLM judge to better grasp the context

Agreement Evaluation

Metrics
- Agreement: probability of randomly selected LLM judge and Human judge agreeing on a randomly selected question
- Average win rate: average of win rates of one LLM’s answer against all other LLMs’ answer
- Metrics can be computed with or without including tie votes

High agreement on MT-bench

GPT-4 show very high agreements with human experts in
- Pairwise comparison
- Single answer grading
The GPT-human agreement-S2 (w/o tie) reaches 85%, higher than human-human agreement (81%)
Conflict resolution
- When a human’s choice deviated from GPT-4, we presented GPT-4’s judgments to humans and ask if they are reasonable
- Despite different views, humans deemed GPT-4’s judgments reasonable in 75% of cases and are even willing to change their choices in 34% of cases
Conclusion
- GPT-4 with single-answer grading matches both pairwise GPT-4 and human preferences very well
- This means GPT-4 has a relatively stable internal rubric
- GPT-4 may sometimes perform slightly worse than pairwise comparison and give more tie votes

agreement-1

Agreement between two types of judges on MT-bench.

Row
- G4-Pair: GPT-4 with pairwise comparison
- G4-Single: GPT-4 with single-answer grading
- The single-answer grading can be converted into pairwise comparison for calculating the agreement
Column
- S1: includes non-tie, tie, and inconsistent (due to position bias) votes and counts inconsistent as tie
- S2: only includes non-tie votes
Cell
- The agreement between two random judges under each setup is denoted as “R=%”
- Top value: agreement
- Bottom gray value: #votes

High agreement on Chatbot Arena

agreement-2

Agreement between two types of judges on Chatbot Arena.

G4-S: GPT-4 with single-answer grading
G4: GPT-4 with pairwise comparison
G3.5: GPT-3.5 with pairwise comparison
C: Claude with pairwise comparison
H: human