The 5-Star Lie: You Are Doing AI Evals Wrong

Hamel Husain

Sep 20

Why binary evals are better than likert scales

Read →

9 Comments

Meenakshi NavamaniAvadaiappan

Sep 21

No pacification where reality to be resolved for the good 😊

Expand full comment

Prashant Bajpayee

Sep 20

Thanks for.informative article

Expand full comment

Reply (1)

Paul Iusztin

Sep 20

Always. More to come. Preparing a 2 month series on the fundamentals of AI Agents

Expand full comment

vincentwu0730

Sep 20

Really well-written and insightful.

Expand full comment

Reply (1)

Paul Iusztin

Sep 20

Hamel is the king of AI evals.

Expand full comment

Paul Iusztin

Sep 20

Your articles completely changed my AI evals game!

Expand full comment

Kevin Kelly

Sep 24

Love this!

Expand full comment

Parth Kapoor

Sep 23Edited

Could you further explain this statement? I'd like to understand this better:

"To be statistically confident that your system has improved from an average score of 3.2 to 3.4 requires a far larger sample size than detecting a shift in a binary pass rate from 75% to 80%. You can spend weeks making changes without knowing if you're actually making progress or just seeing random fluctuations in your annotators' moods."

If we're comparing proportions (75% to 80%) to means (converting Likert scale from Ordinal to Interval data), you'd have to have pretty high dispersion for sample size to be larger for Likert evals, no?

Expand full comment

Mario Lazo

Sep 21Edited

For me the primary evaluation approach is to work with a real SME —- domain expertise is key. Even if you make evals binary - it does not remove you from historical and recency bias.

You might be evaluating the wrong thing in the first place - looking at correlations and not causation!

Expand full comment