9 Comments
User's avatar
Meenakshi NavamaniAvadaiappan's avatar

No pacification where reality to be resolved for the good 😊

Expand full comment
Prashant Bajpayee's avatar

Thanks for.informative article

Expand full comment
Paul Iusztin's avatar

Always. More to come. Preparing a 2 month series on the fundamentals of AI Agents

Expand full comment
vincentwu0730's avatar

Really well-written and insightful.

Expand full comment
Paul Iusztin's avatar

Hamel is the king of AI evals.

Expand full comment
Paul Iusztin's avatar

Your articles completely changed my AI evals game!

Expand full comment
Kevin Kelly's avatar

Love this!

Expand full comment
Parth Kapoor's avatar

Could you further explain this statement? I'd like to understand this better:

"To be statistically confident that your system has improved from an average score of 3.2 to 3.4 requires a far larger sample size than detecting a shift in a binary pass rate from 75% to 80%. You can spend weeks making changes without knowing if you're actually making progress or just seeing random fluctuations in your annotators' moods."

If we're comparing proportions (75% to 80%) to means (converting Likert scale from Ordinal to Interval data), you'd have to have pretty high dispersion for sample size to be larger for Likert evals, no?

Expand full comment
Mario Lazo's avatar

For me the primary evaluation approach is to work with a real SME —- domain expertise is key. Even if you make evals binary - it does not remove you from historical and recency bias.

You might be evaluating the wrong thing in the first place - looking at correlations and not causation!

Expand full comment