Could you further explain this statement? I'd like to understand this better:
"To be statistically confident that your system has improved from an average score of 3.2 to 3.4 requires a far larger sample size than detecting a shift in a binary pass rate from 75% to 80%. You can spend weeks making changes without knowing if you're actually making progress or just seeing random fluctuations in your annotators' moods."
If we're comparing proportions (75% to 80%) to means (converting Likert scale from Ordinal to Interval data), you'd have to have pretty high dispersion for sample size to be larger for Likert evals, no?
For me the primary evaluation approach is to work with a real SME —- domain expertise is key. Even if you make evals binary - it does not remove you from historical and recency bias.
You might be evaluating the wrong thing in the first place - looking at correlations and not causation!
No pacification where reality to be resolved for the good 😊
Thanks for.informative article
Always. More to come. Preparing a 2 month series on the fundamentals of AI Agents
Really well-written and insightful.
Hamel is the king of AI evals.
Your articles completely changed my AI evals game!
Love this!
Could you further explain this statement? I'd like to understand this better:
"To be statistically confident that your system has improved from an average score of 3.2 to 3.4 requires a far larger sample size than detecting a shift in a binary pass rate from 75% to 80%. You can spend weeks making changes without knowing if you're actually making progress or just seeing random fluctuations in your annotators' moods."
If we're comparing proportions (75% to 80%) to means (converting Likert scale from Ordinal to Interval data), you'd have to have pretty high dispersion for sample size to be larger for Likert evals, no?
For me the primary evaluation approach is to work with a real SME —- domain expertise is key. Even if you make evals binary - it does not remove you from historical and recency bias.
You might be evaluating the wrong thing in the first place - looking at correlations and not causation!