On "Tool Confusion": You mentioned the Gorilla benchmark shows models struggle when given more than one tool. I'm curious about your take on this: Is the core challenge truly the number of tools, or is it the agent's underlying reasoning and planning ability in complex, multi-step workflows? Newer benchmarks seem to focus more on testing this multi-turn reasoning capability.
On the 32k Token "Limit": You highlighted that model accuracy can drop significantly after 32,000 tokens, which is a critical warning for production systems. Since some of the latest models (like GPT-4o and Claude 3.5 Sonnet) are showing strong performance well beyond this point, how do you see this "soft limit" evolving? Is it a moving target that engineers need to constantly re-evaluate for each specific model they use?
These are all great questions. To be honest, these are all dimensions that you should be aware of, but it’s extremely hard to know how they perform in your particular use case.
That’s why building AI evaluations that test your features is critical. Benchmarks are good reference points, but they often don’t reflect the real world. That’s why I don’t bother with them; I build my own AI evaluations, test, make changes such as adding/removing tools, and reiterate.
So net net we are back to client-agent server-LLM architecture design considerations for systems optimization and user experience which are loved to be the good 😊
fantastic piece Paul!
An honor to hear that from you, man 🥂
On "Tool Confusion": You mentioned the Gorilla benchmark shows models struggle when given more than one tool. I'm curious about your take on this: Is the core challenge truly the number of tools, or is it the agent's underlying reasoning and planning ability in complex, multi-step workflows? Newer benchmarks seem to focus more on testing this multi-turn reasoning capability.
On the 32k Token "Limit": You highlighted that model accuracy can drop significantly after 32,000 tokens, which is a critical warning for production systems. Since some of the latest models (like GPT-4o and Claude 3.5 Sonnet) are showing strong performance well beyond this point, how do you see this "soft limit" evolving? Is it a moving target that engineers need to constantly re-evaluate for each specific model they use?
These are all great questions. To be honest, these are all dimensions that you should be aware of, but it’s extremely hard to know how they perform in your particular use case.
That’s why building AI evaluations that test your features is critical. Benchmarks are good reference points, but they often don’t reflect the real world. That’s why I don’t bother with them; I build my own AI evaluations, test, make changes such as adding/removing tools, and reiterate.
A great read.
Thanks 🥰
So net net we are back to client-agent server-LLM architecture design considerations for systems optimization and user experience which are loved to be the good 😊
Yes 🤟
😄
Thanks for the article!
Why USER_QUERY goes in the system prompt?
That’s what describes the user intent. It is not always required, but it is often.
Amazing article!
Thanks, Shah! This is the new format I would like to adopt for Decoding ML. What do you think?
I like the format, just an FYI this link is broken from the article https://www.nature.com/articles/s41593-023-01496-2
This article is gold Paul! (and thanks a lot for the shoutout!)
Thanks, man 🥰 Haha, you deserve it 🤟