16 Comments
User's avatar
Paolo Perrone's avatar

fantastic piece Paul!

Expand full comment
Paul Iusztin's avatar

An honor to hear that from you, man 🥂

Expand full comment
Thanh Ng's avatar

On "Tool Confusion": You mentioned the Gorilla benchmark shows models struggle when given more than one tool. I'm curious about your take on this: Is the core challenge truly the number of tools, or is it the agent's underlying reasoning and planning ability in complex, multi-step workflows? Newer benchmarks seem to focus more on testing this multi-turn reasoning capability.  

On the 32k Token "Limit": You highlighted that model accuracy can drop significantly after 32,000 tokens, which is a critical warning for production systems. Since some of the latest models (like GPT-4o and Claude 3.5 Sonnet) are showing strong performance well beyond this point, how do you see this "soft limit" evolving? Is it a moving target that engineers need to constantly re-evaluate for each specific model they use?  

Expand full comment
Paul Iusztin's avatar

These are all great questions. To be honest, these are all dimensions that you should be aware of, but it’s extremely hard to know how they perform in your particular use case.

That’s why building AI evaluations that test your features is critical. Benchmarks are good reference points, but they often don’t reflect the real world. That’s why I don’t bother with them; I build my own AI evaluations, test, make changes such as adding/removing tools, and reiterate.

Expand full comment
Darren's avatar

A great read.

Expand full comment
Paul Iusztin's avatar

Thanks 🥰

Expand full comment
Meenakshi NavamaniAvadaiappan's avatar

So net net we are back to client-agent server-LLM architecture design considerations for systems optimization and user experience which are loved to be the good 😊

Expand full comment
Paul Iusztin's avatar

Yes 🤟

Expand full comment
Meenakshi NavamaniAvadaiappan's avatar

😄

Expand full comment
Mwbwyiv's avatar

Thanks for the article!

Why USER_QUERY goes in the system prompt?

Expand full comment
Paul Iusztin's avatar

That’s what describes the user intent. It is not always required, but it is often.

Expand full comment
Shah's avatar

Amazing article!

Expand full comment
Paul Iusztin's avatar

Thanks, Shah! This is the new format I would like to adopt for Decoding ML. What do you think?

Expand full comment
Shah's avatar

I like the format, just an FYI this link is broken from the article https://www.nature.com/articles/s41593-023-01496-2

Expand full comment
Miguel Otero Pedrido's avatar

This article is gold Paul! (and thanks a lot for the shoutout!)

Expand full comment
Paul Iusztin's avatar

Thanks, man 🥰 Haha, you deserve it 🤟

Expand full comment