5 Comments

I am gaining huge knowledge for monolithic vs micro architecture related and mostly i can used monolithic architecture to build LLM Or RAG Application. Thank You for sharing valuable content for architectures.

Expand full comment

My pleasure, man. Glad you liked it. 🙏

Expand full comment

Funny to see how the AI world slowly hits the normal engineering issues.

Architecture, Scaling, Caching, …

I would recommend anyone to put the LLM into their service. I would recommend to treat it always as an external service.

A lot of the points are true, but there are more. What if you want to test a different model? What about automatic testing? Wanna try it against the real OpenAI?

Use OpenAI REST API as your boundary. Most LLM providers are supporting it.

Another big issue what I’m seeing is the scalability of the LLM (the GPU). While a CPU with more threads can do more in parallel, a GPU is quite limited. You mainly scale via the amount of them.

Separating your service and the LLM has one big drawback. You can scale your services faster than the LLM.

So testing the handling of a lot of requests in a service to service setup becomes crucial.

Expand full comment

All great observations. We could continue discussing this topic for 10 more articles 🔥

By the way, I am a big fan of monolith + separation of concern in your code base. But the thing is that most services, such as AWS Bedrock, Modal, Hugging Face, etc., are optimized to host just the LLM inference engine + the LLM.

Thus, naturally you will end up with at least two services: the LLM + business one

Expand full comment

I agree. I see it the same way. With AI you will have atleast two. Or just one and you pay consumption. (Which just makes the problem of hosting it, someone’s else problem ;))

Expand full comment