90% of AIEs Are Dangerously Abstracted From Reality
Deploy your agents leveraging the backend-frontend architecture. How to evaluate an agentic RAG application.
This week’s topics:
Deploy your agents leveraging the backend-frontend architecture
90% of AI engineers are dangerously abstracted from reality
How to evaluate an agentic RAG application
Deploy your agents leveraging the backend-frontend architecture
Offline pipelines can be productionized from notebooks. But they are trash when it comes to serving AI apps live to users.
That’s why we built a frontend-backend system in our PhiloAgents course.
Here's how it works:
Frontend (Phaser.js): Delivers an interactive game experience, streaming real-time dialogue with AI agents. It handles user input, rendering, and smooth UI updates.
Backend (FastAPI): Runs the core agent logic, handles incoming requests, manages session and agent state, and generates streamed responses.
This clear separation of concerns lets us scale, maintain, and evolve each component independently.
For example:
The frontend can be upgraded or replaced without touching the backend agent logic.
The backend can be scaled horizontally to handle more users without redeploying the UI.
Interfaces and communication protocols between frontend and backend use REST APIs for single-turn queries and WebSocket connections for continuous, low-latency streaming interactions.
Handling deployment is streamlined using Docker Compose:
→ Each service runs isolated in its own container with a shared Docker network.
Understanding the client–server flow is crucial…
The frontend keeps a persistent WebSocket connection open to the backend, allowing token-by-token streaming of agent responses and real-time interaction—just like talking to a human.
To power the backend, we rely on MongoDB for flexible, scalable vector and document storage, and Groq for fast, efficient LLM inference APIs.
This clear separation lets us scale, maintain, and evolve each component independently.
It's also the foundation you must know to turn AI experiments into AI products.
If you want to build agents that don’t just work in theory but thrive in production, you MUST master this architecture.
And if you want a deep dive, we’ve just released the full design walkthrough in the PhiloAgents course. Check it out here ↓
90% of AI engineers are dangerously abstracted from reality (Affiliate)
They work with:
Prebuilt models
High-level APIs
Auto-magical cloud tools
But here’s the thing -
If you don’t understand how these tools actually work, you’ll always be guessing when something breaks.
That’s why the best AI engineers I know go deeper...
They understand:
How Git actually tracks changes.
How Redis handles memory.
How Docker isolates environments.
If you’re serious about engineering, you'd go build the tools you use.
And it’s why I recommend CodeCrafters.io
You won’t just learn tools.
You’ll rebuild them (from scratch).
Git, Redis, Docker, Kafka, SQLite, Shell...
Step by step, test by test
In your favorite language (Rust, Python, Go, etc.)

It’s perfect for AI engineers who want to:
Level up their backend + system design skills
Reduce debugging time in production
Build apps that actually scale under load
And most importantly...
Stop being a model user
Start being a systems thinker
If I had to level up my engineering foundations today, CodeCrafters is where I’d start.
If you consider subscribing, use my affiliate link to support us and get 40% off on CodeCrafters.io ↓
P.S. We only promote tools we use or would personally take.
How to evaluate an agentic RAG application
The most underestimated part of building LLM applications?
Evaluation.
Evaluation can take up to 80% of your development time (because it’s HARD)
Most people obsess over prompts.
They tweak models.
Tune embeddings.
But when it’s time to test whether the whole system actually works?
That’s where it breaks.
Especially in agentic RAG systems—where you’re orchestrating retrieval, reasoning, memory, tools, and APIs into one seamless flow.
Implementation might take a week.
Evaluation takes longer.
(And it’s what makes or breaks the product.)
Let’s clear up a common confusion:
LLM evaluation ≠ RAG evaluation.
LLM eval tests reasoning in isolation—useful, but incomplete.
In production, your model isn’t reasoning in a vacuum.
It’s pulling context from a vector DB, reacting to user input, and shaped by memory + tools.
That’s why RAG evaluation takes a system-level view. It asks:
Did this app respond correctly, given the user input and the retrieved context?
Here’s how to break it down:
Step 1: Evaluate retrieval.
Are the retrieved docs relevant? Ranked correctly?
Use LLM judges to compute context precision and recall
If ranking matters, compute NDCG, MRR metrics
Visualize embeddings (e.g. UMAP)
Step 2: Evaluate generation.
Did the LLM ground its answer in the right info?
Use heuristics, LLM-as-a-judge, and contextual scoring.
In practice, treat your app as a black box and log:
User query
Retrieved context
Model output
(Optional) Expected output
This lets you debug the whole system, not just the model.
How many samples are enough?
5–10? Too few.
30–50? Good start.
400+? Now you’re capturing real patterns and edge cases.
Still, start with how many samples you have available, and keep expanding your evaluation split. It’s better to have an imperfect evaluation layer than nothing.
Also track latency, cost, throughput, and business metrics (like conversion or retention).
Some battle-tested tools:
RAGAS (retrieval-grounding alignment)
ARES (factual grounding)
Opik by Comet (end-to-end open-source eval + monitoring)
Langsmith, Langfuse, Phoenix (observability + tracing)
TL;DR:
Agentic systems are complex.
Success = making evaluation part of your design from Day 0.
We unpack this in full in Lesson 5 of the PhiloAgents course. Check it out ↓
Whenever you’re ready, there are 3 ways we can help you:
Perks: Exclusive discounts on our recommended learning resources
(books, live courses, self-paced courses and learning platforms).
The LLM Engineer’s Handbook: Our bestseller book on teaching you an end-to-end framework for building production-ready LLM and RAG applications, from data collection to deployment (get up to 20% off using our discount code).
Free open-source courses: Master production AI with our end-to-end open-source courses, which reflect real-world AI projects and cover everything from system architecture to data collection, training and deployment.
Images
If not otherwise stated, all images are created by the author.