A blueprint for scalable LLM systems

What makes building real-world multi-agent systems tough? Why embeddings are so powerful

Mar 01, 2025

Why embeddings are so powerful
What makes building real-world multi-agent systems tough?
A blueprint for designing scalable LLM systems: From Notebooks to production

Why embeddings are so powerful

Embeddings are the cornerstone of AI, ML, GenAI, LLMs, you name it. But why are they so powerful?

First, machine learning models work only with numerical values. This is not a problem when working with tabular data, as the data is often in numerical form or can quickly be processed into numbers.

Embeddings are powerful when we want to feed words, images or audio data into models.

For instance, when working with transformer models, you tokenize all your text input, where each token has an embedding associated with it.

The input to the transformer is a sequence of embeddings, which the dense layers of the neural network can easily interpret.

Based on this example, you can use embeddings to encode any categorical variable and feed it to an ML model.

But why not use other simple methods, such as one hot encoding, to encode the categorical variables?

When working with categorical variables with high cardinality, such as language vocabularies, you will suffer from the curse of dimensionality when using other classical methods.

For example, if your vocabulary has 10000 tokens, then only one token will have a length of 10000 after applying one hot encoding. If the input sequence has N tokens, that will become in N * 10000 input parameters. If N >= 100, often when inputting text, the input is too large to be usable.

Another issue with other classical methods that don’t suffer from the curse of dimensionality, such as hashing, is that you lose the semantic relationships between the vectors.

Thus, embedding your input reduces the size of its dimension and condenses all of its semantic meaning into a dense vector.

This is an extremely popular technique for working with images.

A CNN encoder module maps the high-dimensional meaning into an embedding, which is later processed by a CNN decoder that performs the classification or regression task.

The image above shows a typical CNN encoder used to compute embeddings out of images.

To learn more about embeddings, how they work and how they are created, consider reading our article:

Embeddings: the cornerstone of AI & ML

Paul Iusztin

September 14, 2024

Read full story

What makes building real-world multi-agent systems tough? (Affiliate)

(Hint: It’s not the theory or tech stacks)

Turning concepts into functional, scalable systems that solve actual problems.

That's where many of us get stuck -

Bridging the gap between learning about AI agents and building something that works in the real world.

Fortunately, I've found a solution:

The Build Multi-Agent Applications Bootcamp

Build Multi-Agent Applications - A Bootcamp - Enroll here with $100 off

Notably, it's a BootCamp that will run from April 28 to June 13.

It's led by:

Amir Feizpour (10+ years in NLP)
Abhimanyu Anand (7+ years building AI products and applications across diverse industries)

And I absolutely love their approach.

The best and only way to learn AI engineering is by doing.

Well, this course is all about learning by doing.

Within 7 weeks, you’ll design, develop, deploy, and demo a fully functional agent-based application following a sprint-based approach, similar to how people work in the industry.

The most exciting part is that you can implement your own product idea and build it during the course (while the authors will provide 1:1 mentorship).

So, here’s what I’m most excited about:

Hands-on learning: You'll build a real MVP using LLM agents to automate a workflow of your choice (your MVP idea or theirs if you don’t have one)
Personalized support: Access 1:1 sessions with the instructors and dedicated TAs for real-time feedback and code reviews.
Industry-relevant: Learn from mentors with 10+ years in AI and product development.
Sprint-based structure: Simulate real-world product development with focused sprints, from research to implementation.
Project flexibility: Have your own project idea? Bring it! Or choose from predefined projects provided by the course.

By the end of the course, you’ll have a working app you can showcase in a public demo to your company or recruiters.

This means you'll also have the skills to confidently integrate multi-agent systems into your business or projects.

Ready to stop thinking about building multi-agent systems and start doing?

The next cohort starts on April 28th.

Enroll Here

P.S. Use code PAUL for $100.

A blueprint for designing scalable LLM systems: From Notebooks to production

For example, we will fine-tune an LLM and do RAG on social media data, but it can easily be adapted to any data.

We have 4 core components.

We will follow the feature/training/inference (FTI) pipeline architecture.

1. Data collection pipeline

It is based on an ETL that:

Crawls your data from blogs and socials
Standardizes it
Loads it to a NoSQL database (e.g., MongoDB)

As:

We work with text data, which is naturally unstructured
No analytics required

→ A NoSQL database fits like a glove.

2. Feature pipeline

It takes raw articles, posts, and code data points from the data warehouse, processes them, and loads them into a logical feature store.

Let's focus on the logical feature store. As with any RAG-based system, a vector database is one of the central pieces of the infrastructure.

We directly use a vector database as a logical feature store.

Unfortunately, the vector database doesn't offer the concept of a training dataset.

To implement this, we will wrap the retrieved data into a versioned, tracked, and shareable MLOps artifact.

To conclude:

The training pipeline will use the instruct datasets as artifacts (offline)
The inference pipeline will query the vector DB for RAG (online)

3. Training pipeline

It consumes instruct datasets from the feature store, fine-tunes an LLM with it, and stores the tuned LLM weights in a model registry.

More concretely, when a new instruct dataset is available in the logical feature store, we will trigger the training pipeline, consume the artifact, and fine-tune the LLM.

We run multiple experiments to find the best model and hyperparameters. We will use an experiment tracker to compare and select the best hyperparameters.

After the experimentation phase, we store and reuse the best hyperparameters for continuous training (CT).

The LLM candidate's testing pipeline is triggered for a detailed analysis. If it passes, the model is tagged as accepted and deployed to production.

Our modular design lets us leverage an ML orchestrator to schedule and trigger the pipelines for CT.

4. Inference pipeline

It is connected to the model registry and logical feature store. From the model registry, it loads a fine-tuned LLM, and from the logical feature store, it accesses the vector DB for RAG.

It receives client requests as queries through a REST API. It uses the fine-tuned LLM and vector DB to do RAG to answer the queries.

Everything is sent to a prompt monitoring system to analyze, debug, and understand the system.

Consider learning more in our LLM Engineer’s Handbook ↓

Get Your Copy

Whenever you’re ready, there are 3 ways we can help you:

Perks: Exclusive discounts on our recommended learning resources
(books, live courses, self-paced courses and learning platforms).
The LLM Engineer’s Handbook: Our bestseller book on teaching you a framework for building production-ready LLM and RAG applications, from data collection to deployment (get up to 20% off using our discount code).
Free open-source courses: Master production AI with our end-to-end open-source courses, reflecting real-world AI projects and covering everything from system architecture to data collection, training and deployment.

Images

If not otherwise stated, all images are created by the author.