Ready for production ML? Here are the 4 pillars to build production ML systems
ML Platforms & MLOps Components. RAG:RAG: What problems does it solve, and how is it integrated into LLM-powered applications
Decoding ML Notes
This weekโs topics:
Using an ML Platform is critical to integrating MLOps into your project
The 4 pillars to build production ML systems
RAG: What problems does it solve, and how is it integrated into LLM-powered applications?
Using an ML Platform is critical to integrating MLOps into your project
Here are 6 ML platform features you must know & use โ
...and let's use Comet ML as a concrete example.
#๐ญ. ๐๐
๐ฝ๐ฒ๐ฟ๐ถ๐บ๐ฒ๐ป๐ ๐ง๐ฟ๐ฎ๐ฐ๐ธ๐ถ๐ป๐ด
In your ML development phase, you generate lots of experiments.
Tracking and comparing the metrics between them is crucial in finding the optimal model & hyperparameters.
#๐ฎ. ๐ ๐ฒ๐๐ฎ๐ฑ๐ฎ๐๐ฎ ๐ฆ๐๐ผ๐ฟ๐ฒ
Its primary purpose is reproducibility.
To know how a model from a specific experiment was generated, you must know:
- the version of the code
- version of the dataset
- hyperparameters/config
- total compute
... and more
#๐ฏ. ๐ฉ๐ถ๐๐๐ฎ๐น๐ถ๐๐ฎ๐๐ถ๐ผ๐ป๐
Most of the time, along with the scalar metrics, you must log visual results, such as:
- images
- videos
- prompts
- t-SNE graphs
- 3D point clouds
... and more
#4. ๐๐ซ๐ญ๐ข๐๐๐๐ญ๐ฌ
The most powerful feature out of them all.
An artifact is a versioned object that acts as an input or output for your job.
Everything can be an artifact (data, model, code), but the most common case is for your data.
Wrapping your assets around an artifact ensures reproducibility and shareability.
For example, you wrap your features into an artifact (e.g., features:3.1.2), which you can consume and share across multiple ML environments (development or continuous training).
Using an artifact to wrap your data allows you to quickly respond to questions such as "What data have I used to generate the model?" and "What Version?"
#5. ๐๐จ๐๐๐ฅ ๐๐๐ ๐ข๐ฌ๐ญ๐ซ๐ฒ
The model registry is the ultimate way to version your models and make them accessible to all your services.
For example, your continuous training pipeline will log the weights as an artifact into the model registry after it trains the model.
You label this model as "v:1.1.5:staging" and prepare it for testing. If the tests pass, mark it as "v:1.1.0:production" and trigger the CI/CD pipeline to deploy it to production.
#6. ๐๐๐๐ก๐จ๐จ๐ค๐ฌ
Webhooks lets you integrate the Comet model registry with your CI/CD pipeline.
For example, when the model status changes from "Staging" to "Production," a POST request triggers a GitHub Actions workflow to deploy your new model.
โณ๐ Check out Comet to learn more
The 4 pillars to build production ML systems
Before building a production-ready system, it is critical to consider a set of questions that will later determine the nature of your ML system architecture.
๐๐ฆ๐ณ๐ฆ ๐ข๐ณ๐ฆ ๐ต๐ฉ๐ฆ 4 ๐ฑ๐ช๐ญ๐ญ๐ข๐ณ๐ด ๐ต๐ฉ๐ข๐ต ๐บ๐ฐ๐ถ ๐ข๐ญ๐ธ๐ข๐บ๐ด ๐ฉ๐ข๐ท๐ฆ ๐ต๐ฐ ๐ค๐ฐ๐ฏ๐ด๐ช๐ฅ๐ฆ๐ณ ๐ฃ๐ฆ๐ง๐ฐ๐ณ๐ฆ ๐ฅ๐ฆ๐ด๐ช๐จ๐ฏ๐ช๐ฏ๐จ ๐ข๐ฏ๐บ ๐ด๐บ๐ด๐ต๐ฆ๐ฎ โ
โ ๐๐ฎ๐๐ฎ
- What data types do you have? (e.g., tabular data, images, text, etc.)
- What does the data look like? (e.g., for text data, is it in a single language or multiple?)
- How do you collect the data?
- At what frequency do you have to collect the data?
- How do you collect labels for the data? (crucial for how you plan to evaluate and monitor the model in production)
โ ๐ง๐ต๐ฟ๐ผ๐๐ด๐ต๐ฝ๐๐
- What are the throughput requirements? You must know at least the throughput's minimum, average, and maximum statistics.
- How many requests the system must handle simultaneously? (1, 10, 1k, 1 million, etc.)
โ ๐๐ฎ๐๐ฒ๐ป๐ฐ๐
- What are the latency requirements? (1 millisecond, 10 milliseconds, 1 second, etc.)
- Throughput vs. latency trade-off
- Accuracy vs. speed trade-off
โ ๐๐ป๐ณ๐ฟ๐ฎ๐๐๐ฟ๐๐ฐ๐๐๐ฟ๐ฒ
- Batch vs. real-time architecture (closely related to the throughput vs. latency trade-off)
- How should the system scale? (e.g., based on CPU workload, # of requests, queue size, data size, etc.)
- Cost requirements
.
Do you see how we shifted the focus from model performance towards how it is integrated into a more extensive system?
When building production-ready ML, the model's accuracy is no longer the holy grail but a bullet point in a grander scheme.
.
๐ง๐ผ ๐๐๐บ๐บ๐ฎ๐ฟ๐ถ๐๐ฒ, the 4 pillars to keep in mind before designing an ML architecture are:
- Data
- Throughput
- Latency
- Infrastructure
RAG: What problems does it solve, and how is it integrated into LLM-powered applications?
Let's find out โ
RAG is a popular strategy when building LLMs to add external data to your prompt.
=== ๐ฃ๐ฟ๐ผ๐ฏ๐น๐ฒ๐บ ===
Working with LLMs has 3 main issues:
1. The world moves fast
LLMs learn an internal knowledge base. However, the issue is that its knowledge is limited to its training dataset.
The world moves fast. New data flows on the internet every second. Thus, the model's knowledge base can quickly become obsolete.
One solution is to fine-tune the model every minute or day...
If you have some billions to spend around, go for it.
2. Hallucinations
An LLM is full of testosterone and likes to be blindly confident.
Even if the answer looks 100% legit, you can never fully trust it.
3. Lack of reference links
It is hard to trust the response of the LLM if we can't see the source of its decisions.
Especially for important decisions (e.g., health, financials)
=== ๐ฆ๐ผ๐น๐๐๐ถ๐ผ๐ป ===
โ Surprize! It is RAG.
1. Avoid fine-tuning
Using RAG, you use the LLM as a reasoning engine and the external knowledge base as the main memory (e.g., vector DB).
The memory is volatile, so you can quickly introduce or remove data.
2. Avoid hallucinations
By forcing the LLM to answer solely based on the given context, the LLM will provide an answer as follows:
- use the external data to respond to the user's question if it contains the necessary insights
- "I don't know" if not
3. Add reference links
Using RAG, you can easily track the source of the data and highlight it to the user.
=== ๐๐ผ๐ ๐ฑ๐ผ๐ฒ๐ ๐ฅ๐๐ ๐๐ผ๐ฟ๐ธ? ===
Let's say we want to use RAG to build a financial assistant.
๐๐ฉ๐ข๐ต ๐ฅ๐ฐ ๐ธ๐ฆ ๐ฏ๐ฆ๐ฆ๐ฅ?
- a data source with historical and real-time financial news (e.g. Alpaca)
- a stream processing engine (eg. Bytewax)
- an encoder-only model for embedding the docs (e.g., pick one from `sentence-transformers`)
- a vector DB (e.g., Qdrant)
๐๐ฐ๐ธ ๐ฅ๐ฐ๐ฆ๐ด ๐ช๐ต ๐ธ๐ฐ๐ณ๐ฌ?
โณ On the feature pipeline side:
1. using Bytewax, you ingest the financial news and clean them
2. you chunk the news documents and embed them
3. you insert the embedding of the docs along with their metadata (e.g., the initial text, source_url, etc.) to Qdrant
โณ On the inference pipeline side:
4. the user question is embedded (using the same embedding model)
5. using this embedding, you extract the top K most similar news documents from Qdrant
6. along with the user question, you inject the necessary metadata from the extracted top K documents into the prompt template (e.g., the text of documents & its source_url)
7. you pass the whole prompt to the LLM for the final answer
Excellent article Paul! Thank you so much for sharing ๐