DML: Why & what do you need a streaming pipeline when implementing RAG in your LLM applications?
Lesson 3 | The Hands-on LLMs Series
Hello there, I am Paul Iusztin ππΌ
Within this newsletter, I will help you decode complex topics about ML & MLOps one week at a time π₯
Lesson 3 | The Hands-on LLMs Series
Table of Contents:
RAG: What problems does it solve, and how it's integrated into LLM-powered applications?
Why do you need a streaming pipeline instead of a batch pipeline when implementing RAG in your LLM applications?
What do you need to implement a streaming pipeline for a financial assistant?
Previous Lessons:
β³π Check out the Hands-on LLMs course and support it with a β.
#1. RAG: What problems does it solve, and how it's integrated into LLM-powered applications?
Let's find out β
RAG is a popular strategy when building LLMs to add external data to your prompt.
=== π£πΏπΌπ―πΉπ²πΊ ===
Working with LLMs has 3 main issues:
1. The world moves fast
An LLM learns an internal knowledge base. However, the issue is that its knowledge is limited to its training dataset.
The world moves fast. New data flows on the internet every second. Thus, the model's knowledge base can quickly become obsolete.
One solution is to fine-tune the model every minute or day...
If you have some billions to spend around, go for it.
2. Hallucinations
An LLM is full of testosterone and likes to be blindly confident.
Even if the answer looks 100% legit, you can never fully trust it.
3. Lack of reference links
It is hard to trust the response of the LLM if we can't see the source of its decisions.
Especially for important decisions (e.g., health, financials)
=== π¦πΌπΉπππΆπΌπ» ===
β Surprize! It is RAG.
1. Avoid fine-tuning
Using RAG, you use the LLM as a reasoning engine and the external knowledge base as the main memory (e.g., vector DB).
The memory is volatile, so you can quickly introduce or remove data.
2. Avoid hallucinations
By forcing the LLM to answer solely based on the given context, the LLM will provide an answer as follows:
- use the external data to respond to the user's question if it contains the necessary insights
- "I don't know" if not
3. Add reference links
Using RAG, you can easily track the source of the data and highlight it to the user.
=== ππΌπ π±πΌπ²π π₯ππ ππΌπΏπΈ? ===
Let's say we want to use RAG to build a financial assistant.
ππ©π’π΅ π₯π° πΈπ¦ π―π¦π¦π₯?
- a data source with historical and real-time financial news (e.g. Alpaca)
- a stream processing engine (e.g., Bytewax)
- an encoder-only model for embedding the documents (e.g., pick one from `sentence-transformers`)
- a vector DB (e.g., Qdrant)
ππ°πΈ π₯π°π¦π΄ πͺπ΅ πΈπ°π³π¬?
β³ On the feature pipeline side:
1. using Bytewax, you ingest the financial news and clean them
2. you chunk the news documents and embed them
3. you insert the embedding of the docs along with their metadata (e.g., the initial text, source_url, etc.) to Qdrant
β³ On the inference pipeline side:
4. the user question is embedded (using the same embedding model)
5. using this embedding, you extract the top K most similar news documents from Qdrant
6. along with the user question, you inject the necessary metadata from the extracted top K documents into the prompt template (e.g., the text of documents & its source_url)
7. you pass the whole prompt to the LLM for the final answer
β³π Check out the Hands-on LLMs course to see this in action.
#2. Why do you need a streaming pipeline instead of a batch pipeline when implementing RAG in your LLM applications?
The quality of your RAG implementation is as good as the quality & freshness of your data.
Thus, depending on your use case, you have to ask:
"How fresh does my data from the vector DB have to be to provide accurate answers?"
But for the best user experience, the data has to be as fresh as possible, aka real-time data.
For example, when implementing a financial assistant, being aware of the latest financial news is critical. A new piece of information can completely change the course of your strategy.
Hence, when implementing RAG, one critical aspect is to have your vector DB synced with all your external data sources in real-time.
A batch pipeline will work if your use case accepts a particular delay (e.g., one hour, one day, etc.).
But with tools like Bytewax π, building streaming applications becomes much more accessible. So why not aim for the best?
#3. What do you need to implement a streaming pipeline for a financial assistant?
- A financial news data source exposed through a web socket (e.g., Alpaca)
- A Python streaming processing framework. For example, Bytewax π is built in Rust for efficiency and exposes a Python interface for ease of use - you don't need the Java ecosystem to implement real-time pipelines anymore.
- A Python package to process, clean, and chunk documents. `unstructured` offers a rich set of features that makes parsing HTML documents extremely convenient.
- An encoder-only language model that maps your chunked documents into embeddings. `setence-transformers` is well integrated with HuggingFace and has a huge list of models of various sizes.
- A vector DB, where to insert your embeddings and their metadata (e.g., the embedded text, the source_url, the creation date, etc.). For example, Qdrant provides a rich set of features and a seamless experience.
- A way to deploy your streaming pipeline. Docker + AWS will never disappoint you.
- A CI/CD pipeline for continuous tests & deployments. GitHub Actions is a great serverless option with a rich ecosystem.
This is what you need to build & deploy a streaming pipeline solely in Python π₯
β³π Check out the Hands-on LLMs course to see this in action.
Thatβs it for today πΎ
See you next Thursday at 9:00 a.m. CET.
Have a fantastic weekend!
β¦and see you next week for Lesson 4 of the Hands-On LLMs series π₯
Paul
Whenever youβre ready, here is how I can help you:
The Full Stack 7-Steps MLOps Framework: a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.
Machine Learning & MLOps Blog: in-depth topics about designing and productionizing ML systems using MLOps.
Machine Learning & MLOps Hub: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).