DML: Why & what do you need a streaming pipeline when implementing RAG in your LLM applications?

Lesson 3 | The Hands-on LLMs Series

Paul Iusztin

Nov 16, 2023

Hello there, I am Paul Iusztin 👋🏼

Within this newsletter, I will help you decode complex topics about ML & MLOps one week at a time 🔥

Lesson 3 | The Hands-on LLMs Series

Previous Lessons:

↳🔗 Check out the Hands-on LLMs course and support it with a ⭐.

#1. RAG: What problems does it solve, and how it's integrated into LLM-powered applications?

Let's find out ↓

RAG is a popular strategy when building LLMs to add external data to your prompt.

=== 𝗣𝗿𝗼𝗯𝗹𝗲𝗺 ===

Working with LLMs has 3 main issues:

1. The world moves fast

An LLM learns an internal knowledge base. However, the issue is that its knowledge is limited to its training dataset.

The world moves fast. New data flows on the internet every second. Thus, the model's knowledge base can quickly become obsolete.

One solution is to fine-tune the model every minute or day...

If you have some billions to spend around, go for it.

2. Hallucinations

An LLM is full of testosterone and likes to be blindly confident.

Even if the answer looks 100% legit, you can never fully trust it.

3. Lack of reference links

It is hard to trust the response of the LLM if we can't see the source of its decisions.

Especially for important decisions (e.g., health, financials)

=== 𝗦𝗼𝗹𝘂𝘁𝗶𝗼𝗻 ===

→ Surprize! It is RAG.

1. Avoid fine-tuning

Using RAG, you use the LLM as a reasoning engine and the external knowledge base as the main memory (e.g., vector DB).

The memory is volatile, so you can quickly introduce or remove data.

2. Avoid hallucinations

By forcing the LLM to answer solely based on the given context, the LLM will provide an answer as follows:
- use the external data to respond to the user's question if it contains the necessary insights
- "I don't know" if not

3. Add reference links

Using RAG, you can easily track the source of the data and highlight it to the user.

=== 𝗛𝗼𝘄 𝗱𝗼𝗲𝘀 𝗥𝗔𝗚 𝘄𝗼𝗿𝗸? ===

Let's say we want to use RAG to build a financial assistant.

𝘞𝘩𝘢𝘵 𝘥𝘰 𝘸𝘦 𝘯𝘦𝘦𝘥?

- a data source with historical and real-time financial news (e.g. Alpaca)
- a stream processing engine (e.g., Bytewax)
- an encoder-only model for embedding the documents (e.g., pick one from `sentence-transformers`)
- a vector DB (e.g., Qdrant)

𝘏𝘰𝘸 𝘥𝘰𝘦𝘴 𝘪𝘵 𝘸𝘰𝘳𝘬?

↳ On the feature pipeline side:

1. using Bytewax, you ingest the financial news and clean them
2. you chunk the news documents and embed them
3. you insert the embedding of the docs along with their metadata (e.g., the initial text, source_url, etc.) to Qdrant

↳ On the inference pipeline side:

4. the user question is embedded (using the same embedding model)
5. using this embedding, you extract the top K most similar news documents from Qdrant
6. along with the user question, you inject the necessary metadata from the extracted top K documents into the prompt template (e.g., the text of documents & its source_url)
7. you pass the whole prompt to the LLM for the final answer

What is Retrieval Augmented Generation (RAG)? [Image by the Author].

↳🔗 Check out the Hands-on LLMs course to see this in action.

#2. Why do you need a streaming pipeline instead of a batch pipeline when implementing RAG in your LLM applications?

The quality of your RAG implementation is as good as the quality & freshness of your data.

Thus, depending on your use case, you have to ask:
"How fresh does my data from the vector DB have to be to provide accurate answers?"

But for the best user experience, the data has to be as fresh as possible, aka real-time data.

For example, when implementing a financial assistant, being aware of the latest financial news is critical. A new piece of information can completely change the course of your strategy.

Hence, when implementing RAG, one critical aspect is to have your vector DB synced with all your external data sources in real-time.

A batch pipeline will work if your use case accepts a particular delay (e.g., one hour, one day, etc.).

But with tools like Bytewax 🐝, building streaming applications becomes much more accessible. So why not aim for the best?

Streaming vs. batch pipelines when doing RAG [Image by the Author]

#3. What do you need to implement a streaming pipeline for a financial assistant?

- A financial news data source exposed through a web socket (e.g., Alpaca)

- A Python streaming processing framework. For example, Bytewax 🐝 is built in Rust for efficiency and exposes a Python interface for ease of use - you don't need the Java ecosystem to implement real-time pipelines anymore.

- A Python package to process, clean, and chunk documents. `unstructured` offers a rich set of features that makes parsing HTML documents extremely convenient.

- An encoder-only language model that maps your chunked documents into embeddings. `setence-transformers` is well integrated with HuggingFace and has a huge list of models of various sizes.

- A vector DB, where to insert your embeddings and their metadata (e.g., the embedded text, the source_url, the creation date, etc.). For example, Qdrant provides a rich set of features and a seamless experience.

- A way to deploy your streaming pipeline. Docker + AWS will never disappoint you.

- A CI/CD pipeline for continuous tests & deployments. GitHub Actions is a great serverless option with a rich ecosystem.

This is what you need to build & deploy a streaming pipeline solely in Python 🔥

↳🔗 Check out the Hands-on LLMs course to see this in action.

That’s it for today 👾

See you next Thursday at 9:00 a.m. CET.

Have a fantastic weekend!

…and see you next week for Lesson 4 of the Hands-On LLMs series 🔥

Paul

Whenever you’re ready, here is how I can help you:

The Full Stack 7-Steps MLOps Framework: a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.
Machine Learning & MLOps Blog: in-depth topics about designing and productionizing ML systems using MLOps.
Machine Learning & MLOps Hub: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).

DML: Why & what do you need a streaming pipeline when implementing RAG in your LLM applications?

Lesson 3 | The Hands-on LLMs Series

Lesson 3 | The Hands-on LLMs Series

Table of Contents:

Previous Lessons:

#1. RAG: What problems does it solve, and how it's integrated into LLM-powered applications?

#2. Why do you need a streaming pipeline instead of a batch pipeline when implementing RAG in your LLM applications?

#3. What do you need to implement a streaming pipeline for a financial assistant?

Whenever you’re ready, here is how I can help you:

Discussion about this post