DML: Synced Vector DBs - A Guide to Streaming Pipelines for Real-Time RAG in Your LLM Applications

Paul Iusztin

Oct 26, 2023

Hello there, I am Paul Iusztin 👋🏼

Within this newsletter, I will help you decode complex topics about ML & MLOps one week at a time 🔥

This week’s ML & MLOps topics:

Synced Vector DBs - A Guide to Streaming Pipelines for Real-Time Rag in Your LLM Applications

Story: If anyone told you that ML or MLOps is easy, they were right. A simple trick I learned the hard way.

This week’s newsletter is shorter than usual, but I have some great news 🔥

Next week, within the Decoding ML newsletter, I will start a step-by-step series based on the Hands-On LLMs course I am developing.
By the end of this series, you will know how to design, build, and deploy a financial assistant powered by LLMs.
…all of this for FREE inside the Decoding ML newsletter

↳🔗 Check out the Hands-On LLMs course GitHub page and give it a star to stay updated with our progress.

#1. Synced Vector DBs - A Guide to Streaming Pipelines for Real-Time Rag in Your LLM Applications

To successfully use 𝗥𝗔𝗚 in your 𝗟𝗟𝗠 𝗮𝗽𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻𝘀, your 𝘃𝗲𝗰𝘁𝗼𝗿 𝗗𝗕 must constantly be updated with the latest data.

Here is how you can implement a 𝘀𝘁𝗿𝗲𝗮𝗺𝗶𝗻𝗴 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲 to keep your vector DB in sync with your datasets ↓

.

𝗥𝗔𝗚 is a popular strategy when building LLMs to add context to your prompt about your private datasets.

Leveraging your domain data using RAG provides 2 significant benefits:
- you don't need to fine-tune your model as often (or at all)
- avoid hallucinations

.

On the 𝗯𝗼𝘁 𝘀𝗶𝗱𝗲, to implement RAG, you have to:

3. Embed the user's question using an embedding model (e.g., BERT). Use the embedding to query your vector DB and find the most similar vectors using a distance function (e.g., cos similarity).
4. Get the top N closest vectors and their metadata.
5. Attach the extracted top N vectors metadata + the chat history to the input prompt.
6. Pass the prompt to the LLM.
7. Insert the user question + assistant answer to the chat history.

.

But the question is, 𝗵𝗼𝘄 do you 𝗸𝗲𝗲𝗽 𝘆𝗼𝘂𝗿 𝘃𝗲𝗰𝘁𝗼𝗿 𝗗𝗕 𝘂𝗽 𝘁𝗼 𝗱𝗮𝘁𝗲 𝘄𝗶𝘁𝗵 𝘁𝗵𝗲 𝗹𝗮𝘁𝗲𝘀𝘁 𝗱𝗮𝘁𝗮?

↳ You need a real-time streaming pipeline.

How do you implement it?

You need 2 components:

↳ A streaming processing framework. For example, Bytewax is built in Rust for efficiency and exposes a Python interface for ease of use - you don't need Java to implement real-time pipelines anymore.

🔗 Bytewax

↳ A vector DB. For example, Qdrant provides a rich set of features and a seamless experience.

🔗 Qdrant

.

Here is an example of how to implement a streaming pipeline for financial news ↓

#𝟭. Financial news data source (e.g., Alpaca):

To populate your vector DB, you need a historical API (e.g., RESTful API) to add data to your vector DB in batch mode between a desired [start_date, end_date] range. You can tweak the number of workers to parallelize this step as much as possible.
→ You run this once in the beginning.

You need the data exposed under a web socket to ingest news in real time. So, you'll be able to listen to the news and ingest it in your vector DB as soon as they are available.
→ Listens 24/7 for financial news.

#𝟮. Build the streaming pipeline using Bytewax:

Implement 2 input connectors for the 2 different types of APIs: RESTful API & web socket.

The rest of the steps can be shared between both connectors ↓

- Clean financial news documents.
- Chunk the documents.
- Embed the documents (e.g., using Bert).
- Insert the embedded documents + their metadata to the vector DB (e.g., Qdrant).

#𝟯-𝟳. When the users ask a financial question, you can leverage RAG with an up-to-date vector DB to search for the latest news in the industry.

Synced Vector DBs - A Guide to Streaming Pipelines for Real-Time Rag in Your LLM Applications [Image by the Author]

#Story. If anyone told you that ML or MLOps is easy, they were right. A simple trick I learned the hard way.

If anyone told you that 𝗠𝗟 or 𝗠𝗟𝗢𝗽𝘀 is 𝗲𝗮𝘀𝘆, they were 𝗿𝗶𝗴𝗵𝘁.

Here is a simple trick that I learned the hard way ↓

If you are in this domain, you already know that everything changes fast:

- a new tool every month
- a new model every week
- a new project every day

You know what I did? I stopped caring about all these changes and switched my attention to the real gold.

Which is → "𝗙𝗼𝗰𝘂𝘀 𝗼𝗻 𝘁𝗵𝗲 𝗳𝘂𝗻𝗱𝗮𝗺𝗲𝗻𝘁𝗮𝗹𝘀."

.

Let me explain ↓

When you constantly chase the latest models (aka FOMO), you will only have a shallow understanding of that new information (except if you are a genius or already deep into that niche).

But the joke's on you. In reality, most of what you think you need to know, you don't.

So you won't use what you learned and forget most of it after 1-2 months.

What a waste of time, right?

.

But...

If you master the fundamentals of the topic, you want to learn.

For example, for deep learning, you have to know:

- how models are built
- how they are trained
- groundbreaking architectures (Resnet, UNet, Transformers, etc.)
- parallel training
- deploying a model, etc.

...when in need (e.g., you just moved on to a new project), you can easily pick up the latest research.

Thus, after you have laid the foundation, it is straightforward to learn SoTA approaches when needed (if needed).

Most importantly, what you learn will stick with you, and you will have the flexibility to jump from one project to another quickly.

.

I am also guilty. I used to FOMO into all kinds of topics until I was honest with myself and admitted I am no Leonardo Da Vinci.

But here is what I did and worked well:

- building projects
- replicating the implementations of famous papers
- teaching the subject I want to learn
... and most importantly, take my time to relax and internalize the information.

To conclude:

- learn ahead only the fundamentals
- learn the latest trend only when needed

That’s it for today 👾

See you next Thursday at 9:00 a.m. CET.

Have a fantastic weekend!

…and see you next week for the beginning of the Hands-On LLMs series 🔥

Paul

Whenever you’re ready, here is how I can help you:

The Full Stack 7-Steps MLOps Framework: a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.
Machine Learning & MLOps Blog: in-depth topics about designing and productionizing ML systems using MLOps.
Machine Learning & MLOps Hub: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).

DML: Synced Vector DBs - A Guide to Streaming Pipelines for Real-Time RAG in Your LLM Applications

#1. Synced Vector DBs - A Guide to Streaming Pipelines for Real-Time Rag in Your LLM Applications

#Story. If anyone told you that ML or MLOps is easy, they were right. A simple trick I learned the hard way.

Whenever you’re ready, here is how I can help you:

Discussion about this post