DML: How to implement a streaming pipeline to populate a vector DB for real-time RAG?
Lesson 4 | The Hands-on LLMs Series
Hello there, I am Paul Iusztin ๐๐ผ
Within this newsletter, I will help you decode complex topics about ML & MLOps one week at a time ๐ฅ
Lesson 4 | The Hands-on LLMs Series
Table of Contents:
What is Bytewax?
Why have vector DBs become so popular? Why are they so crucial for most ML applications?
How to implement a streaming pipeline to populate a vector DB for real-time RAG?
Previous Lessons:
โณ๐ Check out the Hands-on LLMs course and support it with a โญ.
#1. What is Bytewax?
Are you afraid of writing ๐๐๐ฟ๐ฒ๐ฎ๐บ๐ถ๐ป๐ด ๐ฝ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ๐? Or do you think they are hard to implement?
I did until I discovered Bytewax ๐. Let me show you โ
Bytewax ๐ is an ๐ผ๐ฝ๐ฒ๐ป-๐๐ผ๐๐ฟ๐ฐ๐ฒ ๐๐๐ฟ๐ฒ๐ฎ๐บ ๐ฝ๐ฟ๐ผ๐ฐ๐ฒ๐๐๐ถ๐ป๐ด ๐ณ๐ฟ๐ฎ๐บ๐ฒ๐๐ผ๐ฟ๐ธ that:
- is built in Rust โ๏ธ for performance
- has Python ๐ binding for ease of use
... so for all the Python fanatics out there, no more JVM headaches for you.
Jokes aside, here is why Bytewax ๐ is so powerful โ
- Bytewax local setup is plug-and-play
- can quickly be integrated into any Python project (you can go wild -- even use it in Notebooks)
- can easily be integrated with other Python packages (NumPy, PyTorch, HuggingFace, OpenCV, SkLearn, you name it)
- out-of-the-box connectors for Kafka, local files, or you can quickly implement your own
- CLI tool to easily deploy it to K8s, AWS, or GCP.
๐๐ฐ๐ณ ๐ฆ๐น๐ข๐ฎ๐ฑ๐ญ๐ฆ (๐ญ๐ฐ๐ฐ๐ฌ ๐ข๐ต ๐ต๐ฉ๐ฆ ๐ช๐ฎ๐ข๐จ๐ฆ ๐ฃ๐ฆ๐ญ๐ฐ๐ธ):
1. We defined a streaming app in a few lines of code.
2. We run the streaming app with one command.
.
The thing is that I worked in Kafka Streams (in Kotlin) for one year.
I loved & understood the power of building streaming applications. The only thing that stood in my way was, well... Java.
I don't have something with Java; it is a powerful language. However, building an ML application in Java + Python takes much time due to a more significant resistance to integrating the two.
...and that's where Bytewax ๐ kicks in.
We used Bytewax ๐ for building the streaming pipeline for the ๐๐ฎ๐ป๐ฑ๐-๐ผ๐ป ๐๐๐ ๐ course and loved it.
#2. Why have vector DBs become so popular? Why are they so crucial for most ML applications?
In the world of ML, everything can be represented as an embedding.
A vector DB is an intelligent way to use your data embeddings as an index and perform fast and scalable searches between unstructured data points.
Simply put, a vector DB allows you to find matches between anything and anything (e.g., use an image as a query to find similar pieces of text, video, other images, etc.).
.
๐๐ฏ ๐ข ๐ฏ๐ถ๐ต๐ด๐ฉ๐ฆ๐ญ๐ญ, ๐ต๐ฉ๐ช๐ด ๐ช๐ด ๐ฉ๐ฐ๐ธ ๐บ๐ฐ๐ถ ๐ค๐ข๐ฏ ๐ช๐ฏ๐ต๐ฆ๐จ๐ณ๐ข๐ต๐ฆ ๐ข ๐ท๐ฆ๐ค๐ต๐ฐ๐ณ ๐๐ ๐ช๐ฏ ๐ณ๐ฆ๐ข๐ญ-๐ธ๐ฐ๐ณ๐ญ๐ฅ ๐ด๐ค๐ฆ๐ฏ๐ข๐ณ๐ช๐ฐ๐ด โ
Using various DL techniques, you can project your data points (images, videos, text, audio, user interactions) into the same vector space (aka the embeddings of the data).
You will load the embeddings along a payload (e.g., a URL to the image, date of creation, image description, properties, etc.) into the vector DB, where the data will be indexed along the:
- vector
- payload
- text within the payload
Now that the embedding indexes your data, you can query the vector DB by embedding any data point.
For example, you can query the vector DB with an image of your cat and use a filter to retrieve only "black" cats.
To do so, you must embed the image using the same model you used to embed the data within your vector DB. After you query the database using a given distance (e.g., cosine distance between 2 vectors) to find similar embeddings.
These similar embeddings have attached to them their payload that contains valuable information such as the URL to an image, a URL to a site, an ID of a user, a chapter from a book about the cat of a witch, etc.
.
Using this technique, I used Qdrant to implement RAG for a financial assistant powered by LLMs.
But vector DBs go beyond LLMs & RAG.
๐๐ฆ๐ณ๐ฆ ๐ช๐ด ๐ข ๐ญ๐ช๐ด๐ต ๐ฐ๐ง ๐ธ๐ฉ๐ข๐ต ๐บ๐ฐ๐ถ ๐ค๐ข๐ฏ ๐ฃ๐ถ๐ช๐ญ๐ฅ ๐ถ๐ด๐ช๐ฏ๐จ ๐ท๐ฆ๐ค๐ต๐ฐ๐ณ ๐๐๐ด (e.g., Qdrant ):
- similar image search
- semantic text search (instead of plain text search)
- recommender systems
- RAG for chatbots
- anomalies detection
โณ๐ ๐๐ฉ๐ฆ๐ค๐ฌ ๐ฐ๐ถ๐ต ๐๐ฅ๐ณ๐ข๐ฏ๐ต'๐ด ๐จ๐ถ๐ช๐ฅ๐ฆ๐ด ๐ข๐ฏ๐ฅ ๐ต๐ถ๐ต๐ฐ๐ณ๐ช๐ข๐ญ๐ด ๐ต๐ฐ ๐ญ๐ฆ๐ข๐ณ๐ฏ ๐ฎ๐ฐ๐ณ๐ฆ ๐ข๐ฃ๐ฐ๐ถ๐ต ๐ท๐ฆ๐ค๐ต๐ฐ๐ณ ๐๐๐ด.
#3. How to implement a streaming pipeline to populate a vector DB for real-time RAG?
This is ๐ต๐ผ๐ you can ๐ถ๐บ๐ฝ๐น๐ฒ๐บ๐ฒ๐ป๐ a ๐๐๐ฟ๐ฒ๐ฎ๐บ๐ถ๐ป๐ด ๐ฝ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ to populate a ๐๐ฒ๐ฐ๐๐ผ๐ฟ ๐๐ to do ๐ฅ๐๐ for a ๐ณ๐ถ๐ป๐ฎ๐ป๐ฐ๐ถ๐ฎ๐น ๐ฎ๐๐๐ถ๐๐๐ฎ๐ป๐ powered by ๐๐๐ ๐.
In a previous post, I covered ๐๐ต๐ you need a streaming pipeline over a batch pipeline when implementing RAG.
Now, we will focus on the ๐ต๐ผ๐, aka implementation details โ
๐ All the following steps are wrapped in Bytewax functions and connected in a single streaming pipeline.
๐๐
๐๐ฟ๐ฎ๐ฐ๐ ๐ณ๐ถ๐ป๐ฎ๐ป๐ฐ๐ถ๐ฎ๐น ๐ป๐ฒ๐๐ ๐ณ๐ฟ๐ผ๐บ ๐๐น๐ฝ๐ฎ๐ฐ๐ฎ
You need 2 types of inputs:
1. A WebSocket API to listen to financial news in real-time. This will be used to listen 24/7 for new data and ingest it as soon as it is available.
2. A RESTful API to ingest historical data in batch mode. When you deploy a fresh vector DB, you must populate it with data between a given range [date_start; date_end].
You wrap the ingested HTML document and its metadata in a `pydantic` NewsArticle model to validate its schema.
Regardless of the input type, the ingested data is the same. Thus, the following steps are the same for both data inputs โ
๐ฃ๐ฎ๐ฟ๐๐ฒ ๐๐ต๐ฒ ๐๐ง๐ ๐ ๐ฐ๐ผ๐ป๐๐ฒ๐ป๐
As the ingested financial news is in HTML, you must extract the text from particular HTML tags.
`unstructured` makes it as easy as calling `partition_html(document)`, which will recursively return the text within all essential HTML tags.
The parsed NewsArticle model is mapped into another `pydantic` model to validate its new schema.
The elements of the news article are the headline, summary and full content.
๐๐น๐ฒ๐ฎ๐ป ๐๐ต๐ฒ ๐๐ฒ๐
๐
Now we have a bunch of text that has to be cleaned. Again, `unstructured` makes things easy. Calling a few functions we clean:
- the dashes & bullets
- extra whitespace & trailing punctuation
- non ascii chars
- invalid quotes
Finally, we standardize everything to lowercase.
๐๐ต๐๐ป๐ธ ๐๐ต๐ฒ ๐๐ฒ๐
๐
As the text can exceed the context window of the embedding model, we have to chunk it.
Yet again, `unstructured` provides a valuable function that splits the text based on the tokenized text and expected input length of the embedding model.
This strategy is naive, as it doesn't consider the text's structure, such as chapters, paragraphs, etc. As the news is short, this is not an issue, but LangChain provides a `RecursiveCharacterTextSplitter` class that does that if required.
๐๐บ๐ฏ๐ฒ๐ฑ ๐๐ต๐ฒ ๐ฐ๐ต๐๐ป๐ธ๐
You pass all the chunks through an encoder-only model.
We have used `all-MiniLM-L6-v2` from `sentence-transformers`, a small model that can run on a CPU and outputs a 384 embedding.
But based on the size and complexity of your data, you might need more complex and bigger models.
๐๐ผ๐ฎ๐ฑ ๐๐ต๐ฒ ๐ฑ๐ฎ๐๐ฎ ๐ถ๐ป ๐๐ต๐ฒ ๐ค๐ฑ๐ฟ๐ฎ๐ป๐ ๐๐ฒ๐ฐ๐๐ผ๐ฟ ๐๐
Finally, you insert the embedded chunks and their metadata into the Qdrant vector DB.
The metadata contains the embedded text, the source_url and the publish date.
โณ๐ Check out the Hands-on LLMs course to see this in action.
Thatโs it for today ๐พ
See you next Thursday at 9:00 a.m. CET.
Have a fantastic weekend!
โฆand see you next week for Lesson 5 of the Hands-On LLMs series ๐ฅ
Paul
Whenever youโre ready, here is how I can help you:
The Full Stack 7-Steps MLOps Framework: a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.
Machine Learning & MLOps Blog: in-depth topics about designing and productionizing ML systems using MLOps.
Machine Learning & MLOps Hub: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).