Real-time feature pipelines for RAG
RAG hybrid search with transformers-based sparse vectors. CDC tech stack for event-driven architectures.
This weekโs topics:
CDC tech stack for event-driven architectures
Real-time feature pipelines with CDC
RAG hybrid search with transformers-based sparse vectors
CDC tech stack for event-driven architectures
Here is the ๐๐ฒ๐ฐ๐ต ๐๐๐ฎ๐ฐ๐ธ used to ๐ฏ๐๐ถ๐น๐ฑ a ๐๐ต๐ฎ๐ป๐ด๐ฒ ๐๐ฎ๐๐ฎ ๐๐ฎ๐ฝ๐๐๐ฟ๐ฒ (๐๐๐) ๐ฐ๐ผ๐บ๐ฝ๐ผ๐ป๐ฒ๐ป๐ for implementing an ๐ฒ๐๐ฒ๐ป๐-๐ฑ๐ฟ๐ถ๐๐ฒ๐ป ๐ฎ๐ฟ๐ฐ๐ต๐ถ๐๐ฒ๐ฐ๐๐๐ฟ๐ฒ in our ๐๐๐ ๐ง๐๐ถ๐ป ๐ฐ๐ผ๐๐ฟ๐๐ฒ
๐ช๐ต๐ฎ๐ ๐ถ๐ ๐๐ต๐ฎ๐ป๐ด๐ฒ ๐๐ฎ๐๐ฎ ๐๐ฎ๐ฝ๐๐๐ฟ๐ฒ (๐๐๐)?
The purpose of CDC is to capture insertions, updates, and deletions applied to a database and to make this change data available in a format easily consumable by downstream applications.
๐ช๐ต๐ ๐ฑ๐ผ ๐๐ฒ ๐ป๐ฒ๐ฒ๐ฑ ๐๐๐ ๐ฝ๐ฎ๐๐๐ฒ๐ฟ๐ป?
- Real-time Data Syncing
- Efficient Data Pipelines
- Minimized System Impact
- Event-Driven Architectures
๐ช๐ต๐ฎ๐ ๐ฑ๐ผ ๐๐ฒ ๐ป๐ฒ๐ฒ๐ฑ ๐ณ๐ผ๐ฟ ๐ฎ๐ป ๐ฒ๐ป๐ฑ-๐๐ผ-๐ฒ๐ป๐ฑ ๐ถ๐บ๐ฝ๐น๐ฒ๐บ๐ฒ๐ป๐๐ฎ๐๐ถ๐ผ๐ป ๐ผ๐ณ ๐๐๐?
We will take the tech stack used in our LLM Twin course as an example, where...
... we built a feature pipeline to gather cleaned data for fine-tuning and chunked & embedded data for RAG
๐๐๐ฒ๐ฟ๐๐๐ต๐ถ๐ป๐ด ๐๐ถ๐น๐น ๐ฏ๐ฒ ๐ฑ๐ผ๐ป๐ฒ ๐ผ๐ป๐น๐ ๐ถ๐ป ๐ฃ๐๐๐ต๐ผ๐ป!
๐๐ฆ๐ณ๐ฆ ๐ต๐ฉ๐ฆ๐บ ๐ข๐ณ๐ฆ
โโโ
1. ๐ง๐ต๐ฒ ๐๐ผ๐๐ฟ๐ฐ๐ฒ ๐ฑ๐ฎ๐๐ฎ๐ฏ๐ฎ๐๐ฒ: MongoDB (it (also works for most databases such as MySQL, PostgreSQL, Oracle, etc.)
2. ๐ ๐๐ผ๐ผ๐น ๐๐ผ ๐บ๐ผ๐ป๐ถ๐๐ผ๐ฟ ๐๐ต๐ฒ ๐๐ฟ๐ฎ๐ป๐๐ฎ๐ฐ๐๐ถ๐ผ๐ป ๐น๐ผ๐ด: MongoDB Watcher (also Debezium is a popular & scalable solution)
3. ๐ ๐ฑ๐ถ๐๐๐ฟ๐ถ๐ฏ๐๐๐ฒ๐ฑ ๐พ๐๐ฒ๐๐ฒ: RabbitMQ (another popular option is to use Kafka, but it was overkill in our use case)
4. ๐ ๐๐๐ฟ๐ฒ๐ฎ๐บ๐ถ๐ป๐ด ๐ฒ๐ป๐ด๐ถ๐ป๐ฒ: Bytewax (great streaming engine for the Python ecosystem)
5. ๐ ๐๐ผ๐๐ฟ๐ฐ๐ฒ ๐ฑ๐ฎ๐๐ฎ๐ฏ๐ฎ๐๐ฒ: Qdrant (this works with any other database, but we needed a vector DB to store our data for fine-tuning and RAG)
๐๐ฐ๐ณ ๐ฆ๐น๐ข๐ฎ๐ฑ๐ญ๐ฆ, ๐ฉ๐ฆ๐ณ๐ฆ ๐ช๐ด ๐ฉ๐ฐ๐ธ ๐ข ๐๐๐๐๐ ๐ฐ๐ฑ๐ฆ๐ณ๐ข๐ต๐ช๐ฐ๐ฏ ๐ธ๐ช๐ญ๐ญ ๐ฃ๐ฆ ๐ฑ๐ณ๐ฐ๐ค๐ฆ๐ด๐ด๐ฆ๐ฅ:
1. Write a post to the MongoDB warehouse
2. A "๐ค๐ณ๐ฆ๐ข๐ต๐ฆ" operation is logged in the transaction log of Mongo
3. The MongoDB watcher captures this and emits it to the RabbitMQ queue
4. The Bytewax streaming pipelines read the event from the queue
5. It cleans, chunks, and embeds it right away - in real time!
6. The cleaned & embedded version of the post is written to Qdrant
Real-time feature pipelines with CDC
๐๐ผ๐ to ๐ถ๐บ๐ฝ๐น๐ฒ๐บ๐ฒ๐ป๐ ๐๐๐ to ๐๐๐ป๐ฐ your ๐ฑ๐ฎ๐๐ฎ ๐๐ฎ๐ฟ๐ฒ๐ต๐ผ๐๐๐ฒ and ๐ณ๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐๐๐ผ๐ฟ๐ฒ using a RabbitMQ ๐พ๐๐ฒ๐๐ฒ and a Bytewax ๐๐๐ฟ๐ฒ๐ฎ๐บ๐ถ๐ป๐ด ๐ฒ๐ป๐ด๐ถ๐ป๐ฒ โ
๐๐ถ๐ฟ๐๐, ๐น๐ฒ๐'๐ ๐๐ป๐ฑ๐ฒ๐ฟ๐๐๐ฎ๐ป๐ฑ ๐๐ต๐ฒ๐ฟ๐ฒ ๐๐ผ๐ ๐ป๐ฒ๐ฒ๐ฑ ๐๐ผ ๐ถ๐บ๐ฝ๐น๐ฒ๐บ๐ฒ๐ป๐ ๐๐ต๐ฒ ๐๐ต๐ฎ๐ป๐ด๐ฒ ๐๐ฎ๐๐ฎ ๐๐ฎ๐ฝ๐๐๐ฟ๐ฒ (๐๐๐) ๐ฝ๐ฎ๐๐๐ฒ๐ฟ๐ป:
๐๐๐ ๐ช๐ด ๐ถ๐ด๐ฆ๐ฅ ๐ธ๐ฉ๐ฆ๐ฏ ๐บ๐ฐ๐ถ ๐ธ๐ข๐ฏ๐ต ๐ต๐ฐ ๐ด๐บ๐ฏ๐ค 2 ๐ฅ๐ข๐ต๐ข๐ฃ๐ข๐ด๐ฆ๐ด.
The destination can be a complete replica of the source database (e.g., one for transactional and the other for analytical applications)
...or you can process the data from the source database before loading it to the destination DB (e.g., retrieve various documents and chunk & embed them for RAG).
๐๐ฉ๐ข๐ต'๐ด ๐ธ๐ฉ๐ข๐ต ๐ ๐ข๐ฎ ๐จ๐ฐ๐ช๐ฏ๐จ ๐ต๐ฐ ๐ด๐ฉ๐ฐ๐ธ ๐บ๐ฐ๐ถ:
How to use CDC to sync a MongoDB & Qdrant vector DB to streamline real-time documents that must be ready for fine-tuning LLMs and RAG.
MongoDB is our data warehouse.
Qdrant is our logical feature store.
.
๐๐ฒ๐ฟ๐ฒ ๐ถ๐ ๐๐ต๐ฒ ๐ถ๐บ๐ฝ๐น๐ฒ๐บ๐ฒ๐ป๐๐ฎ๐๐ถ๐ผ๐ป ๐ผ๐ณ ๐๐ต๐ฒ ๐๐๐ ๐ฝ๐ฎ๐๐๐ฒ๐ฟ๐ป:
1. Use Mongo's ๐ธ๐ข๐ต๐ค๐ฉ() method to listen for CRUD transactions
2. For example, on a CREATE operation, along with saving it to Mongo, the ๐ธ๐ข๐ต๐ค๐ฉ() method will trigger a change and return a JSON with all the information.
3. We standardize the JSON in our desired structure.
4. We stringify the JSON and publish it to the RabbitMQ queue
๐๐ผ๐ ๐ฑ๐ผ ๐๐ฒ ๐๐ฐ๐ฎ๐น๐ฒ?
โ You can use Debezium instead of Mongo's ๐ธ๐ข๐ต๐ค๐ฉ() method for scaling up the system, but the idea remains the same.
โ You can swap RabbitMQ with Kafka, but RabbitMQ can get you far.
๐ก๐ผ๐, ๐๐ต๐ฎ๐ ๐ต๐ฎ๐ฝ๐ฝ๐ฒ๐ป๐ ๐ผ๐ป ๐๐ต๐ฒ ๐ผ๐๐ต๐ฒ๐ฟ ๐๐ถ๐ฑ๐ฒ ๐ผ๐ณ ๐๐ต๐ฒ ๐พ๐๐ฒ๐๐ฒ?
You have a Bytewax streaming pipeline - 100% written in Python that:
5. Listens in real-time to new messages from the RabbitMQ queue
6. It cleans, chunks, and embeds the events on the fly
7. It loads the data to Qdrant for LLM fine-tuning & RAG
Do you ๐๐ฎ๐ป๐ to check out the ๐ณ๐๐น๐น ๐ฐ๐ผ๐ฑ๐ฒ?
...or even an ๐ฒ๐ป๐๐ถ๐ฟ๐ฒ ๐ฎ๐ฟ๐๐ถ๐ฐ๐น๐ฒ about ๐๐๐?
The CDC component is part of the ๐๐๐ ๐ง๐๐ถ๐ป FREE ๐ฐ๐ผ๐๐ฟ๐๐ฒ, made by Decoding ML.
โโโ
๐ ๐๐ฆ๐ด๐ด๐ฐ๐ฏ 3: ๐๐ฉ๐ข๐ฏ๐จ๐ฆ ๐๐ข๐ต๐ข ๐๐ข๐ฑ๐ต๐ถ๐ณ๐ฆ: ๐๐ฏ๐ข๐ฃ๐ญ๐ช๐ฏ๐จ ๐๐ท๐ฆ๐ฏ๐ต-๐๐ณ๐ช๐ท๐ฆ๐ฏ ๐๐ณ๐ค๐ฉ๐ช๐ต๐ฆ๐ค๐ต๐ถ๐ณ๐ฆ๐ด
๐ ๐๐ช๐ต๐๐ถ๐ฃ
RAG hybrid search with transformers-based sparse vectors
๐๐๐ฏ๐ฟ๐ถ๐ฑ ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต is standard in ๐ฎ๐ฑ๐๐ฎ๐ป๐ฐ๐ฒ๐ฑ ๐ฅ๐๐ ๐๐๐๐๐ฒ๐บ๐. The ๐๐ฟ๐ถ๐ฐ๐ธ is to ๐ฐ๐ผ๐บ๐ฝ๐๐๐ฒ the suitable ๐๐ฝ๐ฎ๐ฟ๐๐ฒ ๐๐ฒ๐ฐ๐๐ผ๐ฟ๐ for it. Here is an ๐ฎ๐ฟ๐๐ถ๐ฐ๐น๐ฒ that shows ๐ต๐ผ๐ to use ๐ฆ๐ฃ๐๐๐๐ to ๐ฐ๐ผ๐บ๐ฝ๐๐๐ฒ ๐๐ฝ๐ฎ๐ฟ๐๐ฒ ๐๐ฒ๐ฐ๐๐ผ๐ฟ๐ using ๐๐ฟ๐ฎ๐ป๐๐ณ๐ผ๐ฟ๐บ๐ฒ๐ฟ๐ and integrate them into a ๐ต๐๐ฏ๐ฟ๐ถ๐ฑ ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต ๐ฎ๐น๐ด๐ผ๐ฟ๐ถ๐๐ต๐บ using Qdrant.
๐๐๐ฎ ๐๐ค๐ฉ๐๐๐ง ๐ฌ๐๐ฉ๐ ๐จ๐ฅ๐๐ง๐จ๐ ๐ซ๐๐๐ฉ๐ค๐ง๐จ ๐ฌ๐๐๐ฃ ๐ฌ๐ ๐๐๐ซ๐ ๐๐๐ฃ๐จ๐ ๐ซ๐๐๐ฉ๐ค๐ง๐จ (๐๐ข๐๐๐๐๐๐ฃ๐๐จ)?
Sparse vectors represent data by highlighting only the most relevant features (like keywords), significantly reducing memory usage compared to dense vectors.
Also, sparse vectors work great in finding specific keywords, which is why they work fantastic in combination with dense vectors used for finding similarities in semantics but not particular words.
๐ง๐ต๐ฒ ๐ฎ๐ฟ๐๐ถ๐ฐ๐น๐ฒ ๐ต๐ถ๐ด๐ต๐น๐ถ๐ด๐ต๐๐:
- ๐๐ฑ๐ข๐ณ๐ด๐ฆ ๐ท๐ด. ๐ฅ๐ฆ๐ฏ๐ด๐ฆ ๐ท๐ฆ๐ค๐ต๐ฐ๐ณ๐ด
- ๐๐ฐ๐ธ ๐๐๐๐๐๐ ๐ธ๐ฐ๐ณ๐ฌ๐ด: The SPLADE model leverages sparse vectors to perform better than traditional methods like BM25 by computing it using transformer architectures.
- ๐๐ฉ๐บ ๐๐๐๐๐๐ ๐ธ๐ฐ๐ณ๐ฌ๐ด: It expands terms based on context rather than just frequency, offering a nuanced understanding of content relevancy.
- ๐๐ฐ๐ธ ๐ต๐ฐ ๐ช๐ฎ๐ฑ๐ญ๐ฆ๐ฎ๐ฆ๐ฏ๐ต ๐ฉ๐บ๐ฃ๐ณ๐ช๐ฅ ๐ด๐ฆ๐ข๐ณ๐ค๐ฉ ๐ถ๐ด๐ช๐ฏ๐จ ๐๐๐๐๐๐ with Qdrant: step-by-step code
๐๐ฒ๐ฟ๐ฒ ๐ถ๐ ๐๐ต๐ฒ ๐ฎ๐ฟ๐๐ถ๐ฐ๐น๐ฒ
โโโ
๐ ๐๐ฑ๐ข๐ณ๐ด๐ฆ ๐๐ฆ๐ค๐ต๐ฐ๐ณ๐ด ๐ช๐ฏ ๐๐ฅ๐ณ๐ข๐ฏ๐ต: ๐๐ถ๐ณ๐ฆ ๐๐ฆ๐ค๐ต๐ฐ๐ณ-๐ฃ๐ข๐ด๐ฆ๐ฅ ๐๐บ๐ฃ๐ณ๐ช๐ฅ ๐๐ฆ๐ข๐ณ๐ค๐ฉ
Images
If not otherwise stated, all images are created by the author.