DML: This is what you need to build an inference pipeline for a financial assistant powered by LLMs, vector DBs and LLMOps
Lesson 9 | The Hands-on LLMs Series
Hello there, I am Paul Iusztin ๐๐ผ
Within this newsletter, I will help you decode complex topics about ML & MLOps one week at a time ๐ฅ
Lesson 9 | The Hands-on LLMs Series
This is the last lesson within the Hands-on LLMs series... But certainly not the last MLE & MLOps series. We are cooking some exciting stuff. But I hope you had fun and learned much during this series.
Now, let's see how to glue everything we have done so far under the inference pipeline. Enjoy! ๐ง
Table of Contents:
Inference pipeline video lesson
What do you need to build an inference pipeline for a financial assistant powered by LLMs and vector DBs?
How can you build & deploy an inference pipeline for a real-time financial advisor while considering good LLMOps practices?
Previous Lessons:
Lesson 6: What do you need to fine-tune an open-source LLM to create your financial advisor?
Lesson 7: How do you generate a Q&A dataset in <30 minutes to fine-tune your LLMs?
Lesson 8: 7-steps on how to fine-tune an open-source LLM to create your real-time financial advisor
โณ๐ Check out the Hands-on LLMs course and support it with a โญ.
#1. Inference pipeline video lesson
We ๐ซ๐๐ฅ๐๐๐ฌ๐๐ the ๐๐ข๐ง๐๐ฅ video ๐ฅ๐๐ฌ๐ฌ๐จ๐ง of the ๐๐๐ง๐๐ฌ-๐จ๐ง ๐๐๐๐ฌ FREE course that will teach you how to ๐๐ฎ๐ข๐ฅ๐ & ๐๐๐ฉ๐ฅ๐จ๐ฒ an ๐ข๐ง๐๐๐ซ๐๐ง๐๐ ๐ฉ๐ข๐ฉ๐๐ฅ๐ข๐ง๐ for a financial advisor using ๐๐๐ง๐ ๐๐ก๐๐ข๐ง, ๐๐๐๐๐ฉ๐ฌ, and ๐ฏ๐๐๐ญ๐จ๐ซ ๐๐๐ฌ.
๐๐ฆ๐ณ๐ฆ ๐ข๐ณ๐ฆ ๐ต๐ฉ๐ฆ ๐ฌ๐ฆ๐บ ๐ต๐ฐ๐ฑ๐ช๐ค๐ด ๐ค๐ฐ๐ท๐ฆ๐ณ๐ฆ๐ฅ ๐ช๐ฏ ๐ต๐ฉ๐ฆ ๐ท๐ช๐ฅ๐ฆ๐ฐ ๐ญ๐ฆ๐ด๐ด๐ฐ๐ฏ made by Pau Labarta ๐ข๐ฏ๐ฅ ๐ โ
1. Overview of the architecture of the inference pipeline and how to apply LLMOps good practices
2. How to build from scratch a RAG agent using LangChain: ContextExtractorChain + FinancialBotQAChain
3. How to attach a callback class to log input prompts and LLM answers to Comet LLMOps
4. Setting up and running the code locally
5. Deploying the inference pipeline to Beam as a RESTful API
.
๐๐ถ๐ณ๐ช๐ฐ๐ถ๐ด?
Check out the video lesson
and I did โ#2. What do you need to build an inference pipeline for a financial assistant powered by LLMs and vector DBs?
Here are its ๐ณ ๐ธ๐ฒ๐ ๐ฐ๐ผ๐บ๐ฝ๐ผ๐ป๐ฒ๐ป๐๐ โ
1. ๐๐ฒ๐ฐ๐๐ผ๐ฟ ๐๐ ๐ฝ๐ผ๐ฝ๐๐น๐ฎ๐๐ฒ๐ฑ ๐๐ถ๐๐ต ๐ณ๐ถ๐ป๐ฎ๐ป๐ฐ๐ถ๐ฎ๐น ๐ป๐ฒ๐๐: This is the output of the feature pipeline. More concretely, a Qdrant vector DB populated with chunks of financial news from Alpaca. During the inference pipeline, we will use it to query valuable chunks of information and do RAG.
2. ๐ฒ๐บ๐ฏ๐ฒ๐ฑ๐ฑ๐ถ๐ป๐ด ๐น๐ฎ๐ป๐ด๐๐ฎ๐ด๐ฒ ๐บ๐ผ๐ฑ๐ฒ๐น: To embed the user question and query the vector DB, you need the same embedding model used in the feature pipeline, more concretely `๐ข๐ญ๐ญ-๐๐ช๐ฏ๐ช๐๐-๐6-๐ท2` from `๐ด๐ฆ๐ฏ๐ต๐ฆ๐ฏ๐ค๐ฆ-๐ต๐ณ๐ข๐ฏ๐ด๐ง๐ฐ๐ณ๐ฎ๐ฆ๐ณ๐ด`. Using the same encoder-only model is crucial, as the query vector and vector DB index vectors have to be in the same space.
3. ๐ณ๐ถ๐ป๐ฒ-๐๐๐ป๐ฒ๐ฑ ๐ผ๐ฝ๐ฒ๐ป-๐๐ผ๐๐ฟ๐ฐ๐ฒ ๐๐๐ : The output of the training pipeline will be a fine-tuned Falcon 7B on financial tasks.
4. ๐บ๐ผ๐ฑ๐ฒ๐น ๐ฟ๐ฒ๐ด๐ถ๐๐๐ฟ๐: The fine-tuned model will be shared between the training & inference pipeline through Cometโs model registry. By doing so, you decouple entirely the 2 components, and the model can easily be shared under specific environments (e.g., staging, prod) and versions (e.g., v1.0.1).
5. ๐ฎ ๐ณ๐ฟ๐ฎ๐บ๐ฒ๐๐ผ๐ฟ๐ธ ๐ณ๐ผ๐ฟ ๐๐๐ ๐ฎ๐ฝ๐ฝ๐น๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป๐: You need LangChain, as your LLM framework, to glue all the steps together, such as querying the vector DB, storing the history of the conversation, creating the prompt, and calling the LLM. LangChain provides out-of-the-box solutions to chain all these steps together quickly.
6. ๐ฑ๐ฒ๐ฝ๐น๐ผ๐ ๐๐ต๐ฒ ๐๐๐ ๐ฎ๐ฝ๐ฝ ๐ฎ๐ ๐ฎ ๐ฅ๐๐ฆ๐ง๐ณ๐๐น ๐๐ฃ๐: One of the final steps is to deploy your awesome LLM financial assistant under a RESTful API. You can quickly do this using Beam as your serverless infrastructure provider. Beam specializes in DL. Thus, it offers quick ways to load your LLM application on GPU machines and expose it under a RESTful API.
7. ๐ฝ๐ฟ๐ผ๐บ๐ฝ๐ ๐บ๐ผ๐ป๐ถ๐๐ผ๐ฟ๐ถ๐ป๐ด: The last step is to add eyes on top of your system. You can do this using Cometโs LLMOps features that allow you to track & monitor all the prompts & responses of the system.
โณ๐ Check out how these components are working together in our Hands-on LLMs free course.
#3. How can you build & deploy an inference pipeline for a real-time financial advisor while considering good LLMOps practices?
๐๐จ๐ฐ can you ๐๐ฎ๐ข๐ฅ๐ & ๐๐๐ฉ๐ฅ๐จ๐ฒ an ๐ข๐ง๐๐๐ซ๐๐ง๐๐ ๐ฉ๐ข๐ฉ๐๐ฅ๐ข๐ง๐ for a real-time financial advisor with ๐๐๐ง๐ ๐๐ก๐๐ข๐ง powered by ๐๐๐๐ฌ & ๐ฏ๐๐๐ญ๐จ๐ซ ๐๐๐ฌ while considering ๐ ๐จ๐จ๐ ๐๐๐๐๐ฉ๐ฌ ๐ฉ๐ซ๐๐๐ญ๐ข๐๐๐ฌ?
.
As a quick reminder from previous posts, here is what we already have:
- a Qdrant vector DB populated with financial news (the output of the feature pipeline)
- fine-tuned Falcon-7B LoRA weights stored in Cometโs model registry (the output of the training pipeline)
The Qdrant vectorDB is accessed through a Python client.
A specific version of the Falcon-7B LoRA weights is downloaded from Cometโs model registry and loaded in memory using QLoRA.
The goal of the inference pipeline is to use LangChain to glue the 2 components into a single `FinancialAssistant` entity.
.
The `FinancialAssistant` entity is deployed in a request-response fashion under a RESTful API. We used Beam to deploy it quickly under a serverless web endpoint.
To deploy any model using Beam as a RESTful API is as easy as writing the following Python decorator:
@financial_bot. rest_api(keep_warm_seconds=300, loader=load_bot)def run(**inputs):
....
๐๐จ๐ฐ ๐ฅ๐๐ญโ๐ฌ ๐ฎ๐ง๐๐๐ซ๐ฌ๐ญ๐๐ง๐ ๐ญ๐ก๐ ๐๐ฅ๐จ๐ฐ ๐จ๐ ๐ญ๐ก๐ `๐
๐ข๐ง๐๐ง๐๐ข๐๐ฅ๐๐ฌ๐ฌ๐ข๐ฌ๐ญ๐๐ง๐ญ` ๐๐ก๐๐ข๐งโ
1. Clean the userโs input prompt and use a pre-trained โall-MiniLM-L6-v2โ encoder-only model to embed it (the same LM used to populate the vector DB).
2. Using the embedded user input, query the Qdrant vector DB and extract the top 3 most similar financial news based on the cosine similarly distance
ย โ These 2 steps were necessary to do RAG. If you donโt know how RAG works, check out Lesson 3.ย
3. Build the final prompt using a โPromptTemplateโ class (the same one used for training) that formats the following components:
- a system prompt
- the userโs input prompt
- the financial news context
- the chat history
4. Now that our prompt contains all the necessary data, we pass it to the fine-tuned Falcon-7B LLM for the final answer.
The input prompt and LLM answer will be logged and monitored by Comet LLMOps.
5. You can get the answer in one shot or use the `TextIteratorStreamer` class (from HuggingFace) to stream it token-by-token.
6. Store the userโs input prompt and LLM answer in the chat history.
7. Pass the final answer to the client.
Note: You can use the `TextIteratorStreamer` class & wrap the `FinancialAssistant` under a WebSocket (instead of the RESTful API) to stream the answer of the bot token by token.
Similar to what you see in the interface of ChatGPT.
โณ๐ Check out the Hands-on LLMs course and support it with a โญ.
Thatโs it for today ๐พ
With this, we concluded the Hands-On LLMs series. I hope you enjoyed it ๐ฅ
See you next Thursday at 9:00 a.m. CET.
Have a fantastic weekend!
Paul
Whenever youโre ready, here is how I can help you:
The Full Stack 7-Steps MLOps Framework: a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.
Machine Learning & MLOps Blog: in-depth topics about designing and productionizing ML systems using MLOps.
Machine Learning & MLOps Hub: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).