DML: This is what you need to build an inference pipeline for a financial assistant powered by LLMs, vector DBs and LLMOps

Lesson 9 | The Hands-on LLMs Series

Dec 28, 2023

Hello there, I am Paul Iusztin 👋🏼

Within this newsletter, I will help you decode complex topics about ML & MLOps one week at a time 🔥

Lesson 9 | The Hands-on LLMs Series

This is the last lesson within the Hands-on LLMs series... But certainly not the last MLE & MLOps series. We are cooking some exciting stuff. But I hope you had fun and learned much during this series.

Now, let's see how to glue everything we have done so far under the inference pipeline. Enjoy! 🧁

Previous Lessons:

↳🔗 Check out the Hands-on LLMs course and support it with a ⭐.

#1. Inference pipeline video lesson

We 𝐫𝐞𝐥𝐞𝐚𝐬𝐞𝐝 the 𝐟𝐢𝐧𝐚𝐥 video 𝐥𝐞𝐬𝐬𝐨𝐧 of the 𝐇𝐚𝐧𝐝𝐬-𝐨𝐧 𝐋𝐋𝐌𝐬 FREE course that will teach you how to 𝐛𝐮𝐢𝐥𝐝 & 𝐝𝐞𝐩𝐥𝐨𝐲 an 𝐢𝐧𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐩𝐢𝐩𝐞𝐥𝐢𝐧𝐞 for a financial advisor using 𝐋𝐚𝐧𝐠𝐂𝐡𝐚𝐢𝐧, 𝐋𝐋𝐌𝐎𝐩𝐬, and 𝐯𝐞𝐜𝐭𝐨𝐫 𝐃𝐁𝐬.

𝘏𝘦𝘳𝘦 𝘢𝘳𝘦 𝘵𝘩𝘦 𝘬𝘦𝘺 𝘵𝘰𝘱𝘪𝘤𝘴 𝘤𝘰𝘷𝘦𝘳𝘦𝘥 𝘪𝘯 𝘵𝘩𝘦 𝘷𝘪𝘥𝘦𝘰 𝘭𝘦𝘴𝘴𝘰𝘯 made by Pau Labarta 𝘢𝘯𝘥 𝘐 ↓

1. Overview of the architecture of the inference pipeline and how to apply LLMOps good practices

2. How to build from scratch a RAG agent using LangChain: ContextExtractorChain + FinancialBotQAChain

3. How to attach a callback class to log input prompts and LLM answers to Comet LLMOps

4. Setting up and running the code locally

5. Deploying the inference pipeline to Beam as a RESTful API

.

𝘊𝘶𝘳𝘪𝘰𝘶𝘴?

Check out the video lesson

Pau Labarta Bajo

and I did ↓

#2. What do you need to build an inference pipeline for a financial assistant powered by LLMs and vector DBs?

Here are its 𝟳 𝗸𝗲𝘆 𝗰𝗼𝗺𝗽𝗼𝗻𝗲𝗻𝘁𝘀 ↓

1. 𝘃𝗲𝗰𝘁𝗼𝗿 𝗗𝗕 𝗽𝗼𝗽𝘂𝗹𝗮𝘁𝗲𝗱 𝘄𝗶𝘁𝗵 𝗳𝗶𝗻𝗮𝗻𝗰𝗶𝗮𝗹 𝗻𝗲𝘄𝘀: This is the output of the feature pipeline. More concretely, a Qdrant vector DB populated with chunks of financial news from Alpaca. During the inference pipeline, we will use it to query valuable chunks of information and do RAG.

2. 𝗲𝗺𝗯𝗲𝗱𝗱𝗶𝗻𝗴 𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗺𝗼𝗱𝗲𝗹: To embed the user question and query the vector DB, you need the same embedding model used in the feature pipeline, more concretely `𝘢𝘭𝘭-𝘔𝘪𝘯𝘪𝘓𝘔-𝘓6-𝘷2` from `𝘴𝘦𝘯𝘵𝘦𝘯𝘤𝘦-𝘵𝘳𝘢𝘯𝘴𝘧𝘰𝘳𝘮𝘦𝘳𝘴`. Using the same encoder-only model is crucial, as the query vector and vector DB index vectors have to be in the same space.

3. 𝗳𝗶𝗻𝗲-𝘁𝘂𝗻𝗲𝗱 𝗼𝗽𝗲𝗻-𝘀𝗼𝘂𝗿𝗰𝗲 𝗟𝗟𝗠: The output of the training pipeline will be a fine-tuned Falcon 7B on financial tasks.

4. 𝗺𝗼𝗱𝗲𝗹 𝗿𝗲𝗴𝗶𝘀𝘁𝗿𝘆: The fine-tuned model will be shared between the training & inference pipeline through Comet’s model registry. By doing so, you decouple entirely the 2 components, and the model can easily be shared under specific environments (e.g., staging, prod) and versions (e.g., v1.0.1).

5. 𝗮 𝗳𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸 𝗳𝗼𝗿 𝗟𝗟𝗠 𝗮𝗽𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻𝘀: You need LangChain, as your LLM framework, to glue all the steps together, such as querying the vector DB, storing the history of the conversation, creating the prompt, and calling the LLM. LangChain provides out-of-the-box solutions to chain all these steps together quickly.

6. 𝗱𝗲𝗽𝗹𝗼𝘆 𝘁𝗵𝗲 𝗟𝗟𝗠 𝗮𝗽𝗽 𝗮𝘀 𝗮 𝗥𝗘𝗦𝗧𝗳𝘂𝗹 𝗔𝗣𝗜: One of the final steps is to deploy your awesome LLM financial assistant under a RESTful API. You can quickly do this using Beam as your serverless infrastructure provider. Beam specializes in DL. Thus, it offers quick ways to load your LLM application on GPU machines and expose it under a RESTful API.

7. 𝗽𝗿𝗼𝗺𝗽𝘁 𝗺𝗼𝗻𝗶𝘁𝗼𝗿𝗶𝗻𝗴: The last step is to add eyes on top of your system. You can do this using Comet’s LLMOps features that allow you to track & monitor all the prompts & responses of the system.

↳🔗 Check out how these components are working together in our Hands-on LLMs free course.

#3. How can you build & deploy an inference pipeline for a real-time financial advisor while considering good LLMOps practices?

𝐇𝐨𝐰 can you 𝐛𝐮𝐢𝐥𝐝 & 𝐝𝐞𝐩𝐥𝐨𝐲 an 𝐢𝐧𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐩𝐢𝐩𝐞𝐥𝐢𝐧𝐞 for a real-time financial advisor with 𝐋𝐚𝐧𝐠𝐂𝐡𝐚𝐢𝐧 powered by 𝐋𝐋𝐌𝐬 & 𝐯𝐞𝐜𝐭𝐨𝐫 𝐃𝐁𝐬 while considering 𝐠𝐨𝐨𝐝 𝐋𝐋𝐌𝐎𝐩𝐬 𝐩𝐫𝐚𝐜𝐭𝐢𝐜𝐞𝐬?

As a quick reminder from previous posts, here is what we already have:
- a Qdrant vector DB populated with financial news (the output of the feature pipeline)
- fine-tuned Falcon-7B LoRA weights stored in Comet’s model registry (the output of the training pipeline)

The Qdrant vectorDB is accessed through a Python client.

A specific version of the Falcon-7B LoRA weights is downloaded from Comet’s model registry and loaded in memory using QLoRA.

The goal of the inference pipeline is to use LangChain to glue the 2 components into a single `FinancialAssistant` entity.

The `FinancialAssistant` entity is deployed in a request-response fashion under a RESTful API. We used Beam to deploy it quickly under a serverless web endpoint.

To deploy any model using Beam as a RESTful API is as easy as writing the following Python decorator:

@financial_bot. rest_api(keep_warm_seconds=300, loader=load_bot)def run(**inputs):
   ....

𝐍𝐨𝐰 𝐥𝐞𝐭’𝐬 𝐮𝐧𝐝𝐞𝐫𝐬𝐭𝐚𝐧𝐝 𝐭𝐡𝐞 𝐟𝐥𝐨𝐰 𝐨𝐟 𝐭𝐡𝐞 `𝐅𝐢𝐧𝐚𝐧𝐜𝐢𝐚𝐥𝐀𝐬𝐬𝐢𝐬𝐭𝐚𝐧𝐭` 𝐜𝐡𝐚𝐢𝐧↓

1. Clean the user’s input prompt and use a pre-trained “all-MiniLM-L6-v2” encoder-only model to embed it (the same LM used to populate the vector DB).

2. Using the embedded user input, query the Qdrant vector DB and extract the top 3 most similar financial news based on the cosine similarly distance

→ These 2 steps were necessary to do RAG. If you don’t know how RAG works, check out Lesson 3.

3. Build the final prompt using a “PromptTemplate” class (the same one used for training) that formats the following components:
- a system prompt
- the user’s input prompt
- the financial news context
- the chat history

4. Now that our prompt contains all the necessary data, we pass it to the fine-tuned Falcon-7B LLM for the final answer.

The input prompt and LLM answer will be logged and monitored by Comet LLMOps.

5. You can get the answer in one shot or use the `TextIteratorStreamer` class (from HuggingFace) to stream it token-by-token.

6. Store the user’s input prompt and LLM answer in the chat history.

7. Pass the final answer to the client.

Note: You can use the `TextIteratorStreamer` class & wrap the `FinancialAssistant` under a WebSocket (instead of the RESTful API) to stream the answer of the bot token by token.

Similar to what you see in the interface of ChatGPT.

How | Inference pipeline: Build & deploy an inference pipeline using LangChain powered by LLMs & vector DBs [Image by the Author].

↳🔗 Check out the Hands-on LLMs course and support it with a ⭐.

That’s it for today 👾

With this, we concluded the Hands-On LLMs series. I hope you enjoyed it 🔥

See you next Thursday at 9:00 a.m. CET.

Have a fantastic weekend!

Paul

Whenever you’re ready, here is how I can help you:

The Full Stack 7-Steps MLOps Framework: a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.
Machine Learning & MLOps Blog: in-depth topics about designing and productionizing ML systems using MLOps.
Machine Learning & MLOps Hub: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).

DML: This is what you need to build an inference pipeline for a financial assistant powered by LLMs, vector DBs and LLMOps

Lesson 9 | The Hands-on LLMs Series

Lesson 9 | The Hands-on LLMs Series

Table of Contents:

Previous Lessons:

#1. Inference pipeline video lesson

#2. What do you need to build an inference pipeline for a financial assistant powered by LLMs and vector DBs?

#3. How can you build & deploy an inference pipeline for a real-time financial advisor while considering good LLMOps practices?

Whenever you’re ready, here is how I can help you:

Discussion about this post