DML: How to design an LLM system for a financial assistant using the 3-pipeline design
Lesson 1 | The Hands-on LLMs Series
Hello there, I am Paul Iusztin ๐๐ผ
Within this newsletter, I will help you decode complex topics about ML & MLOps one week at a time ๐ฅ
As promised, starting this week, we will begin the series based on the Hands-on LLMs FREE course.
Note that this is not the course itself. It is an overview for all the busy people who will focus on the key aspects.
The entire course will soon be available on ๐ GitHub.
Lesson 1 | The Hands-on LLMs Series
Table of Contents:
What is the 3-pipeline design
How to apply the 3-pipeline design in architecting a financial assistant powered by LLMs
The tech stack used to build an end-to-end LLM system for a financial assistant
As the Hands-on LLMs course is still a ๐๐ผ๐ฟ๐ธ ๐ถ๐ป ๐ฝ๐ฟ๐ผ๐ด๐ฟ๐ฒ๐๐, we want to ๐ธ๐ฒ๐ฒ๐ฝ ๐๐ผ๐ ๐๐ฝ๐ฑ๐ฎ๐๐ฒ๐ฑ on our progress โ
โณ Thus, we opened up the ๐ฑ๐ถ๐๐ฐ๐๐๐๐ถ๐ผ๐ป ๐๐ฎ๐ฏ under the course's GitHub Repository, where we will ๐ธ๐ฒ๐ฒ๐ฝ ๐๐ผ๐ ๐๐ฝ๐ฑ๐ฎ๐๐ฒ๐ฑ with everything is happening.
Also, if you have any ๐ถ๐ฑ๐ฒ๐ฎ๐, ๐๐๐ด๐ด๐ฒ๐๐๐ถ๐ผ๐ป๐, ๐พ๐๐ฒ๐๐๐ถ๐ผ๐ป๐ or want to ๐ฐ๐ต๐ฎ๐, we encourage you to ๐ฐ๐ฟ๐ฒ๐ฎ๐๐ฒ ๐ฎ "๐ป๐ฒ๐ ๐ฑ๐ถ๐๐ฐ๐๐๐๐ถ๐ผ๐ป".
โ We want the course to fill your real needs โ
โณ Hence, if your suggestion fits well with our hands-on course direction, we will consider implementing it.
Check it out and leave a โญ if you like what you see:
โณ๐ Hands-on LLMs course
#1. What is the 3-pipeline design
We all know how ๐บ๐ฒ๐๐๐ ๐ ๐ ๐๐๐๐๐ฒ๐บ๐ can get. That is where the ๐ฏ-๐ฝ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ ๐ฎ๐ฟ๐ฐ๐ต๐ถ๐๐ฒ๐ฐ๐๐๐ฟ๐ฒ ๐ธ๐ถ๐ฐ๐ธ๐ ๐ถ๐ป.
The 3-pipeline design is a way to bring structure & modularity to your ML system and improve your MLOps processes.
This is how โ
=== ๐ฃ๐ฟ๐ผ๐ฏ๐น๐ฒ๐บ ===
Despite advances in MLOps tooling, transitioning from prototype to production remains challenging.
In 2022, only 54% of the models get into production. Auch.
So what happens?
Sometimes the model is not mature enough, sometimes there are some security risks, but most of the time...
...the architecture of the ML system is built with research in mind, or the ML system becomes a massive monolith that is extremely hard to refactor from offline to online.
So, good processes and a well-defined architecture are as crucial as good tools and models.
=== ๐ฆ๐ผ๐น๐๐๐ถ๐ผ๐ป ===
๐๐ฉ๐ฆ 3-๐ฑ๐ช๐ฑ๐ฆ๐ญ๐ช๐ฏ๐ฆ ๐ข๐ณ๐ค๐ฉ๐ช๐ต๐ฆ๐ค๐ต๐ถ๐ณ๐ฆ.
First, let's understand what the 3-pipeline design is.
It is a mental map that helps you simplify the development process and split your monolithic ML pipeline into 3 components:
1. the feature pipeline
2. the training pipeline
3. the inference pipeline
...also known as the Feature/Training/Inference (FTI) architecture.
.
#๐ญ. The feature pipeline transforms your data into features & labels, which are stored and versioned in a feature store.
#๐ฎ. The training pipeline ingests a specific version of the features & labels from the feature store and outputs the trained models, which are stored and versioned inside a model registry.
#๐ฏ. The inference pipeline takes a given version of the features and trained models and outputs the predictions to a client.
.
This is why the 3-pipeline design is so beautiful:
- it is intuitive
- it brings structure, as on a higher level, all ML systems can be reduced to these 3 components
- it defines a transparent interface between the 3 components, making it easier for multiple teams to collaborate
- the ML system has been built with modularity in mind since the beginning
- the 3 components can easily be divided between multiple teams (if necessary)
- every component can use the best stack of technologies available for the job
- every component can be deployed, scaled, and monitored independently
- the feature pipeline can easily be either batch, streaming or both
But the most important benefit is that...
...by following this pattern, you know 100% that your ML model will move out of your Notebooks into production.
What do you think about the 3-pipeline architecture? Have you used it?
If you want to know more about the 3-pipeline design, I recommend this awesome article from Hopsworks โ
โณ๐ From MLOps to ML Systems with Feature/Training/Inference Pipelines
#2. How to apply the 3-pipeline design in architecting a financial assistant powered by LLMs
Building ML systems is hard, right? Wrong.
Here is how the ๐ฏ-๐ฝ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ ๐ฑ๐ฒ๐๐ถ๐ด๐ป can make ๐ฎ๐ฟ๐ฐ๐ต๐ถ๐๐ฒ๐ฐ๐๐ถ๐ป๐ด the ๐ ๐ ๐๐๐๐๐ฒ๐บ for a ๐ณ๐ถ๐ป๐ฎ๐ป๐ฐ๐ถ๐ฎ๐น ๐ฎ๐๐๐ถ๐๐๐ฎ๐ป๐ ๐ฒ๐ฎ๐๐ โ
.
I already covered the concepts of the 3-pipeline design in my previous post, but here is a quick recap:
"""
It is a mental map that helps you simplify the development process and split your monolithic ML pipeline into 3 components:
1. the feature pipeline
2. the training pipeline
3. the inference pipeline
...also known as the Feature/Training/Inference (FTI) architecture.
"""
.
Now, let's see how you can use the FTI architecture to build a financial assistant powered by LLMs โ
#๐ญ. ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ฝ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ
The feature pipeline is designed as a streaming pipeline that extracts real-time financial news from Alpaca and:
- cleans and chunks the news documents
- embeds the chunks using an encoder-only LM
- loads the embeddings + their metadata in a vector DB
- deploys it to AWS
In this architecture, the vector DB acts as the feature store.
The vector DB will stay in sync with the latest news to attach real-time context to the LLM using RAG.
#๐ฎ. ๐ง๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ
The training pipeline is split into 2 main steps:
โณ ๐ค&๐ ๐ฑ๐ฎ๐๐ฎ๐๐ฒ๐ ๐๐ฒ๐บ๐ถ-๐ฎ๐๐๐ผ๐บ๐ฎ๐๐ฒ๐ฑ ๐ด๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ถ๐ผ๐ป ๐๐๐ฒ๐ฝ
It takes the vector DB (feature store) and a set of predefined questions (manually written) as input.
After, you:
- use RAG to inject the context along the predefined questions
- use a large & powerful model, such as GPT-4, to generate the answers
- save the generated dataset under a new version
โณ ๐๐ถ๐ป๐ฒ-๐๐๐ป๐ถ๐ป๐ด ๐๐๐ฒ๐ฝ
- download a pre-trained LLM from Huggingface
- load the LLM using QLoRA
- preprocesses the generated Q&A dataset into a format expected by the LLM
- fine-tune the LLM
- push the best QLoRA weights (model) to a model registry
- deploy it using a serverless solution as a continuous training pipeline
#๐ฏ. ๐๐ป๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ฐ๐ฒ ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ
The inference pipeline is the financial assistant that the clients actively use.
It uses the vector DB (feature store) and QLoRA weights (model) from the model registry in the following way:
- download the pre-trained LLM from Huggingface
- load the LLM using the pretrained QLoRA weights
- connect the LLM and vector DB into a chain
- use RAG to add relevant financial news from the vector DB
- deploy it using a serverless solution under a RESTful API
Here are the main benefits of using the FTI architecture:
- it defines a transparent interface between the 3 modules
- every component can use different technologies to implement and deploy the pipeline
- the 3 pipelines are loosely coupled through the feature store & model registry
- every component can be scaled independently
See this architecture in action in my ๐ ๐๐ฎ๐ป๐ฑ๐-๐ผ๐ป ๐๐๐ ๐ FREE course.
#3. The tech stack used to build an end-to-end LLM system for a financial assistant
The tools are divided based on the ๐ฏ-๐ฝ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ (aka ๐๐ง๐) ๐ฎ๐ฟ๐ฐ๐ต๐ถ๐๐ฒ๐ฐ๐๐๐ฟ๐ฒ:
=== ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ ===
What do you need to build a streaming pipeline?
โ streaming processing framework: Bytewax (brings the speed of Rust into our beloved Python ecosystem)
โ parse, clean, and chunk documents: unstructured
โ validate document structure: pydantic
โ encoder-only language model: HuggingFace sentence-transformers, PyTorch
โ vector DB: Qdrant
โdeploy: Docker, AWS
โ CI/CD: GitHub Actions
=== ๐ง๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ ===
What do you need to build a fine-tuning pipeline?
โ pretrained LLM: HuggingFace Hub
โ parameter efficient tuning method: peft (= LoRA)
โ quantization: bitsandbytes (= QLoRA)
โ training: HuggingFace transformers, PyTorch, trl
โ distributed training: accelerate
โ experiment tracking: Comet ML
โ model registry: Comet ML
โ prompt monitoring: Comet ML
โ continuous training serverless deployment: Beam
=== ๐๐ป๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ฐ๐ฒ ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ ===
What do you need to build a financial assistant?
โ framework for developing applications powered by language models: LangChain
โ model registry: Comet ML
โ inference: HuggingFace transformers, PyTorch, peft (to load the LoRA weights)
โ quantization: bitsandbytes
โ distributed inference: accelerate
โ encoder-only language model: HuggingFace sentence-transformers
โ vector DB: Qdrant
โ prompt monitoring: Comet ML
โ RESTful API serverless service: Beam
.
As you can see, some tools overlap between the FTI pipelines, but not all.
This is the beauty of the 3-pipeline design, as every component represents a different entity for which you can pick the best stack to build, deploy, and monitor.
You can go wild and use Tensorflow in one of the components if you want your colleges to hate you ๐
See the tools in action in my ๐ ๐๐ฎ๐ป๐ฑ๐-๐ผ๐ป ๐๐๐ ๐ FREE course.
Thatโs it for today ๐พ
See you next Thursday at 9:00 a.m. CET.
Have a fantastic weekend!
โฆand see you next week for Lesson 2 of the Hands-On LLMs series ๐ฅ
Paul
Whenever youโre ready, here is how I can help you:
The Full Stack 7-Steps MLOps Framework: a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.
Machine Learning & MLOps Blog: in-depth topics about designing and productionizing ML systems using MLOps.
Machine Learning & MLOps Hub: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).