DML: What do you need to fine-tune an open-source LLM to create your financial advisor?
Lesson 6 | The Hands-on LLMs Series
Hello there, I am Paul Iusztin ๐๐ผ
Within this newsletter, I will help you decode complex topics about ML & MLOps one week at a time ๐ฅ
Lesson 6 | The Hands-on LLMs Series
Table of Contents:
The difference between encoders, decoders, and encoder-decoder LLMs.
You must know these 3 main stages of training an LLM to train your own LLM on your proprietary data.
What do you need to fine-tune an open-source LLM to create your own financial advisor?
Previous Lessons:
โณ๐ Check out the Hands-on LLMs course and support it with a โญ.
#1. The difference between encoders, decoders, and encoder-decoder LLMs
Let's see when to use each architecture โ
As embeddings are everywhere, both encoders and decoders use self-attention layers to encode word tokens into embeddings.
The devil is in the details. Let's clarify it โ
๐ง๐ต๐ฒ ๐ข๐ฟ๐ถ๐ด๐ถ๐ป๐ฎ๐น ๐ง๐ฟ๐ฎ๐ป๐๐ณ๐ผ๐ฟ๐บ๐ฒ๐ฟ
It is an encoder-decoder setup. The encoder processes the input text and hands off its understanding as embeddings to the decoder, which will generate the final output.
The key difference between an encoder & decoder is in how it processes its inputs & outputs.
=== ๐๐ป๐ฐ๐ผ๐ฑ๐ฒ๐ฟ๐ ===
The role of an encoder is to extract relevant information from the whole input and encode it into an embedding (e.g., BERT, RoBERTa).
Within the "Multi-head attention" of the transformer, all the tokens are allowed to speak to each other.
A token at position t can talk to all other previous tokens [0, t-1] and future tokens [t+1, T]. This means that the attention mask is computed along the whole vector.
Thus, because the encoder processes the whole input, it is helpful for classification tasks (e.g., sentiment analysis) and creates embeddings for clustering, recommender systems, vector DB indexes, etc.
=== ๐๐ฒ๐ฐ๐ผ๐ฑ๐ฒ๐ฟ๐ ===
On the flip side, if you want to generate text, use decoder-only models (e.g., GPT family).
Only the current and previous tokens (not the whole input) are used to predict the next token.
Within the "Masked Multi-head attention," the future positions are masked to maintain the autoregressive property of the decoding process.
For example, within the "Masked Multi-head attention," instead of all the tokens talking to each other, a token at position t will have access only to previous tokens at positions t-1, t-2, t-3, ..., 0.
=== ๐๐ป๐ฐ๐ผ๐ฑ๐ฒ๐ฟ-๐ฑ๐ฒ๐ฐ๐ผ๐ฑ๐ฒ๐ฟ ===
This technique is used when you have to understand the entire input sequence (encoder) and the previously generated sequence (decoder -> autoregressive).
Typical use cases are text translation & summarization (the original transformer was built for text translation), where the output heavily relies on the input.
Why? Because the decoding step always has to be conditioned by the encoded information. Also known as cross-attention, the decoder queries the encoded information for information to guide the decoding process.
For example, when translating English to Spanish, every Spanish token predicted is conditioned by the previously predicted Spanish tokens & the entire English sentence.
To conclude...
- a decoder takes as input previous tokens and predicts the next one (in an autoregressive way)
- by dropping the "Masked" logic from the "Masked Multi-head attention," you process the whole input, transforming the decoder into an encoder
- if you hook the encoder to the decoder through a cross-attention layer, you have an encoder-decoder architecture
#2. You must know these 3 main stages of training an LLM to train your own LLM on your proprietary data
You must know these ๐ฏ ๐บ๐ฎ๐ถ๐ป ๐๐๐ฎ๐ด๐ฒ๐ of ๐๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด ๐ฎ๐ป ๐๐๐ to train your own ๐๐๐ on your ๐ฝ๐ฟ๐ผ๐ฝ๐ฟ๐ถ๐ฒ๐๐ฎ๐ฟ๐ ๐ฑ๐ฎ๐๐ฎ.
# ๐ฆ๐๐ฎ๐ด๐ฒ ๐ญ: ๐ฃ๐ฟ๐ฒ๐๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด ๐ณ๐ผ๐ฟ ๐ฐ๐ผ๐บ๐ฝ๐น๐ฒ๐๐ถ๐ผ๐ป
You start with a bear foot randomly initialized LLM.
This stage aims to teach the model to spit out tokens. More concretely, based on previous tokens, the model learns to predict the next token with the highest probability.
For example, your input to the model is "The best programming language is ___", and it will answer, "The best programming language is Rust."
Intuitively, at this stage, the LLM learns to speak.
๐๐ข๐ต๐ข: >1 trillion token (~= 15 million books). The data quality doesn't have to be great. Hence, you can scrape data from the internet.
# ๐ฆ๐๐ฎ๐ด๐ฒ ๐ฎ: ๐ฆ๐๐ฝ๐ฒ๐ฟ๐๐ถ๐๐ฒ๐ฑ ๐ณ๐ถ๐ป๐ฒ-๐๐๐ป๐ถ๐ป๐ด (๐ฆ๐๐ง) ๐ณ๐ผ๐ฟ ๐ฑ๐ถ๐ฎ๐น๐ผ๐ด๐๐ฒ
You start with the pretrained model from stage 1.
This stage aims to teach the model to respond to the user's questions.
For example, without this step, when prompting: "What is the best programming language?", it has a high probability of creating a series of questions such as: "What is MLOps? What is MLE? etc."
As the model mimics the training data, you must fine-tune it on Q&A (questions & answers) data to align the model to respond to questions instead of predicting the following tokens.
After the fine-tuning step, when prompted, "What is the best programming language?", it will respond, "Rust".
๐๐ข๐ต๐ข: 10K - 100K Q&A example
๐๐ฐ๐ต๐ฆ: After aligning the model to respond to questions, you can further single-task fine-tune the model, on Q&A data, on a specific use case to specialize the LLM.
# ๐ฆ๐๐ฎ๐ด๐ฒ ๐ฏ: ๐ฅ๐ฒ๐ถ๐ป๐ณ๐ผ๐ฟ๐ฐ๐ฒ๐บ๐ฒ๐ป๐ ๐น๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐ณ๐ฟ๐ผ๐บ ๐ต๐๐บ๐ฎ๐ป ๐ณ๐ฒ๐ฒ๐ฑ๐ฏ๐ฎ๐ฐ๐ธ (๐ฅ๐๐๐)
Demonstration data tells the model what kind of responses to give but doesn't tell the model how good or bad a response is.
The goal is to align your model with user feedback (what users liked or didn't like) to increase the probability of generating answers that users find helpful.
๐๐๐๐ ๐ช๐ด ๐ด๐ฑ๐ญ๐ช๐ต ๐ช๐ฏ 2:
1. Using the LLM from stage 2, train a reward model to act as a scoring function using (prompt, winning_response, losing_response) samples (= comparison data). The model will learn to maximize the difference between these 2. After training, this model outputs rewards for (prompt, response) tuples.
๐๐ข๐ต๐ข: 100K - 1M comparisons
2. Use an RL algorithm (e.g., PPO) to fine-tune the LLM from stage 2. Here, you will use the reward model trained above to give a score for every: (prompt, response). The RL algorithm will align the LLM to generate prompts with higher rewards, increasing the probability of generating responses that users liked.
๐๐ข๐ต๐ข: 10K - 100K prompts
Note: Post inspired by Chip Huyen's ๐ RLHF: Reinforcement Learning from Human Feedback" article.
#3. What do you need to fine-tune an open-source LLM to create your own financial advisor?
This is the ๐๐๐ ๐ณ๐ถ๐ป๐ฒ-๐๐๐ป๐ถ๐ป๐ด ๐ธ๐ถ๐ you must know โ
๐๐ฎ๐๐ฎ๐๐ฒ๐
The key component of any successful ML project is the data.
You need a 100 - 1000 sample Q&A (questions & answers) dataset with financial scenarios.
The best approach is to hire a bunch of experts to create it manually.
But, for a PoC, that might get expensive & slow.
The good news is that a method called "๐๐ช๐ฏ๐ฆ๐ต๐ถ๐ฏ๐ช๐ฏ๐จ ๐ธ๐ช๐ต๐ฉ ๐ฅ๐ช๐ด๐ต๐ช๐ญ๐ญ๐ข๐ต๐ช๐ฐ๐ฏ" exists.
In a nutshell, this is how it works: "Use a big & powerful LLM (e.g., GPT4) to generate your fine-tuning data. After, use this data to fine-tune a smaller model (e.g., Falcon 7B)."
For specializing smaller LLMs on specific use cases (e.g., financial advisors), this is an excellent method to kick off your project.
๐ฃ๐ฟ๐ฒ-๐๐ฟ๐ฎ๐ถ๐ป๐ฒ๐ฑ ๐ผ๐ฝ๐ฒ๐ป-๐๐ผ๐๐ฟ๐ฐ๐ฒ ๐๐๐
You never want to start training your LLM from scratch (or rarely).
Why? Because you need trillions of tokens & millions of $$$ in compute power.
You want to fine-tune your LLM on your specific task.
The good news is that you can find a plethora of open-source LLMs on HuggingFace (e.g., Falcon, LLaMa, etc.)
๐ฃ๐ฎ๐ฟ๐ฎ๐บ๐ฒ๐๐ฒ๐ฟ ๐ฒ๐ณ๐ณ๐ถ๐ฐ๐ถ๐ฒ๐ป๐ ๐ณ๐ถ๐ป๐ฒ-๐๐๐ป๐ถ๐ป๐ด
As LLMs are big... duh...
... they don't fit on a single GPU.
As you want only to fine-tune the LLM, the community invented clever techniques that quantize the LLM (to fit on a single GPU) and fine-tune only a set of smaller adapters.
One popular approach is QLoRA, which can be implemented using HF's `๐ฑ๐ฆ๐ง๐ต` Python package.
๐ ๐๐ข๐ฝ๐
As you want your project to get to production, you have to integrate the following MLOps components:
- experiment tracker to monitor & compare your experiments
- model registry to version & share your models between the FTI pipelines
- prompts monitoring to debug & track complex chains
โณ๐ All of them are available on ML platforms, such as Comet ML
๐๐ผ๐บ๐ฝ๐๐๐ฒ ๐ฝ๐น๐ฎ๐๐ณ๐ผ๐ฟ๐บ
The most common approach is to train your LLM on your on-prem Nivida GPUs cluster or rent them on cloud providers such as AWS, Paperspace, etc.
But what if I told you that there is an easier way?
There is! It is called serverless.
For example, Beam is a GPU serverless provider that makes deploying your training pipeline as easy as decorating your Python function with `@๐ข๐ฑ๐ฑ.๐ณ๐ถ๐ฏ()`.
Along with ease of deployment, you can easily add your training code to your CI/CD to add the final piece of the MLOps puzzle, called CT (continuous training).
โณ๐ Beam
โณ๐ To see all these components in action, check out our FREE ๐๐ฎ๐ป๐ฑ๐-๐ผ๐ป ๐๐๐ ๐ ๐ฐ๐ผ๐๐ฟ๐๐ฒ & give it a โญ
Thatโs it for today ๐พ
See you next Thursday at 9:00 a.m. CET.
Have a fantastic weekend!
โฆand see you next week for Lesson 7 of the Hands-On LLMs series ๐ฅ
Paul
Whenever youโre ready, here is how I can help you:
The Full Stack 7-Steps MLOps Framework: a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.
Machine Learning & MLOps Blog: in-depth topics about designing and productionizing ML systems using MLOps.
Machine Learning & MLOps Hub: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).