DML: 7-steps on how to fine-tune an open-source LLM to create your real-time financial advisor
Lesson 8 | The Hands-on LLMs Series
Hello there, I am Paul Iusztin ๐๐ผ
Within this newsletter, I will help you decode complex topics about ML & MLOps one week at a time ๐ฅ
Lesson 8 | The Hands-on LLMs Series
Table of Contents:
What is Beam? How does serverless make deploying ML models easy?
7 tips you must know to reduce your VRAM consumption of your LLMs during training
7-steps on how to fine-tune an open-source LLM to create your real-time financial advisor
Previous Lessons:
Lesson 6: What do you need to fine-tune an open-source LLM to create your financial advisor?
Lesson 7: How do you generate a Q&A dataset in <30 minutes to fine-tune your LLMs?
โณ๐ Check out the Hands-on LLMs course and support it with a โญ.
#1. What is Beam? How does serverless make deploying ML models easy?
๐๐ฒ๐ฝ๐น๐ผ๐๐ถ๐ป๐ด & ๐บ๐ฎ๐ป๐ฎ๐ด๐ถ๐ป๐ด ML models is ๐ต๐ฎ๐ฟ๐ฑ, especially when running your models on GPUs.
But ๐๐ฒ๐ฟ๐๐ฒ๐ฟ๐น๐ฒ๐๐ makes things ๐ฒ๐ฎ๐๐.
Using Beam as your serverless provider, deploying & managing ML models can be as easy as โ
๐๐ฒ๐ณ๐ถ๐ป๐ฒ ๐๐ผ๐๐ฟ ๐ถ๐ป๐ณ๐ฟ๐ฎ๐๐๐ฟ๐๐ฐ๐๐๐ฟ๐ฒ & ๐ฑ๐ฒ๐ฝ๐ฒ๐ป๐ฑ๐ฒ๐ป๐ฐ๐ถ๐ฒ๐
In a few lines of code, you define the application that contains:
- the requirements of your infrastructure, such as the CPU, RAM, and GPU
- the dependencies of your application
- the volumes from where you can load your data and store your artifacts
๐๐ฒ๐ฝ๐น๐ผ๐ ๐๐ผ๐๐ฟ ๐ท๐ผ๐ฏ๐
Using the Beam application, you can quickly decore your Python functions to:
- run them once on the given serverless application
- put your task/job in a queue to be processed or even schedule it using a CRON-based syntax
- even deploy it as a RESTful API endpoint
As you can see in the image below, you can have one central function for training or inference, and with minimal effort, you can switch from all these deployment methods.
Also, you don't have to bother at all with managing the infrastructure on which your jobs run. You specify what you need, and Beam takes care of the rest.
By doing so, you can directly start to focus on your application and stop carrying about the infrastructure.
This is the power of serverless!
โณ๐ Check out Beam to learn more
#2. 7 tips you must know to reduce your VRAM consumption of your LLMs during training
Here are ๐ณ ๐๐ถ๐ฝ๐ you must know to ๐ฟ๐ฒ๐ฑ๐๐ฐ๐ฒ your ๐ฉ๐ฅ๐๐ ๐ฐ๐ผ๐ป๐๐๐บ๐ฝ๐๐ถ๐ผ๐ป of your ๐๐๐ ๐ during ๐๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด so you can ๐ณ๐ถ๐ it on ๐
๐ญ ๐๐ฃ๐จ.
When training LLMs, one of the pain points is to have enough VRAM on your system.
The good news is that the gods of DL are with us, and there are methods to lower your VRAM consumption without a significant impact on your performance โ
๐ญ. ๐ ๐ถ๐
๐ฒ๐ฑ-๐ฝ๐ฟ๐ฒ๐ฐ๐ถ๐๐ถ๐ผ๐ป: During training you use both FP32 and FP16 in the following way: "FP32 weights" -> "FP16 weights" -> "FP16 gradients" -> "FP32 gradients" -> "Update weights" -> "FP32 weights" (and repeat). As you can see, the forward & backward passes are done in FP16, and only the optimization step is done in FP32, which reduces both the VRAM and runtime.
๐ฎ. ๐๐ผ๐๐ฒ๐ฟ-๐ฝ๐ฟ๐ฒ๐ฐ๐ถ๐๐ถ๐ผ๐ป: All your computations are done in FP16 instead of FP32. But the key is using bfloat16 ("Brain Floating Point"), a numerical representation Google developed for deep learning. It allows you to represent very large and small numbers, avoiding overflowing or underflowing scenarios.
๐ฏ. ๐ฅ๐ฒ๐ฑ๐๐ฐ๐ถ๐ป๐ด ๐๐ต๐ฒ ๐ฏ๐ฎ๐๐ฐ๐ต ๐๐ถ๐๐ฒ: This one is straightforward. Fewer samples per training iteration result in smaller VRAM requirements. The downside of this method is that you can't go too low with your batch size without impacting your model's performance.
๐ฐ. ๐๐ฟ๐ฎ๐ฑ๐ถ๐ฒ๐ป๐ ๐ฎ๐ฐ๐ฐ๐๐บ๐๐น๐ฎ๐๐ถ๐ผ๐ป: It is a simple & powerful trick to increase your batch size virtually. You compute the gradients for "micro" batches (forward + backward passes). Once the accumulated gradients reach the given "virtual" target, the model weights are updated with the accumulated gradients. For example, you have a batch size of 4 and a micro-batch size of 1. Then, the forward & backward passes will be done using only x1 sample, and the optimization step will be done using the aggregated gradient of the 4 samples.
๐ฑ. ๐จ๐๐ฒ ๐ฎ ๐๐๐ฎ๐๐ฒ๐น๐ฒ๐๐ ๐ผ๐ฝ๐๐ถ๐บ๐ถ๐๐ฒ๐ฟ: Adam is the most popular optimizer. It is one of the most stable optimizers, but the downside is that it has 2 additional parameters (a mean & variance) for every model parameter. If you use a stateless optimizer, such as SGD, you can reduce the number of parameters by 2/3, which is significant for LLMs.
๐ฒ. ๐๐ฟ๐ฎ๐ฑ๐ถ๐ฒ๐ป๐ (๐ผ๐ฟ ๐ฎ๐ฐ๐๐ถ๐๐ฎ๐๐ถ๐ผ๐ป) ๐ฐ๐ต๐ฒ๐ฐ๐ธ๐ฝ๐ผ๐ถ๐ป๐๐ถ๐ป๐ด: It drops specific activations during the forward pass and recomputes them during the backward pass. Thus, it eliminates the need to hold all activations simultaneously in VRAM. This technique reduces VRAM consumption but makes the training slower.
๐ณ. ๐๐ฃ๐จ ๐ฝ๐ฎ๐ฟ๐ฎ๐บ๐ฒ๐๐ฒ๐ฟ ๐ผ๐ณ๐ณ๐น๐ผ๐ฎ๐ฑ๐ถ๐ป๐ด: As the name suggests, the parameters that do not fit on your GPU's VRAM are loaded on the CPU. Intuitively, you can see it as a model parallelism between your GPU & CPU.
Most of these methods are orthogonal, so you can combine them and drastically reduce your VRAM requirements during training.
#3. 7-steps on how to fine-tune an open-source LLM to create your real-time financial advisor
In the past weeks, we covered ๐๐ต๐ you have to fine-tune an LLM and ๐๐ต๐ฎ๐ resources & tools you need:
- Q&A dataset
- pre-trained LLM (Falcon 7B) & QLoRA
- MLOps: experiment tracker, model registry, prompt monitoring (Comet ML)
- compute platform (Beam)
.
Now, let's see how you can hook all of these pieces together into a single fine-tuning module โ
๐ญ. ๐๐ผ๐ฎ๐ฑ ๐๐ต๐ฒ ๐ค&๐ ๐ฑ๐ฎ๐๐ฎ๐๐ฒ๐
Our Q&A samples have the following structure keys: "about_me," "user_context," "question," and "answer."
For task-specific fine-tuning, you need only 100-1000 samples. Thus, you can directly load the whole JSON in memory.
After you map every sample to a list of Python ๐ฅ๐ข๐ต๐ข๐ค๐ญ๐ข๐ด๐ด๐ฆ๐ด to validate the structure & type of the ingested instances.
๐ฎ. ๐ฃ๐ฟ๐ฒ๐ฝ๐ฟ๐ผ๐ฐ๐ฒ๐๐ ๐๐ต๐ฒ ๐ค&๐ ๐ฑ๐ฎ๐๐ฎ๐๐ฒ๐ ๐ถ๐ป๐๐ผ ๐ฝ๐ฟ๐ผ๐บ๐ฝ๐๐
The first step is to use ๐ถ๐ฏ๐ด๐ต๐ณ๐ถ๐ค๐ต๐ถ๐ณ๐ฆ๐ฅ to clean every sample by removing redundant characters.
After, as every sample consists of multiple fields, you must map it to a single piece of text, also known as the prompt.
To do so, you define a ๐๐ณ๐ฐ๐ฎ๐ฑ๐ต๐๐ฆ๐ฎ๐ฑ๐ญ๐ข๐ต๐ฆ class to manage all your prompts. You will use it to map all the sample keys to a prompt using a Python f-string.
The last step is to map the list of Python ๐ฅ๐ข๐ต๐ข๐ค๐ญ๐ข๐ด๐ด๐ฆ๐ด to a HuggingFace dataset and map every sample to a prompt, as discussed above.
๐ฏ. ๐๐ผ๐ฎ๐ฑ ๐๐ต๐ฒ ๐๐๐ ๐๐๐ถ๐ป๐ด ๐ค๐๐ผ๐ฅ๐
Load a pretrained Falcon 7B LLM by passing a ๐ฃ๐ช๐ต๐ด๐ข๐ฏ๐ฅ๐ฃ๐บ๐ต๐ฆ๐ด quantization configuration that loads all the weights on 4 bits.
After using LoRA, you freeze the weights of the original Falcon LLM and attach to it a set of trainable adapters.
๐ฐ. ๐๐ถ๐ป๐ฒ-๐๐๐ป๐ถ๐ป๐ด
The ๐ต๐ณ๐ญ Python package makes this step extremely simple.
You pass to the ๐๐๐๐๐ณ๐ข๐ช๐ฏ๐ฆ๐ณ class the training arguments, the dataset and the model and call the ๐ต๐ณ๐ข๐ช๐ฏ() method.
One crucial aspect is configuring an experiment tracker, such as Comet ML, to log the loss and other vital metrics & artifacts.
๐ฑ. ๐ฃ๐๐๐ต ๐๐ต๐ฒ ๐ฏ๐ฒ๐๐ ๐บ๐ผ๐ฑ๐ฒ๐น ๐๐ผ ๐๐ต๐ฒ ๐บ๐ผ๐ฑ๐ฒ๐น ๐ฟ๐ฒ๐ด๐ถ๐๐๐ฟ๐
One of the final steps is to attach a callback to the ๐๐๐๐๐ณ๐ข๐ช๐ฏ๐ฆ๐ณ class that runs when the training ends to push the model with the lowest loss to the model registry as the new production candidate.
๐ฒ. ๐๐๐ฎ๐น๐๐ฎ๐๐ฒ ๐๐ต๐ฒ ๐ป๐ฒ๐ ๐ฝ๐ฟ๐ผ๐ฑ๐๐ฐ๐๐ถ๐ผ๐ป ๐ฐ๐ฎ๐ป๐ฑ๐ถ๐ฑ๐ฎ๐๐ฒ
Evaluating generative AI models can be pretty tricky.
You can run the LLM on the test set and log the prompts & answers to Comet ML's monitoring system to check them manually.
If the provided answers are valid, using the model registry dashboard, you will manually release it to replace the old LLM.
๐ณ. ๐๐ฒ๐ฝ๐น๐ผ๐ ๐๐ผ ๐๐ฒ๐ฎ๐บ
It is as easy as wrapping the training & inference functions (or classes) with a Python "@๐ข๐ฑ๐ฑ.๐ณ๐ถ๐ฏ()" decorator.

โณ๐ Check out the Hands-on LLMs course and support it with a โญ.
Thatโs it for today ๐พ
See you next Thursday at 9:00 a.m. CET.
Have a fantastic weekend!
โฆand see you next week for Lesson 9, the last lesson of the Hands-On LLMs series ๐ฅ
Paul
Whenever youโre ready, here is how I can help you:
The Full Stack 7-Steps MLOps Framework: a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.
Machine Learning & MLOps Blog: in-depth topics about designing and productionizing ML systems using MLOps.
Machine Learning & MLOps Hub: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).