The ultimate LLM fine-tuning pipeline

Master production-ready fine-tuning with AWS SageMaker, Unsloth, and MLOps best practices

Nov 16, 2024

The new awesome list on building LLM apps (Affiliate)

If you are into LLMs and feel lost (like most of us), here is a fantastic repository that curates the top tools, blogs and tutorials.

The LLM space is complicated and moves fast...

Thus, it can often make us feel lost and confused.

That's why Li Yin started an excellent initiative by creating the LLM-engineer-handbook repository, which curates and aggregates the top tools, blogs, tutorials, books and social accounts.

On LLM topics such as training, fine-tuning, serving and building LLM apps.

Check out the repository

Let's make this the go-to place for finding resources for building LLM apps!

P.S. The name has nothing to do with my book but is on her list.

Great news! We remastered our LLM Twin open-source course.

We drastically simplified the code, fixed all the bugs, and introduced the following:

Fine-tuning with Unsloth and AWS SageMaker (Lesson 7)
LLM & RAG evaluation with Opik (Lesson 8)
Deploying the fine-tuned LLM to AWS SageMaker (Lesson 9)
Prompt monitoring & monitoring evaluation pipelines with Opik (Lesson 10).

You can find all the new lessons on GitHub ←

Let’s kick it off with Lesson 7.

Lesson 7: 8B Parameters, 1 GPU, No Problems: The ultimate LLM fine-tuning pipeline

This lesson will show you how to fine-tune open-source LLMs from Hugging Face using Unsloth, TRL, AWS SageMaker and Comet ML to ensure the following:

MLOPs best practices using Hugging Face and Comet ML;
Use VRAM optimally during fine-tuning using Unsloth and TRL;
Operationalize your training pipelines using AWS SageMaker.

We will primarily focus on engineering scalable and reproducible fine-tuning pipelines (using LLMOps and SWE best practices) rather than digging into fine-tuning techniques.

We will stick to what usually works for fine-tuning, such as using LoRA for supervised fine-tuning (SFT).

Figure 1: LLM fine-tuning production-ready pipeline with SageMaker, Unsloth and Comet

Loading the training dataset from the data registry
Digging into SFT using Unsloth, TRL and Comet ML
Saving the fine-tuned LLM to a model registry
Scaling fine-tuning with AWS SageMaker
Running the training pipeline on AWS SageMaker

🔗 Consider checking out the GitHub repository [1] and support us with a ⭐️

1. Loading the training dataset from the data registry

In Lesson 6, we taught you how to generate an instruct fine-tuning dataset from raw custom data collected from various socials.

Ultimately, we stored and versioned the fine-tuning dataset into a data registry powered by Comet ML. The data registry uses artifacts to track large files and metadata such as tags, versions, and dataset size.

You can observe all the available artifacts from Comet ML in Figure 2.

Also, we made our artifacts publicly available, so you can take a look, play around with them, and even use them to fine-tune the LLM in case you don’t want to compute them yourself:

Figure 2: Comet ML fine-tuning datasets artifacts.

For example, in Figure 3, you can observe what our articles-instruct-dataset artifact looks like. It has 3 versions available, while the latest one is version 12.0.0.

By versioning your fine-tuning data, you ensure lineage, which means you always know what data you train your model on. A critical aspect of ensuring reproducibility which is one of the pillars of MLOps.

Figure 3: How the **articles-instruct-datase**t looks like in Comet ML.

How can we work with these artifacts?

If you are familiar with working with HuggingFace datasets, you will see Comet ML artifacts are similar. Conceptually, they are the same thing, but Comet allows you to quickly build a private data registry on top of your private data.

Let’s dig into the code to see how they work.

class DatasetClient:
    def __init__(
        self,
        output_dir: Path = Path("./finetuning_dataset"),
    ) -> None:
        self.output_dir = output_dir
        self.output_dir.mkdir(parents=True, exist_ok=True)

First, we define a DatasetClient class. It creates a dedicated directory for storing our downloaded datasets.

  def download_dataset(self, dataset_id: str, split: str = "train") -> Dataset:
      assert split in ["train", "test"], "Split must be either 'train' or 'test'"
  
      if "/" in dataset_id:
          tokens = dataset_id.split("/")
          assert (
              len(tokens) == 2
          ), f"Wrong format for the {dataset_id}. It should have a maximum one '/' character following the next template: 'comet_ml_workspace/comet_ml_artiface_name'"
          workspace, artifact_name = tokens
          experiment = Experiment(workspace=workspace)
      else:
          artifact_name = dataset_id
          experiment = Experiment()
  
      artifact = self._download_artifact(artifact_name, experiment)
      asset = self._artifact_to_asset(artifact, split)
      dataset = self._load_data(asset)
  
      experiment.end()
  
      return dataset

This is our primary entry point method — a high-level interface that orchestrates the entire dataset download process. It handles workspace parsing, validates inputs, and coordinates the three main steps: downloading, asset extraction, and data loading.

  def _download_artifact(self, artifact_name: str, experiment) -> Artifact:
      try:
          logged_artifact = experiment.get_artifact(artifact_name)
          artifact = logged_artifact.download(self.output_dir)
      except Exception as e:
          print(f"Error retrieving artifact: {str(e)}")
          raise
  
      print(f"Successfully downloaded  {artifact_name} at location {self.output_dir}")
      return artifact

This section manages the actual download of artifacts from Comet ML. It includes error handling and logging to ensure smooth data retrieval operations.

  def _artifact_to_asset(self, artifact: Artifact, split: str) -> ArtifactAsset:
      if len(artifact.assets) == 0:
          raise RuntimeError("Artifact has no assets")
      elif len(artifact.assets) != 2:
          raise RuntimeError(
              f"Artifact has more {len(artifact.assets)} assets, which is invalid. It should have only 2."
          )
  
      print(f"Picking split = '{split}'")
      asset = [asset for asset in artifact.assets if split in asset.logical_path][0]
      return asset

Here, we handle the validation and extraction of specific dataset splits (train/test) from our artifacts. It ensures we work with the correct data partitions and maintains data integrity.

  def _load_data(self, asset: ArtifactAsset) -> Dataset:
      data_file_path = asset.local_path_or_data
      with open(data_file_path, "r") as file:
          data = json.load(file)
  
      dataset_dict = {k: [str(d[k]) for d in data] for k in data[0].keys()}
      dataset = Dataset.from_dict(dataset_dict)
  
      print(
          f"Successfully loaded dataset from artifact, num_samples = {len(dataset)}",
      )
  
      return dataset

The final piece transforms our raw data into a HuggingFace Dataset object well-supported within the LLM tooling ecosystem, such as TRL, which we will use for fine-tuning.

What does our data look like?

We have ~300 training samples stored in our Comet ML artifacts that follow the structure below:

[
...
 {
    "instruction": "Describe the old architecture of the RAG feature pipeline and its robust design principles.",
    "content": "Our goal is to help enterprises put vectors at the center of their\n> data & compute infrastructure, to build smarter and more reliable\n> software._\n\nTo conclude, Superlinked is a framework that puts the vectors in the center of\ntheir universe and allows you to:\n\n  * chunk and embed embeddings;\n\n  * store multi-index vectors in a vector DB;\n\n  * do complex vector search queries on top of your data..."
  },
...
]

300 samples are not enough for SFT. Usually, you need somewhere between 10k and 100k instruct-answer pairs.

However, they are sufficient to teach you an end-to-end LLM architecture that can easily support 100k datasets if you want to use it and adapt it to your needs.

2. Digging into SFT using Unsloth, TRL and Comet ML

The next step is to define our fine-tuning strategy. We will do only an SFT step using LoRA to keep it simple and cost-effective.

We will use Unsloth and TRL to define our fine-tuning script.

Unsloth is the new kid on the block of fine-tuning LLMs, making training 2x faster and 60% more memory-efficient than directly HuggingFace.

This translates to faster experiments, which means more iterations, feedback, and novelty with lower costs.

Also, we will use Comet ML as our experiment tracker to log all our training metrics between multiple experiments, compare them, and pick the best one to push to production.

🔗 See a concrete example of an experiment tracker by checking out one of our experiments ←

Now, let’s dig into the code. Unsloth and TRL make it straightforward.

ALPACA_TEMPLATE = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
{}"""

We will use the Alpaca format, which is expected by Llama models, to format our instruct dataset into prompts.

def finetune(
    model_name: str,
    output_dir: str,
    dataset_id: str,
    max_seq_length: int = 2048,
    load_in_4bit: bool = False,
    lora_rank: int = 32,
    lora_alpha: int = 32,
    lora_dropout: float = 0.0,
    target_modules: List[str] = [
        "q_proj", "k_proj", "v_proj",
        "up_proj", "down_proj", "o_proj",
        "gate_proj",
    ],
    chat_template: str = "chatml",
    learning_rate: float = 3e-4,
    num_train_epochs: int = 3,
    per_device_train_batch_size: int = 2,
    gradient_accumulation_steps: int = 8,
    is_dummy: bool = True,
) -> tuple:

Next, we define the fine-tuning function and its parameters, including model configurations, LoRA parameters, and training hyperparameters.

  model, tokenizer = load_model(
          model_name, max_seq_length, load_in_4bit,
          lora_rank, lora_alpha, lora_dropout,
          target_modules, chat_template,
      )
      EOS_TOKEN = tokenizer.eos_token
      print(f"Setting EOS_TOKEN to {EOS_TOKEN}")
  
      if is_dummy is True:
          num_train_epochs = 1
          print(f"Training in dummy mode. Setting num_train_epochs to '{num_train_epochs}'")
          print(f"Training in dummy mode. Reducing dataset size to '400'.")

Next, we load the model and tokenizer and handle dummy mode settings for quick testing.

def format_samples_sft(examples):
        text = []
        for instruction, output in zip(
            examples["instruction"], examples["content"], strict=False
        ):
            message = ALPACA_TEMPLATE.format(instruction, output) + EOS_TOKEN
            text.append(message)
        return {"text": text}

This inner function handles the formatting of training examples into the desired template structure.

dataset_client = DatasetClient()
    custom_dataset = dataset_client.download_dataset(dataset_id=dataset_id)
    static_dataset = load_dataset("mlabonne/FineTome-Alpaca-100k", split="train[:10000]")
    dataset = concatenate_datasets([custom_dataset, static_dataset])
    if is_dummy:
        dataset = dataset.select(range(400))
    print(f"Loaded dataset with {len(dataset)} samples.")

    dataset = dataset.map(
        format_samples_sft, batched=True, remove_columns=dataset.column_names
    )
    dataset = dataset.train_test_split(test_size=0.05)

    print("Training dataset example:")
    print(dataset["train"][0])

Next, we handle dataset loading, combining custom and static datasets, and preprocessing the data.

As we don’t have enough fine-tuning data, we enrich our custom dataset with a standard fine-tuning dataset to keep the SFT training step stable and avoid breaking the model.

  trainer = SFTTrainer(
          model=model,
          tokenizer=tokenizer,
          train_dataset=dataset["train"],
          eval_dataset=dataset["test"],
          dataset_text_field="text",
          max_seq_length=max_seq_length,
          dataset_num_proc=2,
          packing=True,
          args=TrainingArguments(
              learning_rate=learning_rate,
              num_train_epochs=num_train_epochs,
              per_device_train_batch_size=per_device_train_batch_size,
              gradient_accumulation_steps=gradient_accumulation_steps,
              fp16=not is_bfloat16_supported(),
              bf16=is_bfloat16_supported(),
              logging_steps=1,
              optim="adamw_8bit",
              weight_decay=0.01,
              lr_scheduler_type="linear",
              per_device_eval_batch_size=per_device_train_batch_size,
              warmup_steps=10,
              output_dir=output_dir,
              report_to="comet_ml",
              seed=0,
          ),
      )
  
      trainer.train()
  
      return model, tokenizer

This final section sets up the SFT (Supervised Fine-Tuning) trainer with all necessary parameters and executes the training process.

To enable experiment tracking with Comet ML is as simple as setting the report_to=”comet_ml” parameter to the TrainingArguments class and having the `COMET_API_KEY`, `COMET_WORKSPACE` and `COMET_PROJECT` environment variables loaded up in memory.

Let’s dig further into how the model is defined using Unsloth.

def load_model(
    model_name: str,
    max_seq_length: int,
    load_in_4bit: bool,
    lora_rank: int,
    lora_alpha: int,
    lora_dropout: float,
    target_modules: List[str],
    chat_template: str,
) -> tuple:

The load_model function takes several essential parameters:

model_name: The identifier of the pre-trained model (e.g., “meta-llama/Meta-Llama-3.1–8B”)
max_seq_length: Maximum sequence length for input tokens
load_in_4bit: Boolean flag for 4-bit quantization
lora_rank, lora_alpha, lora_dropout: LoRA (Low-Rank Adaptation) parameters
target_modules: List of model layers to apply LoRA to
chat_template: The conversation format template to use

  model, tokenizer = FastLanguageModel.from_pretrained(
      model_name=model_name,
      max_seq_length=max_seq_length,
      load_in_4bit=load_in_4bit,
  )

This step from the load_model() function loads the pre-trained model and its tokenizer using Unsloth’s FastLanguageModel.

The load_in_4bit parameter is particularly interesting as it enables 4-bit quantization, significantly reducing the model’s memory footprint while maintaining good performance.

  model = FastLanguageModel.get_peft_model(
      model,
      r=lora_rank,
      lora_alpha=lora_alpha,
      lora_dropout=lora_dropout,
      target_modules=target_modules,
  )

Here’s where the magic of LoRA happens. Instead of fine-tuning all model parameters, LoRA adds small trainable rank decomposition matrices to specific layers (defined in target_modules). This makes fine-tuning much more efficient in terms of memory and computation.

lora_rank (r): Determines the rank of the LoRA update matrices.
lora_alpha: Scaling factor for the LoRA updates.
lora_dropout: Adds regularization to prevent overfitting.

  tokenizer = get_chat_template(
      tokenizer,
      chat_template=chat_template,
  )

Finally, we configure the tokenizer with a specific chat template. This ensures that the model understands the structure of conversations during training and inference. Standard templates include “chatml” (ChatML format) or other custom formats.

This loading pipeline is crucial for efficient fine-tuning because it:

Enables memory-efficient training through 4-bit quantization.
Implements LoRA for parameter-efficient fine-tuning.
Ensures consistent conversation formatting through chat templates.

Using this approach, you can fine-tune LLMs on consumer-grade hardware while achieving excellent results.

To dig deeper into the theory of fine-tuning with LoRA, consider checking out this article written by Maxime: Fine-tune Llama 3.1 Ultra-Efficiently with Unsloth [2].

3. Saving the fine-tuned LLM to a model registry

The same as storing, tracking and versioning your data in a data registry, we have to do it for our fine-tuned model by pushing it to a model registry.

A common strategy when working with open-source models is to use the Hugging Face model registry to store and share your models, which we will also do in this lesson.

base_model_suffix = args.base_model_name.split("/")[-1]
sft_output_model_repo_id = f"{huggingface_workspace}/LLMTwin-{base_model_suffix}"

save_model(
        model,
        tokenizer,
        "model_sft",
        push_to_hub=True,
        repo_id=sft_output_model_repo_id,
    )

First, we compute the output model ID based on our Hugging Face workspace (e.g., pauliusztin) and the new model name. Out of simplicity, we prefixed the base model name with “LLMTwin”.

def save_model(
    model: Any,
    tokenizer: Any,
    output_dir: str,
    push_to_hub: bool = False,
    repo_id: Optional[str] = None,
) -> None:
    model.save_pretrained_merged(output_dir, tokenizer, save_method="merged_16bit")

    if push_to_hub and repo_id:
        model.push_to_hub_merged(repo_id, tokenizer, save_method="merged_16bit")

We save the model locally and push it to Hugging Face, as seen in Figure 3.

Figure 4: Fine-tuned model stored in Hugging Face model registry. Access our fine-tuned LLM here.

Further, you can load a specific version of the model from the model registry for evaluation or serving.

Almost all ML platforms offer a model registry, such as Comet, W&B, Neptune and more, but HuggingFace is a common choice.

For example, the beauty of model registries is that, in case you haven't fine-tuned your LLMTwin, you can use ours to finish the course:

→ Link to our pauliusztin/LLMTwin-Meta-Llama-3.1–8B model.
→ Full code: the finetune.py script.

4. Scaling fine-tuning with AWS SageMaker

So far, we have walked you through the fine-tuning script. A standard approach is to run it on Google Colab locally or using similar approaches based on Notebooks, but what if we want to scale or automate the training?

A 7–8B LLM could fit on a Google Colab machine while using LoRA/QLoRA, but it can get trickier for larger models.

Another issue is that when working with open-source datasets, it’s easy to work with Google Colab, but what if you work with terabytes or petabytes of data?

Here is where tools such as AWS SageMaker kick in. They allow you to hook your fine-tuning script to GPU clusters running on AWS and provide robust access to datasets of various sizes (public or private) powered by S3 (you could host your Comet ML artifacts on S3).

Code-wise, SageMaker makes it easy to set everything up, as seen in the code snippet below, where we:

Locate the requirements.txt file with the Python dependencies used for training.
Grab your Hugging Face user.
Define the SageMaker job using a wrapper dedicated to training jobs that use Hugging Face. They are Docker images preinstalled with the transformer and torch libraries.
Kick off the training.

Beautiful and easy.

from huggingface_hub import HfApi
from sagemaker.huggingface import HuggingFace


finetuning_dir = Path(__file__).resolve().parent
finetuning_requirements_path = finetuning_dir / "requirements.txt"

def run_finetuning_on_sagemaker(
    num_train_epochs: int = 3,
    per_device_train_batch_size: int = 2,
    learning_rate: float = 3e-4,
    is_dummy: bool = False,
) -> None:
    if not finetuning_requirements_path.exists():
        raise FileNotFoundError(
            f"The file {finetuning_requirements_path} does not exist."
        )

    api = HfApi()
    user_info = api.whoami(token=settings.HUGGINGFACE_ACCESS_TOKEN)
    huggingface_user = user_info["name"]
    logger.info(f"Current Hugging Face user: {huggingface_user}")

    hyperparameters = {
        "base_model_name": settings.HUGGINGFACE_BASE_MODEL_ID,
        "dataset_id": settings.DATASET_MODEL_ID,
        "num_train_epochs": num_train_epochs,
        "per_device_train_batch_size": per_device_train_batch_size,
        "learning_rate": learning_rate,
        "model_output_huggingface_workspace": huggingface_user,
    }
    if is_dummy:
        hyperparameters["is_dummy"] = True

    # Create the HuggingFace SageMaker estimator
    huggingface_estimator = HuggingFace(
        entry_point="finetune.py",
        source_dir=str(finetuning_dir),
        instance_type="ml.g5.2xlarge",
        instance_count=1,
        role=settings.AWS_ARN_ROLE,
        transformers_version="4.36",
        pytorch_version="2.1",
        py_version="py310",
        hyperparameters=hyperparameters,
        requirements_file=finetuning_requirements_path,
        environment={
            "HUGGING_FACE_HUB_TOKEN": settings.HUGGINGFACE_ACCESS_TOKEN,
            "COMET_API_KEY": settings.COMET_API_KEY,
            "COMET_WORKSPACE": settings.COMET_WORKSPACE,
            "COMET_PROJECT_NAME": settings.COMET_PROJECT,
        },
    )

    # Start the training job on SageMaker.
    huggingface_estimator.fit()

The hyperparameters dictionary will be sent to the fine-tuning script as CLI arguments, while the environment dictionary will be set as environment variables. That’s why we send only the credentials through the environment argument.

As we train an 8B LLM, we managed to fit the training into a single “ml.g5.2xlarge” instance, which has a single NVIDIA A10G GPU with 24 VRAM, which costs ~2$ / hour.

But the catch is that this is possible only because we fine-tune using Unsloth, which reduces our memory consumption. Without it, we fit the training job only on a “ml.g5.12xlarge” instance with x4 A10G GPUs, which cost ~9$ / hour.

So, yes, Unsloth is incredible!

That is a 77.77% reduction in costs (and we are not even considering that Unsloth experiments run faster due to the framework itself and less IO overhead as we use a single GPU).

5. Running the training pipeline on AWS SageMaker

To run the fine-tuning job, first, you must create an IAM execution role used by AWS SageMaker to access other AWS resources. This is standard practice when working with SageMaker.

make create-sagemaker-execution-role

You must add this to your .env file as your AWS_ARN_ROLE env var. Thus, your .env file should look something like this:

QDRANT_APIKEY=...

# AWS Authentication
AWS_ARN_ROLE=...
AWS_REGION=eu-central-1
AWS_ACCESS_KEY=...
AWS_SECRET_KEY=...

Then, you can kick off a dummy training that uses less data and epochs by running:

make start-training-pipeline-dummy-mode

And the entire training by running:

make start-training-pipeline

After you call any training commands, your CLI should look similar to Figure 5.

Figure 5: AWS SageMaker training job provisiong time.

After the EC2 machine is provisioned for your training job, your CLI should look similar to Figure 6.

Figure 6: AWS SageMaker training job starting.

After the training image is downloaded, your requirements are installed from the requirements.txt file, and the fine-tuning script starts running.

Conclusion

In this lesson, you’ve got your hands dirty with fine-tuning an open-source LLM from HuggingFace using Unsloth, TRL.

Also, you’ve learned why storing, versioning and using your data from a data registry (e.g., using Comet ML artifacts) is critical for reproducibility.

Ultimately, you’ve seen how easy it is to automate your training processes using AWS SageMaker.

Continue the course with Lesson 8 on evaluating the fine-tuned LLM and RAG pipeline using Opik.

Our LLM Engineer’s Handbook inspired the open-source LLM Twin course.

Consider supporting our work by getting our book to learn a complete framework for building and deploying production LLM & RAG systems — from data to deployment.

Perfect for practitioners who want both theory and hands-on expertise by connecting the dots between DE, research, MLE and MLOps:

Buy the LLM Engineer’s Handbook

References

Literature

[1] Decodingml. (n.d.). GitHub — decodingml/llm-twin-course. GitHub. https://github.com/decodingml/llm-twin-course

[2] Maxime Labonne (2024), Fine-tune Llama 3.1 Ultra-Efficiently with Unsloth, Maxime’s Labonne blog. https://mlabonne.github.io/blog/posts/2024-07-29_Finetune_Llama31.html

Images

If not otherwise stated, all images are created by the author.