10 concepts to know when working with LLMs

10 key LLM concepts. Qualitative LLM evaluation. Quick setup for LLM fine-tuning pipelines.

Alex Razvant

Jun 01, 2024

Decoding ML Notes

This week’s topics:

10 concepts to know when working with LLMs
One quick method for qualitative LLM evaluation
Quick setup for an LLM fine-tuning pipeline

10 concepts to know when working with LLMs

Working with LLMs involves knowledge of various concepts and methods. Here’s a list of 10 key ones:

1. LoRA (Low-Rank Adaptation)

Definition: A technique to adapt pre-trained language models to new tasks or domains by adding a low-rank matrix to the model’s weights.
Benefits: Efficient fine-tuning, requires less computational resources, and avoids overfitting.

2. PEFT (Parameter-Efficient Fine-Tuning)

Definition: A method that focuses on fine-tuning a subset of parameters rather than the entire model.
Benefits: Reduces the computational cost and memory usage, making it feasible to fine-tune large models on smaller devices.

3. RAG (Retrieval-Augmented Generation)

Definition: Combines retrieval-based and generation-based methods by using a retrieval system to fetch relevant documents which are then used to generate more accurate responses.
Benefits: Improves the quality of generated text by grounding it in factual information from retrieved documents.

4. MoE (Mixture of Experts)

Definition: An architecture that dynamically routes input tokens to different "expert" models, each specialized in different aspects of language.
Benefits: Allows for training of much larger models without a linear increase in computational cost, improving both scalability and efficiency.

5. Quantization

Definition: The process of reducing the precision of the model’s weights from floating-point to lower bit-width (e.g., 8-bit or 4-bit).
Benefits: Reduces the model size and inference latency, making deployment on edge devices and mobile hardware more feasible.

6. Knowledge Distillation

Definition: A technique where a smaller model (student) is trained to replicate the behavior of a larger model (teacher).
Benefits: Produces compact and efficient models that maintain performance levels comparable to larger models.

7. Zero-Shot Learning

Definition: The ability of a model to make predictions on tasks it hasn’t been explicitly trained on.
Benefits: Enhances the model's versatility and usability across various unseen tasks without additional training.

8. Prompt Engineering

Definition: The practice of designing input prompts to elicit the desired response from a language model.
Benefits: Improves the model’s output quality and relevance by carefully crafting the instructions or questions posed to it.

9. Fine-Tuning

Definition: The process of adapting a pre-trained model to a specific task by training it further on task-specific data.
Benefits: Customizes the model for particular applications, enhancing accuracy and relevance.

10. Prompt Monitoring

Definition: Refers to the process of systematically observing, analyzing, and adjusting the prompts used in interactions with an LLM.
Benefits: Helps to monitor consistency, identify bias, and track the overall performance of how different prompts might affect the interactions with the LLM.

One quick method for qualitative LLM evaluation

LLM evaluation is a crucial process used to assess the performance and capabilities of the models. It involves a series of tests and analyses to determine how well the model understands, interprets, and generates human-like text.

Ensuring the effectiveness and safety of LLMs in practical applications should be a mandatory goal. Evaluating LLM models to reduce hallucinations, guarantee accuracy, and ethical use is crucial as they become more integrated into diverse sectors.

Qualitative evaluation
Qualitative evaluation involves human-in-the-loop judgment or larger models assessing aspects like relevance, coherence, creativity, and appropriateness of the content.
This type of evaluation provides insights that quantitative metrics might miss.

Human Review:
Having domain experts or general users review the generated content to assess its quality based on various criteria such as coherence, fluency, relevance, and creativity.
Human-in-the-loop:
Reinforcement Learning from Human Feedback, RLHF — humans can rate the quality of model outputs, and this feedback is used to fine-tune the model through reinforcement learning techniques.
LLM-based Evaluation:
Involves using a larger general-knowledge model to evaluate the model’s behavior.

Plan for Qualitative Evaluation of LLM Twin.

Here’s a quick flow to evaluate fine-tuned LLM models on custom datasets, for the downstream task of content generation.

User Provides a Query

query: 
Could you please draft a LinkedIn post discussing Vector Databases? 
I`m particularly interested in how do they work.

The query is passed to the fine-tuned LLM for inference, and the response is captured.

response:

Vector databases are an interesting topic, as they allow us to represent data in a way that allows for more efficient similarity searches. 
At their core, vector databases are similar to indexed search engines, but instead of indexing text, they index numerical vectors. 
These vectors are derived from the data itself, typically through techniques like hashing or embedding. 
From there, the vectors are stored in a dictionary-like structure, allowing for constant-time lookups and approximate nearest neighbor searches. 
By using vectors instead of text-based searches, vector databases can be incredibly fast and scalable, especially when dealing with large datasets. 
Understanding how they work can help you make more informed decisions when it comes to data storage and search.</s>"

Defining the evaluation template

"""
You are an AI assistant and your task is to evaluate the output generated by another LLM.
        You need to follow these steps:
        Step 1: Analyze the user query: {query}
        Step 2: Analyze the response: {output}
        Step 3: Evaluate the generated response based on the following criteria and provide a score from 1 to 5 along with a brief justification for each criterion:

        Evaluation:
        Relevance - [score]
        [1 sentence justification why relevance = score]
        Coherence - [score]
        [1 sentence justification why coherence = score]
        Conciseness - [score]
        [1 sentence justification why conciseness = score]
"""

Fill the evaluation template with query (user query) and response (fine-tuned LLMs response).
Pass the prompt to GPT3.5-Turbo to evaluate the quality of the generated response
Log the resulting Promt and Evaluation chain to Comet ML LLM

For a full walkthrough implementation and detailed insights, check the full article on how to effectively evaluate fine-tuned LLMs ↆ

Evaluate Quality of Fine-Tuned LLMs

Quick setup for a LLM fine-tuning pipeline

Foundation models know a lot about a lot, but for production, we need models that know a lot about a little.

Due to the flexibility and efficiency of LoRA for computation and speed during the fine-tuning process, optimizing LLMs for custom downstream tasks has become easier than ever.

Here’s a quick setup to fine-tune a Mistral7b-Instruct model on a custom instruct dataset using Qwak to handle the heavy lifting.

LLM Twin Fine-tuning Workflow with Qwak.

What is Qwak

An ML engineering platform that simplifies the process of building, deploying, and monitoring machine learning models, bridging the gap between data scientists and engineers.

Key points within the ML Lifecycle that Qwak solves:

Deploying and iterating on your models faster
Testing, serializing, and packaging your models using a flexible build mechanism
Deploying models as REST endpoints or streaming applications
Gradually deploying and A/B testing your models in production
Build and Deployment versioning
Selective GPU Instance Pooling and Scheduling

Qwak Project Setup

[QwakNewModelBuild]
|--- main/
|   |- __init__.py 
|   |- requirements.txt   
|   |- model.py    
|--- tests/
|   |- __init__.py
|   |- unit_tests.py
|
|--- test_local_model.py
     # intended to test the model with `run_local` on your machine to validate it before pushing to qwak
|--- test_live_model.py
     # code to test the model in the process of Running Tests from above. 
     # Basically involves a `qwak_inferece.RealTimeClient` class that wraps your model and passes a dummy input through it.

Key points from the project setup:

Under the main folder is the core functionality of the model.
- requirements.txt - list of requirements for the build.
- model.py - implementation of the fine-tuning loop, wrapping the workflow as a QwakModel.
The tests folder contains any Unit Tests/Integration Tests that we define.

To streamline the deployment, one could use a Qwak YAML configuration file to define the build:

build_env:
  docker:
    assumed_iam_role_arn: null
    base_image: public.ecr.aws/qwak-us-east-1/qwak-base:0.0.13-gpu
    cache: true
    env_vars:
      - HUGGINGFACE_ACCESS_TOKEN="your-hf-token"
      - COMET_API_KEY="your-comet-key"
      - COMET_WORKSPACE="comet-workspace"
      - COMET_PROJECT="comet-project"
    no_cache: false
    params: []
    push: true
  python_env:
    dependency_file_path: finetuning/requirements.txt
    git_credentials: null
    git_credentials_secret: null
    poetry: null
    virtualenv: null
  remote:
    is_remote: true
    resources:
      cpus: null
      gpu_amount: null
      gpu_type: null
      instance: gpu.a10.2xl
      memory: null
build_properties:
  branch: finetuning
  build_id: null
  model_id: "your-model-name"
  model_uri:
    dependency_required_folders: []
    git_branch: master
    git_credentials: null
    git_credentials_secret: null
    git_secret_ssh: null
    main_dir: finetuning
    uri: .
  tags: []
deploy: false
deployment_instance: null
post_build: null
pre_build: null
purchase_option: null
step:
  tests: true
  validate_build_artifact: true
  validate_build_artifact_timeout: 120
verbose: 0

Here we specify:

Once the GPU instance is provisioned, download and install the base image
Where to get the requirements from
Environment variables
The model implementation
Enable testing/validation of the artifacts

Once the fine-tuning loop is implemented, the requirements are specified and the build configuration is defined, one can quickly deploy the fine-tuning workflow to Qwak.

For the full walkthrough, make sure to check the detailed article on Medium ↆ

LLM Fine-Tuning pipeline with Qwak

Images

If not otherwise stated, all images are created by the author.

Meng Li

Jul 18, 2024

Since the generative artificial intelligence entered the mainstream view at the end of 2022, most people have gained a basic understanding of this technology and have learned how it uses natural language to help people interact more easily with computers. Some even casually discuss hot terms like "prompts" and "machine learning" in conversations with friends. However, as AI technology continues to evolve, its vocabulary and terminology system are also continuously evolving.

Expand full comment

10 concepts to know when working with LLMs

10 key LLM concepts. Qualitative LLM evaluation. Quick setup for LLM fine-tuning pipelines.

This week’s topics:

10 concepts to know when working with LLMs

1. LoRA (Low-Rank Adaptation)

2. PEFT (Parameter-Efficient Fine-Tuning)

3. RAG (Retrieval-Augmented Generation)

4. MoE (Mixture of Experts)

5. Quantization

6. Knowledge Distillation

7. Zero-Shot Learning

8. Prompt Engineering

9. Fine-Tuning

10. Prompt Monitoring

One quick method for qualitative LLM evaluation

Quick setup for a LLM fine-tuning pipeline

Images

Discussion about this post