Best practices when evaluating fine-tuned LLM models

How to evaluate a custom fine-tuned model, leveraging GPT3.5-Turbo, custom qualitative evaluation templates while monitoring prompts and chains using Comet ML LLM.

May 30, 2024

→ the 8th out of 11 lessons of the LLM Twin free course

What is your LLM Twin? It is an AI character that writes like yourself by incorporating your style, personality, and voice into an LLM.

Why is this course different?

By finishing the “LLM Twin: Building Your Production-Ready AI Replica” free course, you will learn how to design, train, and deploy a production-ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps good practices.

Why should you care? 🫵

→ No more isolated scripts or Notebooks! Learn production ML by building and deploying an end-to-end production-grade LLM system.

More details on what you will learn within the LLM Twin course, here 👈

Latest Lessons of the LLM Twin Course

Lesson 5: The 4 Advanced RAG Algorithms You Must Know to Implement

→ RAG System, Qdrant, Query Expansion, Self Query, Filtered vector search

Lesson 6: The Role of Feature Stores in Fine-Tuning LLMs

→ Custom Dataset Generation, Artifact Versioning, GPT3.5-Turbo Distillation, Qdrant

Lesson 7: How to fine-tune LLMs on custom datasets at Scale using Qwak and CometML

→QLoRA, PEFT, Fine-tuning Mistral-7b-Instruct on custom dataset, Qwak, Comet ML

Lesson 8: Best practices when evaluating fine-tuned LLM models

In Lesson 8 we’ll focus on common evaluation methods for various tasks LLMs are performing, specifically in our case of content generation, we’ll focus on human-in-the-loop and use a larger model to assess the coherence and quantify other metrics for our LLM generations.

It is important to differentiate between evaluating LLM models singlehandedly and evaluating LLM-based systems.
During LLM evaluation, we focus only on how our fine-tuned model generates content and how cohesive is the generation.

Here’s what we’re going to learn in this lesson:

Common LLM evaluation methods for different LLM tasks.
Composing evaluation prompt templates for specific use cases.
Prompt, Chain Monitoring, and CometLLM integration.
The LLM-Twin model evaluation workflow.

LLM Evaluation Workflow. Image by author.

1. What is LLM evaluation?

LLM evaluation is a crucial process used to assess the performance and capabilities of the models. It involves a series of tests and analyses to determine how well the model understands, interprets, and generates human-like text.

Due to the generative nature of LLMs, the evaluation processes for these models involve both quantitative and qualitative assessments.

LLM Evaluation vs RAG Evaluation ✅
LLM evaluation focuses on the model’s ability to generate coherent, relevant, and contextually appropriate text based solely on its pre-trained knowledge.

This involves assessing metrics such as fluency, coherence, relevance, and adherence to the given prompts.

RAG evaluation involves assessing how well the model integrates retrieved information into its responses. This requires evaluating not just the quality of the generated text, but also the accuracy and relevance of the retrieved information, and how effectively it enhances the final output.

Metrics for RAG models often include precision and recall of the retrieval process, as well as the overall coherence and relevance of the augmented generation.

2. Evaluation Techniques

Let’s split these techniques by their intended use case, into Quantitative and Qualitative metrics.

Quantitative evaluation
Involves statistical measures to assess the accuracy, fluency, and other aspects of the generated text.

BLEU - compares the n-gram overlap between the generated text and a reference text.
ROUGE - measures the overlap of n-grams, longest common subsequence, and word sequences between the generated text and reference texts.
Perplexity - how well the model predicts the next word based on the previous context.

Qualitative evaluation
Qualitative evaluation involves human-in-the-loop judgment or larger models assessing aspects like relevance, coherence, creativity, and appropriateness of the content. This type of evaluation provides insights that quantitative metrics might miss.

Human Review:
Having domain experts or general users review the generated content to assess its quality based on various criteria such as coherence, fluency, relevance, and creativity.
Human-in-the-loop:
Reinforcement Learning from Human Feedback, RLHF — humans can rate the quality of model outputs, and this feedback is used to fine-tune the model through reinforcement learning techniques.
LLM-based Evaluation:
Involves using a larger general-knowledge model to evaluate the model’s behavior.

❌ Why BLEU & ROUGE don’t work in our use case?

They focus on measuring N-gram Overlaps.
The generated content might have high variations in wording while still reflecting the user’s query.
Lack of Semantic Understanding.
They do not help evaluate the depth, coherence, or originality of the content.
Weak Creativity
Can’t quantify stylistic elements or the overall human-like quality.

3. How we evaluate our LLM-Twin Model

We aim to verify if our fine-tuned model can generate contextual accurate posts/articles to reflect the provided query.

Within this LLM evaluation stage, we’ll focus on this section of the LLM Twin system design👇, for a recap on the LLM Twin system design, check 👉 Lesson 1.

Section from LLM Twin’s System Design. Image by the author.

Here’s the workflow overview:
1. Defining the Evaluation Prompt Template
2. Define the user query
3. Generate content based on the user query
4. Populate the evaluation template
5. Use GPT3.5-Turbo to evaluate
6. Log evaluation prompt on Comet ML LLM.

The Evaluation Prompt Template

class BasePromptTemplate(ABC, BaseModel):
    @abstractmethod
    def create_template(self, *args) -> PromptTemplate:
        pass

class LLMEvaluationTemplate(BasePromptTemplate):
    prompt: str = """
        You are an AI assistant and your task is to evaluate the output generated by another LLM.
        You need to follow these steps:
        Step 1: Analyze the user query: {query}
        Step 2: Analyze the response: {output}
        Step 3: Evaluate the generated response based on the following criteria and provide a score from 1 to 5 along with a brief justification for each criterion:

        Evaluation:
        Relevance - [score]
        [1 sentence justification why relevance = score]
        Coherence - [score]
        [1 sentence justification why coherence = score]
        Conciseness - [score]
        [1 sentence justification why conciseness = score]
"""

    def create_template(self) -> PromptTemplate:
        return PromptTemplate(template=self.prompt, input_variables=["query", "output"])

Unpacking this template, we’re specifying that given a user query and the `generated` response from our fine-tuned model, the evaluation model should analyze both (query, response) and rank the relationship between the query and the response on 3 criteria.

Relevance - relevance measures how well the generated content aligns with the user query.
Cohesiveness - how logically and smoothly the generated text flows.
Conciseness - how compact is the generated text, free from unnecessary or redundant words.

For all these criteria, we’re asking the larger LLM (GPT3.5-Turbo) to rank each of them on a 1–5 scale. Here’s the evaluation functionality:

def eval(query: str, output: str) -> str:
    evaluation_template = templates.LLMEvaluationTemplate()
    prompt_template = evaluation_template.create_template()

    model = ChatOpenAI(model=settings.OPENAI_MODEL_ID, api_key=settings.OPENAI_API_KEY)
    chain = GeneralChain.get_chain(
        llm=model, output_key="llm_eval", template=prompt_template
    )

    response = chain.invoke({"query": query, "output": output})

    return response["llm_eval"]

The functional evaluation workflow👇:

 ...
prompt_template = self.template.create_template(enable_rag=enable_rag)
prompt_template_variables = {
            "question": query,
        }

prompt = prompt_template.format(question=query)

input_ = pd.DataFrame([{"instruction": prompt}]).to_json()

response: list[dict] = self.qwak_client.predict(input_)
answer = response[0]["content"][0]

if enable_evaluation is True:
  evaluation_result = evaluate_llm(query=query, output=answer)

if enable_monitoring is True:
    metadata = {"llm_evaluation_result": evaluation_result}

self.prompt_monitoring_manager.log(
  prompt=prompt,
  prompt_template=prompt_template.template,
  prompt_template_variables=prompt_template_variables,
  output=answer,
  metadata=metadata,
 )
  self.prompt_monitoring_manager.log_chain(query=query, 
                                           response=answer,
                                           eval_output=evaluation_result
)

To check the full implementation, visit our 👉 LLM Twin Repository.

Note that here we’re using the model deployed in our previous lesson, in the training pipeline, which will serve as the model we’re evaluating. The model we’ll deploy to production, within the inference pipeline will be a separate instance, resulting after we’ve selected the best candidate based on our evaluation results.

Points from this implementation:

We’re passing the query to our evaluation Mistral7b model deployed on Qwak.
We get the response and pass the (query, response) pair to evaluation.
The evaluation template is populated and sent to GPT3.5-Turbo
The resulting prompt is logged to Comet ML LLM feature.

Here’s an example:

query: 
------
Could you please draft a LinkedIn post discussing Vector Databases? 
I`m particularly interested in how do they work.

response:
---------
Vector databases are an interesting topic, as they allow us to represent data in a way that allows for more efficient similarity searches. 
At their core, vector databases are similar to indexed search engines, but instead of indexing text, they index numerical vectors. 
These vectors are derived from the data itself, typically through techniques like hashing or embedding. 
From there, the vectors are stored in a dictionary-like structure, allowing for constant-time lookups and approximate nearest neighbor searches. 
By using vectors instead of text-based searches, vector databases can be incredibly fast and scalable, especially when dealing with large datasets. 
Understanding how they work can help you make more informed decisions when it comes to data storage and search.</s>"

Next, you can see the logs from our Evaluation Chain.

> Entering new LLMChain chain...
...

> Finished chain.
Step 1: Analyze the user query:
The user is requesting a LinkedIn post draft that discusses Vector Databases, with a focus on their functionality.

Step 2: Analyze the response:
The response generated by the other LLM provides an answer that explains vector databases, how they represent data, their similarity to search engines, and touches on the process of indexing and searching within these databases.

Step 3: Evaluate the generated response:
- Relevance [4] - The output is highly relevant as it directly addresses the user's interest in vector databases and how they work.

- Coherence [5] - The output is coherent as it presents a logical flow of information regarding vector databases.

- Conciseness [4] - The output is fairly concise, delivering a good amount of information in a compact format suitable for a LinkedIn post.

Comet ML Prompt Monitoring

Apart from the rich feature set for experiment tracking, Comet ML LLM also offers quite useful features to monitor your LLM-based applications.

Why Monitoring Prompts?
Prompt monitoring is crucial in LLM-based applications for several reasons. It helps ensure the quality and relevance of responses, maintaining accuracy and coherence in user interactions but at the same time allows ML engineers maintaining the project to identify and mitigate bias or hallucination and work on fixing them early on.

Why is it a best practice? 👇

By logging and inspecting multiple sets of resulting prompts, one could extract insights into a generalized metric.
Useful for RLHF analysis
Useful to inspect a full chain, alongside the metadata, processing time, and chain stages being executed.

Below, you’ll find an example of a chain + prompt monitoring dashboard from Comet ML LLM:

To log prompts to Comet LLM, we used this straightforward implementation.

...
  def log(
      cls,
      prompt: str,
      output: str,
      prompt_template: str | None = None,
      prompt_template_variables: dict | None = None,
      metadata: dict | None = None,
  ) -> None:
      comet_llm.init()

      metadata = metadata or {}
      metadata = {
          "model": settings.MODEL_TYPE,
          **metadata,
      }

      comet_llm.log_prompt(
          workspace=settings.COMET_WORKSPACE,
          project=f"{settings.COMET_PROJECT}-monitoring",
          api_key=settings.COMET_API_KEY,
          prompt=prompt,
          prompt_template=prompt_template,
          prompt_template_variables=prompt_template_variables,
          output=output,
          metadata=metadata,
      )

To log chains, we have to log each chain step in order. In the example below, we’ve started the chain using the {"user_query" : query} and have linked the next chain stage using the comet_llm.Span where the inputs must be the same as the previous stage.

We would have a chain like this:
🆕 INPUT -> 🆗 TWIN_RESPONSE -> 🆗 GPT3.5-EVAL -> 🔚END .


def log_chain(cls, query: str, response: str, eval_output: str):
    comet_llm.init(project=f"{settings.COMET_PROJECT}-monitoring")
    comet_llm.start_chain(
        inputs={"user_query": query},
        project=f"{settings.COMET_PROJECT}-monitoring",
        api_key=settings.COMET_API_KEY,
        workspace=settings.COMET_WORKSPACE,
    )
    with comet_llm.Span(
        category="twin_response",
        inputs={"user_query": query},
    ) as span:
        span.set_outputs(outputs=response)

    with comet_llm.Span(
        category="gpt3.5-eval",
        inputs={"eval_result": eval_output},
    ) as span:
        span.set_outputs(outputs=response)
    comet_llm.end_chain(outputs={"response": response, "eval_output": eval_output})

For more details on structuring and logging chains on Comet ML LLM, check
🔗 Comet ML Chain Logging

Conclusion

Here’s what we’ve learned in Lesson 8:

We’ve described common evaluation metrics, quantitative and qualitative.
We have exemplified a common evaluation approach using a larger model (GPT3.5-Turbo) to assess and rank our model’s responses based on relevance, cohesiveness, and conciseness.
How to log prompts and chains to Comet ML LLM feature.
How to define efficient evaluation prompt templates for content generation.

In Lesson 9, we’ll cover the process of building the inference RAG pipeline. We’ll connect the various components of the LLM-Twin system, such as the QDrant Vector DB and Qwak Inference Pipeline, and prepare the system as a complete deployment. See you there!

🔗 Check out the code on GitHub [1] and support us with a ⭐️

Next Steps

Step 1

This is just the short version of Lesson 8 on Best practices when evaluating fine-tuned LLM models

→ For…

The full implementation.
Discussion on detailed code
In-depth walkthrough of how common evaluation methods.
More insights on how to use Comet ML LLM

Check out the full version of Lesson 8 on our Medium publication. It’s still FREE:

Lesson 8 on Medium

Step 2

→ Check out the LLM Twin GitHub repository and try it yourself 🫵

Nothing compares with getting your hands dirty and building it yourself!

LLM Twin Course - GitHub

Images

If not otherwise stated, all images are created by the author.

Ahmed Besbes

Hi!

how do you make sure the evaluation chain will **always** format the answer in the expected format?

I assume that you need the answer in a specific structure in order to parse it and exploit the result.

Expand full comment

1 reply

Meng Li

Jul 19

Four Methods for Evaluating Large Language Models:

1. Fast, Cheap, and Simple: Using metrics like ROUGE (for summarization) or BLEU (for translation) to evaluate LLMs allows us to quickly and automatically assess most of the generated text.

2. Broad Coverage: These extensive question-and-answer sets cover a wide range of topics, enabling us to score LLMs quickly and inexpensively.

3. LLM Self-Evaluation: This method is quick and easy to implement but can be costly to run. It is useful when the evaluation task is easier than the original task itself.

4. Human Expert Evaluation: Arguably the most reliable, but the slowest and most expensive to implement, especially when requiring highly skilled human experts.

1 more comment...