How to evaluate a custom fine-tuned model, leveraging GPT3.5-Turbo, custom qualitative evaluation templates while monitoring prompts and chains using Comet ML LLM.
(query, response) -> fill evaluation prompt template
evaluation prompt template -> GPT3.5-Turbo
GPT3.5-Turbo -> evaluation response
The answer will be a generated "something" (be it post, article). Now we're using this answer alongside the query to populate the evaluation template and ask GPT3.5-Turbo to rate the answer's relation to the given query using relevance,cohesiveness metrics.
We use the query and response to fill the evaluation template and then use GPT to mark our LLM's responses.
Four Methods for Evaluating Large Language Models:
1. Fast, Cheap, and Simple: Using metrics like ROUGE (for summarization) or BLEU (for translation) to evaluate LLMs allows us to quickly and automatically assess most of the generated text.
2. Broad Coverage: These extensive question-and-answer sets cover a wide range of topics, enabling us to score LLMs quickly and inexpensively.
3. LLM Self-Evaluation: This method is quick and easy to implement but can be costly to run. It is useful when the evaluation task is easier than the original task itself.
4. Human Expert Evaluation: Arguably the most reliable, but the slowest and most expensive to implement, especially when requiring highly skilled human experts.
Hi!
how do you make sure the evaluation chain will **always** format the answer in the expected format?
I assume that you need the answer in a specific structure in order to parse it and exploit the result.
Hi Ahmed,
The evaluation chain is composed as this:
query -> inference with finetuned model
inference -> response
(query, response) -> fill evaluation prompt template
evaluation prompt template -> GPT3.5-Turbo
GPT3.5-Turbo -> evaluation response
The answer will be a generated "something" (be it post, article). Now we're using this answer alongside the query to populate the evaluation template and ask GPT3.5-Turbo to rate the answer's relation to the given query using relevance,cohesiveness metrics.
We use the query and response to fill the evaluation template and then use GPT to mark our LLM's responses.
Let me know if that answers your question
Four Methods for Evaluating Large Language Models:
1. Fast, Cheap, and Simple: Using metrics like ROUGE (for summarization) or BLEU (for translation) to evaluate LLMs allows us to quickly and automatically assess most of the generated text.
2. Broad Coverage: These extensive question-and-answer sets cover a wide range of topics, enabling us to score LLMs quickly and inexpensively.
3. LLM Self-Evaluation: This method is quick and easy to implement but can be costly to run. It is useful when the evaluation task is easier than the original task itself.
4. Human Expert Evaluation: Arguably the most reliable, but the slowest and most expensive to implement, especially when requiring highly skilled human experts.