3 Comments

Hi!

how do you make sure the evaluation chain will **always** format the answer in the expected format?

I assume that you need the answer in a specific structure in order to parse it and exploit the result.

Expand full comment

Hi Ahmed,

The evaluation chain is composed as this:

query -> inference with finetuned model

inference -> response

(query, response) -> fill evaluation prompt template

evaluation prompt template -> GPT3.5-Turbo

GPT3.5-Turbo -> evaluation response

The answer will be a generated "something" (be it post, article). Now we're using this answer alongside the query to populate the evaluation template and ask GPT3.5-Turbo to rate the answer's relation to the given query using relevance,cohesiveness metrics.

We use the query and response to fill the evaluation template and then use GPT to mark our LLM's responses.

Let me know if that answers your question

Expand full comment

Four Methods for Evaluating Large Language Models:

1. Fast, Cheap, and Simple: Using metrics like ROUGE (for summarization) or BLEU (for translation) to evaluate LLMs allows us to quickly and automatically assess most of the generated text.

2. Broad Coverage: These extensive question-and-answer sets cover a wide range of topics, enabling us to score LLMs quickly and inexpensively.

3. LLM Self-Evaluation: This method is quick and easy to implement but can be costly to run. It is useful when the evaluation task is easier than the original task itself.

4. Human Expert Evaluation: Arguably the most reliable, but the slowest and most expensive to implement, especially when requiring highly skilled human experts.

Expand full comment