Discussion about this post

User's avatar
Ahmed Besbes's avatar

Hi!

how do you make sure the evaluation chain will **always** format the answer in the expected format?

I assume that you need the answer in a specific structure in order to parse it and exploit the result.

Expand full comment
Meng Li's avatar

Four Methods for Evaluating Large Language Models:

1. Fast, Cheap, and Simple: Using metrics like ROUGE (for summarization) or BLEU (for translation) to evaluate LLMs allows us to quickly and automatically assess most of the generated text.

2. Broad Coverage: These extensive question-and-answer sets cover a wide range of topics, enabling us to score LLMs quickly and inexpensively.

3. LLM Self-Evaluation: This method is quick and easy to implement but can be costly to run. It is useful when the evaluation task is easier than the original task itself.

4. Human Expert Evaluation: Arguably the most reliable, but the slowest and most expensive to implement, especially when requiring highly skilled human experts.

Expand full comment
1 more comment...

No posts