2 Key LLMOps Concepts
How to monitor LLM & RAG applications. Evaluate your RAG like a pro. Learn about memory/compute requirements on LLMs.
Decoding ML Notes
This weekโs topics:
A powerful framework to evaluate RAG pipelines
Why do LLMs require so much VRAM?
LLMOps Chain Monitoring
๐ข๐ป๐ฒ ๐ณ๐ฟ๐ฎ๐บ๐ฒ๐๐ผ๐ฟ๐ธ ๐๐ผ ๐ฒ๐๐ฎ๐น๐๐ฎ๐๐ฒ ๐๐ผ๐๐ฟ ๐ฅ๐๐ - ๐ฅ๐๐๐๐
Building an RAG pipeline is fairly simple. You just need a Vector-DB knowledge base, an LLM to process your prompts, plus additional logic for interactions between these modules.
However, reaching a satisfying performance level imposes its challenges due to the โseparateโ components:
Retriever โ which takes care of querying the Knowledge DB and retrieves additional context that matches the userโs query.
Generator โ which encompasses the LLM module, generating an answer based on the context-augmented prompt. When evaluating a RAG pipeline, we must evaluate both components separately and together.
๐ธ What is RAGAs?
A framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. One of the core concepts of RAGAs is Metric-Driven-Development (MDD) which is a product development approach that relies on data to make well-informed decisions.
๐ธ What metrics do RAGAs expose?
๐ฝ For ๐ฅ๐ฒ๐๐ฟ๐ถ๐ฒ๐๐ฎ๐น Stage :
โณ ๐๐ผ๐ป๐๐ฒ๐
๐ ๐ฃ๐ฟ๐ฒ๐ฐ๐ถ๐๐ถ๐ผ๐ป Evaluates the precision of the context used to generate an answer, ensuring relevant information is selected from the context
โณ ๐๐ผ๐ป๐๐ฒ๐
๐ ๐ฅ๐ฒ๐น๐ฒ๐๐ฎ๐ป๐ฐ๐ Measures how relevant the selected context is to the question. โณ ๐๐ผ๐ป๐๐ฒ๐
๐ ๐ฅ๐ฒ๐ฐ๐ฎ๐น๐น Measures if all the relevant information required to answer the question was retrieved.
โณ ๐๐ผ๐ป๐๐ฒ๐
๐ ๐๐ป๐๐ถ๐๐ถ๐ฒ๐ ๐ฅ๐ฒ๐ฐ๐ฎ๐น๐น Evaluates the recall of entities within the context, ensuring that no important entities are overlooked.
๐ฝ For ๐๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ถ๐ผ๐ป Stage :
โณ ๐๐ฎ๐ถ๐๐ต๐ณ๐๐น๐ป๐ฒ๐๐ Measures how accurately the generated answer reflects the source content, ensuring the generated content is truthful and reliable.
โณ ๐๐ป๐๐๐ฒ๐ฟ ๐ฅ๐ฒ๐น๐ฒ๐๐ฎ๐ป๐ฐ๐ฒ It is validating that the response directly addresses the userโs query.
โณ ๐๐ป๐๐๐ฒ๐ฟ ๐ฆ๐ฒ๐บ๐ฎ๐ป๐๐ถ๐ฐ ๐ฆ๐ถ๐บ๐ถ๐น๐ฎ๐ฟ๐ถ๐๐ Shows that the generated content is semantically aligned with expected responses.
โณ ๐๐ป๐๐๐ฒ๐ฟ ๐๐ผ๐ฟ๐ฟ๐ฒ๐ฐ๐๐ป๐ฒ๐๐ Focuses on fact-checking, assessing the factual accuracy of the generated answer.
๐ธ How to evaluate using RAGAs?
1. Prepare your ๐ฒ๐ถ๐ฆ๐ด๐ต๐ช๐ฐ๐ฏ๐ด,๐ข๐ฏ๐ด๐ธ๐ฆ๐ณ๐ด,๐ค๐ฐ๐ฏ๐ต๐ฆ๐น๐ต๐ด and ๐จ๐ณ๐ฐ๐ถ๐ฏ๐ฅ_๐ต๐ณ๐ถ๐ต๐ฉ๐ด
2. Compose a Dataset object
3. Select metrics
4. Evaluate
5. Monitor scores or log the entire evaluation chain to a platform like CometML.
For a full end-to-end workflow of RAGAs evaluation in practice, I've described it in this LLM-Twin Course Article ๐:
Why are LLMs so Memory-hungry?
LLMs require lots of GPU memory, but let's see why that's the case. ๐
๐ธ What is an LLM parameter?
LLMs, like Mistral 7B or LLama3-8B, have billions of parameters. ๐๐ฎ๐ฐ๐ต ๐ฝ๐ฎ๐ฟ๐ฎ๐บ๐ฒ๐๐ฒ๐ฟ ๐ถ๐ ๐ฎ ๐๐ฒ๐ถ๐ด๐ต๐ stored and accessed during computation.
๐ธ How much GPU VRAM is required? There are three popular precision formats that LLMs are trained in:
โ FP32 - 32bits floating point
โ FP16/BFP16 - 16 bits floating point
Most use mixed precision, e.g., matmul in BFP16 and accumulations in FP32.
For this example, we'll use half-precision BFP16.
Here's a deeper dive on this topic:
๐ Google BFloat16
๐ LLMs Precision Benchmark
๐น Let's calculate the VRAM required:
As 1byte=8bits, we've got:
โ FP32 = 32 bits = 4 bytes
โ FP16/BFP16 = 16bits = 2 bytes
Now, for a 7B model, we would require:
โ VRAM = 7 * 10^9 (billion) * 2 bytes = 14 * 10^9 bytes
Knowing that 1GB = 10 ^ 9 bytes we have ๐ญ๐ฐ๐๐ as the required VRAM to load a ๐ณ๐ ๐บ๐ผ๐ฑ๐ฒ๐น ๐ณ๐ผ๐ฟ ๐ถ๐ป๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ฐ๐ฒ in half BF16 precision.
๐ง๐ต๐ถ๐ ๐ถ๐ ๐ฝ๐๐ฟ๐ฒ๐น๐ ๐ณ๐ผ๐ฟ ๐น๐ผ๐ฎ๐ฑ๐ถ๐ป๐ด ๐๐ต๐ฒ ๐ฝ๐ฎ๐ฟ๐ฎ๐บ๐ฒ๐๐ฒ๐ฟ๐.
Ever encountered the ๐๐จ๐๐ ๐ข๐ข๐ Error e.g "๐๐ณ๐ช๐ฆ๐ฅ ๐ต๐ฐ ๐ข๐ญ๐ญ๐ฐ๐ค๐ข๐ต๐ฆ +56๐๐ ..." when inferencing? here's the most plausible cause for that:
โญ No GPU VRAM left for the activations. Let's figure out the activation size required by using ๐๐๐ฎ๐บ๐ฎ๐ฎ-๐ณ๐ as an example.
๐ธ Activations are a combination of the following model parameters:
- Context Length (N)
- Hidden Size (H)
- Precision (P)
After a quick look at the LLama2-7b model configuration, we get these values:
- Context Length (N) = 4096 tokens
- Hidden Size (H) = 4096 dims
- Precision (P) = BF16 = 2bytes
๐ ๐๐๐ฎ๐บ๐ฎ๐ฎ-๐ณ๐ฏ ๐ ๐ผ๐ฑ๐ฒ๐น ๐ฃ๐ฎ๐ฟ๐ฎ๐บ๐: shorturl.at/CWOJ9
Consult this interactive LLM-VRAM calculator to check on the different memory segments reserved when inferencing/training LLMs.
๐ข Inference/Training VRAM Calculator
๐ก For training, things stay a little different, as more factors come into play, as memory is allocated for:
โณ Full Activations considering N(Heads) and N( Layers)
โณ Optimizer States which differ based on the optimizer type
โณ Gradients
Here's a tutorial on PEFT, QLoRA fine-tuning in action ๐:
Other Resources:
๐ Model Anatomy: shorturl.at/nJeu0
๐ VRAM for Serving: shorturl.at/9UPBE
๐ LLM VRAM Explorer: shorturl.at/yAcTU
One key LLMOps concept - Chain Monitoring
In traditional ML systems, it is easier to backtrack to a problem compared to Generative AI ones based on LLMs. When working with LLMs, their generative nature can lead to complex and sometimes unpredictable behavior.
๐น ๐ ๐๐ผ๐น๐๐๐ถ๐ผ๐ป ๐ณ๐ผ๐ฟ ๐๐ต๐ฎ๐?
"Log prompts or entire chains with representative metadata when testing/evaluating your LLM." ๐๐ฏ๐ฆ ๐ฑ๐ญ๐ข๐ต๐ง๐ฐ๐ณ๐ฎ ๐ต๐ฉ๐ข๐ต ๐ ๐ญ๐ช๐ฌ๐ฆ ๐ข๐ฏ๐ฅ ๐'๐ท๐ฆ ๐ฃ๐ฆ๐ฆ๐ฏ ๐ถ๐ด๐ช๐ฏ๐จ ๐ง๐ฐ๐ณ ๐ต๐ฉ๐ช๐ด ๐ต๐ข๐ด๐ฌ ๐ช๐ด ๐๐ผ๐บ๐ฒ๐๐ ๐ - ๐๐๐ .
๐ธ ๐๐ฒ๐ฟ๐ฒ ๐ฎ๐ฟ๐ฒ ๐ฎ ๐ณ๐ฒ๐ ๐ฐ๐ฎ๐๐ฒ๐ ๐๐ต๐ฒ๐ฟ๐ฒ ๐ถ๐ ๐ฝ๐ฟ๐ผ๐๐ฒ๐ ๐ฏ๐ฒ๐ป๐ฒ๐ณ๐ถ๐ฐ๐ถ๐ฎ๐น:
โ ๐๐ผ๐ฟ ๐ฆ๐๐บ๐บ๐ฎ๐ฟ๐ถ๐๐ฎ๐๐ถ๐ผ๐ป ๐ง๐ฎ๐๐ธ๐
Here you might have a query that represents the larger text, the LLMs response which is the summary, and you could calculate the ROUGE score inline between query & response and add it to the metadata field. Then you can compose a JSON with query, response, and rouge_score and log it to comet.
โ ๐๐ผ๐ฟ ๐ค&๐ ๐ง๐ฎ๐๐ธ๐ Here, you could log the Q&A pairs separately, or even add an evaluation step using a larger model to evaluate the response. Each pair would be composed of Q, A, GT, and True/False to mark the evaluation.
โณ ๐๐ผ๐ฟ ๐๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ถ๐ผ๐ป ๐ง๐ฎ๐๐ธ๐ You could log the query and response, and append in the metadata a few qualitative metrics (e.g. relevance, cohesiveness).
โณ๐๐ผ๐ฟ ๐ฅ๐๐ If you have complex chains within your RAG application, you could log prompt structures (sys_prompt, query), and LLM responses and track the chain execution step by step.
โณ ๐๐ผ๐ฟ ๐ก๐๐ฅ You could define the entity fields and log the query, response, entities_list, and extracted_entities in the same prompt payload.
โณ๐๐ผ๐ฟ ๐ฉ๐ถ๐๐ถ๐ผ๐ป ๐ง๐ฟ๐ฎ๐ป๐๐ณ๐ผ๐ฟ๐บ๐ฒ๐ฟ๐ CometML LLM also allows you to log images associated with a prompt or a chain. If youโre working with GPT4-Vision for example, you could log the query and the generated image in the same payload.
Also, besides the actual prompt payload, you could inspect the processing time per each step of a chain.
For example, a 3-step chain in an RAG application might query the Vector DB, compose the prompt, and pass it to the LLM, and when logging the chain to CometML, you could see the processing time/chain step.
๐น ๐ง๐ผ ๐๐ฒ๐ ๐ถ๐ ๐๐ฝ, ๐๐ผ๐'๐น๐น ๐ป๐ฒ๐ฒ๐ฑ:
- CometML pip package
- CometML API key - Workspace name and Project Name
I've used this approach when evaluating a fine-tuned LLM on a custom instruction dataset. For a detailed walkthrough ๐
Images
If not otherwise stated, all images are created by the author.