DML: What is the difference between your ML development and continuous training environments?
3 techniques you must know to evaluate your LLMs quickly. Experimentation vs. continuous training environments.
Hello there, I am Paul Iusztin ๐๐ผ
Within this newsletter, I will help you decode complex topics about ML & MLOps one week at a time ๐ฅ
This weekโs ML & MLOps topics:
3 techniques you must know to evaluate your LLMs quickly
What is the difference between your ML development and continuous training environments?
Story: Job roles tell you there is just one type of MLE, but there are actually 3.
But first, I want to let you know that after 1 year of making content, I finally decided to share my content on Twitter/X.
I took this decision because everybody has a different way of reading and interacting with their socials.
...and I want everyone to enjoy my content on their favorite platform.
I even bought that stu*** blue ticker to see that I am serious about this ๐
So...
If you like my content and you are a Twitter/X person โ
โณ๐ follow at @๐ข๐ฎ๐ฌ๐ณ๐ญ๐ข๐ง๐ฉ๐๐ฎ๐ฅ
#1. 3 techniques you must know to evaluate your LLMs quickly
Manually testing the output of your LLMs is a tedious and painful process โ you need to automate it.
In generative AI, most of the time, you cannot leverage standard metrics.
Thus, the real question is, how do you evaluate the outputs of an LLM?
Depending on your problem, here is what you can do โ
#๐ญ. ๐ฆ๐๐ฟ๐๐ฐ๐๐๐ฟ๐ฒ๐ฑ ๐ฎ๐ป๐๐๐ฒ๐ฟ๐ - ๐๐ผ๐ ๐ธ๐ป๐ผ๐ ๐ฒ๐
๐ฎ๐ฐ๐๐น๐ ๐๐ต๐ฎ๐ ๐๐ผ๐ ๐๐ฎ๐ป๐ ๐๐ผ ๐ด๐ฒ๐
Even if you use an LLM to generate text, you can ask it to generate a response in a structured format (e.g., JSON) that can be parsed.
You know exactly what you want (e.g., a list of products extracted from the user's question).
Thus, you can easily compare the generated and ideal answers using classic approaches.
For example, when extracting the list of products from the user's input, you can do the following:
- check if the LLM outputs a valid JSON structure
- use a classic method to compare the generated and real answers
#๐ฎ. ๐ก๐ผ "๐ฟ๐ถ๐ด๐ต๐" ๐ฎ๐ป๐๐๐ฒ๐ฟ (๐ฒ.๐ด., ๐ด๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ถ๐ป๐ด ๐ฑ๐ฒ๐๐ฐ๐ฟ๐ถ๐ฝ๐๐ถ๐ผ๐ป๐, ๐๐๐บ๐บ๐ฎ๐ฟ๐ถ๐ฒ๐, ๐ฒ๐๐ฐ.)
When generating sentences, the LLM can use different styles, words, etc. Thus, traditional metrics (e.g., BLUE score) are too rigid to be useful.
You can leverage another LLM to test the output of our initial LLM. The trick is in what questions to ask.
When testing LLMs, you won't have a big testing split size as you are used to. A set of 10-100 tricky examples usually do the job (it won't be costly).
Here, we have another 2 sub scenarios:
โณ ๐ฎ.๐ญ ๐ช๐ต๐ฒ๐ป ๐๐ผ๐ ๐ฑ๐ผ๐ป'๐ ๐ต๐ฎ๐๐ฒ ๐ฎ๐ป ๐ถ๐ฑ๐ฒ๐ฎ๐น ๐ฎ๐ป๐๐๐ฒ๐ฟ ๐๐ผ ๐ฐ๐ผ๐บ๐ฝ๐ฎ๐ฟ๐ฒ ๐๐ต๐ฒ ๐ฎ๐ป๐๐๐ฒ๐ฟ ๐๐ผ (๐๐ผ๐ ๐ฑ๐ผ๐ป'๐ ๐ต๐ฎ๐๐ฒ ๐ด๐ฟ๐ผ๐๐ป๐ฑ ๐๐ฟ๐๐๐ต)
You don't have access to an expert to write an ideal answer for a given question to compare it to.
Based on the initial prompt and generated answer, you can compile a set of questions and pass them to an LLM. Usually, these are Y/N questions that you can easily quantify and check the validity of the generated answer.
This is known as "Rubric Evaluation"
For example:
"""
- Is there any disagreement between the response and the context? (Y or N)
- Count how many questions the user asked. (output a number)
...
"""
This strategy is intuitive, as you can ask the LLM any question you are interested in as long it can output a quantifiable answer (Y/N or a number).
โณ ๐ฎ.๐ฎ. ๐ช๐ต๐ฒ๐ป ๐๐ผ๐ ๐ฑ๐ผ ๐ต๐ฎ๐๐ฒ ๐ฎ๐ป ๐ถ๐ฑ๐ฒ๐ฎ๐น ๐ฎ๐ป๐๐๐ฒ๐ฟ ๐๐ผ ๐ฐ๐ผ๐บ๐ฝ๐ฎ๐ฟ๐ฒ ๐๐ต๐ฒ ๐ฟ๐ฒ๐๐ฝ๐ผ๐ป๐๐ฒ ๐๐ผ (๐๐ผ๐ ๐ต๐ฎ๐๐ฒ ๐ด๐ฟ๐ผ๐๐ป๐ฑ ๐๐ฟ๐๐๐ต)
When you can access an answer manually created by a group of experts, things are easier.
You will use an LLM to compare the generated and ideal answers based on semantics, not structure.
For example:
"""
(A) The submitted answer is a subset of the expert answer and entirely consistent.
...
(E) The answers differ, but these differences don't matter.
"""
#2. What is the difference between your ML development and continuous training environments?
They might do the same thing, but their design is entirely different โ
๐ ๐ ๐๐ฒ๐๐ฒ๐น๐ผ๐ฝ๐บ๐ฒ๐ป๐ ๐๐ป๐๐ถ๐ฟ๐ผ๐ป๐บ๐ฒ๐ป๐
At this point, your main goal is to ingest the raw and preprocessed data through versioned artifacts (or a feature store), analyze it & generate as many experiments as possible to find the best:
- model
- hyperparameters
- augmentations
Based on your business requirements, you must maximize some specific metrics, find the best latency-accuracy trade-offs, etc.
You will use an experiment tracker to compare all these experiments.
After you settle on the best one, the output of your ML development environment will be:
- a new version of the code
- a new version of the configuration artifact
Here is where the research happens. Thus, you need flexibility.
That is why we decouple it from the rest of the ML systems through artifacts (data, config, & code artifacts).
๐๐ผ๐ป๐๐ถ๐ป๐๐ผ๐๐ ๐ง๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด ๐๐ป๐๐ถ๐ฟ๐ผ๐ป๐บ๐ฒ๐ป๐
Here is where you want to take the data, code, and config artifacts and:
- train the model on all the required data
- output a staging versioned model artifact
- test the staging model artifact
- if the test passes, label it as the new production model artifact
- deploy it to the inference services
A common strategy is to build a CI/CD pipeline that (e.g., using GitHub Actions):
- builds a docker image from the code artifact (e.g., triggered manually or when a new artifact version is created)
- start the training pipeline inside the docker container that pulls the feature and config artifacts and outputs the staging model artifact
- manually look over the training report -> If everything went fine, manually trigger the testing pipeline
- manually look over the testing report -> if everything worked fine (e.g., the model is better than the previous one), manually trigger the CD pipeline that deploys the new model to your inference services
Note how the model registry quickly helps you to decouple all the components.
Also, because training and testing metrics are not always black & white, it is tough to 100% automate the CI/CD pipeline.
Thus, you need a human in the loop when deploying ML models.

To conclude...
The ML development environment is where you do your research to find better models:
- ๐ช๐ฏ๐ฑ๐ถ๐ต: data artifact
- ๐ฐ๐ถ๐ต๐ฑ๐ถ๐ต: code & config artifacts
The continuous training environment is used to train & test the production model at scale:
- ๐ช๐ฏ๐ฑ๐ถ๐ต: data, code, config artifacts
- ๐ฐ๐ถ๐ต๐ฑ๐ถ๐ต: model artifact
This is not a fixed solution, as ML systems are still an open question.
But if you want to see this strategy in action โ
โณ๐ Check out my The Full Stack 7-Steps MLOps Framework FREE Course.
Story: Job roles tell you there is just one type of MLE, but there are actually 3
Here they are โ
These are the 3 ML engineering personas I found while working with different teams in the industry:
#๐ญ. ๐ฅ๐ฒ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต๐ฒ๐ฟ๐ ๐๐ป๐ฑ๐ฒ๐ฟ๐ฐ๐ผ๐๐ฒ๐ฟ
They like to stay in touch with the latest papers, understand the architecture of models, optimize them, run experiments, etc.
They are great at picking the best models but not that great at writing clean code and scaling the solution.
#๐ฎ. ๐ฆ๐ช๐ ๐๐ป๐ฑ๐ฒ๐ฟ๐ฐ๐ผ๐๐ฒ๐ฟ
They pretend they read papers but don't (maybe only when they have to). They are more concerned with writing modular code and data quality than the latest hot models. Usually, these are the "data-centric" people.
They are great at writing clean code & processing data at scale but lack deep mathematical skills to develop complex DL solutions.
#๐ฏ. ๐ ๐๐ข๐ฝ๐ ๐ณ๐ฟ๐ฒ๐ฎ๐ธ๐
They ultimately don't care about the latest research & hot models. They are more into the latest MLOps tools and building ML systems. They love to automate everything and use as many tools as possible.
Great at scaling the solution and building ML pipelines, but not great at running experiments & tweaking ML models. They love to treat the ML model as a black box.
I started as #1. , until I realized I hated it - now I am a mix of:
โ #๐ญ. 20%
โ #๐ฎ. 40%
โ #๐ฏ. 40%
But that doesn't mean one is better - these types are complementary.
A great ML team should have at least one of each persona.
What do you think? Did I get it right?
Thatโs it for today ๐พ
See you next Thursday at 9:00 a.m. CET.
Have a fantastic weekend!
Paul
Whenever youโre ready, here is how I can help you:
The Full Stack 7-Steps MLOps Framework: a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.
Machine Learning & MLOps Blog: in-depth topics about designing and productionizing ML systems using MLOps.
Machine Learning & MLOps Hub: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).