PI #008: Your Data Validation Guide in Under 3 Minutes
A wise man said: data validation shouldn't be hard & you should validate everything!
Hello there, I am Paul Iusztin, and within this newsletter, I will deliver,
your weekly piece of MLE & MLOps wisdom straight to your inbox ๐ฅ
This week we will cover:
How to use GE to validate your data
Why you should consider validating your data in multiple points of your pipeline
#1. How to use GE to validate your data
Data validation shouldn't be hard.
Here is your data validation guide in under 2 minutes ๐
Data validation ensures the integrity and quality of your data ingested automatically into your ML system.
Thus, implementing your data validation layer is crucial in any successful ML system.
.
๐ง๐ผโโ๏ธ Great Expectations make everything straightforward.
Using GE, you must stack multiple ExpectationConfiguration objects, where each object checks a single rule/feature.
For example:
ExpectationConfiguration(
expectation_type="expect_column_distinct_values_to_be_in_set",
kwargs={"column": "area", "value_set": (0, 1, 2)}
)
, checks if the "area" feature contains only values equal to 0, 1 or 2.
The most common checks you have to do are for the following:
- the schema of the table;
- the type of each column;
- the values of each column: an interval for continuous variables or an expected set for discrete variables;
- null values.
.
After you run your GE validation suit, you will get a success %.
Based on the success % you can make various decisions, such as:
๐ข == 100% - ingest the data without an alert
๐ก >=90% - ingest the data with an alert
๐ด <90% - drop the data with an error
P.S. Using GE + Hopsworks as your Feature Store makes everything even simpler ๐ฅ
So remember...
GE makes implementing your data validation layer straightforward.
You have to check every feature for a given set of rules.
Based on the success % you have to take various actions.
#2. Why you should consider validating your data in multiple points of your pipeline
A wise man said: ๐๐ฎ๐น๐ถ๐ฑ๐ฎ๐๐ฒ ๐ฒ๐๐ฒ๐ฟ๐๐๐ต๐ถ๐ป๐ด!
100% you heard that data validation is good...
but where should we validate the data? Everywhere!
That might be an overstatement, but let me explain.
When the outputs of an ML model are poor, there are 1000+ reasons why that happened.
But even if you know that the issue is data related...
Narrowing down to the actual function that messed up everything is extremely hard.
Thus, by adding data validation before & after:
- the ingestion ETL;
- the data engineering pipeline;
- the feature engineering pipeline;
you might add some redundancy, but this will make scanning for errors extremely easy.
.
Imagine that you would have a data validation check only after the FE pipeline. If that fails, you know it failed ๐ฃ๐ถ๐ต ๐ฅ๐ฐ๐ฏ'๐ต ๐ฌ๐ฏ๐ฐ๐ธ ๐ธ๐ฉ๐ฆ๐ณ๐ฆ ๐ช๐ต ๐ง๐ข๐ช๐ญ๐ฆ๐ฅ.
If the system is small, that is not an issue, but imagine you have 100+ transformations spread across multiple teams...
๐ฅฒ Finding the right error might take you hours or even days.
๐ By adding multiple data validation points in your system, you can quickly answer to: "where the system failed".
Thus, by adding data validation in multiple, you automatically slice the pipeline making it easy to diagnose.
Note that this is just an example. Your data infrastructure might look different.
But the fundamental idea remains the same. Add data validation in all the essential points of your data pipelines to quickly slice and dice the upcoming errors.
๐ If you want a hands-on example of using GE to validate your data, check out my article: Ensuring Trustworthy ML Systems With Data Validation and Real-Time Monitoring.
These are this weekโs tips & tricks about data validation.
See you next Thursday at 9:00 am CET.
Have a fantastic weekend!
Paul
Whenever youโre ready, here is how I can help you:
The Full Stack 7-Steps MLOps Framework: a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.
Machine Learning & MLOps Blog: here, I approach in-depth topics about designing and productionizing ML systems using MLOps.