#1. Your Best Friend - ML Monitoring

Part 1: Introduction to ML Monitoring: Why, When, and How.

Paul Iusztin

Jun 01, 2023

You finally deployed your model. Yay!

Now, you can sit and relax.

Unfortunately, not so fast!

After you deploy your model, it is subject to 4 main points of failure:

service health
model performance
data quality and integrity
data and concept drift

In this newsletter, we will focus on model performance and, data & concept drift.

Let’s start with the beginning.

To compute model performance, aka metrics, the first step is to acquire your ground truth.

While your model is in production, a common issue is that you don't have your ground truth immediately or at all.

Let's investigate 3 types of ground truths you encounter while in production 👇

1. Real-time Ground Truth

This is the ideal scenario where you can easily access your actuals.

For example, when you recommend an ad and the consumer either clicks it or not.

Or when you estimate food delivery times. When the food arrives, you can quickly calculate the performance.

2. Delayed Ground Truth

In this case, eventually, you will access the ground truths. But, unfortunately, it will be too late to react in time adequately.

For example, you want to predict if a person can access a loan, and you will know the actual outcome only when the loan is paid.

3. No Ground Truth

This is the worst scenario, as you can't automatically collect any GT. Usually, in these cases, you have to hire human annotators if you need any actuals.

For example, you have an object detector running in production. No feedback from the environment will provide you with the GT at any point.

☢️ Be careful at bias in your GT.

Let me explain. In the example where you predict if someone can receive a loan, you will collect GT only from people your model considers credit-worthy.

Thus, your ground truths will likely be very optimistic relative to giving credit to people.

Do you see the importance of the data you use to train your model?

How to handle cases when you don't have GT?

This is where data and concept drift kicks in.

You can use drifts as a proxy for your model performance.

Thus, you can always collect real-time data to know when to send warnings and alerts.

Computing drifts is the key to a robust ML monitoring system. Let’s dig deeper 👇

Data and Concept Drifts

Using data and concept drift, we can detect events such as:

outside world drastically changes (e.g., the COVID-19 pandemic)
the meaning of the field changes
the system naturally evolves, and feature shifts
a drastic increase in volume
features got switched, etc.

Intuitively when a drift happens, the world (data) your model is used to change. Thus, it will behave weirdly if it doesn't adapt to the change (e.g., retrain).

How do we compute data and concept drift?

Mainly using statistical distance checks such as:

PSI
KL Divergence
JS Divergence
KS Tests
EMD

Using these methods, we can analyze for changes (aka drifts) the:

model inputs
model outputs
actuals (aka ground truth)

How do we pick the reference distribution relative to which you watch for drifts?

Fixed Window

It can include a slice from your training, validation, and test data.

Moving Window

Your reference distribution can be a window from the past, e.g., last week.

Now let's take a deeper look at what we can analyze:

1. Model Inputs

They can drift: suddenly, gradually, or on a recurring basis.

You can analyze every feature, such as:

The training vs. production distribution
Production time window A vs. B

⚠️ Note that feature drifts are not always correlated with performance loss.

2. Model Outputs

Again, you can analyze the following:

the training vs. production prediction distribution
prediction distribution between production window A and B

⚡ Note that the analysis is done on the predictions, not the ground truth.

3. Actuals / Ground Truth

Analyze:

Actuals distribution for training vs. production
Prediction vs. actual distributions in productions

⚡ Note that I never mentioned the word metric here.

The 5 methods you need to know to measure data & concept drift in your ML monitoring system.

One of the main methods to detect drifts is to use statistical distances.

Statistical distances are used to quantify the distance between two distributions.

In ML observability, you usually measure the distribution distance between:

training <-> production
past production <-> present production

⚡ When picking a method, look at the following:

< the sample size it works best with >
< the level of sensitivity it has >

Now let's see which are the top 5 most used statistical distances 👇

1. PSI

supports both numeric and categorical features
popular in the finance industry for measuring the inputs of the model
its value ranges from 0 to +Inf
it has a set of popular thresholds:
- < 0.1 -> OK
- < 0.2 -> Investigate
- >= 0.2 -> Alert
PSI has low sensitivity / detects only major changes
you have to define bins -> the sample size does not affect it
it is symmetric

2. KL Divergence

supports both numeric and categorical features
it is the relative entropy between a distribution and the reference
its value ranges from 0 to +Inf
the higher the score, the more different they are
KL has low sensitivity / detects only significant changes
you have to define bins -> the sample size does not affect it
it is not symmetric

3. JS Divergence

supports both numeric and categorical features
it is based on KL Divergence
its value ranges from 0 to 1
it is symmetric
it is slightly more sensitive than PSI and KL divergence
you have to define bins -> the sample size does not affect it
it can easily be interpreted as it ranges between [0, 1] and is symmetric

4. EMD

supports only numerical features
it shows the absolute value of the drift (the other methods cancel out drifts in different directions).
its value is feature dependent. Thus it is good practice to normalize the values with the STD
its value ranges from 0 to +Inf
works well with large sample sizes
tends to be more sensitive than PSI and JS but not too sensitive

5. KS Test

it works with numerical features
this one is different, as it is a nonparametric statistical test
the null hypothesis is that the two samples come from the same distribution
thus a common approach is to check if p-value < 0.05 -> you detected drift
it is very sensitive
its sensitivity increases with the sample size
usually works well with samples sizes < 1000 where you want to detect any slight deviation

⚡ One final note is that drifts are not always correlated with performance loss.

Thus, it is good practice to have two thresholds:

one for warnings -> investigate
one for alarms -> retrain

If retraining is costly, you should be careful when picking the alarm threshold.

Conclusion

Thank you for reading my first newsletter! This means a lot to me.

This week you learned the following:

why you need to monitor your model after it is deployed.
what types of ground truth do you have to compute metrics
how to use data & concept drifts when you don’t have access to ground truth
how to compute data & concept drifts for structured data.

What challenges have you encountered after deploying your model in production? Leave your thoughts in the comments 👇