How to ensure your models are fail-safe in production?
Effectively monitor model serving stacks at scale, extracting key insights from their behaviour under large loads.
Even before deploying an ML system, engineers require accurate insights on how it will perform under load, both locally and at scale to identify bottlenecks and unexpected behaviors that might pop up.
Iβve encountered this myself as Iβve tested and deployed a Computer Vision solution, only to see that the regular processing time jumped from ππ»π±π²πΏ π± π΅πΌππΏπ to process 24 hours of video, ππΌ π΄ π΅πΌππΏπ in the pre-production/validation environment.
In this article, Iβll cover a pipeline for efficient monitoring of your ML models deployed within NVIDIAβs Triton Inference Server.
Hereβs what weβll learn today:
How to set up a performance monitoring pipeline
Containers
Configuration Files
Docker Compose
Metrics scrapping configuration
Adding the Prometheus targets
Adding the Grafana datasource
Health check scrapping targets
Creating dashboards
Add Grafana Panels for GPU/Triton metrics.
Table of Contents
Preparing docker-compose file
Metrics scrapping configuration
Creating dashboards
Visualizations
#1. Docker Compose Monitoring
Letβs start by explaining what each service is doing and then by preparing the docker-compose to encapsulate and run all these services. We have the following:
cAdvisor which scrapes RAM/CPU usage per container
Triton Inference Server which serves ML models and yields GPU-specific metrics.
Prometheus which is the binding between the metrics generators and the consumer.
Grafana which is the consumer endpoint for visualization.
Letβs inspect the docker-compose-monitoring.yaml file.
# cat docker-compose-monitoring.yaml
version: '3.4'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "${PROMETHEUS_PORT}:${PROMETHEUS_PORT}"
container_name: prometheus
restart: always
volumes:
- "${MONITORING_CONFIGURATIONS}/prometheus.monitoring.yml:/etc/prometheus/prometheus.monitoring.yml"
- /var/run/docker.sock:/var/run/docker.sock:ro
command:
- "--config.file=/etc/prometheus/prometheus.monitoring.yml"
- "--enable-feature=expand-external-labels"
depends_on:
- cadvisor
networks:
monitor-net:
ipv4_address: ${PROM_IP}
grafana:
image: grafana/grafana-enterprise:8.2.0
container_name: grafana
ports:
- "${GRAFANA_PORT}:${GRAFANA_PORT}"
volumes:
- ${MONITORING_CONFIGURATIONS}/datasources:/etc/grafana/provisioning/datasources
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PWD}
- GF_SECURITY_ADMIN_USER=${GRAFANA_USER}
networks:
monitor-net:
ipv4_address: ${GRAFANA_IP}
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
restart: always
ports:
- "${CADVISOR_PORT}:${CADVISOR_PORT}"
volumes:
- "/etc/localtime:/etc/localtime:ro"
- "/etc/timezone:/etc/timezone:ro"
- "/:/rootfs:ro"
- "/var/run:/var/run:rw"
- "/sys:/sys:ro"
- "/var/lib/docker:/var/lib/docker:ro"
networks:
monitor-net:
ipv4_address: ${CADVISOR_IP}
triton_server:
container_name: tis2109
image: nvcr.io/nvidia/tritonserver:21.09-py3
privileged: true
ports:
- "8002:8002"
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
volumes:
- ${TRITON_MODELS_REPOSITORY}:/models
command: ["tritonserver","--model-repository=/models", "--strict-model-config=false"]
networks:
monitor-net:
ipv4_address: ${TRITON_IP}
networks:
monitor-net:
driver: bridge
internal: false
ipam:
driver: default
config:
- subnet: ${SUBNET}
gateway: ${GATEWAY}
As you may observe, in this compose file we have a few ${VAR} which are masked. These are set automatically from within a .env file such that this flow follows the best practices on local development and CI/CD pipelines.
Now letβs see whatβs in the .env file:
# == Monitoring vars ==
PROMETHEUS_PORT=9090
GRAFANA_PORT=3000
CADVISOR_PORT=8080
MONITORING_CONFIGURATIONS=<path_to_your_configuration_files>
# == Credentials ==
GRAFANA_PWD=admin
GRAFANA_USER=admin
# == TIS vars ==
TRITON_MODELS_REPOSITORY=<path_to_your_triton_model_repository>
# == Underlying network ==
SUBNET=172.17.0.0/16
GATEWAY=172.17.0.1
# == Subnet IP's ==
TRITON_IP=172.17.0.3
CADVISOR_IP=172.17.0.4
PROM_IP=172.17.0.5
GRAFANA_IP=172.72.0.6
Pretty much all variables are set, but here are the key 2 we need to take a look at:
MONITORING_CONFIGURATIONS
This one should point to a folder where you have this structure
.__ monitoring
| |_ datasources
| | |_ datasources.yml
| |_ prometheus.monitoring.yml
TRITON_MODEL_REPOSITORY
The structure of your model repository should look like this:
model_repository
βββ prod_client1_encoder
βββ 1
βββresnet50.engine
βββ config.pbtxt
If youβre not familiar with NVIDIA Triton Inference Server, no problem.
I have covered it 0-100 in this article:
The prometheus.monitoring.yml is where we will add the targets (containers) we wish to get metrics from.
The datasources.yml is where weβll add the Prometheus as a source for Grafana dashboards, such that it will appear when you open Grafana Web.
#2 Metrics scrapping configuration
Letβs go ahead and configure the Prometheus targets. Weβll write in the prometheus.monitoring.yml file.
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['172.17.0.5:9090']
- job_name: 'triton-server'
static_configs:
- targets: ['172.72.0.3:8002']
- job_name: 'cadvisor'
static_configs:
- targets: ['172.72.0.4:8080']
We have 3 targets:
prometheus - if youβre asking why? This is a best practice for healthy monitoring because with thousands of metrics - we might get bottlenecks and itβs useful to know the resource usage of Prometheus itself.
triton-server - this one is crucial, as itβs at the core of this deep learning stack as it serves and manages our ML models. Triton has an incorporated Prometheus endpoint on port 8002 which offers various metrics across the inference process.
cAdvisor - to get CPU/RAM usage across containers in this deployment.
With all these configured, we can start the composer and inspect for any issues.
Letβs start the containers.
docker compose -f docker-compose-monitoring.yaml up
Letβs inspect Prometheus targets:
Go to your web browser and insert the full Prometheus URL (IP:9090).
Go to Status β Targets
Check if each target from the scrapping config is healthy (green).
Once we verified these, we could go ahead and create our dashboard in Grafana.
#3 Creating dashboards
In order to access the Grafana WebUi dashboard, open your browser and go to
`localhost:3000`, where 3000 is the port weβve ran the Grafana container.
If youβre running this as part of a stack on a cloud deployment or a dedicated server, you can access the web-ui using the public IP of the machine.
If youβre prompted to a login page, just use `admin` on both username/password fields. Better security is indeed recommended, but this doesnβt fall within the scope of this article.
Once youβve opened Grafana web-ui, weβll have to do the following:
Point the datasource to our prometheus metrics scrapper endpoint
Create a new dashboard
Add charts to aggregate/visualize metrics weβre interested in.
#3.1 Prometheus DataSource
On the left panel, go to the gear icon (settings) and select DataSources.
Youβll be prompted to a view like this one below:
Click on βAdd data sourceβ and under the Time Series Databases select `Prometheus`. As you may have seen, Grafana supports many integrations for metrics scrapping - in this article weβll use Prometheus.
Youβll be prompted to this, and here youβll have to add the URL of your prometheus endpoint. In our docker-compose deployment we will use `http://prometheus:9090` following this template `http://container_name:container_port`
If youβve reached this point, this section of adding the data source is complete. Letβs continue with creating a dashboard.
#3.2 Creating a Grafana Dashboard
On the left panel, go to the β+β sign and select `Dashboard`. Youβll be redirected to a new dashboard page with a pre-defined panel group. As weβre building everything from scratch, weβll use only `Empty Panels` which weβll set to display key metrics.
Hereβs the blueprint weβll follow for one example:
Add a query and define the `promql` (prometheus query language)
Configure visualisation type, graph style and legend.
Below is a view of an empty panel. Next weβll go further and populate it.
This will be a `time-series` and weβll display the time it took the Triton Inference server to perform an inference request. Hereβs the query to compose that metric:
(irate(nv_inference_compute_infer_duration_us{job="triton-server"}[$__rate_interval]) / 1000) / irate(nv_inference_request_success{job="triton-server"}[$__rate_interval])
Important:
Since weβve decoupled the Inference from the actual application code, we have a client-server communication protocol. In this case, a complete inference request will consist of:
I/O input data from client β server
Inference on data
I/O output results from server β client
In this metric
nv_inference_compute_infer_duration_us
, we measure only the inference on data step within the Triton Server.
Now, letβs unpack it:
nv_inference_compute_infer_duration_us
nv_inference_request_success - shows the number of successful requests marked by Triton Server.
We apply an iRate with an rate_interval to see the time progression of this report such that we monitor the inference_time dependency on the number of successful requests.
Below you can see how the query looks. You might also observe under Legend that we have `{{model}}-{{version}}`.
This will filter the chartβs legend to display the model + itβs deployment version within Triton Server.
As per configuring right panel settings, you could specify:
Chart Type - select either straight/curved/T-step lines
Metrics Range - select the metric (e.g milliseconds (ms)) and define low_range (e.g 0) and high_range (e.g 100ms)
Custom Text - to display as legend or other field.
#3.3 Complete Visualization
Based on the flow above, one could create the rest of the charts.
I could go over all of them, but it will extend this article a lot.
Below, youβll find attached an example of what Iβm using to study SLOs/SLIs/SLAs for the inference performance of the AI stack.
This also helps planning for a scaling strategy, either add multiple replicas of the model or scale-up to multiple machines running the Inference Serving framework.
If you donβt know about SLO/SLI/SLA terms, check them here:
https://www.atlassian.com/incident-management/kpis/sla-vs-slo-vs-sliSame for key-metrics to monitor your Triton Server Deployment here:
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/metrics.htmlIβll be covering these in the future, but for now - these resources will do.
Conclusion
In this article, we went over how to set up and build a performance monitoring stack for our ML Application and Model Serving Framework using Prometheus and Grafana.
We went over preparing the docker-compose file, deploying the stack, configuring sources, and creating dashboard panels.
Following this tutorial, youβll be able to structure and deploy a monitoring pipeline for your ML application either in testing environments as a single deployment or as an aggregator dashboard (e.g in a cloud scenario setting) to combine multiple input sources and have a single dashboard consumer point from which you monitor the entire stack deployment.