5 Tools to monitor the performance of your Deep Learning Stack!

Introducing DML Notes, plus an overview of popular vision foundation models, a toolset to use in vision data engineering and performance monitoring DL pipelines.

Feb 17, 2024

Decoding ML Notes

Hey everyone, Happy Saturday!

Today marks the beginning of an exciting new journey.
We’re introducing "DML Notes" a weekly series dedicated to covering summaries, tips and tricks, and advice on ML& MLOps engineering in a short-form content format.

Every Saturday, "DML Notes" will bring you a concise, easy-to-digest roundup of the most significant concepts and practices in MLE, Deep Learning, and MLOps. We aim to craft this series to enhance your understanding and keep you updated while respecting your time (2-3 minutes) and curiosity.

Let’s start with the first iteration and cover a few key elements when working with Deep Learning Vision Systems.

This week’s topics:

𝟱 𝗩𝗶𝘀𝗶𝗼𝗻 𝗙𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻 𝗠𝗼𝗱𝗲𝗹𝘀 to keep an eye on!
𝗧𝗼𝗽 𝟭𝟬 FFMPEG commands when working with image/video!
𝟱 𝗧𝗼𝗼𝗹𝘀 𝘁𝗼 monitor the performance of your 𝗗𝗲𝗲𝗽 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗦𝘁𝗮𝗰𝗸!

5 Vision Foundation Models

With LLMs being in the spotlight, let’s not forget these foundation models for vision that use the same Transformer + Attention mechanisms but instead of text - they process image patches as tokens.

Here are the Top-5

𝗦𝗔𝗠 (𝗦𝗲𝗴𝗺𝗲𝗻𝘁-𝗔𝗻𝘆𝘁𝗵𝗶𝗻𝗴)
𝘍𝘳𝘰𝘮: 𝘔𝘦𝘵𝘢 𝘈𝘐
Used For: 𝘚𝘦𝘮𝘢𝘯𝘵𝘪𝘤 𝘚𝘦𝘨𝘮𝘦𝘯𝘵𝘢𝘵𝘪𝘰𝘯
𝘐𝘯𝘱𝘶𝘵: 𝘐𝘮𝘢𝘨𝘦𝘴
𝗖𝗟𝗜𝗣 (𝗖𝗼𝗻𝘁𝗿𝗮𝘀𝘁𝗶𝘃𝗲-𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲-𝗜𝗺𝗮𝗴𝗲-𝗣𝗿𝗲𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴)
𝘍𝘳𝘰𝘮: 𝘖𝘱𝘦𝘯𝘈𝘐
Used For: 𝘌𝘮𝘣𝘦𝘥𝘥𝘪𝘯𝘨 𝘌𝘹𝘵𝘳𝘢𝘤𝘵𝘪𝘰𝘯
𝘐𝘯𝘱𝘶𝘵: 𝘐𝘮𝘢𝘨𝘦𝘴 + 𝘛𝘦𝘹𝘵
𝗗𝗶𝗻𝗼𝗩𝟮
𝘍𝘳𝘰𝘮: 𝘔𝘦𝘵𝘢 𝘈𝘐
Used For: 𝘌𝘮𝘣𝘦𝘥𝘥𝘪𝘯𝘨 𝘌𝘹𝘵𝘳𝘢𝘤𝘵𝘪𝘰𝘯
𝘐𝘯𝘱𝘶𝘵: 𝘐𝘮𝘢𝘨𝘦𝘴
𝗢𝗪𝗟-𝗩𝗶𝗧 (𝗩𝗶𝘀𝗶𝗼𝗻 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗲𝗿 𝗳𝗼𝗿 𝗢𝗽𝗲𝗻-𝗪𝗼𝗿𝗹𝗱 𝗟𝗼𝗰𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻)
𝘍𝘳𝘰𝘮: 𝘎𝘰𝘰𝘨𝘭𝘦 𝘙𝘦𝘴𝘦𝘢𝘳𝘤𝘩
Used For: 𝘡𝘦𝘳𝘰-𝘚𝘩𝘰𝘵 𝘖𝘣𝘫𝘦𝘤𝘵 𝘋𝘦𝘵𝘦𝘤𝘵𝘪𝘰𝘯
𝘐𝘯𝘱𝘶𝘵: 𝘐𝘮𝘢𝘨𝘦𝘴 + 𝘛𝘦𝘹𝘵
𝗗𝗘𝗧𝗥 (𝗗𝗲𝘁𝗲𝗰𝘁𝗶𝗼𝗻 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗲𝗿)
𝘍𝘳𝘰𝘮: 𝘔𝘦𝘵𝘢 𝘈𝘐
Used For: 𝘖𝘣𝘫𝘦𝘤𝘵 𝘋𝘦𝘵𝘦𝘤𝘵𝘪𝘰𝘯
𝘐𝘯𝘱𝘶𝘵: 𝘐𝘮𝘢𝘨𝘦𝘴

What is 𝘌𝘮𝘣𝘦𝘥𝘥𝘪𝘯𝘨 𝘌𝘹𝘵𝘳𝘢𝘤𝘵𝘪𝘰𝘯 - it implies transposing the image into the pre-learned latent space of features, yielding to the most accurate compressed context of the image.

What is 𝘡𝘦𝘳𝘰-𝘚𝘩𝘰𝘵 𝘖𝘣𝘫𝘦𝘤𝘵 𝘋𝘦𝘵𝘦𝘤𝘵𝘪𝘰𝘯 - zero-shot means that the model requires no prior training on the specific label-set it is inferred with. In this case, this model can find an object it has never seen before in its training set.

What is 𝘚𝘦𝘮𝘢𝘯𝘵𝘪𝘤 𝘚𝘦𝘨𝘮𝘦𝘯𝘵𝘢𝘵𝘪𝘰𝘯 - can be considered as a per-pixel classification. Each pixel is attributed to an object label, such that the result yields accurate delimitations of different objects in an image.

Some of these models were used as vision encoders in projects like LLaVA (Vision Transformer + LLaMA) which is a multi-modal model text + image that can process and understand an image - similar to GPT4.

Top 10 FFMPEG commands you must know!

As a Computer Vision engineer, I use FFMPEG 90% of the time when I’m working with data. Be it to generate datasets, split/merge videos/images, or inspect metadata - I’m able to do everything related to media data manipulation.

Here are 10 commands to get you started:

1️⃣ Basic conversion : 𝙛𝙛𝙢𝙥𝙚𝙜 -𝙞 𝙞𝙣𝙥𝙪𝙩.𝙢𝙥4 𝙤𝙪𝙩𝙥𝙪𝙩.𝙖𝙫𝙞
2️⃣ Extracting frames : 𝙛𝙛𝙢𝙥𝙚𝙜 -𝙞 𝙫𝙞𝙙𝙚𝙤.𝙢𝙥4 -𝙧 1/1 $𝙛𝙞𝙡𝙚𝙣𝙖𝙢𝙚%03𝙙.𝙟𝙥𝙜
3️⃣ Resizing videos : 𝙛𝙛𝙢𝙥𝙚𝙜 -𝙞 𝙞𝙣𝙥𝙪𝙩.𝙢𝙥4 -𝙫𝙛 𝙨𝙘𝙖𝙡𝙚=320:240 𝙤𝙪𝙩𝙥𝙪𝙩.𝙢𝙥4
4️⃣ Adjust framerate : 𝙛𝙛𝙢𝙥𝙚𝙜 -𝙞 𝙞𝙣𝙥𝙪𝙩.𝙢𝙥4 -𝙧 30 𝙤𝙪𝙩𝙥𝙪𝙩.𝙢𝙥4
5️⃣ Trimming videos : 𝙛𝙛𝙢𝙥𝙚𝙜 -𝙞 𝙞𝙣𝙥𝙪𝙩.𝙢𝙥4 -𝙨𝙨 00:00:10 -𝙩𝙤 00:00:20 -𝙘 𝙘𝙤𝙥𝙮 𝙤𝙪𝙩𝙥𝙪𝙩.𝙢𝙥4
6️⃣ Compress videos : 𝙛𝙛𝙢𝙥𝙚𝙜 -𝙞 𝙞𝙣𝙥𝙪𝙩.𝙢𝙥4 -𝙫𝙘𝙤𝙙𝙚𝙘 𝙝264 -𝙖𝙘𝙤𝙙𝙚𝙘 𝙢𝙥2 𝙤𝙪𝙩𝙥𝙪𝙩.𝙢𝙥4
7️⃣ Adjust aspect-ratio: 𝙛𝙛𝙢𝙥𝙚𝙜 -𝙞 𝙞𝙣𝙥𝙪𝙩.𝙢𝙥4 -𝙖𝙨𝙥𝙚𝙘𝙩 1.7777 𝙤𝙪𝙩𝙥𝙪𝙩.𝙢𝙥4
8️⃣ Audio Extraction : 𝙛𝙛𝙢𝙥𝙚𝙜 -𝙞 𝙫𝙞𝙙𝙚𝙤.𝙢𝙥4 -𝙦:𝙖 0 -𝙢𝙖𝙥 𝙖 𝙖𝙪𝙙𝙞𝙤.𝙢𝙥3`
9️⃣ Remote VideoPlay:
𝙨𝙨𝙝 [HOST]@[IP] 𝙛𝙛𝙢𝙥𝙚𝙜 -𝙞 "[REMOTE_PATH] -𝙘 𝙘𝙤𝙥𝙮 -𝙛 𝙣𝙪𝙩 𝙥𝙞𝙥𝙚:1" | 𝙛𝙛𝙥𝙡𝙖𝙮 -𝙞 𝙥𝙞𝙥𝙚:0
🔟 Images to video :
𝙛𝙛𝙢𝙥𝙚𝙜 -𝙛𝙧𝙖𝙢𝙚𝙧𝙖𝙩𝙚 1 -𝙞 𝙞𝙢𝙜%03𝙙.𝙥𝙣𝙜 -𝙘:𝙫 𝙡𝙞𝙗𝙭264 -𝙧 30 -𝙥𝙞𝙭_𝙛𝙢𝙩 𝙮𝙪𝙫420𝙥 𝙤𝙪𝙩.𝙢𝙥4

5 Tools to monitor the performance of your Deep Learning Stack!

This is the toolset that I use frequently to set up a performance monitoring pipeline for the Computer Vision Edge stacks I deploy. It allows us to identify various issues in the flow, like:

IDLE GPU Times - check when the GPU is lazy and doesn’t process.
Inference Times - especially important in multi-batch prediction as one can see how stressed the system becomes.
CPU/RAM Usage - especially important in real-time video processing applications, as video-reading and decoding eat a lot of CPU and one has to pay attention to memory management when working with video frames to avoid memory build-up or leaks.

Here’s an overview of the tooling:

1. 𝗰𝗔𝗱𝘃𝗶𝘀𝗼𝗿
Used to scrape/monitor individual container metrics.

2. 𝗣𝗿𝗼𝗺𝗲𝘁𝗵𝗲𝘂𝘀
Configured to monitor and scrape cAdvisor metrics (CPU/RAM) and Triton Inference Server GPU metrics, like the following:
- Latency - the average time a complete request is finished
- QPS or QueriesPerSecond - a useful metric to test the speed of the model when performing inference.

3. 𝗗𝗼𝗰𝗸𝗲𝗿 𝗖𝗼𝗺𝗽𝗼𝘀𝗲
Used to wrap up and contain all these services.
It’s easy to manage and allows you to control the actions for each container at once.

4. 𝗚𝗿𝗮𝗳𝗮𝗻𝗮
Visualization dashboard, this is the metrics-consuming point. The UI panels can be defined and saved as a .json to be shared or re-uploaded in other Grafana configurations.
A few recommended metrics to monitor:
- CPU/RAM usage per container
- GPU usage %
- GPU Active Memory/IDLE
- Inference throughput and batching frequency.

5.𝗧𝗿𝗶𝘁𝗼𝗻 𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗦𝗲𝗿𝘃𝗲𝗿
Used as the model serving framework. A big advantage is the integrated Prometheus metrics port (:8002) that monitors a multitude of GPU-specific parameters.

Missed the post on NVIDIA Triton? ↓

I’ve covered Triton Inference Server in-depth, starting from the environment setup up to model deployment and inferencing an image of a pizza.

🔗 NVIDIA Triton Inference Server

Don’t hesitate to share your thoughts - we would love to hear them.

→ Remember, when ML looks encoded - we’ll help you decode it.

“From Decoding ML, every Thursday and Saturday!”

5 Tools to monitor the performance of your Deep Learning Stack!

Introducing DML Notes, plus an overview of popular vision foundation models, a toolset to use in vision data engineering and performance monitoring DL pipelines.

This week’s topics:

5 Vision Foundation Models

Top 10 FFMPEG commands you must know!

5 Tools to monitor the performance of your Deep Learning Stack!

Discussion about this post