Reduce your PyTorch code latency by 82%

How not to optimize the inference of your DL models. Computer science is dead.

Aug 03, 2024

Decoding ML Notes

This week’s topics:

Reduce the latency of your PyTorch code by 82%
How I failed to optimize the inference of my DL models
Computer science is dead

𝗡𝗲𝘄 𝗯𝗼𝗼𝗸 on engineering end-to-end LLM systems, from data collection and fine-tuning to LLMOps (deployment, monitoring).

I kept this one a secret, but in the past months, in collaboration with Packt , Alex Vesa and Maxime Labonne , we started working on the 𝘓𝘓𝘔 𝘌𝘯𝘨𝘪𝘯𝘦𝘦𝘳'𝘴 𝘏𝘢𝘯𝘥𝘣𝘰𝘰𝘬.

𝗔 𝗯𝗼𝗼𝗸 that will walk you through everything you know to build a production-ready LLM project.

I am a big advocate of learning with hands-on examples while being anchored in real-world use cases.

That is why this is not the standard theoretical book.

While reading the book, you will learn to build a complex LLM project: an LLM Twin. In contrast, theoretical aspects will back everything to understand why we make certain decisions.

However, our ultimate goal is to present a framework that can be applied to most LLM projects.

.

𝗛𝗲𝗿𝗲 𝗶𝘀 𝗮 𝘀𝗻𝗲𝗮𝗸 𝗽𝗲𝗲𝗸 𝗼𝗳 𝘄𝗵𝗮𝘁 𝘆𝗼𝘂 𝘄𝗶𝗹𝗹 𝗹𝗲𝗮𝗿𝗻 𝘄𝗶𝘁𝗵𝗶𝗻 𝘁𝗵𝗲 𝗟𝗟𝗠 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿'𝘀 𝗛𝗮𝗻𝗱𝗯𝗼𝗼𝗸:

- collect unstructured data
- create instruction datasets from raw data to fine-tune LLMs
- SFT techniques such as LoRA and QLoRA
- LLM evaluation techniques
- Preference alignment using DPO
- inference optimization methods (key optimization, model parallelism, quantization, attention mechanisms)
- advanced RAG algorithms using LangChain as our LLM framework and Qdrant as our vector DB

- design LLM systems using the FTI architecture
- use AWS SageMaker to fine-tune and deploy open-source LLMs
- use ZenML to orchestrate all the pipelines and track the data as artifacts
- LLMOps patterns such as CT/CI/CD pipelines, model registries and using Comet for experiment tracking and prompt monitoring

.

The book is still a work in progress, but we are very excited about it!

Thank you, Packt, for making this possible and Maxime and Alex for this remarkable collaboration.

If you are curious, you can currently pre-order it from Amazon. The whole book should be released by the end of September 2024.

↓↓↓

🔗 𝘓𝘓𝘔 𝘌𝘯𝘨𝘪𝘯𝘦𝘦𝘳'𝘴 𝘏𝘢𝘯𝘥𝘣𝘰𝘰𝘬: 𝘔𝘢𝘴𝘵𝘦𝘳 𝘵𝘩𝘦 𝘢𝘳𝘵 𝘰𝘧 𝘦𝘯𝘨𝘪𝘯𝘦𝘦𝘳𝘪𝘯𝘨 𝘓𝘢𝘳𝘨𝘦 𝘓𝘢𝘯𝘨𝘶𝘢𝘨𝘦 𝘔𝘰𝘥𝘦𝘭𝘴 𝘧𝘳𝘰𝘮 𝘤𝘰𝘯𝘤𝘦𝘱𝘵 𝘵𝘰 𝘱𝘳𝘰𝘥𝘶𝘤𝘵𝘪𝘰𝘯

Reduce the latency of your PyTorch code by 82%

This is how I 𝗿𝗲𝗱𝘂𝗰𝗲𝗱 the 𝗹𝗮𝘁𝗲𝗻𝗰𝘆 of my 𝗣𝘆𝗧𝗼𝗿𝗰𝗵 𝗰𝗼𝗱𝗲 by 𝟴𝟮% 𝘂𝘀𝗶𝗻𝗴 𝗼𝗻𝗹𝘆 𝗣𝘆𝘁𝗵𝗼𝗻 & 𝗣𝘆𝗧𝗼𝗿𝗰𝗵. 𝗡𝗢 𝗳𝗮𝗻𝗰𝘆 𝘁𝗼𝗼𝗹𝘀 𝗶𝗻𝘃𝗼𝗹𝘃𝗲𝗱!

𝙏𝙝𝙚 𝙥𝙧𝙤𝙗𝙡𝙚𝙢?

During inference, I am using 5 DL at ~25k images at once.

The script took around ~4 hours to run.

The problem is that this isn't a batch job that runs over the night...

Various people across the company required it to run in "real-time" multiple times a day.

𝙏𝙝𝙚 𝙨𝙤𝙡𝙪𝙩𝙞𝙤𝙣?

The first thing that might come to your mind is to start using some fancy optimizer (e.g., TensorRT).

Even though that should be done at some point...

First, you should 𝗮𝘀𝗸 𝘆𝗼𝘂𝗿𝘀𝗲𝗹𝗳:

- I/O bottlenecks: reading & writing images
- preprocessing & postprocessing - can it be parallelized?
- are the CUDA cores used at their maximum potential?
- is the bandwidth between the CPU & GPU throttled?
- can we move more computation to the GPU?

That being said...

𝗛𝗲𝗿𝗲 is what I did I 𝗱𝗲𝗰𝗿𝗲𝗮𝘀𝗲𝗱 the 𝗹𝗮𝘁𝗲𝗻𝗰𝘆 of the script by 𝟴𝟮%

↓↓↓

𝟭. 𝗕𝗮𝘁𝗰𝗵𝗲𝗱 𝘁𝗵𝗲 𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝘀𝗮𝗺𝗽𝗹𝗲𝘀

Batching is not only valuable for training but also mighty in speeding up your inference time.

Otherwise, you waste your GPU CUDA cores.

Instead of passing through the models one sample at a time, I now process 64.

𝟮. 𝗟𝗲𝘃𝗲𝗿𝗮𝗴𝗲𝗱 𝗣𝘆𝗧𝗼𝗿𝗰𝗵'𝘀 𝗗𝗮𝘁𝗮𝗟𝗼𝗮𝗱𝗲𝗿

This has 2 main advantages:

- parallel data loading & preprocessing on multiple processes (NOT threads)
- copying your input images directly into the pinned memory (avoid a CPU -> CPU copy operation)

𝟯. 𝗠𝗼𝘃𝗲𝗱 𝗮𝘀 𝗺𝘂𝗰𝗵 𝗼𝗳 𝘁𝗵𝗲 𝗽𝗼𝘀𝘁𝗽𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴 𝗼𝗻 𝘁𝗵𝗲 𝗚𝗣𝗨

I saw that the tensor was moved too early on the CPU and mapped to a NumPy array.

I refactored the code to keep it on the GPU as much as possible, which had 2 main advantages:

- tensors are processed faster on the GPU
- at the end of the logic, I had smaller tensors, resulting in smaller transfers between the CPU & GPU

𝟰. 𝗠𝘂𝗹𝘁𝗶𝘁𝗵𝗿𝗲𝗮𝗱𝗶𝗻𝗴 𝗳𝗼𝗿 𝗮𝗹𝗹 𝗺𝘆 𝗜/𝗢 𝘄𝗿𝗶𝘁𝗲 𝗼𝗽𝗲𝗿𝗮𝘁𝗶𝗼𝗻𝘀

For I/O bottlenecks, using Python threads is extremely powerful.

I moved all my writes under a 𝘛𝘩𝘳𝘦𝘢𝘥𝘗𝘰𝘰𝘭𝘌𝘹𝘦𝘤𝘶𝘵𝘰𝘳, batching my write operations.

.

Note that I used only good old Python & PyTorch code.

→ When the code is poorly written, no tool can save you

Only now is the time to add fancy tooling, such as TensorRT.

So remember...

To optimize the PyTorch code by 82%:

1. Batched the inference samples
2. Leveraged PyTorch's DataLoader
3. Moved as much of the postprocessing on the GPU
4. Multithreading for all my I/O write operations

What other methods do you have in mind? Leave them in the comments ↓

How I failed to optimize the inference of my DL models

This is how I FAILED to 𝗼𝗽𝘁𝗶𝗺𝗶𝘇𝗲 the 𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 of my 𝗗𝗟 𝗺𝗼𝗱𝗲𝗹𝘀 when 𝗿𝘂𝗻𝗻𝗶𝗻𝗴 𝘁𝗵𝗲𝗺 on a 𝗡𝘃𝗶𝗱𝗶𝗮 𝗚𝗣𝗨. Let me tell you 𝘄𝗵𝗮𝘁 𝘁𝗼 𝗮𝘃𝗼𝗶𝗱 ↓

I had a simple task. To reduce the latency of the DL models used in production.

We had 4 DL models that were running on Nvidia GPUs.

After a first look at the inference code, I saw that the inputs to the models weren't batched.

We were processing one sample at a time.

I said to myself: "Ahaa! That's it. I cracked it. We just have to batch as many samples as possible, and we are done."

So, I did just that...

After 2-3 days of work adding the extra batch dimension to the PyTorch preprocessing & postprocessing code, 𝗜 𝗿𝗲𝗮𝗹𝗶𝘇𝗲𝗱 𝗜 𝗪𝗔𝗦 𝗪𝗥𝗢𝗡𝗚.

No alternative text description for this image

𝗛𝗲𝗿𝗲 𝗶𝘀 𝘄𝗵𝘆

↓↓↓

We were using Nvidia GPUs from the A family (A6000, A5000, etc.).

As these GPUs have a lot of memory (>40GB), I managed to max out the VRAM and squash a batch of 256 images on the GPU.

Relative to using a "𝘣𝘢𝘵𝘤𝘩 = 1" it was faster, but not A LOT FASTER, as I expected.

Then I tried batches of 128, 64, 32, 16, and 8.

...and realized that everything > batch = 16 was running slower than using a batch of 16.

→ 𝗔 𝗯𝗮𝘁𝗰𝗵 𝗼𝗳 𝟭𝟲 𝘄𝗮𝘀 𝘁𝗵𝗲 𝘀𝘄𝗲𝗲𝘁 𝘀𝗽𝗼𝘁.

But that is not good, as I was using only ~10% of the VRAM...

𝗪𝗵𝘆 𝗶𝘀 𝘁𝗵𝗮𝘁?

The Nvidia A family of GPUs are known to:

- having a lot of VRAM
- not being very fast (the memory transfer between the CPU & GPU + the number of CUDA cores isn't that great)

That being said, my program was throttled.

Even if my GPU could handle much more memory-wise, the memory transfer & processing speeds weren't keeping up.

In the end, it was a good optimization: ~75% faster

𝗕𝘂𝘁 𝘁𝗵𝗲 𝗹𝗲𝘀𝘀𝗼𝗻 𝗼𝗳 𝘁𝗵𝗶𝘀 𝘀𝘁𝗼𝗿𝘆 𝗶𝘀:

→ ALWAYS KNOW YOUR HARDWARE ←

Most probably, running a bigger batch on an A100 or V100 wouldn't have the same problem.

I plan to try that.

But that is why...

→ 𝙮𝙤𝙪 𝙖𝙡𝙬𝙖𝙮𝙨 𝙝𝙖𝙫𝙚 𝙩𝙤 𝙤𝙥𝙩𝙞𝙢𝙞𝙯𝙚 𝙩𝙝𝙚 𝙥𝙖𝙧𝙖𝙢𝙚𝙩𝙚𝙧𝙨 𝙤𝙛 𝙮𝙤𝙪𝙧 𝙨𝙮𝙨𝙩𝙚𝙢 𝙗𝙖𝙨𝙚𝙙 𝙤𝙣 𝙮𝙤𝙪𝙧 𝙝𝙖𝙧𝙙𝙬𝙖𝙧𝙚!

In theory, I knew this, but it is completely different when you encounter it in production.

Let me know in the comments if you want more similar stories on "DO NOTs" from my experience.

Computer science is dead

𝗖𝗼𝗺𝗽𝘂𝘁𝗲𝗿 𝘀𝗰𝗶𝗲𝗻𝗰𝗲 𝗶𝘀 𝗱𝗲𝗮𝗱. Do this instead.

In a recent talk, Jensen Huang, CEO of Nvidia, said that kids shouldn't learn programming anymore.

He said that until now, most of us thought that everyone should learn to program at some point.

But the actual opposite is the truth.

With the rise of AI, nobody should have or need to learn to program anymore.

He highlights that with AI tools, the technology divide between non-programmers and engineers is closing.

.

𝗔𝘀 𝗮𝗻 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿, 𝗺𝘆 𝗲𝗴𝗼 𝗶𝘀 𝗵𝘂𝗿𝘁; 𝗺𝘆 𝗳𝗶𝗿𝘀𝘁 𝗿𝗲𝗮𝗰𝘁𝗶𝗼𝗻 𝗶𝘀 𝘁𝗼 𝘀𝗮𝘆 𝗶𝘁 𝗶𝘀 𝘀𝘁𝘂𝗽𝗶𝗱.

But after thinking about it more thoroughly, I tend to agree with him.

After all, even now, almost anybody can work with AI.

This probably won't happen in the next 10 years, but at some point, 100% will do.

At some point, we will ask our AI companion to write a program that does X for us or whatever.

But, I think this is a great thing, as it will give us more time & energy to focus on what matters, such as:

- solving real-world problems (not just tech problems)
- moving to the next level of technology (Bioengineering, interplanetary colonization, etc.)
- think about the grand scheme of things
- be more creative
- more time to connect with our family
- more time to take care of our

I personally think it is a significant step for humanity.

.

What do you think?

As an engineer, do you see your job still present in the next 10+ years?

Here is the full talk

↓↓↓

Images

If not otherwise stated, all images are created by the author.