Reduce your PyTorch code latency by 82%
How not to optimize the inference of your DL models. Computer science is dead.
Decoding ML Notes
This weekโs topics:
Reduce the latency of your PyTorch code by 82%
How I failed to optimize the inference of my DL models
Computer science is dead
๐ก๐ฒ๐ ๐ฏ๐ผ๐ผ๐ธ on engineering end-to-end LLM systems, from data collection and fine-tuning to LLMOps (deployment, monitoring).
I kept this one a secret, but in the past months, in collaboration with Packt , Alex Vesa and Maxime Labonne , we started working on the ๐๐๐ ๐๐ฏ๐จ๐ช๐ฏ๐ฆ๐ฆ๐ณ'๐ด ๐๐ข๐ฏ๐ฅ๐ฃ๐ฐ๐ฐ๐ฌ.
๐ ๐ฏ๐ผ๐ผ๐ธ that will walk you through everything you know to build a production-ready LLM project.
I am a big advocate of learning with hands-on examples while being anchored in real-world use cases.
That is why this is not the standard theoretical book.
While reading the book, you will learn to build a complex LLM project: an LLM Twin. In contrast, theoretical aspects will back everything to understand why we make certain decisions.
However, our ultimate goal is to present a framework that can be applied to most LLM projects.
.
๐๐ฒ๐ฟ๐ฒ ๐ถ๐ ๐ฎ ๐๐ป๐ฒ๐ฎ๐ธ ๐ฝ๐ฒ๐ฒ๐ธ ๐ผ๐ณ ๐๐ต๐ฎ๐ ๐๐ผ๐ ๐๐ถ๐น๐น ๐น๐ฒ๐ฎ๐ฟ๐ป ๐๐ถ๐๐ต๐ถ๐ป ๐๐ต๐ฒ ๐๐๐ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ'๐ ๐๐ฎ๐ป๐ฑ๐ฏ๐ผ๐ผ๐ธ:
- collect unstructured data
- create instruction datasets from raw data to fine-tune LLMs
- SFT techniques such as LoRA and QLoRA
- LLM evaluation techniques
- Preference alignment using DPO
- inference optimization methods (key optimization, model parallelism, quantization, attention mechanisms)
- advanced RAG algorithms using LangChain as our LLM framework and Qdrant as our vector DB
- design LLM systems using the FTI architecture
- use AWS SageMaker to fine-tune and deploy open-source LLMs
- use ZenML to orchestrate all the pipelines and track the data as artifacts
- LLMOps patterns such as CT/CI/CD pipelines, model registries and using Comet for experiment tracking and prompt monitoring
.
The book is still a work in progress, but we are very excited about it!
Thank you, Packt, for making this possible and Maxime and Alex for this remarkable collaboration.
If you are curious, you can currently pre-order it from Amazon. The whole book should be released by the end of September 2024.
โโโ
๐ ๐๐๐ ๐๐ฏ๐จ๐ช๐ฏ๐ฆ๐ฆ๐ณ'๐ด ๐๐ข๐ฏ๐ฅ๐ฃ๐ฐ๐ฐ๐ฌ: ๐๐ข๐ด๐ต๐ฆ๐ณ ๐ต๐ฉ๐ฆ ๐ข๐ณ๐ต ๐ฐ๐ง ๐ฆ๐ฏ๐จ๐ช๐ฏ๐ฆ๐ฆ๐ณ๐ช๐ฏ๐จ ๐๐ข๐ณ๐จ๐ฆ ๐๐ข๐ฏ๐จ๐ถ๐ข๐จ๐ฆ ๐๐ฐ๐ฅ๐ฆ๐ญ๐ด ๐ง๐ณ๐ฐ๐ฎ ๐ค๐ฐ๐ฏ๐ค๐ฆ๐ฑ๐ต ๐ต๐ฐ ๐ฑ๐ณ๐ฐ๐ฅ๐ถ๐ค๐ต๐ช๐ฐ๐ฏ
Reduce the latency of your PyTorch code by 82%
This is how I ๐ฟ๐ฒ๐ฑ๐๐ฐ๐ฒ๐ฑ the ๐น๐ฎ๐๐ฒ๐ป๐ฐ๐ of my ๐ฃ๐๐ง๐ผ๐ฟ๐ฐ๐ต ๐ฐ๐ผ๐ฑ๐ฒ by ๐ด๐ฎ% ๐๐๐ถ๐ป๐ด ๐ผ๐ป๐น๐ ๐ฃ๐๐๐ต๐ผ๐ป & ๐ฃ๐๐ง๐ผ๐ฟ๐ฐ๐ต. ๐ก๐ข ๐ณ๐ฎ๐ป๐ฐ๐ ๐๐ผ๐ผ๐น๐ ๐ถ๐ป๐๐ผ๐น๐๐ฒ๐ฑ!
๐๐๐ ๐ฅ๐ง๐ค๐๐ก๐๐ข?
During inference, I am using 5 DL at ~25k images at once.
The script took around ~4 hours to run.
The problem is that this isn't a batch job that runs over the night...
Various people across the company required it to run in "real-time" multiple times a day.
๐๐๐ ๐จ๐ค๐ก๐ช๐ฉ๐๐ค๐ฃ?
The first thing that might come to your mind is to start using some fancy optimizer (e.g., TensorRT).
Even though that should be done at some point...
First, you should ๐ฎ๐๐ธ ๐๐ผ๐๐ฟ๐๐ฒ๐น๐ณ:
- I/O bottlenecks: reading & writing images
- preprocessing & postprocessing - can it be parallelized?
- are the CUDA cores used at their maximum potential?
- is the bandwidth between the CPU & GPU throttled?
- can we move more computation to the GPU?
That being said...
๐๐ฒ๐ฟ๐ฒ is what I did I ๐ฑ๐ฒ๐ฐ๐ฟ๐ฒ๐ฎ๐๐ฒ๐ฑ the ๐น๐ฎ๐๐ฒ๐ป๐ฐ๐ of the script by ๐ด๐ฎ%
โโโ
๐ญ. ๐๐ฎ๐๐ฐ๐ต๐ฒ๐ฑ ๐๐ต๐ฒ ๐ถ๐ป๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ฐ๐ฒ ๐๐ฎ๐บ๐ฝ๐น๐ฒ๐
Batching is not only valuable for training but also mighty in speeding up your inference time.
Otherwise, you waste your GPU CUDA cores.
Instead of passing through the models one sample at a time, I now process 64.
๐ฎ. ๐๐ฒ๐๐ฒ๐ฟ๐ฎ๐ด๐ฒ๐ฑ ๐ฃ๐๐ง๐ผ๐ฟ๐ฐ๐ต'๐ ๐๐ฎ๐๐ฎ๐๐ผ๐ฎ๐ฑ๐ฒ๐ฟ
This has 2 main advantages:
- parallel data loading & preprocessing on multiple processes (NOT threads)
- copying your input images directly into the pinned memory (avoid a CPU -> CPU copy operation)
๐ฏ. ๐ ๐ผ๐๐ฒ๐ฑ ๐ฎ๐ ๐บ๐๐ฐ๐ต ๐ผ๐ณ ๐๐ต๐ฒ ๐ฝ๐ผ๐๐๐ฝ๐ฟ๐ผ๐ฐ๐ฒ๐๐๐ถ๐ป๐ด ๐ผ๐ป ๐๐ต๐ฒ ๐๐ฃ๐จ
I saw that the tensor was moved too early on the CPU and mapped to a NumPy array.
I refactored the code to keep it on the GPU as much as possible, which had 2 main advantages:
- tensors are processed faster on the GPU
- at the end of the logic, I had smaller tensors, resulting in smaller transfers between the CPU & GPU
๐ฐ. ๐ ๐๐น๐๐ถ๐๐ต๐ฟ๐ฒ๐ฎ๐ฑ๐ถ๐ป๐ด ๐ณ๐ผ๐ฟ ๐ฎ๐น๐น ๐บ๐ ๐/๐ข ๐๐ฟ๐ถ๐๐ฒ ๐ผ๐ฝ๐ฒ๐ฟ๐ฎ๐๐ถ๐ผ๐ป๐
For I/O bottlenecks, using Python threads is extremely powerful.
I moved all my writes under a ๐๐ฉ๐ณ๐ฆ๐ข๐ฅ๐๐ฐ๐ฐ๐ญ๐๐น๐ฆ๐ค๐ถ๐ต๐ฐ๐ณ, batching my write operations.
.
Note that I used only good old Python & PyTorch code.
โ When the code is poorly written, no tool can save you
Only now is the time to add fancy tooling, such as TensorRT.
.
So remember...
To optimize the PyTorch code by 82%:
1. Batched the inference samples
2. Leveraged PyTorch's DataLoader
3. Moved as much of the postprocessing on the GPU
4. Multithreading for all my I/O write operations
What other methods do you have in mind? Leave them in the comments โ
How I failed to optimize the inference of my DL models
This is how I FAILED to ๐ผ๐ฝ๐๐ถ๐บ๐ถ๐๐ฒ the ๐ถ๐ป๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ฐ๐ฒ of my ๐๐ ๐บ๐ผ๐ฑ๐ฒ๐น๐ when ๐ฟ๐๐ป๐ป๐ถ๐ป๐ด ๐๐ต๐ฒ๐บ on a ๐ก๐๐ถ๐ฑ๐ถ๐ฎ ๐๐ฃ๐จ. Let me tell you ๐๐ต๐ฎ๐ ๐๐ผ ๐ฎ๐๐ผ๐ถ๐ฑ โ
I had a simple task. To reduce the latency of the DL models used in production.
We had 4 DL models that were running on Nvidia GPUs.
After a first look at the inference code, I saw that the inputs to the models weren't batched.
We were processing one sample at a time.
I said to myself: "Ahaa! That's it. I cracked it. We just have to batch as many samples as possible, and we are done."
So, I did just that...
After 2-3 days of work adding the extra batch dimension to the PyTorch preprocessing & postprocessing code, ๐ ๐ฟ๐ฒ๐ฎ๐น๐ถ๐๐ฒ๐ฑ ๐ ๐ช๐๐ฆ ๐ช๐ฅ๐ข๐ก๐.
๐๐ฒ๐ฟ๐ฒ ๐ถ๐ ๐๐ต๐
โโโ
We were using Nvidia GPUs from the A family (A6000, A5000, etc.).
As these GPUs have a lot of memory (>40GB), I managed to max out the VRAM and squash a batch of 256 images on the GPU.
Relative to using a "๐ฃ๐ข๐ต๐ค๐ฉ = 1" it was faster, but not A LOT FASTER, as I expected.
Then I tried batches of 128, 64, 32, 16, and 8.
...and realized that everything > batch = 16 was running slower than using a batch of 16.
โ ๐ ๐ฏ๐ฎ๐๐ฐ๐ต ๐ผ๐ณ ๐ญ๐ฒ ๐๐ฎ๐ ๐๐ต๐ฒ ๐๐๐ฒ๐ฒ๐ ๐๐ฝ๐ผ๐.
But that is not good, as I was using only ~10% of the VRAM...
๐ช๐ต๐ ๐ถ๐ ๐๐ต๐ฎ๐?
The Nvidia A family of GPUs are known to:
- having a lot of VRAM
- not being very fast (the memory transfer between the CPU & GPU + the number of CUDA cores isn't that great)
That being said, my program was throttled.
Even if my GPU could handle much more memory-wise, the memory transfer & processing speeds weren't keeping up.
In the end, it was a good optimization: ~75% faster
๐๐๐ ๐๐ต๐ฒ ๐น๐ฒ๐๐๐ผ๐ป ๐ผ๐ณ ๐๐ต๐ถ๐ ๐๐๐ผ๐ฟ๐ ๐ถ๐:
โ ALWAYS KNOW YOUR HARDWARE โ
Most probably, running a bigger batch on an A100 or V100 wouldn't have the same problem.
I plan to try that.
But that is why...
โ ๐ฎ๐ค๐ช ๐๐ก๐ฌ๐๐ฎ๐จ ๐๐๐ซ๐ ๐ฉ๐ค ๐ค๐ฅ๐ฉ๐๐ข๐๐ฏ๐ ๐ฉ๐๐ ๐ฅ๐๐ง๐๐ข๐๐ฉ๐๐ง๐จ ๐ค๐ ๐ฎ๐ค๐ช๐ง ๐จ๐ฎ๐จ๐ฉ๐๐ข ๐๐๐จ๐๐ ๐ค๐ฃ ๐ฎ๐ค๐ช๐ง ๐๐๐ง๐๐ฌ๐๐ง๐!
In theory, I knew this, but it is completely different when you encounter it in production.
Let me know in the comments if you want more similar stories on "DO NOTs" from my experience.
Computer science is dead
๐๐ผ๐บ๐ฝ๐๐๐ฒ๐ฟ ๐๐ฐ๐ถ๐ฒ๐ป๐ฐ๐ฒ ๐ถ๐ ๐ฑ๐ฒ๐ฎ๐ฑ. Do this instead.
In a recent talk, Jensen Huang, CEO of Nvidia, said that kids shouldn't learn programming anymore.
He said that until now, most of us thought that everyone should learn to program at some point.
But the actual opposite is the truth.
With the rise of AI, nobody should have or need to learn to program anymore.
He highlights that with AI tools, the technology divide between non-programmers and engineers is closing.
.
๐๐ ๐ฎ๐ป ๐ฒ๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ, ๐บ๐ ๐ฒ๐ด๐ผ ๐ถ๐ ๐ต๐๐ฟ๐; ๐บ๐ ๐ณ๐ถ๐ฟ๐๐ ๐ฟ๐ฒ๐ฎ๐ฐ๐๐ถ๐ผ๐ป ๐ถ๐ ๐๐ผ ๐๐ฎ๐ ๐ถ๐ ๐ถ๐ ๐๐๐๐ฝ๐ถ๐ฑ.
But after thinking about it more thoroughly, I tend to agree with him.
After all, even now, almost anybody can work with AI.
This probably won't happen in the next 10 years, but at some point, 100% will do.
At some point, we will ask our AI companion to write a program that does X for us or whatever.
But, I think this is a great thing, as it will give us more time & energy to focus on what matters, such as:
- solving real-world problems (not just tech problems)
- moving to the next level of technology (Bioengineering, interplanetary colonization, etc.)
- think about the grand scheme of things
- be more creative
- more time to connect with our family
- more time to take care of our
I personally think it is a significant step for humanity.
.
What do you think?
As an engineer, do you see your job still present in the next 10+ years?
Here is the full talk
โโโ
Images
If not otherwise stated, all images are created by the author.
Excellent article, except the part CS is dead is invalid