Your model takes too long to do inference?
Compiling ML models. Running models in C++/Java/C#. Fastest Inference Engine out there.
Decoding ML Notes
Hey everyone, Happy Saturday!
We're sharing "DML Notes" every week.
This series offers quick summaries, helpful hints, and advice about Machine Learning & MLOps engineering in a short and easy-to-understand format.
Every Saturday, "DML Notes" will bring you a concise, easy-to-digest roundup of the most significant concepts and practices in MLE, GenAI, Deep Learning, and MLOps.
We aim to craft this series to enhance your understanding and keep you updated while respecting your time (2-3 minutes) and curiosity.
๐ฌ Introducing DML Chat ๐ฌ
After DML Notes, the DML Chat is the second add-on that weโre excited about.
Our vision for it is to be a place where we can get in touch with you and answer all of your questions regarding the articles weโre posting, and the challenges youโre facing.
Apart from that, weโll post short prompts, thoughts, the updates that come our way and you can jump into the discussion right away.
Weโll only allow subscribers who support our writing to initiate conversations as a measure of time management on our part.
Donโt worry, you can still follow the threads and learn from our awesome articles.๐The DML Team
This weekโs topics:
TorchScript: How PyTorch models can be compiled for faster inference.
ONNXRuntime: Run optimized models in C++/Java/C#
TensorRT: Fastest framework to run Deep Learning / LLM models out there.
TorchScript: How PyTorch models can be compiled for faster inference.
Pytorch, by default runs in ๐๐ฎ๐ด๐ฒ๐ฟ ๐ ๐ผ๐ฑ๐ฒ with the AutoGrad, and the Python Interpreter can be plugged in quite easily to debug and iterate fast - but the inference mode is quite slow to be considered for production, this task is usually abstracted and ๐ต๐ฎ๐ป๐ฑ๐น๐ฒ๐ฑ ๐ฏ๐ ๐ฎ๐ป ๐ถ๐ป๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ฐ๐ฒ ๐ฒ๐ป๐ด๐ถ๐ป๐ฒ ๐ผ๐ฟ ๐ฒ๐ ๐ฒ๐ฐ๐๐๐ฒ๐ฑ ๐ฑ๐ถ๐ฟ๐ฒ๐ฐ๐๐น๐ ๐ถ๐ป ๐++.
Hereโs the method of compiling a PyTorch model using ๐๐ค๐ง๐๐๐๐๐ง๐๐ฅ๐ฉ + ๐ ๐๐ ๐๐ค๐ข๐ฅ๐๐ก๐๐ง ๐จ๐ฉ๐๐๐ .
๐๐ซ๐๐ก๐ช๐๐ฉ๐๐ค๐ฃ ๐๐ค๐๐
This is a plain PyTorch model definition, the one weโre all familiar with, with .๐๐ซ๐๐ก() mode set. Evaluation mode turns off AutoGrad computation, Dropout, and modifies Normalisation Layers.๐พ๐ค๐ฃ๐ซ๐๐ง๐ฉ๐๐ฃ๐ ๐ฉ๐ค ๐๐ค๐ง๐๐๐๐๐ง๐๐ฅ๐ฉ
Using the compiler stack, this can be done in two ways, by using either the scripting module or the tracing module.
Via ๐จ๐๐ง๐๐ฅ๐ฉ๐๐ฃ๐, youโre translating model code directly into TorchScript.
Via ๐ฉ๐ง๐๐๐๐ฃ๐, the tensor operations flow from the AutoGrad is recorded in a โ๐ด๐ฟ๐ฎ๐ฝ๐ตโ.๐๐ฃ๐ฉ๐๐ง๐ข๐๐๐๐๐ฉ๐ ๐๐๐ฅ๐ง๐๐จ๐๐ฃ๐ฉ๐๐ฉ๐๐ค๐ฃ (๐๐)
The TorchScript model is now in a language-agnostic format, detailing the computation graph for optimization to be consumed by the JIT.
The IR-generated code is statically typed and verbose such that it defines each instruction performed, much like Assembly code.๐๐ฅ๐ฉ๐๐ข๐๐ฏ๐๐ฉ๐๐ค๐ฃ๐จ
At this stage, the generated IR Graph is optimized through different tactics: - Constants folding - Removing redundancies - Considering specific kernels for the targeted hardware (GPU, TPU, CPU)๐๐๐ง๐๐๐ก๐๐ฏ๐๐ฉ๐๐ค๐ฃ
The last step is compilation and saving the model, such that you can de-serialize it in C++, define a tensor, populate it, feed-forward it, and process the outputs.
ONNXRuntime: Run optimized models in C++/Java/C#
ONNX Runtime is an open-source project that provides a performance-focused engine for ONNX (Open Neural Network Exchange) models. It enables models trained in various frameworks to be converted to the ONNX format and then efficiently run.
ONNX is a standard format to represent neural networks, the key stays in its IR (intermediate representation) format which is a โclose-to-hardwareโ representation of the network such that other frameworks (PyTorch, TensorFlow, Caffe) can interop with it.
Hereโs what happens under the hood:
๐ฆ๐ผ๐๐ฟ๐ฐ๐ฒ ๐๐ผ ๐ข๐ก๐ก๐ซ
Assign .eval() mode, define in/out tensor names and shapes run onnx.export().๐๐ผ๐บ๐ฝ๐๐๐ฎ๐๐ถ๐ผ๐ป๐ฎ๐น ๐๐ฟ๐ฎ๐ฝ๐ต
On the forward pass, tensor operations are recorded.
A computational graph is constructed and the w + params are tied to their respective graph operations.
Then, they are converted to and mapped to ONNX operations.
The model now is in ๐๐๐๐ ๐๐ ๐๐จ ๐ ๐๐ง๐๐ฅ๐ ๐ค๐ ๐๐ง๐ค๐ฉ๐ค ๐ค๐๐๐๐๐ฉ๐จ.๐ข๐ก๐ก๐ซ ๐ฅ๐๐ป๐๐ถ๐บ๐ฒ (๐ข๐ฅ๐ง)
Loading the .onnx, the proto graph is parsed and converted to in-memory representation.๐๐ฟ๐ฎ๐ฝ๐ต ๐ข๐ฝ๐๐ถ๐บ๐ถ๐๐ฎ๐๐ถ๐ผ๐ป๐
Given a selected execution provider (CUDA, CPU) the ORT applies a series of provider-independent optimizations.
These are divided into 3 levels:
โ L1 Basic (Nodes folding)
โ L2 Extended (GELU, MatMul Fusion)
โ L3 Layout (CPU only, NCHW to NCHWc)๐ฃ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป๐ถ๐ป๐ด ๐๐ต๐ฒ ๐๐ฟ๐ฎ๐ฝ๐ต
This process splits the graph into a set of subgraphs based on the available Execution Providers.
Since ORT can run parallel/distributed sub-graphs, here's a performance boost.๐๐๐๐ถ๐ด๐ป ๐ฆ๐๐ฏ๐ด๐ฟ๐ฎ๐ฝ๐ต๐ = Each subgraph is reduced to a single fused operator, using the provider's ๐พ๐ค๐ข๐ฅ๐๐ก๐() that wraps it as a custom operator, also called ๐๐๐ง๐ฃ๐๐ก. CPU is the default provider and is used as a fallback measure.
๐๐ป๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ฐ๐ฒ
The created engine can then be used for inference.
๐๐๐ ๐๐ง๐๐๐ฉ๐๐ ๐ ๐๐ง๐ฃ๐๐ก๐จ ๐ฅ๐ง๐ค๐๐๐จ๐จ ๐ฉ๐๐ ๐ง๐๐๐๐๐ซ๐๐ data according to the defined Graph order of subgraphs and yield the outputs that can be further processed.
TensorRT: Fastest inference framework for Deep Learning / LLMs out there.
TensorRT is NVIDIA's high-performance deep learning inference optimizer and runtime library for production environments.
It's designed to accelerate deep learning inference on NVIDIA GPUs, providing lower latency and higher throughput for deep learning applications.
Apart from specific hardware (e.g Groq LPUs) optimized for matmul in transformers, TensorRT provides fastest inference times on all Deep Learning model architectures, including Transformers and LLMโs.
Hereโs what happens when converting a model from ONNX to a TensorRT engine:
Parsing:
The first step involves parsing the ONNX model file.
TensorRT's parser supports ONNX models and interprets the various layers, weights, and inputs defined in the ONNX format.
This step translates the high-level network definition into an internal representation that TensorRT can work with.Layer Fusion:
TensorRT performs layer fusion optimizations during this phase.
It combines multiple layers and operations into a single, more efficient kernel. This reduces the overhead of launching multiple kernels and improves the execution speed of the network.Precision Calibration:
TensorRT offers precision calibration to optimize model execution.
It can convert floating-point operations (like FP32) into lower precision operations (like FP16 or INT8) to increase inference speed.
During this step, TensorRT ensures that the precision reduction does not significantly impact the accuracy of the model.Kernel Autotuning:
TensorRT selects the most efficient algorithms and kernels for the target GPU architecture.
It evaluates various implementation strategies for each layer of the network on the specific hardware and selects the fastest option.Memory Optimization:
TensorRT optimizes memory usage by analyzing the network's graph to minimize memory footprint during inference.
It reuses memory between layers when possible and allocates memory efficiently for both weights and intermediate tensors.Serialization:
Once the model has been optimized, TensorRT serializes the optimized network into an engine file.
This engine file is a binary representation of the optimized model.
It can be loaded and executed on compatible NVIDIA GPUs for inference.Inference:
The final step is running inference with the TensorRT engine.
The application loads the serialized engine file, prepares the input data, and executes the model to obtain predictions.
TensorRT engines are designed for high-performance inference, significantly reducing latency and increasing throughput compared to running the original ONNX model directly.