Upskill your LLM knowledge base with these tools.
Speed-up your LLM inference and dissect the Attention Mechanism with step-by-step animation.
Decoding ML Notes
The LLM-Twin Course development has taken off! ๐
Join aboard and learn how to design, build, and implement an end-to-end LLM replica, by following along in a step-by-step hands-on manner with the development of data pipelines, ingestion, LLM fine-tuning, serving, monitoring, and more.
The first 2/11 lessons are out, make sure to check them out here:
Lesson 1: An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM Twin
Lesson 2: The Importance of Data Pipelines in the Era of Generative AI
This weekโs topics:
Fast inference on LLMs
Visualize attention mechanism
A commonly misunderstood CUDA issue!
Fast inference LLMs
For the last few years, LLMs have been a hot topic - new models, RAGs, new papers, the rise of OpenSource models, etc.
The attention mechanism is easy to understand, but โhungryโ to compute - thus multiple methods aim to fill the performance gap in model-serving.
Here are the top 4 LLM inference solutions:
๐๐๐๐
A fast and easy-to-use library for LLM inference and serving.๐๐๐ฎ ๐๐จ๐ฅ๐๐๐ฉ๐จ ๐๐ง๐:
โ is open-source
โ state-of-the-art serving throughput
โ fast model execution with optimized CUDA kernels/graph.
โ efficient memory management using PagedAttention
โ support for AMD GPUs (ROCm) โ deploy support with NVIDIA Triton, KServe, Docker
๐ ๐๐ฆ๐ต ๐๐ต๐ข๐ณ๐ต๐ฆ๐ฅ: shorturl.at/nAFPW
๐ง๐ฒ๐ป๐๐ผ๐ฟ๐ฅ๐ง-๐๐๐
A library that accelerates and optimizes inference performance of the latest LLMs.๐๐๐ฎ ๐๐จ๐ฅ๐๐๐ฉ๐จ ๐๐ง๐:
โ is open-source
โ built on a strong TensorRT foundation
โ leverages custom-optimized CUDA kernels for transformers โ enhances customization
โ supports various optimization (quant, tensor parallelism)
โ takes advantage of the NVIDIA Toolkit (perf-analyzer, Triton)
๐ ๐๐ฆ๐ต ๐๐ต๐ข๐ณ๐ต๐ฆ๐ฅ: shorturl.at/dluMX
๐ข๐น๐น๐ฎ๐บ๐ฎ
A tool that allows you to run open-source language models locally.๐๐ฒ๐ ๐ฎ๐๐ฝ๐ฒ๐ฐ๐๐ ๐ฎ๐ฟ๐ฒ:
โ multi-modal model support
โ optimizes setup and configuration details, including GPU usage
โ bundles weights, configuration, and data into a single Modelfile package
๐ ๐๐ฆ๐ต ๐๐ต๐ข๐ณ๐ต๐ฆ๐ฅ: shorturl.at/dGZ46
๐๐ต๐ฎ๐ ๐๐ถ๐๐ต ๐ฅ๐ง๐ซ
A solution from NVIDIA that allows users to build their own personalized chatbot experience.
๐๐๐ฎ ๐๐จ๐ฅ๐๐๐ฉ๐จ ๐๐ง๐:
โ emphasizes no-code, ChatGPT-like interface
โ one can connect custom documents, videos, notes, and PDFs โ easy to set up RAG (Retrieval Augmented Generation)
โ support for the latest LLMs
โ leverages TensorRT-LLM and RTX acceleration
โ downloadable installer (35GB), out-of-the-box Mistral & LLaMA 7b versions
๐ ๐๐ฆ๐ต ๐๐ต๐ข๐ณ๐ต๐ฆ๐ฅ: shorturl.at/ekuK6
Visualize attention mechanism
๐๐๐ models are complex - the key to understanding the process is the ๐ฎ๐๐๐ฒ๐ป๐๐ถ๐ผ๐ป ๐บ๐ฒ๐ฐ๐ต๐ฎ๐ป๐ถ๐๐บ.
Here are ๐ฏ ๐๐ผ๐ผ๐น๐ to help you interactively visualize attention:
๐๐๐๐ฒ๐ป๐๐ถ๐ผ๐ป๐ฉ๐ถ๐ : shorturl.at/DSY58
๐ค๐ฐ๐ฏ๐ง๐ช๐จ๐ถ๐ณ๐ข๐ฃ๐ญ๐ฆ ๐ฏ๐ถ๐ฎ ๐ฉ๐ฆ๐ข๐ฅ๐ด.
๐ค๐ฐ๐ฏ๐ง๐ช๐จ๐ถ๐ณ๐ข๐ฃ๐ญ๐ฆ ๐ฏ๐ถ๐ฎ ๐ญ๐ข๐บ๐ฆ๐ณ๐ด.
๐ฉ๐ข๐ด ๐๐ช๐, ๐๐๐๐, ๐๐๐2 ๐ช๐ฏ๐ค๐ญ๐ถ๐ฅ๐ฆ๐ฅ.
๐ฎ๐ visualization + ๐ฏ๐ ๐ป๐ฐ๐ฐ๐ฎ-๐ช๐ฏ๐ด ๐ฐ๐ฏ ๐ด๐ฆ๐ญ๐ฆ๐ค๐ต๐ฆ๐ฅ ๐ญ๐ข๐บ๐ฆ๐ณ๐ด.
๐ฃ๐๐ง๐ผ๐ฟ๐ฐ๐ต ๐ ๐ : shorturl.at/lqJQY
๐ค๐ถ๐ด๐ต๐ฐ๐ฎ ๐ฐ๐ฑ๐ฆ๐ณ๐ข๐ต๐ช๐ฐ๐ฏ๐ด.
๐ฆ๐น๐ต๐ฆ๐ฏ๐ด๐ช๐ฃ๐ญ๐ฆ ๐ช๐ฏ ๐จ๐ณ๐ข๐ฑ๐ฉ-๐ญ๐ช๐ฌ๐ฆ ๐ง๐ข๐ด๐ฉ๐ช๐ฐ๐ฏ.
๐ฉ๐ข๐ด ๐๐๐2-๐ฏ๐ข๐ฏ๐ฐ, ๐๐ฐ๐๐ ๐๐ฆ๐ค๐ฉ๐ฏ๐ช๐ฒ๐ถ๐ฆ ๐ช๐ฏ๐ค๐ญ๐ถ๐ฅ๐ฆ๐ฅ.
3D
๐๐๐๐๐ฟ๐ผ๐ณ๐: shorturl.at/ivCR1
๐ช๐ฏ๐ด๐ฑ๐ฆ๐ค๐ต ๐ด๐ต๐ฆ๐ฑ-๐ฃ๐บ-๐ด๐ต๐ฆ๐ฑ 1 ๐ต๐ฐ๐ฌ๐ฆ๐ฏ ๐ฑ๐ณ๐ฆ๐ฅ๐ช๐ค๐ต๐ช๐ฐ๐ฏ.
๐ฉ๐ข๐ด ๐๐๐2-๐ด๐ฎ๐ข๐ญ๐ญ, ๐๐๐3, ๐๐๐-๐ฏ๐ข๐ฏ๐ฐ, ๐๐๐2-๐๐ ๐ช๐ฏ๐ค๐ญ๐ถ๐ฅ๐ฆ๐ฅ.
straight-forward
A commonly misunderstood CUDA issue!
The problem was that ๐ป๐๐ถ๐ฑ๐ถ๐ฎ-๐๐บ๐ถ was showing a ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ ๐๐ฃ๐จ ๐ฑ๐ฒ๐๐ถ๐ฐ๐ฒ ๐ผ๐ฟ๐ฑ๐ฒ๐ฟ compared to docker or Python. Thus, errors regarding the disjoint memory regions appeared.
๐๐ฒ๐ฟ๐ฒ'๐ ๐๐ต๐ฒ ๐๐ฟ๐ถ๐ฐ๐ธ:
๐ฆ๐๐๐๐ฒ๐บ ๐๐ฎ๐๐ฒ๐ฟ
๐ฃ๐ซ๐๐๐๐-๐จ๐ข๐ works at the system level and orders GPU ๐ง๐๐จ๐ฅ๐๐๐ฉ๐๐ฃ๐ ๐ฉ๐๐ ๐ฉ๐ค๐ฅ-๐๐ค๐ฌ๐ฃ ๐ค๐ง๐๐๐ง ๐ค๐ ๐๐ค๐ฌ ๐ฉ๐๐ ๐ฅ๐๐ฎ๐จ๐๐๐๐ก ๐ซ๐๐๐๐ค ๐๐๐ง๐ ๐๐จ ๐๐ฃ๐จ๐๐ง๐ฉ๐๐ ๐๐ฃ๐ฉ๐ค ๐ฉ๐๐ ๐๐พ๐_๐๐๐๐๐๐๐ ๐จ๐ก๐ค๐ฉ๐จ ๐ค๐ฃ ๐ฉ๐๐ ๐ข๐ค๐ฉ๐๐๐ง๐๐ค๐๐ง๐.
๐ฆ๐ผ๐ณ๐๐๐ฎ๐ฟ๐ฒ ๐๐ฎ๐๐ฒ๐ฟ
At this layer, python/docker or any other program, by default is seeing the ๐๐๐๐จ ๐๐ฃ ๐ฉ๐๐ "๐๐ผ๐๐๐๐๐_๐๐๐๐๐" ๐ค๐ง๐๐๐ง, meaning it will take the ๐๐๐ ๐ฌ๐๐ฉ๐ ๐ฉ๐๐ ๐๐๐๐๐๐จ๐ฉ ๐พ๐พ (๐๐ช๐๐ ๐๐๐ฅ๐๐๐๐ก๐๐ฉ๐ฎ) ๐ค๐ฃ ๐ฉ๐๐ ๐๐๐ง๐จ๐ฉ ๐๐ฃ๐๐๐ญ.
The solution here is to condition the applications at the Software Layer to respect the System Layer ordering by setting the env variable:
๐พ๐๐ฟ๐ผ_๐ฟ๐๐๐๐พ๐๐_๐๐๐ฟ๐๐ = "๐๐พ๐_๐ฝ๐๐_๐๐ฟ"