Upskill your LLM knowledge base with these tools.
Speed-up your LLM inference and dissect the Attention Mechanism with step-by-step animation.
Decoding ML Notes
The LLM-Twin Course development has taken off! 🚀
Join aboard and learn how to design, build, and implement an end-to-end LLM replica, by following along in a step-by-step hands-on manner with the development of data pipelines, ingestion, LLM fine-tuning, serving, monitoring, and more.
The first 2/11 lessons are out, make sure to check them out here:
Lesson 1: An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM Twin
Lesson 2: The Importance of Data Pipelines in the Era of Generative AI
This week’s topics:
Fast inference on LLMs
Visualize attention mechanism
A commonly misunderstood CUDA issue!
Fast inference LLMs
For the last few years, LLMs have been a hot topic - new models, RAGs, new papers, the rise of OpenSource models, etc.
The attention mechanism is easy to understand, but “hungry” to compute - thus multiple methods aim to fill the performance gap in model-serving.
Here are the top 4 LLM inference solutions:
𝘃𝗟𝗟𝗠
A fast and easy-to-use library for LLM inference and serving.𝙆𝙚𝙮 𝙖𝙨𝙥𝙚𝙘𝙩𝙨 𝙖𝙧𝙚:
➝ is open-source
➝ state-of-the-art serving throughput
➝ fast model execution with optimized CUDA kernels/graph.
➝ efficient memory management using PagedAttention
➝ support for AMD GPUs (ROCm) ➝ deploy support with NVIDIA Triton, KServe, Docker
🔗 𝘎𝘦𝘵 𝘚𝘵𝘢𝘳𝘵𝘦𝘥: shorturl.at/nAFPW
𝗧𝗲𝗻𝘀𝗼𝗿𝗥𝗧-𝗟𝗟𝗠
A library that accelerates and optimizes inference performance of the latest LLMs.𝙆𝙚𝙮 𝙖𝙨𝙥𝙚𝙘𝙩𝙨 𝙖𝙧𝙚:
➝ is open-source
➝ built on a strong TensorRT foundation
➝ leverages custom-optimized CUDA kernels for transformers ➝ enhances customization
➝ supports various optimization (quant, tensor parallelism)
➝ takes advantage of the NVIDIA Toolkit (perf-analyzer, Triton)
🔗 𝘎𝘦𝘵 𝘚𝘵𝘢𝘳𝘵𝘦𝘥: shorturl.at/dluMX
𝗢𝗹𝗹𝗮𝗺𝗮
A tool that allows you to run open-source language models locally.𝗞𝗲𝘆 𝗮𝘀𝗽𝗲𝗰𝘁𝘀 𝗮𝗿𝗲:
➝ multi-modal model support
➝ optimizes setup and configuration details, including GPU usage
➝ bundles weights, configuration, and data into a single Modelfile package
🔗 𝘎𝘦𝘵 𝘚𝘵𝘢𝘳𝘵𝘦𝘥: shorturl.at/dGZ46
𝗖𝗵𝗮𝘁 𝘄𝗶𝘁𝗵 𝗥𝗧𝗫
A solution from NVIDIA that allows users to build their own personalized chatbot experience.
𝙆𝙚𝙮 𝙖𝙨𝙥𝙚𝙘𝙩𝙨 𝙖𝙧𝙚:
➝ emphasizes no-code, ChatGPT-like interface
➝ one can connect custom documents, videos, notes, and PDFs ➝ easy to set up RAG (Retrieval Augmented Generation)
➝ support for the latest LLMs
➝ leverages TensorRT-LLM and RTX acceleration
➝ downloadable installer (35GB), out-of-the-box Mistral & LLaMA 7b versions
🔗 𝘎𝘦𝘵 𝘚𝘵𝘢𝘳𝘵𝘦𝘥: shorturl.at/ekuK6
Visualize attention mechanism
𝗟𝗟𝗠 models are complex - the key to understanding the process is the 𝗮𝘁𝘁𝗲𝗻𝘁𝗶𝗼𝗻 𝗺𝗲𝗰𝗵𝗮𝗻𝗶𝘀𝗺.
Here are 𝟯 𝘁𝗼𝗼𝗹𝘀 to help you interactively visualize attention:
𝗔𝘁𝘁𝗲𝗻𝘁𝗶𝗼𝗻𝗩𝗶𝘇 : shorturl.at/DSY58
𝘤𝘰𝘯𝘧𝘪𝘨𝘶𝘳𝘢𝘣𝘭𝘦 𝘯𝘶𝘮 𝘩𝘦𝘢𝘥𝘴.
𝘤𝘰𝘯𝘧𝘪𝘨𝘶𝘳𝘢𝘣𝘭𝘦 𝘯𝘶𝘮 𝘭𝘢𝘺𝘦𝘳𝘴.
𝘩𝘢𝘴 𝘝𝘪𝘛, 𝘉𝘌𝘙𝘛, 𝘎𝘗𝘛2 𝘪𝘯𝘤𝘭𝘶𝘥𝘦𝘥.
𝟮𝗗 visualization + 𝟯𝗗 𝘻𝘰𝘰𝘮-𝘪𝘯𝘴 𝘰𝘯 𝘴𝘦𝘭𝘦𝘤𝘵𝘦𝘥 𝘭𝘢𝘺𝘦𝘳𝘴.
𝗣𝘆𝗧𝗼𝗿𝗰𝗵 𝗠𝗠: shorturl.at/lqJQY
𝘤𝘶𝘴𝘵𝘰𝘮 𝘰𝘱𝘦𝘳𝘢𝘵𝘪𝘰𝘯𝘴.
𝘦𝘹𝘵𝘦𝘯𝘴𝘪𝘣𝘭𝘦 𝘪𝘯 𝘨𝘳𝘢𝘱𝘩-𝘭𝘪𝘬𝘦 𝘧𝘢𝘴𝘩𝘪𝘰𝘯.
𝘩𝘢𝘴 𝘎𝘗𝘛2-𝘯𝘢𝘯𝘰, 𝘓𝘰𝘙𝘈 𝘛𝘦𝘤𝘩𝘯𝘪𝘲𝘶𝘦 𝘪𝘯𝘤𝘭𝘶𝘥𝘦𝘥.
3D
𝗕𝗕𝘆𝗖𝗿𝗼𝗳𝘁: shorturl.at/ivCR1
𝘪𝘯𝘴𝘱𝘦𝘤𝘵 𝘴𝘵𝘦𝘱-𝘣𝘺-𝘴𝘵𝘦𝘱 1 𝘵𝘰𝘬𝘦𝘯 𝘱𝘳𝘦𝘥𝘪𝘤𝘵𝘪𝘰𝘯.
𝘩𝘢𝘴 𝘎𝘗𝘛2-𝘴𝘮𝘢𝘭𝘭, 𝘎𝘗𝘛3, 𝘎𝘗𝘛-𝘯𝘢𝘯𝘰, 𝘎𝘗𝘛2-𝘟𝘓 𝘪𝘯𝘤𝘭𝘶𝘥𝘦𝘥.
straight-forward
A commonly misunderstood CUDA issue!
The problem was that 𝗻𝘃𝗶𝗱𝗶𝗮-𝘀𝗺𝗶 was showing a 𝗱𝗶𝗳𝗳𝗲𝗿𝗲𝗻𝘁 𝗚𝗣𝗨 𝗱𝗲𝘃𝗶𝗰𝗲 𝗼𝗿𝗱𝗲𝗿 compared to docker or Python. Thus, errors regarding the disjoint memory regions appeared.
𝗛𝗲𝗿𝗲'𝘀 𝘁𝗵𝗲 𝘁𝗿𝗶𝗰𝗸:
𝗦𝘆𝘀𝘁𝗲𝗺 𝗟𝗮𝘆𝗲𝗿
𝙣𝙫𝙞𝙙𝙞𝙖-𝙨𝙢𝙞 works at the system level and orders GPU 𝙧𝙚𝙨𝙥𝙚𝙘𝙩𝙞𝙣𝙜 𝙩𝙝𝙚 𝙩𝙤𝙥-𝙙𝙤𝙬𝙣 𝙤𝙧𝙙𝙚𝙧 𝙤𝙛 𝙝𝙤𝙬 𝙩𝙝𝙚 𝙥𝙝𝙮𝙨𝙞𝙘𝙖𝙡 𝙫𝙞𝙙𝙚𝙤 𝙘𝙖𝙧𝙙 𝙞𝙨 𝙞𝙣𝙨𝙚𝙧𝙩𝙚𝙙 𝙞𝙣𝙩𝙤 𝙩𝙝𝙚 𝙋𝘾𝙄_𝙀𝙓𝙋𝙍𝙀𝙎𝙎 𝙨𝙡𝙤𝙩𝙨 𝙤𝙣 𝙩𝙝𝙚 𝙢𝙤𝙩𝙝𝙚𝙧𝙗𝙤𝙖𝙧𝙙.
𝗦𝗼𝗳𝘁𝘄𝗮𝗿𝗲 𝗟𝗮𝘆𝗲𝗿
At this layer, python/docker or any other program, by default is seeing the 𝙂𝙋𝙐𝙨 𝙞𝙣 𝙩𝙝𝙚 "𝙁𝘼𝙎𝙏𝙀𝙎𝙏_𝙁𝙄𝙍𝙎𝙏" 𝙤𝙧𝙙𝙚𝙧, meaning it will take the 𝙂𝙋𝙐 𝙬𝙞𝙩𝙝 𝙩𝙝𝙚 𝙝𝙞𝙜𝙝𝙚𝙨𝙩 𝘾𝘾 (𝙘𝙪𝙙𝙖 𝙘𝙖𝙥𝙖𝙗𝙞𝙡𝙞𝙩𝙮) 𝙤𝙣 𝙩𝙝𝙚 𝙛𝙞𝙧𝙨𝙩 𝙞𝙣𝙙𝙚𝙭.
The solution here is to condition the applications at the Software Layer to respect the System Layer ordering by setting the env variable:
𝘾𝙐𝘿𝘼_𝘿𝙀𝙑𝙄𝘾𝙀𝙎_𝙊𝙍𝘿𝙀𝙍 = "𝙋𝘾𝙄_𝘽𝙐𝙎_𝙄𝘿"