4 key decoding strategies for LLMs that you must know
The only 6 prompt engineering techniques you need to know. One thing that I do that sets me apart from the crowd.
Hello everyone,
I hope you enjoyed what Alex R. & Alex V. have prepared for you in their previous articles.
I promised that the 3 of us would dig deeper into more exciting topics about production-ready LLM and CV models.
→ But this is just the beginning. Stay tuned for more production ML 🔥
This week’s topics:
4 key decoding strategies for LLMs that you must know
The only 6 prompt engineering techniques you need to know
One thing that I do that sets me apart from the crowd
Want to build your first 𝗟𝗟𝗠 𝗽𝗿𝗼𝗷𝗲𝗰𝘁 but don't know where to start?
If you want to learn in a structured way to build hands-on LLM systems using good LLMOps principles…
We want to announce that we just released 8 Medium lessons for the Hands-on LLMs course that will put you on the right track ↓
Within the 8 Medium lessons, you will go step-by-step through the theory, system design, and code to learn how to build a:
real-time streaming pipeline (deployed on AWS) that uses Bytewax as the stream engine to listen to financial news, cleans & embeds the documents, and loads them to a vector DB
fine-tuning pipeline (deployed as a serverless continuous training) that fine-tunes an LLM on financial data using QLoRA, monitors the experiments using an experiment tracker and saves the best model to a model registry
inference pipeline built in LangChain (deployed as a serverless RESTful API) that loads the fine-tuned LLM from the model registry and answers financial questions using RAG (leveraging the vector DB populated with financial news)
We will also show you how to integrate various serverless tools, such as:
• Comet ML as your ML Platform;
• Qdrant as your vector DB;
• Beam as your infrastructure.

Who is this for?
The series targets MLE, DE, DS, or SWE who want to learn to engineer LLM systems using LLMOps good principles.
How will you learn?
The series contains 4 hands-on video lessons and the open-source code you can access on GitHub.
Curious? ↓
Check out the 8 Medium lessons of the Hands-on LLMs course and start building your own LLMs system:
🔗 The Hands-on LLMs Medium Series
4 key decoding strategies for LLMs that you must know
You see, LLMs don't just spit out text.
They calculate "logits", which are mapped to probabilities for every possible token in their vocabulary.
It uses previous token IDs to predict the next most likely token (the auto-regressive nature of decoder models).
The real magic happens in the decoding strategy you pick ↓
- Greedy Search
- Beam Search
- Top-K Sampling
- Nucleus Sampling
.
𝗚𝗿𝗲𝗲𝗱𝘆 𝗦𝗲𝗮𝗿𝗰𝗵
It only holds onto the most likely token at each stage. It's fast and efficient, but it is short-sighted.
𝗕𝗲𝗮𝗺 𝗦𝗲𝗮𝗿𝗰𝗵
This time, you are not looking at just the token with the highest probability. But you are considering the N most likely tokens.
This will create a tree-like structure, where each node will have N children.
The procedure repeats until you hit a maximum length or an end-of-sequence token.
Ultimately, you pick the leaf with the biggest score and recursively pick its parent until you hit the root node.
For example, in the graph below, we have "𝘣𝘦𝘢𝘮𝘴 = 2" and "𝘭𝘦𝘯𝘨𝘵𝘩 = 3".
𝗧𝗼𝗽-𝗞 𝗦𝗮𝗺𝗽𝗹𝗶𝗻𝗴
This technique extends the Beam search strategy and adds a dash of randomness to the generation process.
Instead of just picking the most likely tokens, it's selecting a token randomly from the top k most likely choices.
Thus, the tokens with the highest probability will appear more often, but other tokens will be generated occasionally to add some randomness ("creativity").
𝗡𝘂𝗰𝗹𝗲𝘂𝘀 𝗦𝗮𝗺𝗽𝗹𝗶𝗻𝗴
In this case, you're not just picking the top k most probable tokens here. You're picking a cutoff value p and forming a "nucleus" of tokens.
In other words, rather than selecting the top k most probable tokens, nucleus sampling chooses a cutoff value p such that the sum of the probabilities of the selected tokens exceeds p.
Thus, at every step, you will have a various number of possible tokens included in the "nucleus" from which you sample. This introduces even more diversity and creativity into your output.
.
𝗡𝗼𝘁𝗲: For 𝘵𝘰𝘱-𝘬 and 𝘯𝘶𝘤𝘭𝘦𝘶𝘴 𝘴𝘢𝘮𝘱𝘭𝘪𝘯𝘨, you can also use the "𝘵𝘦𝘮𝘱𝘦𝘳𝘢𝘵𝘦" hyperparameter to tweak the output probabilities. It is a parameter that ranges from 0 to 1. A low temperature (e.g., 0.1) will decrease the entropy (randomness), making the generation more stable.
To summarize...
There are 2 main decoding strategies for LLMs:
- greedy search
- beam search
To add more variability and creativity to beam search, you can use:
- top-k sampling
- nucleus sampling
The only 6 prompt engineering techniques you need to know
The whole field of prompt engineering can be reduced to these 6 techniques I use almost daily when using ChatGPT (or other LLMs).
Here they are ↓
#1. 𝐅𝐞𝐰 𝐬𝐡𝐨𝐭 𝐩𝐫𝐨𝐦𝐩𝐭𝐢𝐧𝐠
Add in your prompt 2 or 3 high-quality demonstrations, each consisting of both input and desired output, on the target task.
The LLM will better understand your intention and what kind of answers you expect based on concrete examples.
#2. 𝐒𝐞𝐥𝐟-𝐜𝐨𝐧𝐬𝐢𝐬𝐭𝐞𝐧𝐜𝐲 𝐬𝐚𝐦𝐩𝐥𝐢𝐧𝐠
Sample multiple outputs with "temperature > 0" and select the best one out of these candidates.
How to pick the best candidate?
It will vary from task to task, but here are 2 primary scenarios ↓
1. Some tasks are easy to validate, such as programming questions. In this case, you can write unit tests to verify the correctness of the generated code.
2. For more complicated tasks, you can manually inspect them or use another LLM (or another specialized model) to rank them.
#3. 𝐂𝐡𝐚𝐢𝐧-𝐨𝐟-𝐓𝐡𝐨𝐮𝐠𝐡𝐭 (𝐂𝐨𝐓)
You want to force the LLM to explain its thought process, which eventually leads to the final answer, step by step.
This will help the LLM to reason complex tasks better.
You want to use CoT for complicated reasoning tasks + large models (e.g., with more than 50B parameters). Simple tasks only benefit slightly from CoT prompting.
Here are a few methods to achieve CoT:
- provide a list of bullet points with all the steps you expect the LLM to take
- use "Few shot prompt" to teach the LLM to think in steps
... or my favorite: use sentences such as "Let's think step by step."
#4. 𝐀𝐮𝐠𝐦𝐞𝐧𝐭𝐞𝐝 𝐏𝐫𝐨𝐦𝐩𝐭𝐬
The LLM's internal knowledge is limited to the data it was trained on. Also, often, it forgets specific details of older training datasets.
The most common use case is Retrieval-Augmented Generation (RAG).
That is why using the LLM as a reasoning engine is beneficial to parse and extract information from a reliable source of information given as context in the prompt.
𝘞𝘩𝘺?
- avoid retraining the model on new data
- avoid hallucinating
- access to references on the source
#5. 𝐀 𝐬𝐢𝐧𝐠𝐥𝐞 𝐫𝐞𝐬𝐩𝐨𝐧𝐬𝐢𝐛𝐢𝐥𝐢𝐭𝐲 𝐩𝐞𝐫 𝐩𝐫𝐨𝐦𝐩𝐭
Quite self-explanatory. It is similar to the DRY principle in SWE.
Having only x1 task/prompt is good practice to avoid confusing the LLM.
If you have more complex tasks, split them into granular ones and merge the results later in a different prompt.
#6. 𝐁𝐞 𝐚𝐬 𝐞𝐱𝐩𝐥𝐢𝐜𝐢𝐭 𝐚𝐬 𝐩𝐨𝐬𝐬𝐢𝐛𝐥𝐞
The LLM cannot read your mind. To maximize the probability of getting precisely what you want, you can imagine the LLM as a 7-year-old to whom you must explain everything step-by-step to be sure he understood.
𝘕𝘰𝘵𝘦: The level of detail in the prompt is inversely proportional to the size & complexity of the model.
The truth is that prompt engineering is quite intuitive, and we don't have to overthink it too much.
What would you add to this list?
One thing that I do that sets me apart from the crowd
Here is one thing that I do that sets me apart from the crowd:
"𝘐 𝘢𝘮 𝘰𝘬𝘢𝘺 𝘸𝘪𝘵𝘩 𝘣𝘦𝘪𝘯𝘨 𝘵𝘩𝘦 𝘥𝘶𝘮𝘱 𝘰𝘯𝘦 𝘵𝘩𝘢𝘵 𝘢𝘴𝘬𝘴 𝘮𝘢𝘯𝘺 𝘲𝘶𝘦𝘴𝘵𝘪𝘰𝘯𝘴."
𝐇𝐦𝐦... 𝐖𝐡𝐲?
The reality is that even the brightest minds cannot understand everything from the first shot.
It is not necessarily that you cannot understand the concepts.
There are other factors, such as:
- you are tired
- you haven't paid enough attention
- the concept wasn't explained at your level
- the presenter wasn't clear enough, etc.
Also, the truth is that many of us don't understand everything from the first shot when presented with a new concept.
But because of our ego, we are afraid to come out and ask something because we are worried that we will sound stupid.
The jokes are on you.
Most people will be grateful you broke the ice and asked to explain the concept again.
𝐖𝐡𝐲?
It will help the team to learn the new concepts better.
It will start a discussion to dig deeper into the subject.
It will piss off or annoy the people you don't like.
It will help other people ask questions next time.
It will open up new perspectives on the problem.
To conclude...
Ignore your ego and what people think of you. Own your curiosity and ask questions when you feel like it.
It is ok not to know everything.
It is better to be stupid for 5 minutes than your entire life.
Congrats on learning something new today!
Don’t hesitate to share your thoughts - we would love to hear them.
→ Remember, when ML looks encoded - we’ll help you decode it.
See you next Thursday at 9:00 am CET.
Have a fantastic weekend!