4 key decoding strategies for LLMs that you must know
The only 6 prompt engineering techniques you need to know. One thing that I do that sets me apart from the crowd.
Hello everyone,
I hope you enjoyed what Alex R. & Alex V. have prepared for you in their previous articles.
I promised that the 3 of us would dig deeper into more exciting topics about production-ready LLM and CV models.
โ But this is just the beginning. Stay tuned for more production ML ๐ฅ
This weekโs topics:
4 key decoding strategies for LLMs that you must know
The only 6 prompt engineering techniques you need to know
One thing that I do that sets me apart from the crowd
Want to build your first ๐๐๐ ๐ฝ๐ฟ๐ผ๐ท๐ฒ๐ฐ๐ but don't know where to start?
If you want to learn in a structured way to build hands-on LLM systems using good LLMOps principlesโฆ
We want to announce that we just released 8 Medium lessons for the Hands-on LLMs course that will put you on the right track โ
Within the 8 Medium lessons, you will go step-by-step through the theory, system design, and code to learn how to build a:
real-time streaming pipeline (deployed on AWS) that uses Bytewax as the stream engine to listen to financial news, cleans & embeds the documents, and loads them to a vector DB
fine-tuning pipeline (deployed as a serverless continuous training) that fine-tunes an LLM on financial data using QLoRA, monitors the experiments using an experiment tracker and saves the best model to a model registry
inference pipeline built in LangChain (deployed as a serverless RESTful API) that loads the fine-tuned LLM from the model registry and answers financial questions using RAG (leveraging the vector DB populated with financial news)
We will also show you how to integrate various serverless tools, such as:
โข Comet ML as your ML Platform;
โข Qdrant as your vector DB;
โข Beam as your infrastructure.
Who is this for?
The series targets MLE, DE, DS, or SWE who want to learn to engineer LLM systems using LLMOps good principles.
How will you learn?
The series contains 4 hands-on video lessons and the open-source code you can access on GitHub.
Curious? โ
Check out the 8 Medium lessons of the Hands-on LLMs course and start building your own LLMs system:
๐ The Hands-on LLMs Medium Series
4 key decoding strategies for LLMs that you must know
You see, LLMs don't just spit out text.
They calculate "logits", which are mapped to probabilities for every possible token in their vocabulary.
It uses previous token IDs to predict the next most likely token (the auto-regressive nature of decoder models).
The real magic happens in the decoding strategy you pick โ
- Greedy Search
- Beam Search
- Top-K Sampling
- Nucleus Sampling
.
๐๐ฟ๐ฒ๐ฒ๐ฑ๐ ๐ฆ๐ฒ๐ฎ๐ฟ๐ฐ๐ต
It only holds onto the most likely token at each stage. It's fast and efficient, but it is short-sighted.
๐๐ฒ๐ฎ๐บ ๐ฆ๐ฒ๐ฎ๐ฟ๐ฐ๐ต
This time, you are not looking at just the token with the highest probability. But you are considering the N most likely tokens.
This will create a tree-like structure, where each node will have N children.
The procedure repeats until you hit a maximum length or an end-of-sequence token.
Ultimately, you pick the leaf with the biggest score and recursively pick its parent until you hit the root node.
For example, in the graph below, we have "๐ฃ๐ฆ๐ข๐ฎ๐ด = 2" and "๐ญ๐ฆ๐ฏ๐จ๐ต๐ฉ = 3".
๐ง๐ผ๐ฝ-๐ ๐ฆ๐ฎ๐บ๐ฝ๐น๐ถ๐ป๐ด
This technique extends the Beam search strategy and adds a dash of randomness to the generation process.
Instead of just picking the most likely tokens, it's selecting a token randomly from the top k most likely choices.
Thus, the tokens with the highest probability will appear more often, but other tokens will be generated occasionally to add some randomness ("creativity").
๐ก๐๐ฐ๐น๐ฒ๐๐ ๐ฆ๐ฎ๐บ๐ฝ๐น๐ถ๐ป๐ด
In this case, you're not just picking the top k most probable tokens here. You're picking a cutoff value p and forming a "nucleus" of tokens.
In other words, rather than selecting the top k most probable tokens, nucleus sampling chooses a cutoff value p such that the sum of the probabilities of the selected tokens exceeds p.
Thus, at every step, you will have a various number of possible tokens included in the "nucleus" from which you sample. This introduces even more diversity and creativity into your output.
.
๐ก๐ผ๐๐ฒ: For ๐ต๐ฐ๐ฑ-๐ฌ and ๐ฏ๐ถ๐ค๐ญ๐ฆ๐ถ๐ด ๐ด๐ข๐ฎ๐ฑ๐ญ๐ช๐ฏ๐จ, you can also use the "๐ต๐ฆ๐ฎ๐ฑ๐ฆ๐ณ๐ข๐ต๐ฆ" hyperparameter to tweak the output probabilities. It is a parameter that ranges from 0 to 1. A low temperature (e.g., 0.1) will decrease the entropy (randomness), making the generation more stable.
To summarize...
There are 2 main decoding strategies for LLMs:
- greedy search
- beam search
To add more variability and creativity to beam search, you can use:
- top-k sampling
- nucleus sampling
The only 6 prompt engineering techniques you need to know
The whole field of prompt engineering can be reduced to these 6 techniques I use almost daily when using ChatGPT (or other LLMs).
Here they are โ
#1. ๐
๐๐ฐ ๐ฌ๐ก๐จ๐ญ ๐ฉ๐ซ๐จ๐ฆ๐ฉ๐ญ๐ข๐ง๐
Add in your prompt 2 or 3 high-quality demonstrations, each consisting of both input and desired output, on the target task.
The LLM will better understand your intention and what kind of answers you expect based on concrete examples.
#2. ๐๐๐ฅ๐-๐๐จ๐ง๐ฌ๐ข๐ฌ๐ญ๐๐ง๐๐ฒ ๐ฌ๐๐ฆ๐ฉ๐ฅ๐ข๐ง๐
Sample multiple outputs with "temperature > 0" and select the best one out of these candidates.
How to pick the best candidate?
It will vary from task to task, but here are 2 primary scenarios โ
1. Some tasks are easy to validate, such as programming questions. In this case, you can write unit tests to verify the correctness of the generated code.
2. For more complicated tasks, you can manually inspect them or use another LLM (or another specialized model) to rank them.
#3. ๐๐ก๐๐ข๐ง-๐จ๐-๐๐ก๐จ๐ฎ๐ ๐ก๐ญ (๐๐จ๐)
You want to force the LLM to explain its thought process, which eventually leads to the final answer, step by step.
This will help the LLM to reason complex tasks better.
You want to use CoT for complicated reasoning tasks + large models (e.g., with more than 50B parameters). Simple tasks only benefit slightly from CoT prompting.
Here are a few methods to achieve CoT:
- provide a list of bullet points with all the steps you expect the LLM to take
- use "Few shot prompt" to teach the LLM to think in steps
... or my favorite: use sentences such as "Let's think step by step."
#4. ๐๐ฎ๐ ๐ฆ๐๐ง๐ญ๐๐ ๐๐ซ๐จ๐ฆ๐ฉ๐ญ๐ฌ
The LLM's internal knowledge is limited to the data it was trained on. Also, often, it forgets specific details of older training datasets.
The most common use case is Retrieval-Augmented Generation (RAG).
That is why using the LLM as a reasoning engine is beneficial to parse and extract information from a reliable source of information given as context in the prompt.
๐๐ฉ๐บ?
- avoid retraining the model on new data
- avoid hallucinating
- access to references on the source
#5. ๐ ๐ฌ๐ข๐ง๐ ๐ฅ๐ ๐ซ๐๐ฌ๐ฉ๐จ๐ง๐ฌ๐ข๐๐ข๐ฅ๐ข๐ญ๐ฒ ๐ฉ๐๐ซ ๐ฉ๐ซ๐จ๐ฆ๐ฉ๐ญ
Quite self-explanatory. It is similar to the DRY principle in SWE.
Having only x1 task/prompt is good practice to avoid confusing the LLM.
If you have more complex tasks, split them into granular ones and merge the results later in a different prompt.
#6. ๐๐ ๐๐ฌ ๐๐ฑ๐ฉ๐ฅ๐ข๐๐ข๐ญ ๐๐ฌ ๐ฉ๐จ๐ฌ๐ฌ๐ข๐๐ฅ๐
The LLM cannot read your mind. To maximize the probability of getting precisely what you want, you can imagine the LLM as a 7-year-old to whom you must explain everything step-by-step to be sure he understood.
๐๐ฐ๐ต๐ฆ: The level of detail in the prompt is inversely proportional to the size & complexity of the model.
The truth is that prompt engineering is quite intuitive, and we don't have to overthink it too much.
What would you add to this list?
One thing that I do that sets me apart from the crowd
Here is one thing that I do that sets me apart from the crowd:
"๐ ๐ข๐ฎ ๐ฐ๐ฌ๐ข๐บ ๐ธ๐ช๐ต๐ฉ ๐ฃ๐ฆ๐ช๐ฏ๐จ ๐ต๐ฉ๐ฆ ๐ฅ๐ถ๐ฎ๐ฑ ๐ฐ๐ฏ๐ฆ ๐ต๐ฉ๐ข๐ต ๐ข๐ด๐ฌ๐ด ๐ฎ๐ข๐ฏ๐บ ๐ฒ๐ถ๐ฆ๐ด๐ต๐ช๐ฐ๐ฏ๐ด."
๐๐ฆ๐ฆ... ๐๐ก๐ฒ?
The reality is that even the brightest minds cannot understand everything from the first shot.
It is not necessarily that you cannot understand the concepts.
There are other factors, such as:
- you are tired
- you haven't paid enough attention
- the concept wasn't explained at your level
- the presenter wasn't clear enough, etc.
Also, the truth is that many of us don't understand everything from the first shot when presented with a new concept.
But because of our ego, we are afraid to come out and ask something because we are worried that we will sound stupid.
The jokes are on you.
Most people will be grateful you broke the ice and asked to explain the concept again.
๐๐ก๐ฒ?
It will help the team to learn the new concepts better.
It will start a discussion to dig deeper into the subject.
It will piss off or annoy the people you don't like.
It will help other people ask questions next time.
It will open up new perspectives on the problem.
To conclude...
Ignore your ego and what people think of you. Own your curiosity and ask questions when you feel like it.
It is ok not to know everything.
It is better to be stupid for 5 minutes than your entire life.
Congrats on learning something new today!
Donโt hesitate to share your thoughts - we would love to hear them.
โ Remember, when ML looks encoded - weโll help you decode it.
See you next Thursday at 9:00 am CET.
Have a fantastic weekend!