DML: How do you generate a Q&A dataset in <30 minutes to fine-tune your LLMs?
Lesson 7 | The Hands-on LLMs Series
Hello there, I am Paul Iusztin ๐๐ผ
Within this newsletter, I will help you decode complex topics about ML & MLOps one week at a time ๐ฅ
Lesson 7 | The Hands-on LLMs Series
Table of Contents:
Real-time feature pipeline video lesson
How do you generate a synthetic domain-specific Q&A dataset in <30 minutes to fine-tune your open-source LLM?
My personal list of filtered resources about LLMs & vector DBs
Previous Lessons:
Lesson 4: How to implement a streaming pipeline to populate a vector DB for real-time RAG?
Lesson 6: What do you need to fine-tune an open-source LLM to create your financial advisor?
โณ๐ Check out the Hands-on LLMs course and support it with a โญ.
#1. Real-time feature pipeline video lesson
I know we are currently talking about the training pipeline and Q&A dataset generation, but sometimes, mixing the information to remember and make new connections is healthy.
โฆor maybe that is only an excuse to share the video lesson about the feature pipeline that wasnโt ready when I started this series.
It will teach you how to ๐ถ๐ป๐ด๐ฒ๐๐ ๐ณ๐ถ๐ป๐ฎ๐ป๐ฐ๐ถ๐ฎ๐น ๐ป๐ฒ๐๐ in ๐ฟ๐ฒ๐ฎ๐น-๐๐ถ๐บ๐ฒ from Alpaca, ๐ฐ๐น๐ฒ๐ฎ๐ป & ๐ฒ๐บ๐ฏ๐ฒ๐ฑ the ๐ฑ๐ผ๐ฐ๐๐บ๐ฒ๐ป๐๐, and ๐น๐ผ๐ฎ๐ฑ them in a ๐๐ฒ๐ฐ๐๐ผ๐ฟ ๐๐.
๐๐ฒ๐ฟ๐ฒ ๐ถ๐ ๐ฎ๐ป ๐ผ๐๐ฒ๐ฟ๐๐ถ๐ฒ๐ ๐ผ๐ณ ๐๐ต๐ฒ ๐๐ถ๐ฑ๐ฒ๐ผ โ
1. Step-by-step instructions on how to set up the streaming pipeline code & a Qdrant vector DB serverless cluster
2. Why we used Bytewax to build the streaming pipeline
3. How we used Bytewax to ingest financial news in real-time leveraging a WebSocket, clean the documents, chunk them, embed them and ingest them in the Qdrant vector DB
4. How we adapted the Bytewax streaming pipeline to also work in batch mode to populate the vector DB with historical data
5. How to run the code
6. How to deploy the code to AWS
Here it is โ Enjoy ๐
#2. How do you generate a synthetic domain-specific Q&A dataset in <30 minutes to fine-tune your open-source LLM?
This method is also known as ๐ณ๐ถ๐ป๐ฒ๐๐๐ป๐ถ๐ป๐ด ๐๐ถ๐๐ต ๐ฑ๐ถ๐๐๐ถ๐น๐น๐ฎ๐๐ถ๐ผ๐ป. Here are its 3 ๐ฎ๐ข๐ช๐ฏ ๐ด๐ต๐ฆ๐ฑ๐ด โ
๐๐ฐ๐ณ ๐ฆ๐น๐ข๐ฎ๐ฑ๐ญ๐ฆ, ๐ญ๐ฆ๐ต'๐ด ๐จ๐ฆ๐ฏ๐ฆ๐ณ๐ข๐ต๐ฆ ๐ข ๐&๐ ๐ง๐ช๐ฏ๐ฆ-๐ต๐ถ๐ฏ๐ช๐ฏ๐จ ๐ฅ๐ข๐ต๐ข๐ด๐ฆ๐ต ๐ถ๐ด๐ฆ๐ฅ ๐ต๐ฐ ๐ง๐ช๐ฏ๐ฆ-๐ต๐ถ๐ฏ๐ฆ ๐ข ๐ง๐ช๐ฏ๐ข๐ฏ๐ค๐ช๐ข๐ญ ๐ข๐ฅ๐ท๐ช๐ด๐ฐ๐ณ ๐๐๐.
๐ฆ๐๐ฒ๐ฝ ๐ญ: ๐ ๐ฎ๐ป๐๐ฎ๐น๐น๐ ๐ด๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ฒ ๐ฎ ๐ณ๐ฒ๐ ๐ถ๐ป๐ฝ๐๐ ๐ฒ๐
๐ฎ๐บ๐ฝ๐น๐ฒ๐
Generate a few input samples (~3) that have the following structure:
- ๐ถ๐ด๐ฆ๐ณ_๐ค๐ฐ๐ฏ๐ต๐ฆ๐น๐ต: describe the type of investor (e.g., "I am a 28-year-old marketing professional")
- ๐ฒ๐ถ๐ฆ๐ด๐ต๐ช๐ฐ๐ฏ: describe the user's intention (e.g., "Is Bitcoin a good investment option?")
๐ฆ๐๐ฒ๐ฝ ๐ฎ: ๐๐
๐ฝ๐ฎ๐ป๐ฑ ๐๐ต๐ฒ ๐ถ๐ป๐ฝ๐๐ ๐ฒ๐
๐ฎ๐บ๐ฝ๐น๐ฒ๐ ๐๐ถ๐๐ต ๐๐ต๐ฒ ๐ต๐ฒ๐น๐ฝ ๐ผ๐ณ ๐ฎ ๐๐ฒ๐ฎ๐ฐ๐ต๐ฒ๐ฟ ๐๐๐
Use a powerful LLM as a teacher (e.g., GPT4, Falcon 180B, etc.) to generate up to +N similar input examples.
We generated 100 input examples in our use case, but you can generate more.
You will use the manually filled input examples to do few-shot prompting.
This will guide the LLM to give you domain-specific samples.
๐๐ฉ๐ฆ ๐ฑ๐ณ๐ฐ๐ฎ๐ฑ๐ต ๐ธ๐ช๐ญ๐ญ ๐ญ๐ฐ๐ฐ๐ฌ ๐ญ๐ช๐ฌ๐ฆ ๐ต๐ฉ๐ช๐ด:
"""
...
Generate 100 more examples with the following pattern:
# USER CONTEXT 1
...
# QUESTION 1
...
# USER CONTEXT 2
...
"""
๐ฆ๐๐ฒ๐ฝ ๐ฏ: ๐จ๐๐ฒ ๐๐ต๐ฒ ๐๐ฒ๐ฎ๐ฐ๐ต๐ฒ๐ฟ ๐๐๐ ๐๐ผ ๐ด๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ฒ ๐ผ๐๐๐ฝ๐๐๐ ๐ณ๐ผ๐ฟ ๐ฎ๐น๐น ๐๐ต๐ฒ ๐ถ๐ป๐ฝ๐๐ ๐ฒ๐
๐ฎ๐บ๐ฝ๐น๐ฒ๐
Now, you will have the same powerful LLM as a teacher, but this time, it will answer all your N input examples.
But first, to introduce more variance, we will use RAG to enrich the input examples with news context.
Afterward, we will use the teacher LLM to answer all N input examples.
...and bam! You generated a domain-specific Q&A dataset with almost 0 manual work.
.
Now, you will use this data to train a smaller LLM (e.g., Falcon 7B) on a niched task, such as financial advising.
This technique is known as finetuning with distillation because you use a powerful LLM as the teacher (e.g., GPT4, Falcon 180B) to generate the data, which will be used to fine-tune a smaller LLM (e.g., Falcon 7B), which acts as the student.
โ๏ธ ๐๐ฐ๐ต๐ฆ: To ensure that the generated data is of high quality, you can hire a domain expert to check & refine it.
โณ To learn more about this technique, check out โHow to generate a Q&A dataset in less than 30 minutesโ Pau Labarta's article from
.#3. My personal list of filtered resources about LLMs & vector DBs
The internet is full of ๐น๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐ฟ๐ฒ๐๐ผ๐๐ฟ๐ฐ๐ฒ๐ about ๐๐๐ ๐ & ๐๐ฒ๐ฐ๐๐ผ๐ฟ ๐๐๐. But ๐บ๐ผ๐๐ ๐ผ๐ณ ๐ถ๐ is ๐๐ฟ๐ฎ๐๐ต.
After ๐ฒ ๐บ๐ผ๐ป๐๐ต๐ of ๐ฟ๐ฒ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต๐ถ๐ป๐ด ๐๐๐ ๐ & ๐๐ฒ๐ฐ๐๐ผ๐ฟ ๐๐๐, here is a ๐น๐ถ๐๐ ๐ผ๐ณ ๐ณ๐ถ๐น๐๐ฒ๐ฟ๐ฒ๐ฑ ๐ฟ๐ฒ๐๐ผ๐๐ฟ๐ฐ๐ฒ๐ that I ๐ฝ๐ฒ๐ฟ๐๐ผ๐ป๐ฎ๐น๐น๐ ๐๐๐ฒ โ
๐๐ญ๐ฐ๐จ๐ด:
- philschmid
- Chip Huyen
- eugeneyan
- LLM Learning Lab
- Lil'Log
- VectorHub by SuperLinked
- Qdrant Blog
๐๐ณ๐ต๐ช๐ค๐ญ๐ฆ๐ด:
- Patterns for Building LLM-based Systems & Products
- RLHF: Reinforcement Learning from Human Feedback
- Illustrating Reinforcement Learning from Human Feedback (RLHF)
- Understanding Encoder And Decoder LLMs
- Building LLM applications for production
- Prompt Engineering
- Transformers
- Bidirectional Encoder Representations from Transformers (BERT)
- Multimodality and Large Multimodal Models (LMMs) by Chip Huyen
๐๐ช๐ฅ๐ฆ๐ฐ๐ด:
- Word Embedding and Word2Vec, Clearly Explained!!!
- Let's build GPT: from scratch, in code, spelled out
- Transformer Neural Networks, ChatGPT's foundation, Clearly Explained!!!
- Large Language Models with Semantic Search
- Decoder-Only Transformers, ChatGPTs specific Transformer, Clearly Explained!!!
๐๐ฐ๐ฅ๐ฆ ๐๐ฆ๐ฑ๐ฐ๐ด๐ช๐ต๐ฐ๐ณ๐ช๐ฆ๐ด:
- OpenAI Cookbook
- generative-ai-for-beginners
๐๐ฐ๐ถ๐ณ๐ด๐ฆ๐ด:
- LangChain for LLM Application Development
- Building Systems with the ChatGPT API
- ChatGPT Prompt Engineering for Developers
.
...and hopefully, my ๐ Hands-on LLMs course will soon appear along them.
Let me know what you think of this list and have fun learning ๐ฅ
Thatโs it for today ๐พ
See you next Thursday at 9:00 a.m. CET.
Have a fantastic weekend!
โฆand see you next week for Lesson 8 of the Hands-On LLMs series ๐ฅ
Paul
Whenever youโre ready, here is how I can help you:
The Full Stack 7-Steps MLOps Framework: a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.
Machine Learning & MLOps Blog: in-depth topics about designing and productionizing ML systems using MLOps.
Machine Learning & MLOps Hub: a place where all my work is aggregated in one place (courses, articles, webinars, podcasts, etc.).