Using this Python package, you can x10 your text preprocessing pipelines
End-to-end framework for production-ready LLMs. Top 6 ML platform features you must know and use in your ML system.
Decoding ML Notes
This weekโs topics:
Top 6 ML platform features you must know and use in your ML system.
Using this Python package, you can x10 your text preprocessing pipelines
End-to-end framework for production-ready LLMs
Top 6 ML platform features you must know and use in your ML system
Here they are โ
#๐ญ. ๐๐
๐ฝ๐ฒ๐ฟ๐ถ๐บ๐ฒ๐ป๐ ๐ง๐ฟ๐ฎ๐ฐ๐ธ๐ถ๐ป๐ด
In your ML development phase, you generate lots of experiments.
Tracking and comparing the metrics between them is crucial in finding the optimal model.
#๐ฎ. ๐ ๐ฒ๐๐ฎ๐ฑ๐ฎ๐๐ฎ ๐ฆ๐๐ผ๐ฟ๐ฒ
Its primary purpose is reproducibility.
To know how a model was generated, you need to know:
- the version of the code
- the version of the packages
- hyperparameters/config
- total compute
- version of the dataset
... and more
#๐ฏ. ๐ฉ๐ถ๐๐๐ฎ๐น๐ถ๐๐ฎ๐๐ถ๐ผ๐ป๐
Most of the time, along with the metrics, you must log a set of visualizations for your experiment.
Such as:
- images
- videos
- prompts
- t-SNE graphs
- 3D point clouds
... and more
#๐ฐ. ๐ฅ๐ฒ๐ฝ๐ผ๐ฟ๐๐
You don't work in a vacuum.
You have to present your work to other colleges or clients.
A report lets you take the metadata and visualizations from your experiment...
...and create, deliver and share a targeted presentation for your clients or peers.
#๐ฑ. ๐๐ฟ๐๐ถ๐ณ๐ฎ๐ฐ๐๐
The most powerful feature out of them all.
An artifact is a versioned object that is an input or output for your task.
Everything can be an artifact, but the most common cases are:
- data
- model
- code
Wrapping your assets around an artifact ensures reproducibility.
For example, you wrap your features into an artifact (e.g., features:3.1.2), which you can consume into your ML development step.
The ML development step will generate config (e.g., config:1.2.4) and code (e.g., code:1.0.2) artifacts used in the continuous training pipeline.
Doing so lets you quickly respond to questions such as "What I used to generate the model?" and "What Version?"
#๐ฒ. ๐ ๐ผ๐ฑ๐ฒ๐น ๐ฅ๐ฒ๐ด๐ถ๐๐๐ฟ๐
The model registry is the ultimate way to make your model accessible to your production ecosystem.
For example, in your continuous training pipeline, after the model is trained, you load the weights as an artifact into the model registry (e.g., model:1.2.4).
You label this model as "staging" under a new version and prepare it for testing. If the tests pass, mark it as "production" under a new version and prepare it for deployment (e.g., model:2.1.5).
All of these features are used in a mature ML system. What is your favorite one?
Using this Python package, you can x10 your text preprocessing pipelines
Any text preprocessing pipeline has to clean, partition, extract, or chunk text data to feed it into your LLMs.
๐๐ป๐๐๐ฟ๐๐ฐ๐๐๐ฟ๐ฒ๐ฑ offers a ๐ฟ๐ถ๐ฐ๐ต and ๐ฐ๐น๐ฒ๐ฎ๐ป ๐๐ฃ๐ that allows you to quickly:
- ๐ฑ๐ข๐ณ๐ต๐ช๐ต๐ช๐ฐ๐ฏ your data into smaller segments from various data sources (e.g., HTML, CSV, PDFs, even images, etc.)
- ๐ค๐ญ๐ฆ๐ข๐ฏ๐ช๐ฏ๐จ the text of anomalies (e.g., wrong ASCII characters), any irrelevant information (e.g., white spaces, bullets, etc.), and filling missing values
- ๐ฆ๐น๐ต๐ณ๐ข๐ค๐ต๐ช๐ฏ๐จ information from pieces of text (e.g., datetimes, addresses, IP addresses, etc.)
- ๐ค๐ฉ๐ถ๐ฏ๐ฌ๐ช๐ฏ๐จ your text segments into pieces of text that can be inserted into your embedding model
- ๐ฆ๐ฎ๐ฃ๐ฆ๐ฅ๐ฅ๐ช๐ฏ๐จ data (e.g., wrapper over OpenAIEmbeddingEncoder, HuggingFaceEmbeddingEncoders, etc.)
- ๐ด๐ต๐ข๐จ๐ฆ your data to be fed into various tools (e.g., Label Studio, Label Box, etc.)
๐๐น๐น ๐๐ต๐ฒ๐๐ฒ ๐๐๐ฒ๐ฝ๐ ๐ฎ๐ฟ๐ฒ ๐ฒ๐๐๐ฒ๐ป๐๐ถ๐ฎ๐น ๐ณ๐ผ๐ฟ:
- feeding your data into your LLMs
- embedding the data and ingesting it into a vector DB
- doing RAG
- labeling
- recommender systems
... basically for any LLM or multimodal applications
.
Implementing all these steps from scratch will take a lot of time.
I know some Python packages already do this, but the functionality is scattered across multiple packages.
๐๐ป๐๐๐ฟ๐๐ฐ๐๐๐ฟ๐ฒ๐ฑ packages everything together under a nice, clean API.
End-to-end framework for production-ready LLMs
Want to ๐น๐ฒ๐ฎ๐ฟ๐ป to ๐ฏ๐๐ถ๐น๐ฑ ๐ฝ๐ฟ๐ผ๐ฑ๐๐ฐ๐๐ถ๐ผ๐ป ๐๐๐ ๐ in a ๐๐๐ฟ๐๐ฐ๐๐๐ฟ๐ฒ๐ฑ ๐๐ฎ๐? For ๐๐ฅ๐๐? Then ๐๐ผ๐ ๐๐ต๐ผ๐๐น๐ฑ ๐๐ฎ๐ธ๐ฒ our ๐ก๐๐ช ๐ฐ๐ผ๐๐ฟ๐๐ฒ on how to ๐ถ๐บ๐ฝ๐น๐ฒ๐บ๐ฒ๐ป๐ an ๐ฒ๐ป๐ฑ-๐๐ผ-๐ฒ๐ป๐ฑ ๐ณ๐ฟ๐ฎ๐บ๐ฒ๐๐ผ๐ฟ๐ธ for ๐ฝ๐ฟ๐ผ๐ฑ๐๐ฐ๐๐ถ๐ผ๐ป-๐ฟ๐ฒ๐ฎ๐ฑ๐ ๐๐๐ ๐๐๐๐๐ฒ๐บ๐ โ
๐ง Decoding ML and I are ๐๐๐ฎ๐ฟ๐๐ถ๐ป๐ด a ๐ป๐ฒ๐ ๐๐ฅ๐๐ ๐ฐ๐ผ๐๐ฟ๐๐ฒ on ๐น๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด how to ๐ฎ๐ฟ๐ฐ๐ต๐ถ๐๐ฒ๐ฐ๐ and ๐ฏ๐๐ถ๐น๐ฑ a ๐ฟ๐ฒ๐ฎ๐น-๐๐ผ๐ฟ๐น๐ฑ ๐๐๐ ๐๐๐๐๐ฒ๐บ by ๐ฏ๐๐ถ๐น๐ฑ๐ถ๐ป๐ด an ๐๐๐ ๐ง๐๐ถ๐ป:
โ from start to finishโ-โfrom
โ from data collection to deployment
โ production-ready
โ from NO MLOps to experiment trackers, model registries, prompt monitoring, and versioning
The course is called: ๐๐๐ ๐ง๐๐ถ๐ป: ๐๐๐ถ๐น๐ฑ๐ถ๐ป๐ด ๐ฌ๐ผ๐๐ฟ ๐ฃ๐ฟ๐ผ๐ฑ๐๐ฐ๐๐ถ๐ผ๐ป-๐ฅ๐ฒ๐ฎ๐ฑ๐ ๐๐ ๐ฅ๐ฒ๐ฝ๐น๐ถ๐ฐ๐ฎ
...and here is what you will learn to build
โโโ
๐ 4 ๐๐บ๐ต๐ฉ๐ฐ๐ฏ ๐ฎ๐ช๐ค๐ณ๐ฐ๐ด๐ฆ๐ณ๐ท๐ช๐ค๐ฆ๐ด:
โ ๐ง๐ต๐ฒ ๐ฑ๐ฎ๐๐ฎ ๐ฐ๐ผ๐น๐น๐ฒ๐ฐ๐๐ถ๐ผ๐ป ๐ฝ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ
- Crawl your digital data from various social media platforms.
- Clean, normalize and load the data to a NoSQL DB through a series of ETL pipelines.
- Send database changes to a queue using the CDC pattern.
โ Deployed on AWS.
โ ๐ง๐ต๐ฒ ๐ณ๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ฝ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ
- Consume messages from a queue through a Bytewax streaming pipeline.
- Every message will be cleaned, chunked, embedded and loaded into a Qdrant vector DB in real-time.
โ Deployed on AWS.
โ ๐ง๐ต๐ฒ ๐๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด ๐ฝ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ
- Create a custom dataset based on your digital data.
- Fine-tune an LLM using QLoRA.
- Use Comet ML's experiment tracker to monitor the experiments.
- Evaluate and save the best model to Comet's model registry.
โ Deployed on Qwak.
โ ๐ง๐ต๐ฒ ๐ถ๐ป๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ฐ๐ฒ ๐ฝ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ
- Load and quantize the fine-tuned LLM from Comet's model registry.
- Deploy it as a REST API
- Enhance the prompts using RAG
- Generate content using your LLM twin
- Monitor the LLM using Comet's prompt monitoring dashboard
โ Deployed on Qwak.
.
๐๐ญ๐ฐ๐ฏ๐จ ๐ต๐ฉ๐ฆ 4 ๐ฎ๐ช๐ค๐ณ๐ฐ๐ด๐ฆ๐ณ๐ท๐ช๐ค๐ฆ๐ด, ๐บ๐ฐ๐ถ ๐ธ๐ช๐ญ๐ญ ๐ญ๐ฆ๐ข๐ณ๐ฏ ๐ต๐ฐ ๐ช๐ฏ๐ต๐ฆ๐จ๐ณ๐ข๐ต๐ฆ 3 ๐ด๐ฆ๐ณ๐ท๐ฆ๐ณ๐ญ๐ฆ๐ด๐ด ๐ต๐ฐ๐ฐ๐ญ๐ด:
- Comet as your ML Platform
- Qdrant as your vector DB
- Qwak as your ML infrastructure
.
To stay updated on ๐๐๐ ๐ง๐๐ถ๐ป: ๐๐๐ถ๐น๐ฑ๐ถ๐ป๐ด ๐ฌ๐ผ๐๐ฟ ๐ฃ๐ฟ๐ผ๐ฑ๐๐ฐ๐๐ถ๐ผ๐ป-๐ฅ๐ฒ๐ฎ๐ฑ๐ ๐๐ ๐ฅ๐ฒ๐ฝ๐น๐ถ๐ฐ๐ฎ course...
๐พ๐๐๐๐ ๐๐ฉ ๐ค๐ช๐ฉ ๐๐๐ฉ๐๐ช๐ ๐๐ฃ๐ ๐จ๐ช๐ฅ๐ฅ๐ค๐ง๐ฉ ๐ช๐จ ๐ฌ๐๐ฉ๐ ๐ โญ๏ธ
โโโ
๐ ๐๐๐ ๐ง๐๐ถ๐ป: ๐๐๐ถ๐น๐ฑ๐ถ๐ป๐ด ๐ฌ๐ผ๐๐ฟ ๐ฃ๐ฟ๐ผ๐ฑ๐๐ฐ๐๐ถ๐ผ๐ป-๐ฅ๐ฒ๐ฎ๐ฑ๐ ๐๐ ๐ฅ๐ฒ๐ฝ๐น๐ถ๐ฐ๐ฎ
Images
If not otherwise stated, all images are created by the author.