Using this Python package, you can x10 your text preprocessing pipelines

End-to-end framework for production-ready LLMs. Top 6 ML platform features you must know and use in your ML system.

May 11, 2024

Decoding ML Notes

This week’s topics:

Top 6 ML platform features you must know and use in your ML system.
Using this Python package, you can x10 your text preprocessing pipelines
End-to-end framework for production-ready LLMs

Top 6 ML platform features you must know and use in your ML system

Here they are ↓

#𝟭. 𝗘𝘅𝗽𝗲𝗿𝗶𝗺𝗲𝗻𝘁 𝗧𝗿𝗮𝗰𝗸𝗶𝗻𝗴

In your ML development phase, you generate lots of experiments.

Tracking and comparing the metrics between them is crucial in finding the optimal model.

#𝟮. 𝗠𝗲𝘁𝗮𝗱𝗮𝘁𝗮 𝗦𝘁𝗼𝗿𝗲

Its primary purpose is reproducibility.

To know how a model was generated, you need to know:
- the version of the code
- the version of the packages
- hyperparameters/config
- total compute
- version of the dataset
... and more

#𝟯. 𝗩𝗶𝘀𝘂𝗮𝗹𝗶𝘀𝗮𝘁𝗶𝗼𝗻𝘀

Most of the time, along with the metrics, you must log a set of visualizations for your experiment.

Such as:
- images
- videos
- prompts
- t-SNE graphs
- 3D point clouds
... and more

#𝟰. 𝗥𝗲𝗽𝗼𝗿𝘁𝘀

You don't work in a vacuum.

You have to present your work to other colleges or clients.

A report lets you take the metadata and visualizations from your experiment...

...and create, deliver and share a targeted presentation for your clients or peers.

#𝟱. 𝗔𝗿𝘁𝗶𝗳𝗮𝗰𝘁𝘀

The most powerful feature out of them all.

An artifact is a versioned object that is an input or output for your task.

Everything can be an artifact, but the most common cases are:
- data
- model
- code

Wrapping your assets around an artifact ensures reproducibility.

For example, you wrap your features into an artifact (e.g., features:3.1.2), which you can consume into your ML development step.

The ML development step will generate config (e.g., config:1.2.4) and code (e.g., code:1.0.2) artifacts used in the continuous training pipeline.

Doing so lets you quickly respond to questions such as "What I used to generate the model?" and "What Version?"

#𝟲. 𝗠𝗼𝗱𝗲𝗹 𝗥𝗲𝗴𝗶𝘀𝘁𝗿𝘆

The model registry is the ultimate way to make your model accessible to your production ecosystem.

For example, in your continuous training pipeline, after the model is trained, you load the weights as an artifact into the model registry (e.g., model:1.2.4).

You label this model as "staging" under a new version and prepare it for testing. If the tests pass, mark it as "production" under a new version and prepare it for deployment (e.g., model:2.1.5).

All of these features are used in a mature ML system. What is your favorite one?

Using this Python package, you can x10 your text preprocessing pipelines

Any text preprocessing pipeline has to clean, partition, extract, or chunk text data to feed it into your LLMs.

𝘂𝗻𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲𝗱 offers a 𝗿𝗶𝗰𝗵 and 𝗰𝗹𝗲𝗮𝗻 𝗔𝗣𝗜 that allows you to quickly:

- 𝘱𝘢𝘳𝘵𝘪𝘵𝘪𝘰𝘯 your data into smaller segments from various data sources (e.g., HTML, CSV, PDFs, even images, etc.)
- 𝘤𝘭𝘦𝘢𝘯𝘪𝘯𝘨 the text of anomalies (e.g., wrong ASCII characters), any irrelevant information (e.g., white spaces, bullets, etc.), and filling missing values
- 𝘦𝘹𝘵𝘳𝘢𝘤𝘵𝘪𝘯𝘨 information from pieces of text (e.g., datetimes, addresses, IP addresses, etc.)
- 𝘤𝘩𝘶𝘯𝘬𝘪𝘯𝘨 your text segments into pieces of text that can be inserted into your embedding model
- 𝘦𝘮𝘣𝘦𝘥𝘥𝘪𝘯𝘨 data (e.g., wrapper over OpenAIEmbeddingEncoder, HuggingFaceEmbeddingEncoders, etc.)
- 𝘴𝘵𝘢𝘨𝘦 your data to be fed into various tools (e.g., Label Studio, Label Box, etc.)

𝗔𝗹𝗹 𝘁𝗵𝗲𝘀𝗲 𝘀𝘁𝗲𝗽𝘀 𝗮𝗿𝗲 𝗲𝘀𝘀𝗲𝗻𝘁𝗶𝗮𝗹 𝗳𝗼𝗿:

- feeding your data into your LLMs
- embedding the data and ingesting it into a vector DB
- doing RAG
- labeling
- recommender systems

... basically for any LLM or multimodal applications

.

Implementing all these steps from scratch will take a lot of time.

I know some Python packages already do this, but the functionality is scattered across multiple packages.

𝘂𝗻𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲𝗱 packages everything together under a nice, clean API.

End-to-end framework for production-ready LLMs

Want to 𝗹𝗲𝗮𝗿𝗻 to 𝗯𝘂𝗶𝗹𝗱 𝗽𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗟𝗟𝗠𝘀 in a 𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲𝗱 𝘄𝗮𝘆? For 𝗙𝗥𝗘𝗘? Then 𝘆𝗼𝘂 𝘀𝗵𝗼𝘂𝗹𝗱 𝘁𝗮𝗸𝗲 our 𝗡𝗘𝗪 𝗰𝗼𝘂𝗿𝘀𝗲 on how to 𝗶𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁 an 𝗲𝗻𝗱-𝘁𝗼-𝗲𝗻𝗱 𝗳𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸 for 𝗽𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻-𝗿𝗲𝗮𝗱𝘆 𝗟𝗟𝗠 𝘀𝘆𝘀𝘁𝗲𝗺𝘀 ↓

🧠 Decoding ML and I are 𝘀𝘁𝗮𝗿𝘁𝗶𝗻𝗴 a 𝗻𝗲𝘄 𝗙𝗥𝗘𝗘 𝗰𝗼𝘂𝗿𝘀𝗲 on 𝗹𝗲𝗮𝗿𝗻𝗶𝗻𝗴 how to 𝗮𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁 and 𝗯𝘂𝗶𝗹𝗱 a 𝗿𝗲𝗮𝗹-𝘄𝗼𝗿𝗹𝗱 𝗟𝗟𝗠 𝘀𝘆𝘀𝘁𝗲𝗺 by 𝗯𝘂𝗶𝗹𝗱𝗶𝗻𝗴 an 𝗟𝗟𝗠 𝗧𝘄𝗶𝗻:

→ from start to finish - from
→ from data collection to deployment
→ production-ready
→ from NO MLOps to experiment trackers, model registries, prompt monitoring, and versioning

No alternative text description for this image

The course is called: 𝗟𝗟𝗠 𝗧𝘄𝗶𝗻: 𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗬𝗼𝘂𝗿 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻-𝗥𝗲𝗮𝗱𝘆 𝗔𝗜 𝗥𝗲𝗽𝗹𝗶𝗰𝗮

...and here is what you will learn to build

↓↓↓

🐍 4 𝘗𝘺𝘵𝘩𝘰𝘯 𝘮𝘪𝘤𝘳𝘰𝘴𝘦𝘳𝘷𝘪𝘤𝘦𝘴:

→ 𝗧𝗵𝗲 𝗱𝗮𝘁𝗮 𝗰𝗼𝗹𝗹𝗲𝗰𝘁𝗶𝗼𝗻 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲

- Crawl your digital data from various social media platforms.
- Clean, normalize and load the data to a NoSQL DB through a series of ETL pipelines.
- Send database changes to a queue using the CDC pattern.

☁ Deployed on AWS.

→ 𝗧𝗵𝗲 𝗳𝗲𝗮𝘁𝘂𝗿𝗲 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲

- Consume messages from a queue through a Bytewax streaming pipeline.
- Every message will be cleaned, chunked, embedded and loaded into a Qdrant vector DB in real-time.

☁ Deployed on AWS.

→ 𝗧𝗵𝗲 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲

- Create a custom dataset based on your digital data.
- Fine-tune an LLM using QLoRA.
- Use Comet ML's experiment tracker to monitor the experiments.
- Evaluate and save the best model to Comet's model registry.

☁ Deployed on Qwak.

→ 𝗧𝗵𝗲 𝗶𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲

- Load and quantize the fine-tuned LLM from Comet's model registry.
- Deploy it as a REST API
- Enhance the prompts using RAG
- Generate content using your LLM twin
- Monitor the LLM using Comet's prompt monitoring dashboard

☁ Deployed on Qwak.

.

𝘈𝘭𝘰𝘯𝘨 𝘵𝘩𝘦 4 𝘮𝘪𝘤𝘳𝘰𝘴𝘦𝘳𝘷𝘪𝘤𝘦𝘴, 𝘺𝘰𝘶 𝘸𝘪𝘭𝘭 𝘭𝘦𝘢𝘳𝘯 𝘵𝘰 𝘪𝘯𝘵𝘦𝘨𝘳𝘢𝘵𝘦 3 𝘴𝘦𝘳𝘷𝘦𝘳𝘭𝘦𝘴𝘴 𝘵𝘰𝘰𝘭𝘴:

- Comet as your ML Platform
- Qdrant as your vector DB
- Qwak as your ML infrastructure

.

To stay updated on 𝗟𝗟𝗠 𝗧𝘄𝗶𝗻: 𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗬𝗼𝘂𝗿 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻-𝗥𝗲𝗮𝗱𝘆 𝗔𝗜 𝗥𝗲𝗽𝗹𝗶𝗰𝗮 course...

𝘾𝙝𝙚𝙘𝙠 𝙞𝙩 𝙤𝙪𝙩 𝙂𝙞𝙩𝙃𝙪𝙗 𝙖𝙣𝙙 𝙨𝙪𝙥𝙥𝙤𝙧𝙩 𝙪𝙨 𝙬𝙞𝙩𝙝 𝙖 ⭐️

↓↓↓

🔗 𝗟𝗟𝗠 𝗧𝘄𝗶𝗻: 𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗬𝗼𝘂𝗿 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻-𝗥𝗲𝗮𝗱𝘆 𝗔𝗜 𝗥𝗲𝗽𝗹𝗶𝗰𝗮

Images

If not otherwise stated, all images are created by the author.

Using this Python package, you can x10 your text preprocessing pipelines

End-to-end framework for production-ready LLMs. Top 6 ML platform features you must know and use in your ML system.

This week’s topics:

Top 6 ML platform features you must know and use in your ML system

Using this Python package, you can x10 your text preprocessing pipelines

End-to-end framework for production-ready LLMs

Images

Discussion about this post