Decoding ML #012: This Is My Favorite Software Design Pattern You Must Know

My Favorite Software Design Pattern That You Must Know as an MLE. Unify Batch and Streaming ML Pipelines.

Paul Iusztin

Aug 17, 2023

Hello there, I am Paul Iusztin 👋🏼

Within this newsletter, I will help you decode complex topics about ML & MLOps one week at a time 🔥

This week we will cover:

My Favorite Software Design Pattern That You Must Know as an MLE
Unify Batch and Streaming ML Pipelines

But first,

If you want to quickly learn how to 𝗱𝗲𝘀𝗶𝗴𝗻 & 𝗯𝘂𝗶𝗹𝗱 𝗮𝗻 𝗲𝗻𝗱-𝘁𝗼-𝗲𝗻𝗱 𝗠𝗟 𝗯𝗮𝘁𝗰𝗵 𝗮𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 using 𝗠𝗟𝗢𝗽𝘀 good practices,

I want to let you know that:

→ I presented an overview of "𝗧𝗵𝗲 𝗙𝘂𝗹𝗹 𝗦𝘁𝗮𝗰𝗸 𝟳-𝗦𝘁𝗲𝗽𝘀 𝗠𝗟𝗢𝗽𝘀 𝗙𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸" course.

During the webinar, I had the chance to explain how all the puzzle pieces (aka architecture components) of a batch architecture work together.

If you want to understand how to design:
- a batch architecture
- feature, training, and inference pipelines
- orchestration
- data validation & monitoring
- web app using FastAPI & Streamlit
- deploy & CI/CD pipeline
- adapt the batch architecture to an online system

Then this recording of the 1-hour webinar might be just for you ↓

#1. My Favorite Software Design Pattern That You Must Know as an MLE

This is my 𝗳𝗮𝘃𝗼𝗿𝗶𝘁𝗲 𝗱𝗲𝘀𝗶𝗴𝗻 𝗽𝗮𝘁𝘁𝗲𝗿𝗻 that you must know as an ML engineer.

Most ML engineers completely ignore software design patterns, but let me explain why you should know this one for your machine learning projects 👇

I am talking about Composite.

The Composite pattern is a structural design pattern that helps you compose objects in a tree-like structure.

Let me explain by starting with the problem.

𝗣𝗿𝗼𝗯𝗹𝗲𝗺

Let's say that you want to build an ML pipeline that performs object detection + tracking.

You can easily divide it into smaller pipelines, such as:
1. preprocessing
2. training | inference
3. postprocessing

Also, these 3 pipelines, in their turn, are split into smaller components.

Let's say that to speed up the ML pipeline. You want to run in parallel everything possible.

Thus, depending on the use case, it would be best to have a module to compose components sequentially or in parallel.

❌ If you don't think this through, your code can quickly transform into spaghetti.

Composite Intuition & Structure [Image by the Author].

𝗦𝗼𝗹𝘂𝘁𝗶𝗼𝗻

✅ Now, the Composite design pattern kicks in.

-> 𝘛𝘩𝘪𝘴 𝘪𝘴 𝘩𝘰𝘸 𝘺𝘰𝘶 𝘤𝘢𝘯 𝘪𝘮𝘱𝘭𝘦𝘮𝘦𝘯𝘵 𝘵𝘩𝘦 𝘔𝘓 𝘱𝘪𝘱𝘦𝘭𝘪𝘯𝘦 𝘢𝘣𝘰𝘷𝘦 𝘶𝘴𝘪𝘯𝘨 𝘵𝘩𝘦 𝘊𝘰𝘮𝘱𝘰𝘴𝘪𝘵𝘦 𝘱𝘢𝘵𝘵𝘦𝘳𝘯:

1. Define a standard interface for all the transformations. Let's call it "Transformation."

2. We create an abstract class called "AtomicTransformation" that inherits the "Transformation" interface for an atomic transformation.

3. We implement an abstract class called "CompositeTransformation" for running multiple transformations. This class inherits the "Transformation" interface but also inputs a list of "Transformation" objects as input.

4. Depending on how you want to call a sequence of transformations, you can inherit the "CompositeTransformation" interface and implement classes for:
- "SequenceTransformations"
- "ParallelTransformations,"
- "DistributedTransformations," etc.

5. Now, when you want to implement a granular transformation (e.g., normalize the image). You implement the "AtomicTransformation" interface.

6. When you want to glue multiple transformations together, you leverage the "CompositeTransformation" classes.

7. When you call a "CompositeTransformation" under the hood, it calls the list of "Transformation" objects until it hits an "AtomicTransformation" object which will do the actual transformation.

Note that because both the "AtomicTransformation" and "CompositeTransformation" inherit the "Transformation" interface, you can use them interchangeably, like LEGOs.

That is powerful.

That is why we all love Sklearn and their "Pipeline" interface 🔥

If you want to know how to apply other software design patterns in MLE, here is another article I wrote that you might like: 🔗 10 Underrated Software Patterns Every ML Engineer Should Know

#2. Unify Batch and Streaming ML Pipelines

What happens if you want to introduce a real-time/streaming data source into your system?

You cry. Just kidding. It is a lot easier than it sounds.

Let's get some context.

Until now, you used only a static data source to train your model & compute your features.

But you find out that your business wants to use real-time news feeds as features for your model.

𝗪𝗵𝗮𝘁 𝗱𝗼 𝘆𝗼𝘂 𝗱𝗼?

You have to implement 2 𝘮𝘢𝘪𝘯 𝘱𝘪𝘱𝘦𝘭𝘪𝘯𝘦𝘴 𝘧𝘰𝘳 𝘺𝘰𝘶𝘳 𝘯𝘦𝘸 𝘴𝘵𝘳𝘦𝘢𝘮𝘪𝘯𝘨 𝘪𝘯𝘱𝘶𝘵 𝘴𝘰𝘶𝘳𝘤𝘦:

#𝟭. One that will quickly transform the raw data into features and make them accessible into the feature store to be used by the production services.

#𝟮. One that will store the raw data in the static raw data source (e.g., a warehouse) so it will be used later for experimentation and research.

Before ingesting into your system, the real-time data source might need an extra processing step to standardize and adapt the data format to your interface.

A standard strategy for:
#𝟭. Kafka as your streaming platform
#𝟮. Flink/Kafka Streams as your streaming processing units

For step #2. most of the time, you will have access to out-of-the-box data connectors that quickly load the real-time data into your static data storage (e.g., from Kafka to an S3 bucket or Big Query data warehouse).

The Merge of Batch and Streaming ML Pipelines [Image by the Author].

To conclude...

To add a streaming data source to your current infrastructure, you need the following:
- Kafka
- Flink/Kafka Streams
- to move your streaming data source into your static one
- to quickly compute features and load them into the feature store

Thus, it isn't hard—just a lot of infrastructure to set up.

That’s it for today 👾

See you next Thursday at 9:00 am CET.

Have a fantastic weekend!

Paul

Whenever you’re ready, here is how I can help you:

The Full Stack 7-Steps MLOps Framework: a 7-lesson FREE course that will walk you step-by-step through how to design, implement, train, deploy, and monitor an ML batch system using MLOps good practices. It contains the source code + 2.5 hours of reading & video materials on Medium.
Machine Learning & MLOps Blog: here, I approach in-depth topics about designing and productionizing ML systems using MLOps.

Decoding ML #012: This Is My Favorite Software Design Pattern You Must Know

My Favorite Software Design Pattern That You Must Know as an MLE. Unify Batch and Streaming ML Pipelines.

#1. My Favorite Software Design Pattern That You Must Know as an MLE

#2. Unify Batch and Streaming ML Pipelines

Whenever you’re ready, here is how I can help you:

Discussion about this post