The ultimate MLOps tool
6 steps to build your AWS infrastructure that will work for 90% of your projects. How to build a real-time news search engine
Decoding ML Notes
Based on your feedback from last weekโs poll, we will post exclusively on Saturdays starting now.
Enjoy todayโs article ๐ค
This weekโs topics:
The ultimate MLOps tool
6 steps to build your AWS infrastructure that will work for 90% of your projects
How to build a real-time news search engine
The ultimate MLOps tool
I tested this ๐ผ๐ฟ๐ฐ๐ต๐ฒ๐๐๐ฟ๐ฎ๐๐ผ๐ฟ ๐๐ผ๐ผ๐น for my ๐ ๐ ๐ฝ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ๐ and ๐น๐ผ๐๐ฒ๐ฑ ๐ถ๐! It is the ๐๐น๐๐ถ๐บ๐ฎ๐๐ฒ ๐ ๐๐ข๐ฝ๐ ๐๐ผ๐ผ๐น to glue everything together for ๐ฟ๐ฒ๐ฝ๐ฟ๐ผ๐ฑ๐๐ฐ๐ถ๐ฏ๐ถ๐น๐ถ๐๐ and ๐ฐ๐ผ๐ป๐๐ถ๐ป๐๐ผ๐๐ ๐๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด.
In the past months, I have tested most of the top orchestrator tools out there: Airflow, Prefect, Argo, Kubeflow, Metaflow...
You name it!
๐๐๐ ๐ผ๐ป๐ฒ ๐๐๐ผ๐ผ๐ฑ ๐ผ๐๐ ๐๐ผ ๐บ๐ฒ.
I am talking about ZenML!
๐ช๐ต๐?
They realized they don't have to compete with tools such as Airflow or AWS in the orchestrators and MLOps race, but join them!
Instead of being yet another orchestrator tool, they have built an ๐ฎ๐ฏ๐๐๐ฟ๐ฎ๐ฐ๐ ๐น๐ฎ๐๐ฒ๐ฟ ๐ผ๐ป ๐๐ผ๐ฝ ๐ผ๐ณ ๐๐ต๐ฒ ๐ ๐๐ข๐ฝ๐ ๐ฒ๐ฐ๐ผ๐๐๐๐๐ฒ๐บ:
- experiment trackers & model registries (e.g., Weights & Biases, Comet)
- orchestrators (e.g., Apache Airflow, Kubeflow)
- container registries for your Docker images
- model deployers (Hugging Face , BentoML, Seldon)
They wrote a clever wrapper that integrated the whole MLOps ecosystem!
๐๐ญ๐ด๐ฐ, ๐ช๐ฏ๐ต๐ฆ๐จ๐ณ๐ข๐ต๐ช๐ฏ๐จ ๐ช๐ต ๐ช๐ฏ๐ต๐ฐ ๐บ๐ฐ๐ถ๐ณ ๐๐บ๐ต๐ฉ๐ฐ๐ฏ ๐ค๐ฐ๐ฅ๐ฆ ๐ช๐ด ๐ฏ๐ฐ๐ต ๐ช๐ฏ๐ต๐ณ๐ถ๐ด๐ช๐ท๐ฆ.
As long your code is modular (which should be anyway), you have to annotate your DAG:
- steps with "Stephen S."
- entry point with james wang
๐๐ด ๐บ๐ฐ๐ถ ๐ค๐ข๐ฏ ๐ด๐ฆ๐ฆ ๐ช๐ฏ ๐ต๐ฉ๐ฆ ๐ค๐ฐ๐ฅ๐ฆ ๐ด๐ฏ๐ช๐ฑ๐ฑ๐ฆ๐ต๐ด ๐ฃ๐ฆ๐ญ๐ฐ๐ธ โ
.
๐ง๐ต๐ฒ๐ ๐ฎ๐น๐๐ผ ๐ฝ๐ฟ๐ผ๐๐ถ๐ฑ๐ฒ ๐๐ต๐ฒ ๐ฐ๐ผ๐ป๐ฐ๐ฒ๐ฝ๐ ๐ผ๐ณ ๐ฎ "๐๐๐ฎ๐ฐ๐ธ".
This allows you to configure multiple tools and infrastructure sets your pipeline can run on.
๐๐ฐ๐ณ ๐ฆ๐น๐ข๐ฎ๐ฑ๐ญ๐ฆ:
- ๐ข ๐ญ๐ฐ๐ค๐ข๐ญ ๐ด๐ต๐ข๐ค๐ฌ: that uses a local orchestrator, artifact store, and compute for quick testing (so you don't have to set up other dependencies)
- ๐ข๐ฏ ๐๐๐ ๐ด๐ต๐ข๐ค๐ฌ: that uses AWS SageMaker Orchestrator, Comet, and Seldon
As I am still learning ZenML, this was just an intro post to share my excitement.
I plan to integrate it into Decoding ML's LLM twin open-source project and share the process with you!
.
๐ ๐ฒ๐ฎ๐ป๐๐ต๐ถ๐น๐ฒ, ๐ฐ๐ผ๐ป๐๐ถ๐ฑ๐ฒ๐ฟ ๐ฐ๐ต๐ฒ๐ฐ๐ธ๐ถ๐ป๐ด ๐ผ๐๐ ๐๐ต๐ฒ๐ถ๐ฟ ๐๐๐ฎ๐ฟ๐๐ฒ๐ฟ ๐ด๐๐ถ๐ฑ๐ฒ โ
๐ ๐๐ต๐ข๐ณ๐ต๐ฆ๐ฅ ๐จ๐ถ๐ช๐ฅ๐ฆ: https://lnkd.in/dPzXHvjH
6 steps to build your AWS infrastructure that will work for 90% of your projects
๐ฒ ๐๐๐ฒ๐ฝ๐ to ๐ฏ๐๐ถ๐น๐ฑ your ๐๐ช๐ฆ ๐ถ๐ป๐ณ๐ฟ๐ฎ๐๐๐ฟ๐๐ฐ๐๐๐ฟ๐ฒ (using ๐๐ฎ๐) and a ๐๐/๐๐ ๐ฝ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ that will ๐๐ผ๐ฟ๐ธ for ๐ต๐ฌ% of your ๐ฝ๐ฟ๐ผ๐ท๐ฒ๐ฐ๐๐ โ
We will use the data collection pipeline from our free digital twin course as an example, but it can easily be extrapolated to most of your projects.
๐๐ช๐ณ๐ด๐ต, ๐ญ๐ฆ๐ต'๐ด ๐ด๐ฆ๐ฆ ๐ธ๐ฉ๐ข๐ต ๐ช๐ด ๐ช๐ฏ ๐ฐ๐ถ๐ณ ๐ต๐ฐ๐ฐ๐ญ๐ฃ๐ฆ๐ญ๐ต:
- Docker
- AWS ECR
- AWS Lambda
- MongoDB
- Pulumni
- GitHub Actions
๐๐ฆ๐ค๐ฐ๐ฏ๐ฅ๐ญ๐บ, ๐ญ๐ฆ๐ต'๐ด ๐ฒ๐ถ๐ช๐ค๐ฌ๐ญ๐บ ๐ถ๐ฏ๐ฅ๐ฆ๐ณ๐ด๐ต๐ข๐ฏ๐ฅ ๐ธ๐ฉ๐ข๐ต ๐ต๐ฉ๐ฆ ๐ฅ๐ข๐ต๐ข ๐ค๐ฐ๐ญ๐ญ๐ฆ๐ค๐ต๐ช๐ฐ๐ฏ ๐ฑ๐ช๐ฑ๐ฆ๐ญ๐ช๐ฏ๐ฆ ๐ช๐ด ๐ฅ๐ฐ๐ช๐ฏ๐จ
It automates your digital data collection from LinkedIn, Medium, Substack, and GitHub. The normalized data will be loaded into MongoDB.
๐๐ฐ๐ธ, ๐ญ๐ฆ๐ต'๐ด ๐ถ๐ฏ๐ฅ๐ฆ๐ณ๐ด๐ต๐ข๐ฏ๐ฅ ๐ฉ๐ฐ๐ธ ๐ต๐ฉ๐ฆ ๐๐๐ ๐ช๐ฏ๐ง๐ณ๐ข๐ด๐ต๐ณ๐ถ๐ค๐ต๐ถ๐ณ๐ฆ ๐ข๐ฏ๐ฅ ๐๐/๐๐ ๐ฑ๐ช๐ฑ๐ฆ๐ญ๐ช๐ฏ๐ฆ ๐ธ๐ฐ๐ณ๐ฌ๐ด โ
1. We wrap the application's entry point with a `๐ฉ๐ข๐ฏ๐ฅ๐ญ๐ฆ(๐ฆ๐ท๐ฆ๐ฏ๐ต, ๐ค๐ฐ๐ฏ๐ต๐ฆ๐น๐ต: ๐๐ข๐ฎ๐ฃ๐ฅ๐ข๐๐ฐ๐ฏ๐ต๐ฆ๐น๐ต)` function. The AWS Lambda serverless computing service will default to the `๐ฉ๐ข๐ฏ๐ฅ๐ญ๐ฆ()` function.
2. Build a Docker image of your application inheriting the `๐ฑ๐ถ๐ฃ๐ญ๐ช๐ค.๐ฆ๐ค๐ณ.๐ข๐ธ๐ด/๐ญ๐ข๐ฎ๐ฃ๐ฅ๐ข/๐ฑ๐บ๐ต๐ฉ๐ฐ๐ฏ:3.11` base Docker image
โ Now, you can quickly check your AWS Lambda function locally by making HTTP requests to your Docker container.
3. Use Pulumni IaC to create your AWS infrastructure programmatically:
- an ECR as your Docker registry
- an AWS Lambda service
- a MongoDB cluster
- the VPC for the whole infrastructure
4. Now that we have our Docker image and infrastructure, we can build our CI/CD pipeline using GitHub Actions. The first step is to build the Docker image inside the CI and push it to ECR when a new PR is merged into the main branch.
5. On the CD part, we will take the fresh Docker image from ECR and deploy it to AWS Lambda.
6. Repeat the same logic with the Pulumni code โ Add a CD GitHub Action that updates the infrastructure whenever the IaC changes.
With ๐๐ต๐ถ๐ ๐ณ๐น๐ผ๐, you will do fine for ๐ต๐ฌ% of your ๐ฝ๐ฟ๐ผ๐ท๐ฒ๐ฐ๐๐ ๐ฅ
.
๐๐ฐ ๐ด๐ถ๐ฎ๐ฎ๐ข๐ณ๐ช๐ป๐ฆ, ๐ต๐ฉ๐ฆ ๐๐/๐๐ ๐ธ๐ช๐ญ๐ญ ๐ญ๐ฐ๐ฐ๐ฌ ๐ญ๐ช๐ฌ๐ฆ ๐ต๐ฉ๐ช๐ด:
feature PR -> merged to main -> build Docker image -> push to ECR -> deploy to AWS Lambda
๐ช๐ฎ๐ป๐ ๐๐ผ ๐ฟ๐๐ป ๐๐ต๐ฒ ๐ฐ๐ผ๐ฑ๐ฒ ๐๐ผ๐๐ฟ๐๐ฒ๐น๐ณ?
Consider checking out ๐๐ฒ๐๐๐ผ๐ป ๐ฎ from the FREE ๐๐๐ ๐ง๐๐ถ๐ป ๐ฐ๐ผ๐๐ฟ๐๐ฒ hosted by:
๐ The Importance of Data Pipelines in the Era of Generative AI
How to build a real-time news search engine
Decoding ML ๐ฟ๐ฒ๐น๐ฒ๐ฎ๐๐ฒ๐ฑ an ๐ฎ๐ฟ๐๐ถ๐ฐ๐น๐ฒ & ๐ฐ๐ผ๐ฑ๐ฒ on building a ๐ฅ๐ฒ๐ฎ๐น-๐๐ถ๐บ๐ฒ ๐ก๐ฒ๐๐ ๐ฆ๐ฒ๐ฎ๐ฟ๐ฐ๐ต ๐๐ป๐ด๐ถ๐ป๐ฒ using ๐๐ฎ๐ณ๐ธ๐ฎ, ๐ฉ๐ฒ๐ฐ๐๐ผ๐ฟ ๐๐๐ and ๐๐๐ฟ๐ฒ๐ฎ๐บ๐ถ๐ป๐ด ๐ฒ๐ป๐ด๐ถ๐ป๐ฒ๐.
๐๐ท๐ฆ๐ณ๐บ๐ต๐ฉ๐ช๐ฏ๐จ ๐ช๐ฏ ๐๐บ๐ต๐ฉ๐ฐ๐ฏ!
๐ง๐ต๐ฒ ๐ฒ๐ป๐ฑ ๐ด๐ผ๐ฎ๐น?
Learn to build a production-ready semantic search engine for news that is synced in real-time with multiple news sources using:
- a streaming engine
- Kafka
- a vector DB.
๐ง๐ต๐ฒ ๐ฝ๐ฟ๐ผ๐ฏ๐น๐ฒ๐บ?
According to a research study by earthweb.com, the daily influx of news articles, both online and offline, is between 2 and 3 million.
How would you constantly sync these data sources with your vector DB to stay in sync with the outside world?
๐ง๐ต๐ฒ ๐๐ผ๐น๐๐๐ถ๐ผ๐ป!
โ Here is where the streaming pipeline kicks in.
As soon as a new data point is available, it is:
- ingested
- processed
- loaded to a vector DB
...in real-time by the streaming pipeline โ
.
๐๐ฆ๐ณ๐ฆ ๐ช๐ด ๐ธ๐ฉ๐ข๐ต ๐บ๐ฐ๐ถ ๐ธ๐ช๐ญ๐ญ ๐ญ๐ฆ๐ข๐ณ๐ฏ ๐ง๐ณ๐ฐ๐ฎ ๐ต๐ฉ๐ฆ ๐ข๐ณ๐ต๐ช๐ค๐ญ๐ฆ โ
โ Set up your own Upstash ๐๐ฎ๐ณ๐ธ๐ฎ & ๐ฉ๐ฒ๐ฐ๐๐ผ๐ฟ ๐๐ ๐ฐ๐น๐๐๐๐ฒ๐ฟ๐
โ ๐ฆ๐๐ฟ๐๐ฐ๐๐๐ฟ๐ฒ & ๐๐ฎ๐น๐ถ๐ฑ๐ฎ๐๐ฒ your ๐ฑ๐ฎ๐๐ฎ points using Pydantic
โ ๐ฆ๐ถ๐บ๐๐น๐ฎ๐๐ฒ multiple ๐๐ฎ๐ณ๐ธ๐ฎ ๐๐น๐ถ๐ฒ๐ป๐๐ using ๐๐ฉ๐ณ๐ฆ๐ข๐ฅ๐๐ฐ๐ฐ๐ญ๐๐น๐ฆ๐ค๐ถ๐ต๐ฐ๐ณ & ๐๐ข๐ง๐ฌ๐ข๐๐ณ๐ฐ๐ฅ๐ถ๐ค๐ฆ๐ณ
โ ๐ฆ๐๐ฟ๐ฒ๐ฎ๐บ ๐ฝ๐ฟ๐ผ๐ฐ๐ฒ๐๐๐ถ๐ป๐ด using Bytewax - learn to ๐ฏ๐๐ถ๐น๐ฑ ๐ฎ ๐ฟ๐ฒ๐ฎ๐น-๐๐ถ๐บ๐ฒ ๐ฅ๐๐ ingestion ๐ฝ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ
โ ๐๐ฎ๐๐ฐ๐ต-๐๐ฝ๐๐ฒ๐ฟ๐๐ถ๐ป๐ด ๐ฒ๐บ๐ฏ๐ฒ๐ฑ๐ฑ๐ถ๐ป๐ด๐ + ๐บ๐ฒ๐๐ฎ๐ฑ๐ฎ๐๐ฎ to Upstash Vector DB
โ Build a ๐ค&๐ ๐จI using Streamlit
โ ๐จ๐ป๐ถ๐ ๐ง๐ฒ๐๐๐ถ๐ป๐ด - Yes, we even added unit testing!
๐๐๐ฟ๐ถ๐ผ๐๐ ๐๐ผ ๐น๐ฒ๐๐ฒ๐น ๐๐ฝ ๐๐ผ๐๐ฟ ๐ฃ๐๐๐ต๐ผ๐ป, ๐๐๐ฟ๐ฒ๐ฎ๐บ๐ถ๐ป๐ด & ๐ฅ๐๐ ๐ด๐ฎ๐บ๐ฒ ๐ซต
Then, consider checking out ๐ต๐ฉ๐ฆ ๐ข๐ณ๐ต๐ช๐ค๐ญ๐ฆ & ๐ค๐ฐ๐ฅ๐ฆ. Everything is free.
โโโ
๐ [Article] How to build a real-time News Search Engine using Vector DBs
๐ ๐๐ถ๐๐๐๐ฏ ๐ฐ๐ผ๐ฑ๐ฒ
Images
If not otherwise stated, all images are created by the author.