Snowflake RAG: Idea to prod in 10 steps

H&M’s fashion AI recommendation engine. AI builders bootcamp.

Jan 04, 2025

Exciting news! Based on our aggregated experience from the LLM Engineer’s Handbook and LLM Twin course, we will start a new and improved open-source course on production LLMs & RAG systems.

But we need your opinion on one thing ↓

This week’s topics:

AI builders bootcamp
Build a production RAG app on top of your company’s documents using Snowflake
H&M’s fashion AI recommendation engine

AI builders bootcamp

Everyone is talking about the potential AI.

But not enough people teach the essentials through practical examples...

Shaw Talebi has stepped in to save the day.

For context, Shaw is an ex-Toyota data scientist with 6+ years in AI and has inspired 50,000+ learners through his blog and YouTube channel.

In January, he will be running a 7-week cohort:

𝗔𝗜 𝗕𝘂𝗶𝗹𝗱𝗲𝗿𝘀 𝗕𝗼𝗼𝘁𝗰𝗮𝗺𝗽.

(It's already one of the most popular courses on Maven)

What makes this course stand out?

The emphasis on practical understanding.

Students are encouraged to learn through building projects.

... And all lectures are centered around specific use cases students can use to develop their projects.

All the example code and slides (so far) from cohort 1 are open-sourced - find them here.

We know Shaw as a patient, curious and great engineer with hands-on experience, which makes him a perfect teacher (talent seen in his YouTube channel with 40k+ subs).

Combining these aspects, the course will teach you how to:

Build custom AI projects from start to finish
Develop essential Python skills
Understand the AI landscape
Manage ML projects effectively

And the course is best for:

𝗧𝗲𝗰𝗵𝗻𝗶𝗰𝗮𝗹 𝗽𝗿𝗼𝗳𝗲𝘀𝘀𝗶𝗼𝗻𝗮𝗹𝘀 who want to build the practical skills they need to advance their career
𝗕𝘂𝘀𝗶𝗻𝗲𝘀𝘀 𝗹𝗲𝗮𝗱𝗲𝗿𝘀 𝗮𝗻𝗱 𝗽𝗿𝗼𝗱𝘂𝗰𝘁 𝗺𝗮𝗻𝗮𝗴𝗲𝗿𝘀 who want to gain the technical foundation required to lead AI initiatives and teams.
𝗘𝗻𝘁𝗿𝗲𝗽𝗿𝗲𝗻𝗲𝘂𝗿𝘀 who want to learn the essential AI and ML engineering skills needed to develop AI-native products and services.

Ready to level up your AI skills?

→ The first cohort starts on January 10th (next Friday)
→ Get 25% off with code: PAUL25
→ 100% company reimbursement eligible

Enroll now (start next week)

Build a production RAG app on top of your company’s documents using Snowflake

A managed ML platform can reduce your dev time from months to days.

For example, using Snowflake, let's build a production RAG app on top of your company's documents.

🤕 𝘛𝘩𝘦 𝘱𝘳𝘰𝘣𝘭𝘦𝘮?

Your company's documentation is scattered and hard to work with. You try to build a RAG application on top of it, but the development time, costs and complexity are too high.

🤔 𝘛𝘩𝘦 𝘴𝘰𝘭𝘶𝘵𝘪𝘰𝘯?

Use a fully managed platform, such as Snowflake, which handles all the infrastructure pain points while offering the flexibility to process your documentation as you see fit.

So...

💻 𝘛𝘩𝘪𝘴 𝘪𝘴 𝘩𝘰𝘸 𝘺𝘰𝘶 𝘪𝘮𝘱𝘭𝘦𝘮𝘦𝘯𝘵 𝘢 𝘙𝘈𝘎 𝘢𝘱𝘱𝘭𝘪𝘤𝘢𝘵𝘪𝘰𝘯 𝘶𝘴𝘪𝘯𝘨 𝘚𝘯𝘰𝘸𝘧𝘭𝘢𝘬𝘦 ↓

Any RAG architecture contains 3 pipelines + a Chatbot UI. Here is how it looks like implemented in Snowflake:

𝗧𝗵𝗲 𝗶𝗻𝗴𝗲𝘀𝘁𝗶𝗼𝗻 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲: A batch pipeline used to populate the vector DB.

1. Access your company documents stored in Snowflake (e.g., PDFs)

2. Extract the text from the PDFs, clean and chunk it using Snowpark.

3. Embed each chunk using an embedding model hosted on Snowflake Cortex AI.

4. Store the embeddings, along with their metadata, in Snowflake's vector DB.

5. Schedule the ingestion pipeline to check for new documents every 10 minutes and load them into the vector DB.

𝗧𝗵𝗲 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲: Queries the vector DB and retrieves relevant data to the user's input.

6. Takes user inputs from the Streamlit chatbot UI.

7. Compute the query embedding using the same model hosted on Snowflake Cortex AI.

8. Query the vector DB using the query embedding and return top K similar chunks along with their metadata.

𝗧𝗵𝗲 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲: Takes the user's input and the retrieved context, builds the prompt and passes it to an LLM.

9. Using the prompt template, the latest user input and the retrieved chunks as context, create the prompt.

10. Input the prompt to the LLM hosted on Cortex AI (e.g., llama3, mistral, arctic) and return the generated answer to the Streamlit UI.

→ All of this sits directly in Snowflake. Even the Streamlit UI.

As we leverage a fully managed platform, there are no more headaches on:

Storing and processing your data using Snowpark
Hosting and scaling your embedding models and LLMs on Snowflake Cortex AI
Deploying your RAG app as a Streamlit application

All of this can reduce the development time from months to days.

What is your experience building on fully managed platforms vs. building from scratch using multiple tools?

H&M’s fashion AI recommendation engine

One of the biggest challenges when building real-time recommenders?

Most people think it’s just about making accurate predictions...

But the real challenge is narrowing down from millions of potential item candidates to just a few personalized recommendations.

And it must happen in less than a second.

This is where the 4-stage recommender architecture comes into play...

It's a scalable framework used by companies like NVIDIA and YouTube to personalize recommendations in real-time.

I want to walk you through how we can apply this architecture to a real-world use case:

H&M’s fashion recommendation engine.

The problem:

At H&M, the goal is to recommend fashion items to millions of customers based on their browsing and shopping history.

For example, if a customer searches for T-shirts, the recommender should automatically prioritize personalized T-shirt suggestions.

But how can this be done in real-time?

4-Stage Recommender Architecture

Here’s how H&M uses the 4-stage architecture to make this happen:

Stage 1: Candidate Generation

When customers surf the H&M app, their ID and date are sent to the recommender system.

The Customer Query Model computes a customer embedding based on these inputs.

This embedding is then compared to a vector index of all H&M’s fashion items, which helps narrow millions of items to a coarse list of hundreds of relevant articles.

Stage 2: Filtering

Next, using a Bloom filter, we filter out items the customer has already seen or purchased.

This step reduces the list of candidates to a more focused set, eliminating unnecessary redundancies.

Stage 3: Ranking

The remaining items are ranked based on their relevance to the customer.

At this stage, the Hopsworks feature store provides features in real-time from its online store describing the item and the customer relationship.

This enables a CatBoost model to score the list of hundreds of items more accurately in real-time.

Stage 4: Ordering & Business Logic

Finally, the items are ordered based on their relevance scores and any additional business logic (e.g., promotional items or new collections).

We reduce the final list to a few dozen highly personalized recommendations.

The customer now sees fashion items they are most likely to click on and buy.

The entire architecture is powered by Hopsworks, an AI Lakehouse that provides:

A Feature Store that stores the features in an online store accessible for real-time inference.
A Model Registry to manage the query, ranking, and candidate encoder models.
A Serving Layer to deploy the recommender system in production.

🔗 Curious to dive deeper?

Check out how we built this step-by-step:

Building a TikTok-like recommender

Paul Iusztin

November 28, 2024

Read full story

Images

If not otherwise stated, all images are created by the author.

Raghavan Madabusi

Jan 6

I suggest to keep any promotional content at the end of the article even though its relevant. I came here to read about Snowflake RAG and then got distracted by AI bootcamp. It took 5 mins to come back here to continue reading about the main topic. Please move these to the bottom of the article. Thanks.

Expand full comment

1 reply by Paul Iusztin

1 more comment...