Multi-modal embeddings in recsys: Beyond text and images

Fine-tune any LLM in four simple steps. Hidden bottlenecks in your code? Not anymore.

Paul Iusztin

Feb 15, 2025

Multi-modal embeddings in recsys: Beyond text and images
Hidden bottlenecks in your code? Not anymore
Fine-tune any LLM in four simple steps

Multi-modal embeddings in recsys: Beyond text and images

Want to know how companies like H&M recommend the perfect items?

The 𝗧𝘄𝗼-𝗧𝗼𝘄𝗲𝗿 𝗡𝗲𝘁𝘄𝗼𝗿𝗸 is behind it all - here's how it works:

It embeds users and items into a shared vector space to enable fast and efficient retrieval of recommendations.

To compute the embeddings for users and items, we need something different than the standard encoders we use when working with text or image data.

The model architecture consists of two parallel networks known as towers:

𝗤𝘂𝗲𝗿𝘆 𝗧𝗼𝘄𝗲𝗿 (𝗖𝘂𝘀𝘁𝗼𝗺𝗲𝗿 𝗤𝘂𝗲𝗿𝘆 𝗘𝗻𝗰𝗼𝗱𝗲𝗿): This tower encodes user features such as customer_id, age, and other attributes to understand user preferences.
𝗖𝗮𝗻𝗱𝗶𝗱𝗮𝘁𝗲 𝗧𝗼𝘄𝗲𝗿 (𝗜𝘁𝗲𝗺 𝗘𝗻𝗰𝗼𝗱𝗲𝗿): This tower encodes item-specific features like article_id, garment_group_name, and other product attributes.

Both towers work independently but generate embeddings that live in the same low-dimensional space.

This shared space allows the model to compare users and items efficiently and determine relevant matches.

Each tower processes its respective inputs by:

𝗙𝗲𝗮𝘁𝘂𝗿𝗲 𝗘𝗻𝗰𝗼𝗱𝗶𝗻𝗴 𝗮𝗻𝗱 𝗙𝘂𝘀𝗶𝗼𝗻: User and item features (both categorical and numerical) are converted into dense embeddings. For example, customer_id and article_id are mapped to vectors leveraging an EmbeddingLayer, while categorical features like garment_group_name use a one-hot encoding layer.
𝗡𝗲𝘂𝗿𝗮𝗹 𝗡𝗲𝘁𝘄𝗼𝗿𝗸 𝗥𝗲𝗳𝗶𝗻𝗲𝗺𝗲𝗻𝘁: A feedforward neural network (FNN) with multiple dense layers refines these embeddings, reducing them to a low-dimensional space. This compact embedding helps prevent overfitting and ensures more generalized recommendations.

𝘉𝘶𝘵 𝘸𝘩𝘺 𝘥𝘰 𝘭𝘰𝘸-𝘥𝘪𝘮𝘦𝘯𝘴𝘪𝘰𝘯𝘢𝘭 𝘦𝘮𝘣𝘦𝘥𝘥𝘪𝘯𝘨𝘴 𝘮𝘢𝘵𝘵𝘦𝘳?

Because without limiting the dimensions, the model might overfit by memorizing past purchases.

(Meaning users will receive repetitive recommendations of items users already own)

By balancing collaborative filtering and content-based filtering, the Two-Tower Model provides both personalized and diverse recommendations.

The more features added, the more the model leans toward content-based filtering.

But to experiment, you can add more features to both towers, measure the results, compare them and repeat.

Why?

The more user and item features are included, the more the model moves towards content-based filtering.

(resulting in more diverse and dynamic recommendations)

Learn more about implementing and training the Two-Tower Network in our “Training pipelines for TikTok-like recommenders” article ↓

GO TO ARTICLE

Hidden bottlenecks in your code? Not anymore (Sponsored)

Every AI developer knows this struggle:

Optimizing AI systems and large-scale applications.

You're working on your first AI project, connecting all the components and algorithms that seem perfect in theory.

But when you hit run, your application drags and takes ages to process.

Why?

It's the bottlenecks.

Maybe it’s I/O, reading images or text data sequentially instead of using async, or poorly parallelized API calls to models like OpenAI’s.

Perhaps you’re not batching your inputs properly, or the code isn’t optimized for parallel processing - and this is all while running locally.

When you push your code to production (especially on Kubernetes clusters) these inefficiencies multiply and become harder to detect.

That's where profilers come in...

They help identify these bottlenecks and guide optimization.

But most traditional profiling tools have limitations:

Too slow. Too basic Only applicable to local environments

None one of them solve the complexity of modern AI workflows at scale.

That’s why I’m excited to share 𝗣𝗲𝗿𝗳𝗼𝗿𝗮𝘁𝗼𝗿 by Yandex.

It's an open-source, distributed profiler that’s available for all developers on GitHub.

What makes it notable?

It profiles in real-time
It helps you identify which lines of code are costly and reveals performance blind spots.
Works seamlessly across local programs and Kubernetes clusters (as long as they run Linux)
It's capable of profiling CPU, GPU, and I/O
The developers estimate that it may help you save up to 20% on infrastructure costs

I’ve personally seen how these bottlenecks can slow down AI projects.

And it taught me an important lesson:

Having a tool to constantly monitor and optimize your system's performance is essential.

What makes this one even better is that it’s open-sourced and free for everyone.

This is a must-have for engineering teams and CTOs who want to maintain their code quality and infrastructure costs.

Learn more about Perforator in their article:

GO TO ARTICLE

Or consider directly testing out the tool:

GO TO GITHUB

Fine-tune any LLM in four simple steps

It's crazy how easy it is to fine-tune LLMs at scale with Unsloth AI, Hugging Face, and SageMaker. Here are 4 simple steps:

Load a base model of choice from the Hugging Face model registry.
Load an instruction dataset of choice from the Hugging Face data registry. Format it using your prompt template and create a train-test split.
Use the SFTTrainer from trl for SFT fine-tuning (or DPOTrainer for preference alignment).
Wrap up the fine-tuning script with a HuggingFace SageMaker class and deploy it to EC2 instances.

Using this template, you can easily swap:

your base LLM
your dataset
your AWS infrastructure

Based on your fine-tuning scale and task needs. That's why today's AI challenges are about:

gathering raw data
cleaning your data
having enough $$$ to compute.

You can find the complete code in our LLM Engineer's Handbook GitHub repository — ready to be adapted to your use case:

GO TO CODE

Whenever you’re ready, there are 3 ways we can help you:

Perks: Exclusive discounts on our recommended learning resources
(live courses, self-paced courses, learning platforms and books).
The LLM Engineer’s Handbook: Our bestseller book on mastering the art of engineering Large Language Models (LLMs) systems from concept to production.
Free open-source courses: Master production AI with our end-to-end open-source courses, which reflect real-world AI projects, covering everything from system architecture to data collection and deployment.

Images

If not otherwise stated, all images are created by the author.

Multi-modal embeddings in recsys: Beyond text and images

Fine-tune any LLM in four simple steps. Hidden bottlenecks in your code? Not anymore.

This week’s topics:

Multi-modal embeddings in recsys: Beyond text and images

Hidden bottlenecks in your code? Not anymore (Sponsored)

Fine-tune any LLM in four simple steps

Whenever you’re ready, there are 3 ways we can help you:

Images

Discussion about this post