Training pipelines for TikTok-like…

Anca Ioana Muscalagiu

Dec 12, 2024

Training ML models for a H&M real-time personalized recommender

Read →

8 Comments

Cha_le

Dec 25

Hi, Thank you for the blog post and the tutorial.

I have some questions on the ranking model. In the example we use cat boost and aim for 1 and 0 label on if the customer buy the product or not.

Aren't the result of this model will just be the prediction of the customer likely to buy the article or not?

This model will not output the recommendation in ranking, right?

So in my mind, we can only use it to filter out products customers are not gonna buy. But we still don't really have an ordering result of which product to show first.

Maybe I am missing something. If anyone can help me verify this that would be very helpful

Thank you

Expand full comment

Reply (1)

Paul Iusztin

Dec 25Edited

Hello,

Great question!

We have the list of raw candidates after querying the vector DB.

Using the CatBoost model, which gives the probability of what item the customer is most likely to buy, we can sort that list of raw candidates.

Thus, we use the CatBoost model probability of buying something as a score for sorting the initial raw list of candidates.

Then, we can take that sorted list and show all of it or just the top items.

Hope that helps!

Expand full comment

Reply (1)

Cha_le

Dec 25

Ah, I see.

Thank you so much for the clarification.

I forgot that we can also use probability to rank the result.

Thank you for the answer and this great learning material

Expand full comment

Reply (1)

Cha_le

Dec 26

Hi,

I have some additional questions.

In the "Additional Evaluation Metrics" for the retrieved model.

There is mention the loss that also can be used "NDCG (Normalized Discounted Cumulative Gain)"

Which also places importance on the position of the recommendation output by the restrive model.

In theory, We can also use this to recommend the production and omit the ranking model, right?

Since the model already optimizes the rank it outputs, we can then filter out any unwanted items and return the result list right away.

I am guessing the reason we don't do this is the ranking model would produce better results compared to just relying on the retrieve model result.

In that case, the question depends on the use case if we want the quality to be.

Did I understand this right? Or maybe I am missing something important.

Expand full comment

Reply (1)

Paul Iusztin

Dec 26Edited

Yes. You got it right.

Great observation.

But the key idea is that during the first step 1, with semantic search, you retrieve ~hundreds of results, but in a real-world app, you want to show the best results at the top.

If your semantic search is super precise, then you don't need the ranking step. But as you don't use so many features for your semantic search step, the changes for being super precise are small.

But in the end, using metrics such as NDCG, you can measure your system's performance with and without the ranking step and decide what works best.

Expand full comment

Reply (1)

Cha_le

Dec 26

I see.

That make sense.

Thank you so much for the information.

Expand full comment

Reply (1)