|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Consolidated Recommendation Systems" |
| 4 | +date: "2025-02-13" |
| 5 | +categories: RecSys |
| 6 | +--- |
| 7 | + |
| 8 | +This post is a quick summary of [Lessons Learnt From Consolidating ML Models in a Large Scale Recommendation System](https://netflixtechblog.medium.com/lessons-learnt-from-consolidating-ml-models-in-a-large-scale-recommendation-system-870c5ea5eb4a). I have also added a few questions I got while reading it. I end the post with what we do at work to deal with this. |
| 9 | + |
| 10 | + |
| 11 | +## Summary |
| 12 | + |
| 13 | +- Recommendation System: candidate gen + ranking. |
| 14 | +- A typical ranking model pipeline: |
| 15 | + |
| 16 | + 1. Label prep |
| 17 | + 2. Feature prep |
| 18 | + 3. Model training |
| 19 | + 4. Model evaluation |
| 20 | + 5. Model deployment (with inference contract) |
| 21 | + |
| 22 | +- Each recommendation use case (e.g.: discover page, notifications, related items, category exploration, search) will have a version of the above pipeline. |
| 23 | +- As use cases increase, the team will need to maintain multiple such pipelines. It is time-consuming to maintain multiple pipelines and increases points of failure. |
| 24 | + |
| 25 | + <figure class="image"> |
| 26 | + <img src="{{ site.url }}/assets/2025-02/consolidated_recsys_neflix_1.webp" alt="" style="text-align: center; margin: auto"> |
| 27 | + <figcaption style="text-align: center">Figure 1: Figure from the Netflix blog linked at the start.</figcaption> |
| 28 | + </figure> |
| 29 | + |
| 30 | +- Since the pipelines have the same component, we can consolidate them. |
| 31 | +- Consolidated pipeline: |
| 32 | + |
| 33 | + 1. Label prep for each use case separately |
| 34 | + 2. Stratified union of all the prepared labels |
| 35 | + 3. Feature prep (separate categorical feature representing the use case) |
| 36 | + 4. Model training |
| 37 | + 5. Model evaluation |
| 38 | + 6. Model deployment (with inference contract) |
| 39 | + |
| 40 | + <figure class="image"> |
| 41 | + <img src="{{ site.url }}/assets/2025-02/consolidated_recsys_neflix_2.webp" alt="" style="text-align: center; margin: auto" width="100"> |
| 42 | + <figcaption style="text-align: center">Figure 2: Figure from the Netflix blog linked at the start.</figcaption> |
| 43 | + </figure> |
| 44 | + |
| 45 | +- Label prep for each use case separately |
| 46 | + |
| 47 | + 1. Each use case will have different ways of generating the labels. |
| 48 | + 2. Use case context details are added as separate features. |
| 49 | + - Search context: search query, region |
| 50 | + - Similar items context: source item |
| 51 | + 3. When the use case is search, context features specific to the similar item use case will be filled with default values. |
| 52 | + |
| 53 | +- Union of all the prepared labels |
| 54 | + |
| 55 | + 1. Final labelled set: a% samples from use case-1 labels + b% samples from use case-2 labels + … + z% samples from use case-n labels |
| 56 | + 2. The proportions [a, b, …, z] come from stratification |
| 57 | + 3. Q: How is this stratification done? Platform traffic across different use cases? |
| 58 | + 4. Q: What are the results when these proportions are business-driven? Eg: contribution to revenue. |
| 59 | + |
| 60 | +- Feature prep |
| 61 | + |
| 62 | + 1. All use case specific features added to the data. |
| 63 | + 2. If a feature is only used for use case 1 then it will contain default value for all the other use cases. |
| 64 | + 3. Add a new categorical feature task_type to the features to inform the model about the target reco task. |
| 65 | + |
| 66 | +- Model training happens as usual: feature vector and labels. Architecture remains the same. Optimisation remains the same. |
| 67 | +- Model evaluation |
| 68 | + |
| 69 | + 1. Check the appropriate eval metrics to check the model. |
| 70 | + 2. Q: How do we judge if the model performed well for all the use cases? |
| 71 | + 3. Q: Will it require a separate evaluation set for each use case? |
| 72 | + 4. Q: Can there be a 2nd order Simpson’s paradox here: the consolidated model performs well, but when tried for individual use cases, its performance is low? My hunch: no. |
| 73 | + |
| 74 | +- Model deployment (with inference contract) |
| 75 | + |
| 76 | + 1. Deploy the same model in the respective environment made for each use case. That env will have all the specific network-related knobs: batch size, throughput, latency, caching policy, parallelism, etc. |
| 77 | + 2. Generic API contract to support the heterogenous context (search query for search, source item for related items use case.) |
| 78 | + |
| 79 | +- Caveats |
| 80 | + |
| 81 | + 1. The consolidated use cases should be related (eg: ranking for movies in the search and discover page) |
| 82 | + 2. One definition of related can be: ranking the same entities. |
| 83 | + |
| 84 | +- Advantages |
| 85 | + |
| 86 | + 1. Reduces maintenance costs (less code; fewer deployments) |
| 87 | + 2. Quick model iterations to all the use cases |
| 88 | + - Updates (new features, architecture, etc) for one use case can be applied to other use cases. |
| 89 | + - If consolidated tasks are related, then new features don’t cause regression in practice. |
| 90 | + 3. Can be extended to any related use case from offline and online POV. |
| 91 | + 4. Cross-learning: the model potentially gains more (hidden) learning from the other tasks. Eg: having search data gives more data to the model learning for related-items task. |
| 92 | + - Q: Is this happening? How can we verify this? One way: Train an independent model on the use-case specific data and compare its performance with the consolidated model’s performance on the same task. |
| 93 | + |
| 94 | +- I was confused about what to call this learning paradigm. [Wikipedia](https://en.wikipedia.org/wiki/Multi-task_learning) says that it is multi-task learning. |
| 95 | + |
| 96 | + |
| 97 | +## Practice at my work |
| 98 | + |
| 99 | +- The models are not merged across different tasks like relevance and search. |
| 100 | +- Within relevance ranking tasks (discover, similar items, category exploration), have a common base ranker model. |
| 101 | +- On top of that, we have different heuristics to make it better for that particular section. |
| 102 | +- Advantages: |
| 103 | + - There is only one main model for all related tasks. |
| 104 | + - Keeps the heuristics logic simple and, thus, easy to maintain. |
| 105 | +- Challenges |
| 106 | + - Heuristics are crude/manual/semi-automated → we may be leaving some gains on the table. There are bandit-based approaches to automating it, though. |
| 107 | + - It loses out on cross-learning opportunities. |
| 108 | + |
0 commit comments