Recommendations at "Reasonable Scale": joining dataOps with deep learning recSys with Merlin and Metaflow (blog)
February 2023: aside from behavioral testing, the ML pipeline is now completed. A blog post on the NVIDIA Medium was just published!
This project is a collaboration with the Outerbounds, NVIDIA Merlin and Comet teams, in an effort to release as open source code a realistic data and ML pipeline for cutting edge recommender systems "that just works". Anyone can cook do great ML, not just Big Tech, if you know how to pick and choose your tools.
TL;DR: (after setup) a single ML person is able to train a cutting edge deep learning model (actually, several versions of it in parallel), test it and deploy it without any explicit infrastructure work, without talking to any DevOps person, without using anything that is not Python or SQL.
As a use case, we pick a popular RecSys challenge, user-item recommendations for the fashion industry: given the past purchases of a shopper, can we train a model to predict what he/she will buy next? In the current V1.0, we target a typical offline training, cached predictons setup: we prepare in advance the top-k recommendations for our users, and store them in a fast cache to be served when shoppers go online.
Our goal is to build a pipeline with all the necessary real-world ingredients:
- dataOps with Snowflake and dbt;
- training Merlin models (possibly on GPUs), in parallel, leveraging Metaflow;
- experiment and parameter tracking;
- advanced testing with Reclist (FORTHCOMING);
- serving cached prediction through FaaS and SaaS (AWS Lambda, DynamoDb, the serverless framework);
- error analysis and debugging with a Streamlit app.
At a quick glance, this is what we are building:
For an in-depth explanation of the philosophy behind the approach, please check the companion blog post or watch our NVIDIA Summit keynote.
If you like this project please add a star on Github here and check out / share / star the RecList package.
This project builds on our open roadmap for "MLOps at Resonable Scale", automated documentation of pipelines, rounded evaluation for RecSys:
- NVIDIA Merlin meets the MLOps ecosystem
- CIKM RecSys Evaluation Challenge;
- NVIDIA RecSys Summit keynote and slides;
- RecList (project website);
- You don't need a bigger boat (repo, paper, talk).
The code is a self-contained, end-to-end recommender project; however, since we leverage best-in-class tools, some preliminary (one time) setup is required. Please make sure the requirements are satisfied, depending on what you wish to run and on what you are already using - roughly in order of ascending complexity:
The basics: Metaflow, Snowflake and dbt
A Snowflake account is needed to host the data, and a working Metaflow setup is needed to run the flow on AWS GPUs if you wish to do so:
- Snowflake account: sign-up for a free trial.
- AWS account: sign-up for a free AWS account.
- Metaflow on AWS: follow the setup guide - in theory the pipeline should work also with a local setup (i.e. no additional work after installing the
requirements
), if you don't need cloud computing. However, we strongly recommend a fully AWS-compatible setup. The current flow has been tested with Metaflow out-of-the-box (no config, all local), Metaflow with AWS data store but all local computing, and Metaflow with AWS data store and AWS Batch with GPU computing. - dbt core setup: on top of installing the package in
requirements.txt
, you need to properly configure your dbt_profile.
Please note that while the current implementation focused on Metaflow on AWS (with Batch), the exact same code (with a change in decorator!) would work in any kubernetes-based infrastructure (or Azure!). For the same reasons, Snowflake can be replaced with other warehouses leaving the main result unchanged: an end-to-end, scalable, production-ready pipeline for deep learning recommendations.
Adding experiment tracking
- Comet ML: sign-up for free and get an api key. If you don't want experiment tracking, make sure to comment out the Comet specific parts in the
train_model
step.
Adding PaaS deployment
- AWS Lambda setup: if the env
SAVE_TO_CACHE
is set to1
, the Metaflow pipeline will try and cache in dynamoDB recommendations for the users in the test set. Those recommendations can be served through an endpont using AWS Lambda. If you wish to serve your recommendations, you need to run the serverless project in theserverless
folder before running the flow: the project will create both a DynamoDB table and a working GET endpoint. To do so: first, install the serverless framework and connect it with your AWS; second, cd into theserverless
folder, and runAWS_PROFILE=tooso serverless deploy
(whereAWS_PROFILE
selects a specific AWS config with permission to run the framework, and can be omitted if you use your default). If all goes well, the CLI will create the relevant resources and print out the URL for your public rec API, e.g.endpoint: GET - https://xafacoa313.execute-api.us-west-2.amazonaws.com/dev/itemRecs
: you can verifiy the endpoint is working by pasting the URL in the browser (response will be empty as you need to run the flow to populate dynamoDB). Make sure the region of deployment in theserverless.yml
file is the same as the one in the Metaflow pipeline. Note that while we use the serverless framework for convenience, the same setup can be done manually, if preferred.
Adding a Streamlit app for error analysis
- We added a small Streamlit app (run with
EXPORT_TO_APP=1
to test this very experimental feature) to help visualize and filter predictions: how is the model doing on "short sleeves" items? If you plan on using the app you need to install also therequirements_app.txt
in theapp
folder.
A note on containers
At the moment of writing, Merlin does not have an official ECR, so we pulled nvcr.io/nvidia/merlin/merlin-tensorflow:22.11
and slightly changed the entry point to work with Metaflow / AWS Batch. The docker
folder contains the relevant files - the current flow uses a public ECR repository (public.ecr.aws/outerbounds/merlin-reasonable-scale:22.11-latest
) we prepared on our AWS when running training in BATCH; if you wish to use your own ECR or the repo above becomes unavailable for whatever reason, you can just change the relevant image
parameter in the flow (note: you need to register for a free NVIDIA account first to be able to pull from nvcr).
We recommend using python 3.9 for this project.
Setup a virtual environment with the project dependencies:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Note that if you never plan on running Merlin's training locally, but only through Metaflow + AWS Batch, you can avoid installing merlin and tensorflow libraries.
NOTE: if you plan on using the Streamlit app (above) make sure to pip install also the requirements_app.txt
in the app
folder.
Inside src
, create a version of the local.env
file named only .env
(do not commit it!), and fill its values:
VARIABLE | TYPE (DEFAULT) | MEANING |
---|---|---|
SF_USER | string | Snowflake user name |
SF_PWD | string | Snowflake password |
SF_ACCOUNT | string | Snowflake account |
SF_DB | string | Snowflake database |
SF_ROLE | string | Snowflake role to run SQL |
SF_WAREHOUSE | string | Snowflake warehouse to run SQL |
EN_BATCH | 0-1 (0) | Enable cloud computing for Metaflow |
COMET_API_KEY | string | Comet ML api key |
EXPORT_TO_APP | 0-1 (0) | Enable exporting predictions for inspections through Streamlit |
SAVE_TO_CACHE | 0-1 (0) | Enable storing predictions to an external cache for serving. If 1, you need to deploy the AWS Lambda (see above) before running the flow |
The original dataset is from the H&M data challenge.
- Download the files
articles.csv
,customers.csv
,transactions_train.csv
and put them in thesrc/data
folder. Thedata
folder containsimages_to_s3.csv
, which is a simple file simulating a mapping betweend IDs and s3 storage for product images (note that the mapping reflects our own bucket, but you should copy over the images to your own cloud storage and change the files accordingly). Images are not used in the RecSys pipeline directly, but they can support additional use cases, such as debugging (which is why we added the meta-data in our sql transformations). - Run
upload_to_snowflake.py
as a one-off script: the program will dump the dataset to Snowflake, using a typical modern data stack pattern. This allows us to use dbt and Metaflow to run a realistic ELT and ML code.
Once you run the script, check your Snowflake for the new tables:
After the data is loaded, we use dbt as our transformation tool of choice. While you can run dbt code as part of a Metaflow pipeline, we keep the dbt part separate in this project to simplify the runtime component: it will be trivial (as shown here for example) to orchestrate the SQL code within Metaflow if you wish to do so. After the data is loaded in Snowflake:
cd
into thedbt
folder;- run
dbt run
.
Check your Snowflake for the new tables created by dbt:
In particular, the table "EXPLORATION_DB"."HM_POST"."FILTERED_DATAFRAME"
represents a dataframe in which user, article and transaction data are all joined together - the Metaflow pipeline will read from this table, leveraging the pre-processing done at scale through dbt and Snowflake.
Once the above setup steps are completed, you can run the flow:
- cd into the
src
folder; - run the flow with
AWS_PROFILE=DemoReno-363597772528 python my_merlin_flow.py run --max-workers 4 --with card
, whereAWS_PROFILE
is needed to select the AWS config that runs the flow and its related AWS infrastructure (you can omit it, if you're using the default). As per standard Metaflow setup, make sure to set as envs alsoMETAFLOW_PROFILE
andAWS_DEFAULT_REGION
as needed (you can omit it, if you're using the default settings after the AWS Setup).
At the end of the flow, you can inspect the default DAG Card with python my_merlin_flow.py card view get_dataset
:
For an intro to DAG cards, please check our NeurIPS 2021 paper.
If you run the flow with the full setup, you will end up with:
- versioned datasets and model artifacts, accessible through the standard Metaflow client API;
- a dashboard for experiment tracking, including a quick panel to inspect predicted items for selected shoppers;
- an automated, versioned documentation for your pipeline, in the form of Metaflow cards;
- a live, scalable endpoint serving batched predictions using AWS Lambda and DynamoDB.
If you have set EXPORT_TO_APP=1
(and completed the setup), you can also visualize predictions using a Streamlit app that:
- automatically uses the serialized data from the last succesful Metaflow run;
- leverages CLIP capabilities to offer a quick, free-text way to navigate the prediction set based on the features of the ground truth item (e.g. "long sleeves shirt").
Cd into the app
folder, and run streamlit run pred_inspector.py
(make sure Metaflow envs have been set, as usual). You can filter for product type of the target item and use text-to-image search to sort items (try for example with "jeans" or "short sleeves").
- Improving error analysis and evaluation: improvements will come automatically from RecList;
- Making sure dependencies are easy to adjust depending on setup - e.g. dask_cudf vs pandas depending on your GPU set up;
- Supporting other recSys use cases, possibly coming with more complex deployment options (e.g. Triton on Sagemaker).
-
This is a pretty complex project! Can we make it simpler? Yes (e.g. here) and no: while surely the code can be shortened and optimized a bit, in 700 lines we now have an entire recommender system in a production setting, including data preparation in a warehouse, scheduled DAG for batch predictions, a serverless endpoint to serve from the cache, parallel optimization of a modern deep learning architecture (plus, behavioral testing, and CLIP-based data exploration with Streamlit). It is indeed surprising that, when looking into it, the project is really not complex at, all considering you can basically run a production system with it: it is indeed telling that of all the 700 lines, few dozens are actually about the recommendation model, with the majority of the code making sure meta-data, tracking, evaluation etc. are done properly. We are working on other introduction to RecSys, and will add their links here when they are public in case a simplified setup is preferred (e.g. parquet to duckDB instead of setting up the full Snowflake warehouse!).
-
What if my datasets are not static to begin with, but depends on real interactions? We open-sourced a serverless pipeline that show how data ingestion could work with the same philosophical principles.
-
I want to add tool X, or replace Y with Z: how modular is this pipeline? Our aim is to present a pipeline simple enough to be quickly grasped, complex enough to sustain a real deep learning model and industry use case. That said, it is possible that what worked for us may not work as perfectly for you: e.g. you may wish to change experiment tracking (e.g., an abstraction for Neptune is here), or use a different data warehouse solution (e.g. BigQuery), or orchestrate the entire thing in a different way (check again here for a Prefect-based solution). We start by providing a flow that "just works", but our focus is mainly on the functional pieces, not just the tools: what are the essential computations we need to run a modern recsys pipeline? If you find other tools are better for you, please go ahead - and let us know, feedback is always useful!
Main Contributors:
- Jacopo, general design, Metaflow fan boy, prototype;
- the Outerbounds team, in particular Hamel for Metaflow guidance, Valay for AWS Batch support;
- the NVIDIA Merlin team, in particular Gabriel, Ronay, Ben, Even.
Special thanks:
- Dhruv Nair from Comet for double-checking our experiment tracking setup and suggesting improvements.
All the code in this repo is freely available under a MIT License, also included in the project.