Skip to content

Commit

Permalink
[FSTORE-1239] Rewrite Recommendation System tutorial using new Embedd…
Browse files Browse the repository at this point in the history
…ings API (#237)

* Recommender example re-written with the 3.7 API containing embeddings.
  • Loading branch information
Maxxx-zh authored Feb 19, 2024
1 parent eff9074 commit 16d7e4a
Show file tree
Hide file tree
Showing 9 changed files with 727 additions and 603 deletions.
11 changes: 9 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,13 @@ advanced_tutorials/citibike/data/__MACOSX/._202304-citibike-tripdata.csv
advanced_tutorials/citibike/data/__MACOSX/._202305-citibike-tripdata.csv
loan_approval/lending_model/roc_curve.png
advanced_tutorials/timeseries/price_model/model_prediction.png

advanced_tutorials/recommender-system/query_model/variables/variables.index
advanced_tutorials/recommender-system/query_model/variables/variables.data-00000-of-00001
advanced_tutorials/recommender-system/query_model/saved_model.pb
advanced_tutorials/recommender-system/query_model/fingerprint.pb
advanced_tutorials/recommender-system/candidate_model/variables/variables.index
advanced_tutorials/recommender-system/candidate_model/variables/variables.data-00000-of-00001
advanced_tutorials/recommender-system/candidate_model/fingerprint.pb
advanced_tutorials/recommender-system/candidate_model/saved_model.pb
integrations/neo4j/aml_model/*
integrations/neo4j/aml_model_transformer.py
integrations/neo4j/aml_model_transformer.py
58 changes: 46 additions & 12 deletions advanced_tutorials/recommender-system/1_feature_engineering.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -10,17 +10,17 @@
"\n",
"**Your Python Jupyter notebook should be configured for >8GB of memory.**\n",
"\n",
"In this series of tutorials, we will build a recommender system for fashion items. It will consist of two models: a *retrieval model* and a *ranking model*. The idea is that the retrieval model should be able to quickly generate a small subset of candidate items from a large collection of items. This comes at the cost of granularity, which is why we also train a ranking model that can afford to use more features than the retrieval model.\n",
"In this series of tutorials, you will build a recommender system for fashion items. It will consist of two models: a *retrieval model* and a *ranking model*. The idea is that the retrieval model should be able to quickly generate a small subset of candidate items from a large collection of items. This comes at the cost of granularity, which is why you also train a ranking model that can afford to use more features than the retrieval model.\n",
"\n",
"### <span style=\"color:#ff5f27\">✍🏻 Data</span>\n",
"\n",
"We will use data from the [H&M Personalized Fashion Recommendations](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations) Kaggle competition.\n",
"You will use data from the [H&M Personalized Fashion Recommendations](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations) Kaggle competition.\n",
"\n",
"<!-- https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/data\n",
"\n",
"For this challenge you are given the purchase history of customers across time, along with supporting metadata. Your challenge is to predict what articles each customer will purchase in the 7-day period immediately after the training data ends. Customer who did not make any purchase during that time are excluded from the scoring. -->\n",
"\n",
"The full dataset contains images of all products, but here we will simply use the tabular data. We have three data sources:\n",
"The full dataset contains images of all products, but here you will simply use the tabular data. You have three data sources:\n",
"- `articles.csv`: info about fashion items.\n",
"- `customers.csv`: info about users.\n",
"- `transactions_train.csv`: info about transactions.\n"
Expand Down Expand Up @@ -75,7 +75,31 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## <span style=\"color:#ff5f27\">🗄️ Read Articles Data</span>"
"## <span style=\"color:#ff5f27\">🗄️ Read Articles Data</span>\n",
"\n",
"The **article_id** and **product_code** serve different purposes in the context of H&M's product database:\n",
"\n",
"- **Article ID**: This is a unique identifier assigned to each individual article within the database. It is typically used for internal tracking and management purposes. Each distinct item or variant of a product (e.g., different sizes or colors) would have its own unique article_id.\n",
"\n",
"- **Product Code**: This is also a unique identifier, but it is associated with a specific product or style rather than individual articles. It represents a broader category or type of product within H&M's inventory. Multiple articles may share the same product code if they belong to the same product line or style.\n",
"\n",
"While both are unique identifiers, the article_id is specific to individual items, whereas the product_code represents a broader category or style of product.\n",
"\n",
"Here is an example:\n",
"\n",
"**Product: Basic T-Shirt**\n",
"\n",
"- **Product Code:** TS001\n",
"\n",
"- **Article IDs:**\n",
" - Article ID: 1001 (Size: Small, Color: White)\n",
" - Article ID: 1002 (Size: Medium, Color: White)\n",
" - Article ID: 1003 (Size: Large, Color: White)\n",
" - Article ID: 1004 (Size: Small, Color: Black)\n",
" - Article ID: 1005 (Size: Medium, Color: Black)\n",
"\n",
"In this example, \"TS001\" is the product code for the basic t-shirt style. Each variant of this t-shirt (e.g., different sizes and colors) has its own unique article_id.\n",
"\n"
]
},
{
Expand Down Expand Up @@ -176,7 +200,7 @@
"metadata": {},
"outputs": [],
"source": [
"trans_df = pd.read_parquet('https://repo.hops.works/dev/jdowling/transactions_train.parquet')[:600000]\n",
"trans_df = pd.read_parquet('https://repo.hops.works/dev/jdowling/transactions_train.parquet')[:1_000_000]\n",
"print(trans_df.shape)\n",
"trans_df.head(3)"
]
Expand All @@ -199,7 +223,7 @@
"source": [
"## <span style=\"color:#ff5f27\">👨🏻‍🏭 Transactions Feature Engineering</span>\n",
"\n",
"The time of the year a purchase was made should be a strong predictor, as seasonality plays a big factor in fashion purchases. Here, we will use the month of the purchase as a feature. Since this is a cyclical feature (January is as close to December as it is to February), we'll map each month to the unit circle using sine and cosine."
"The time of the year a purchase was made should be a strong predictor, as seasonality plays a big factor in fashion purchases. Here, you will use the month of the purchase as a feature. Since this is a cyclical feature (January is as close to December as it is to February), you'll map each month to the unit circle using sine and cosine."
]
},
{
Expand All @@ -225,7 +249,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that we have a large dataset. For the sake of the tutorial, we will use a small subset of this dataset, which we generate by sampling 25'000 customers and using their transactions."
"You can see that you have a large dataset. For the sake of the tutorial, you will use a small subset of this dataset, which you generate by sampling 25'000 customers and using their transactions."
]
},
{
Expand Down Expand Up @@ -386,14 +410,14 @@
"\n",
"A [feature group](https://docs.hopsworks.ai/feature-store-api/latest/generated/feature_group/) can be seen as a collection of conceptually related features.\n",
"\n",
"Before we can create a feature group we need to connect to our feature store."
"Before you can create a feature group you need to connect to your feature store."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To create a feature group we need to give it a name and specify a primary key. It is also good to provide a description of the contents of the feature group."
"To create a feature group you need to give it a name and specify a primary key. It is also good to provide a description of the contents of the feature group."
]
},
{
Expand All @@ -416,9 +440,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we have also set `online_enabled=True`, which enables low latency access to the data. A full list of arguments can be found in the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/feature_store_api/#create_feature_group).\n",
"Here you have also set `online_enabled=True`, which enables low latency access to the data. A full list of arguments can be found in the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/feature_store_api/#create_feature_group).\n",
"\n",
"At this point, we have only specified some metadata for the feature group. It does not store any data or even have a schema defined for the data. To make the feature group persistent we populate it with its associated data using the `save` function."
"At this point, you have only specified some metadata for the feature group. It does not store any data or even have a schema defined for the data. To make the feature group persistent you populate it with its associated data using the `insert` method."
]
},
{
Expand Down Expand Up @@ -565,7 +589,17 @@
" trans_fg, \n",
" articles_fg, \n",
" customers_fg,\n",
")"
")\n",
"ranking_df.head(3)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ranking_df.label.value_counts()"
]
},
{
Expand Down
Loading

0 comments on commit 16d7e4a

Please sign in to comment.