[FSTORE-1239] Rewrite Recommendation System tutorial using new Embedd…

…ings API (#237) * Recommender example re-written with the 3.7 API containing embeddings.
logicalclocks · Feb 19, 2024 · 16d7e4a · 16d7e4a
1 parent eff9074
commit 16d7e4a
Show file tree

Hide file tree

Showing 9 changed files with 727 additions and 603 deletions.
diff --git a/.gitignore b/.gitignore
@@ -179,6 +179,13 @@ advanced_tutorials/citibike/data/__MACOSX/._202304-citibike-tripdata.csv
 advanced_tutorials/citibike/data/__MACOSX/._202305-citibike-tripdata.csv
 loan_approval/lending_model/roc_curve.png
 advanced_tutorials/timeseries/price_model/model_prediction.png
-
+advanced_tutorials/recommender-system/query_model/variables/variables.index
+advanced_tutorials/recommender-system/query_model/variables/variables.data-00000-of-00001
+advanced_tutorials/recommender-system/query_model/saved_model.pb
+advanced_tutorials/recommender-system/query_model/fingerprint.pb
+advanced_tutorials/recommender-system/candidate_model/variables/variables.index
+advanced_tutorials/recommender-system/candidate_model/variables/variables.data-00000-of-00001
+advanced_tutorials/recommender-system/candidate_model/fingerprint.pb
+advanced_tutorials/recommender-system/candidate_model/saved_model.pb
 integrations/neo4j/aml_model/*
-integrations/neo4j/aml_model_transformer.py
+integrations/neo4j/aml_model_transformer.py
diff --git a/advanced_tutorials/recommender-system/1_feature_engineering.ipynb b/advanced_tutorials/recommender-system/1_feature_engineering.ipynb
@@ -10,17 +10,17 @@
     "\n",
     "**Your Python Jupyter notebook should be configured for >8GB of memory.**\n",
     "\n",
-    "In this series of tutorials, we will build a recommender system for fashion items. It will consist of two models: a *retrieval model* and a *ranking model*. The idea is that the retrieval model should be able to quickly generate a small subset of candidate items from a large collection of items. This comes at the cost of granularity, which is why we also train a ranking model that can afford to use more features than the retrieval model.\n",
+    "In this series of tutorials, you will build a recommender system for fashion items. It will consist of two models: a *retrieval model* and a *ranking model*. The idea is that the retrieval model should be able to quickly generate a small subset of candidate items from a large collection of items. This comes at the cost of granularity, which is why you also train a ranking model that can afford to use more features than the retrieval model.\n",
     "\n",
     "### <span style=\"color:#ff5f27\">✍🏻 Data</span>\n",
     "\n",
-    "We will use data from the [H&M Personalized Fashion Recommendations](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations) Kaggle competition.\n",
+    "You will use data from the [H&M Personalized Fashion Recommendations](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations) Kaggle competition.\n",
     "\n",
     "<!-- https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/data\n",
     "\n",
     "For this challenge you are given the purchase history of customers across time, along with supporting metadata. Your challenge is to predict what articles each customer will purchase in the 7-day period immediately after the training data ends. Customer who did not make any purchase during that time are excluded from the scoring. -->\n",
     "\n",
-    "The full dataset contains images of all products, but here we will simply use the tabular data. We have three data sources:\n",
+    "The full dataset contains images of all products, but here you will simply use the tabular data. You have three data sources:\n",
     "- `articles.csv`: info about fashion items.\n",
     "- `customers.csv`: info about users.\n",
     "- `transactions_train.csv`: info about transactions.\n"
@@ -75,7 +75,31 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## <span style=\"color:#ff5f27\">🗄️ Read Articles Data</span>"
+    "## <span style=\"color:#ff5f27\">🗄️ Read Articles Data</span>\n",
+    "\n",
+    "The **article_id** and **product_code** serve different purposes in the context of H&M's product database:\n",
+    "\n",
+    "- **Article ID**: This is a unique identifier assigned to each individual article within the database. It is typically used for internal tracking and management purposes. Each distinct item or variant of a product (e.g., different sizes or colors) would have its own unique article_id.\n",
+    "\n",
+    "- **Product Code**: This is also a unique identifier, but it is associated with a specific product or style rather than individual articles. It represents a broader category or type of product within H&M's inventory. Multiple articles may share the same product code if they belong to the same product line or style.\n",
+    "\n",
+    "While both are unique identifiers, the article_id is specific to individual items, whereas the product_code represents a broader category or style of product.\n",
+    "\n",
+    "Here is an example:\n",
+    "\n",
+    "**Product: Basic T-Shirt**\n",
+    "\n",
+    "- **Product Code:** TS001\n",
+    "\n",
+    "- **Article IDs:**\n",
+    "    - Article ID: 1001 (Size: Small, Color: White)\n",
+    "    - Article ID: 1002 (Size: Medium, Color: White)\n",
+    "    - Article ID: 1003 (Size: Large, Color: White)\n",
+    "    - Article ID: 1004 (Size: Small, Color: Black)\n",
+    "    - Article ID: 1005 (Size: Medium, Color: Black)\n",
+    "\n",
+    "In this example, \"TS001\" is the product code for the basic t-shirt style. Each variant of this t-shirt (e.g., different sizes and colors) has its own unique article_id.\n",
+    "\n"
    ]
   },
   {
@@ -176,7 +200,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "trans_df = pd.read_parquet('https://repo.hops.works/dev/jdowling/transactions_train.parquet')[:600000]\n",
+    "trans_df = pd.read_parquet('https://repo.hops.works/dev/jdowling/transactions_train.parquet')[:1_000_000]\n",
     "print(trans_df.shape)\n",
     "trans_df.head(3)"
    ]
@@ -199,7 +223,7 @@
    "source": [
     "## <span style=\"color:#ff5f27\">👨🏻‍🏭 Transactions Feature Engineering</span>\n",
     "\n",
-    "The time of the year a purchase was made should be a strong predictor, as seasonality plays a big factor in fashion purchases. Here, we will use the month of the purchase as a feature. Since this is a cyclical feature (January is as close to December as it is to February), we'll map each month to the unit circle using sine and cosine."
+    "The time of the year a purchase was made should be a strong predictor, as seasonality plays a big factor in fashion purchases. Here, you will use the month of the purchase as a feature. Since this is a cyclical feature (January is as close to December as it is to February), you'll map each month to the unit circle using sine and cosine."
    ]
   },
   {
@@ -225,7 +249,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can see that we have a large dataset. For the sake of the tutorial, we will use a small subset of this dataset, which we generate by sampling 25'000 customers and using their transactions."
+    "You can see that you have a large dataset. For the sake of the tutorial, you will use a small subset of this dataset, which you generate by sampling 25'000 customers and using their transactions."
    ]
   },
   {
@@ -386,14 +410,14 @@
     "\n",
     "A [feature group](https://docs.hopsworks.ai/feature-store-api/latest/generated/feature_group/) can be seen as a collection of conceptually related features.\n",
     "\n",
-    "Before we can create a feature group we need to connect to our feature store."
+    "Before you can create a feature group you need to connect to your feature store."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "To create a feature group we need to give it a name and specify a primary key. It is also good to provide a description of the contents of the feature group."
+    "To create a feature group you need to give it a name and specify a primary key. It is also good to provide a description of the contents of the feature group."
    ]
   },
   {
@@ -416,9 +440,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Here we have also set `online_enabled=True`, which enables low latency access to the data. A full list of arguments can be found in the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/feature_store_api/#create_feature_group).\n",
+    "Here you have also set `online_enabled=True`, which enables low latency access to the data. A full list of arguments can be found in the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/feature_store_api/#create_feature_group).\n",
     "\n",
-    "At this point, we have only specified some metadata for the feature group. It does not store any data or even have a schema defined for the data. To make the feature group persistent we populate it with its associated data using the `save` function."
+    "At this point, you have only specified some metadata for the feature group. It does not store any data or even have a schema defined for the data. To make the feature group persistent you populate it with its associated data using the `insert` method."
    ]
   },
   {
@@ -565,7 +589,17 @@
     "    trans_fg, \n",
     "    articles_fg, \n",
     "    customers_fg,\n",
-    ")"
+    ")\n",
+    "ranking_df.head(3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ranking_df.label.value_counts()"
    ]
   },
   {