[FSTORE-978] Hopsworks recommender system tutorial (#193)

* recommender-system
logicalclocks · Sep 12, 2023 · e0812e8 · e0812e8
1 parent a9ebc34
commit e0812e8
Show file tree

Hide file tree

Showing 14 changed files with 5,625 additions and 0 deletions.
diff --git a/advanced_tutorials/recommender-system/1_feature_engineering.ipynb b/advanced_tutorials/recommender-system/1_feature_engineering.ipynb
diff --git a/advanced_tutorials/recommender-system/2a_create_retrieval_dataset.ipynb b/advanced_tutorials/recommender-system/2a_create_retrieval_dataset.ipynb
@@ -0,0 +1,232 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Connected. Call `.close()` to terminate connection gracefully.\n",
+      "\n",
+      "Logged in to project, explore it here https://hopsworks0.logicalclocks.com/p/119\n"
+     ]
+    }
+   ],
+   "source": [
+    "import hopsworks\n",
+    "\n",
+    "project = hopsworks.login()  # insert API Key from https://app.hopsworks.ai"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create Retrieval Dataset\n",
+    "\n",
+    "In this notebook, we'll create a dataset for our retrieval model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Connected. Call `.close()` to terminate connection gracefully.\n"
+     ]
+    }
+   ],
+   "source": [
+    "fs = project.get_feature_store()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Feature Selection\n",
+    "\n",
+    "First, we'll load the feature groups we created in the previous tutorial."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "trans_fg = fs.get_feature_group(\"transactions\",version=1)\n",
+    "customers_fg = fs.get_feature_group(\"customers\",version=1)\n",
+    "articles_fg = fs.get_feature_group(\"articles\",version=1)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We'll need to join these three data sources to make the data compatible with out retrieval model. Recall that each row in the `transactions` feature group relates information about which customer bought which item. We'll join this feature group with the `customers` and `articles` feature groups to inject customer and item features into each row."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query = trans_fg.select([\"customer_id\", \"article_id\", \"t_dat\", \"month_sin\", \"month_cos\"])\\\n",
+    "    .join(customers_fg.select([\"age\"]), on=\"customer_id\")\\\n",
+    "    .join(articles_fg.select([\"garment_group_name\", \"index_group_name\"]), on=\"article_id\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Feature View Creation\n",
+    "In Hopsworks, you write features to feature groups (where the features are stored) and you read features from feature views. A feature view is a logical view over features, stored in feature groups, and a feature view typically contains the features used by a specific model. This way, feature views enable features, stored in different feature groups, to be reused across many different models."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Transformation functions available:\n",
+      "- min_max_scaler - version: 1\n",
+      "- standard_scaler - version: 1\n",
+      "- label_encoder - version: 1\n",
+      "- month_sin - version: 1\n",
+      "- robust_scaler - version: 1\n",
+      "- month_cos - version: 1\n"
+     ]
+    }
+   ],
+   "source": [
+    "# explore available transformation functions\n",
+    "\n",
+    "print(\"Transformation functions available:\")\n",
+    "for tr_fn in fs.get_transformation_functions():\n",
+    "    print(\"- \" + tr_fn.name + \" - version: \" + str(tr_fn.version))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Feature view created successfully, explore it at \n",
+      "https://hopsworks0.logicalclocks.com/p/119/fs/67/fv/retrieval/version/1\n"
+     ]
+    }
+   ],
+   "source": [
+    "month_to_sin = fs.get_transformation_function(name=\"month_sin\", version=1)\n",
+    "month_to_cos = fs.get_transformation_function(name=\"month_cos\", version=1)\n",
+    "\n",
+    "feature_view = fs.create_feature_view(\n",
+    "    name='retrieval',\n",
+    "    query=query,\n",
+    "    transformation_functions={\n",
+    "        \"month_sin\": month_to_sin,\n",
+    "        \"month_cos\": month_to_cos,\n",
+    "    }\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To view and explore data in the feature view we can retrieve batch data using the `get_batch_data()` method."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Training Dataset Creation\n",
+    "\n",
+    "Finally, we can create our dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Training dataset job started successfully, you can follow the progress at \n",
+      "https://hopsworks0.logicalclocks.com/p/119/jobs/named/retrieval_1_create_fv_td_10072023185611/executions\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "VersionWarning: Incremented version to `1`.\n"
+     ]
+    }
+   ],
+   "source": [
+    "feature_view = fs.get_feature_view(\"retrieval\", version=1)\n",
+    "\n",
+    "td_version, td_job = feature_view.create_train_validation_test_split(\n",
+    "    validation_size = 0.1, \n",
+    "    test_size = 0.1,\n",
+    "    description = 'Retrieval dataset splits',\n",
+    "    data_format = 'csv',\n",
+    "    write_options = {'wait_for_job': True},\n",
+    "    coalesce = True\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Next Steps\n",
+    "\n",
+    "In the next notebook, we'll train a model on the dataset we created in this notebook."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}