The goal of this repository is to provide you a kickstart for your personalization journey when building a recommendation engine based on Amazon Personalize.
It will serve you as a reference implementation you can both learn the concepts and integration aspects of Amazon Personalize. You can also use it to build you own recommendation engine upon best practices and production-ready components based on the AWS CDK. Those components include:
- Ready to use Amazon Sagemaker Notebook instances for your data exploration, analysis and feature engineering tasks.
As part of the AI Services, Amazon Personalize is a managed service to provide personalization and recommendations based on the same technology used at Amazon.com. Using Amazon Personalize you will get convenient APIs you can use to solve a specific personalization and recommendation business problem.
Amazon Personalize is a fully managed service. It generates highly relevant recommendations using deep learning techniques. It build custom and private ML models using your own data.
In order to install the backend on your local machine, please ensure to have the following tools available:
- AWS CLI v2
- node.js v14 (LTS)
- npm v6 (bundled with node.js)
First install the required dependencies using npm
.
npm install
This will install everything you need to deploy the backend and make changes to the code. It will also include all required dev-dependencies like ESLint, CommitLint or Jest.
Please also follow the "Getting started" instructions on the official AWS CDK documentation to configure your local workstation accordingly. Run the following command to verify correct installation and print the version number of the AWS CDK.
cdk --version
If you do not want to install the CDK globally, you can also execute it via npx
like shown in the following examples.
Once you have installed the AWS CDK successfully, you can bootstrap your AWS Account by running
npx cdk bootstrap
npm run build
compile typescript to jsnpm run watch
watch for changes and compilenpm run test
perform the jest unit testscdk deploy
deploy this stack to your default AWS account/regioncdk diff
compare deployed stack with current statecdk synth
emits the synthesized CloudFormation template
In order to deploy your kickstart stack you simply have to execute a cdk deploy
command. The CDK application follows a multi stack approach. You can choose if you want to provision a sandbox stack in your developer accounts or a shared stack in your central environment accounts like dev-, staging- or production accounts.
Here an example how to provision a sandbox account for a developer called bob.
cdk deploy personalize-bob --context sandbox=bob
If you want to deploy a shared development stack you can execute it like
cdk deploy personalize-dev
💡 Please change the stack setup in personalize.ts to your needs and environment setup by adding your individual stacks.
The Sagemaker-Notebook component provides an automatic provisioned Amazon Sagemaker Notebook for your data analysis and data exploration tasks. Provisioning a Sagemaker Notebook is optional and not required in all stages and cases. In central provisioned dev, staging or production accounts a Sagemaker Notebook is not inevitably necessary.
But it will help you in your developer sandbox accounts or stacks. Hence you can set some stack properties to control in which cases you need a provisioned Sagemaker Notebook and in which cases you don't need them:
new PersonalizeStack(app, `personalize-${sandbox}`, {
provisionSagemakerNotebook: true,
notebookInstanceName: `personalize-${sandbox}`,
notebookVolumeSizeInGb: 64,
});
We want to build a recommendation engine covering all features of Amazon Personalize. The dataset we will use is the publicly available MovieLens dataset.
GroupLens Research has collected and made available rating data sets from the MovieLens web site (https://movielens.org). The data sets were collected over various periods of time, depending on the size of the set. Before using these data sets, please review their README files for the usage licenses and other details.
Source: https://grouplens.org/datasets/movielens/
The MovieLens dataset contains 25 million movie ratings and a rich set of movie metadata. We will use this data to provide an initial version of our recommendation engine based on historical data. Before you just import your historical data, it is recommended to gather knowledge. Both on your data and on your business domain. Every recommendation engine project is kind of unique if we look at the data we have to process and the way the business works. In a very first step during a proof-of-concept phase, it is all about finding answers on:
- What data can we use?
- What data do we need?
- Is our data quality sufficient?
- How do we access the required data?
- How do we identify users, interaction or items we want to recommend?
The data exploration notebook contains analyses both inspired from my personal experiences as well as a lot of cool stuff of the open source community like the following:
- Analysis on the Movie Lens dataset using pandas
- Amazon Personalize immersion day
- Analysis of MovieLens dataset (Beginner'sAnalysis)
- Data Analysis using the MovieLens dataset with pandas
- Comprehensive Data Visualization with Matplotlib
Amazon Personalize in general is easy to use. It offers you several APIs to get data in and get data out of Amazon Personalize. Personalize organizes your data into three containers - so called Datasets:
- Interactions,
- Items,
- Users
Each dataset is described by a schema that defines its structure in terms of fields, datatypes and additional metadata so that Amazon Personalize can interpret your data. Based from past experiences on several use cases it is likely that your raw data do not fit 100% to the required schema you define for your recommendation engine. This results in the need of preparing your data before your are able to import it into Amazon Personalize.
The complexity of the data preparation tasks and amount of work that is needed to prepare your data, differs from use case to use case. To prepare your data you have tons of options using several AWS services:
- You can setup a AWS Glue baked ETL pipeline.
- You can use AWS Glue DataBrew to normalize and transform your data.
- You can follow a streaming based approach by using a Kinesis Datastream with a AWS Lambda function as a consumer and import your data incrementally
A lot of options are on the table. In this project we start with a simple but efficient approach to prepare the movielens raw data using Amazon Athena. Our goal is to convert the movielens raw data into an Amazon Personalize schema compatible format. We can use these data for an initial model training using a bulk import. Amazon Personalize defines three requirements in order to be able to bulk import your data:
- Your data must be in CSV Format
- Your data must be readable via Amazon S3
- The data must fit to the schema you have defined for your datasets.
Based on the simplicity of the movielens data, implementing this with Amazon Athena is very straight forward.
Our movielens raw data are available in our raw data bucket in Amazon S3. An AWS Glue crawler is connected to this bucket in order create an AWS Glue metadata catalog. The metadata catalog can then be queried using Amazon Athena. An Athena workgroup configures the results to be written in an Amazon S3 Bucket for your prepared data.
In order to automate the whole process of crawling and querying, all steps are baked into an AWS Step Functions state machine.
Each datasets - like movies or ratings - are persisted in separated folders inside the raw data bucket. This makes crawling the data much easier as we have separated data catalogs for ratings and movies. Raw data can optionally be partitioned by date for example.
According to our data exploration, movies data are mapped to our future items dataset and ratings will be the foundation for our interaction datasets. Hence we have two crawlers that are actively used in our data preparation workflow.
The solution also creates a crawler for users data but due to the absence of raw data for users, it is not used.
The whole process of crawling and preparing raw data is automate by a AWS Step Function state machine. The state machine orchestrates the data preparation for all three datasets while only the processing of interactions is required. Preparing item and user datasets is optional as they are not required by Amazon Personalize.
Crawling can be bypassed in each individual statemachine execution. In order to execute the statemachine, the following execution input parameters must be provided:
{
"interactionsPreparationQuery": "strin",
"itemsPreparationQuery": "string",
"usersPreparationQuery": "string",
"itemsCrawlerName": "string",
"interactionsCrawlerName": "string",
"usersCrawlerName": "string",
"bypassCrawler": "boolean"
}
Both the required ids of the named Athena queries as well as the name of the AWS Glue crawlers can be copied from the output section of the Cloudformation stack.
Contributions, issues and feature requests are welcome!
Feel free to check issues page for open issues or create one for feature requests or if you have general questions.
Be also sure to review our contributing guidelines and code of conduct.