A simple machine learning system capable of recommending songs given a user as a query using collaborative filtering and TensorFlow.
Unlike classic matrix factorization approaches, using a neural network allows user and item features to be included during training.
This example covers how distributed data preprocessing, training, and serving can be done on Google Cloud Platform(GCP).
Further reading:
- Neural Collaborative Filtering: A paper on using neural networks instead of matrix factorization to perform collaborative filtering.
- Deep Neural Networks for YouTube Recommendations: Youtube's approach to recommending millions of videos at low latencies by first generating candidates from multiple models and ranking the candidate pool.
For a fully managed service, check out Recommendations AI.
Create a new project on GCP and set up GCP credentials:
gcloud auth login
gcloud auth application-default login
Enable the following APIS:
Using the preprocessing/config.example.ini
template, create
preprocessing/config.ini
with the GCP project id fields filled in.
Additionally, you will need to create a GCS bucket. This code assumes a bucket
exists by the name of [project-id]-bucket
.
Set up your python environment:
python3 -m venv venv
source ./venv/bin/activate
pip install -r requirements.txt
The data preprocessing pipeline uses the ListenBrainz dataset hosted on Cloud Marketplace. Data is processed and written to Google Cloud Storage(GCS) as TFRecords.
These files are read using Cloud DataFlow. The steps involved are as follows:
- Read the data in using the BigQuery query found here. This query cleans the features and creates a label for each unique user-item pair that exists. This label is 1 if a user has listened to a song more than twice and 0 otherwise. Samples are also given weights based on how many interactions there were between the user and item.
- Using TensorFlow Transform, map each username and product id to an integer value and write the vocabularies to text files. Leave users and items under a set frequency threshold out of the vocabularies.
- Filter away user-item pairs where either element is outside of its corresponding vocabulary.
- Split the data into train, validation, and test sets.
- Write each dataset as TFRecords to GCS.
Command | Description |
---|---|
bin/run.preprocess.local.sh |
Process a sample of the data locally and write outputs to a local directory. |
bin/run.preprocess.cloud.sh |
Process the data on GCP using DataFlow and write outputs to a GCS bucket. |
bin/run.test.sh |
Run unit tests for the preprocessing pipeline. |
A Custom Estimator is trained using TensorFlow and Cloud AI Platform(CAIP). The training steps are as follows:
- Read TFRecords from GCS and create a
tf.data.Dataset
for each of them that yields data in batches. - Use the TensorFlow Transform output from preprocessing to transform usernames and product ids into int ids.
- Use
user_id
s anditem_id
s to train embeddings. - Add item features and create two input layers: one with user embedding vectors, and another with the concatenation of item embedding vectors and item features.
- Create a user neural net and item neural net from the input layers, ensuring that the final layers are the same size.
- Compute the cosine similarity between the final layers of the user and item nets. Take the absolute value to get a value between 0 and 1.
- Calculate error using log loss and train the model.
- Evaluate the model performance by sampling 1000 random items and calculating the average recall@k when each positive sample's item is ranked against these random items for the sample's user.
- Export a
SavedModel
for use in serving.
Training job scripts expect the following argument:
MODEL_INPUTS_DIR
: The directory containing the TFRecords from preprocessing.
Command | Description |
---|---|
bin/run.train.local.sh |
Train the model locally and save model checkpoints to a local model dir. |
bin/run.train.cloud.sh |
Train the model on CAIP and save model checkpoints to a GCS bucket. |
bin/run.train.tune.sh |
Train the model on CAIP as above, but using hyperparameter tuning. |
Note: SCALE_TIER
is set to STANDARD_1
to demonstrate distributed training.
However, it can be set to BASIC
to reduce costs. See scale
tiers.
Model training can be monitored on Tensorboard using the following command:
tensorboard --logdir <path to model dir>/<trial number>
Tensorboard's projector, in particular, is very useful for debugging
or analyzing embeddings. In the projector tab in Tensorboard, try setting the
label to name
.
Models can be hosted on CAIP, which can be used to make online and batch predictions via JSON requests.
- Upload the
SavedModel
from training to CAIP. - Using a file containing a list of usernames, create inputs to pass to the model hosted on CAIP for predictions.
- Make the predictions.
The cloud serving job and prediction job scripts expect the same argument:
MODEL_OUTPUTS_DIR
: The model directory containing each model trial.TRIAL
(optional): The trial number to use. The local serving job expects no arguments, and the local prediction job expects the model version number.
Command | Description |
---|---|
bin/run.serve.local.sh |
Upload a new version of the recommender model to CAIP using a locally trained model. |
bin/run.serve.cloud.sh |
Upload a new version of the recommender model to CAIP using a model trained on CAIP. |
bin/run.predict.local.sh |
Using serving/test.json , create a prediction job on CAIP after using the local serving script. |
bin/run.predict.cloud.sh |
Using serving/test.json , create a prediction job on CAIP after using the cloud sering script. |