The MovieLens sample demonstrates how to build personalized recommendation models to recommend movies to users based on movie ratings data from MovieLens 20M dataset.
- Make sure you follow the Google Cloud ML setup here before trying the sample. More documentation about Cloud ML is available here.
- Make sure your Google Cloud project has sufficient quota.
Install dependencies by running pip install -r requirements.txt
The MovieLens dataset is available in many different sizes but here we focus on 20M dataset which can be downloaded here.
When running locally please use the MovieLens small dataset.
The dataset contains several files in CSV format. Following files are used in the sample:
All ratings are contained in the file ratings.csv
. Each line of this file
after the header row represents one rating of one movie by one user, and has
the following format:
userId, movieId, rating, timestamp
The lines within this file are ordered first by userId, then, within user, by movieId.
Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).
Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.
Movie information is contained in the file movies.csv
. Each line of this file
after the header row represents one movie, and has the following format:
movieId,title,genres
Movie titles are entered manually or imported from https://www.themoviedb.org/, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.
Genres are a pipe-separated list, and are selected from the following:
- Action
- Adventure
- Animation
- Children's
- Comedy
- Crime
- Documentary
- Drama
- Fantasy
- Film-Noir
- Horror
- Musical
- Mystery
- Romance
- Sci-Fi
- Thriller
- War
- Western
- (no genres listed)
Identifiers that can be used to link to other sources of movie data are
contained in the file links.csv
. Each line of this file after the header row
represents one movie, and has the following format:
movieId, imdbId, tmdbId
movieId is an identifier for movies used by https://movielens.org. E.g., the movie Toy Story has the link https://movielens.org/movies/1. imdbId is an identifier for movies used by http://www.imdb.com. E.g., the movie Toy Story has the link http://www.imdb.com/title/tt0114709/. tmdbId is an identifier for movies used by https://www.themoviedb.org. E.g., the movie Toy Story has the link https://www.themoviedb.org/movie/862.
This sample consists of two parts: data pre-processing and model training steps.
The pre-processing step can be performed either locally or on cloud depending on the size of the input.
We will read the above files and convert them into TFRecords format for training.
- Ratings are split into training and evaluation sets based on user_id. The
percentage of users that are included in the evaluation set is controlled by
the flag
--percent_eval
. - For users in the training set, we generate one example for each rating. Each example includes the movie ids and ratings of all movies rated by the same user except the candidate movie.
- Additional negative examples are generated by randomly picking movies not
rated by the user and creating an example with a 0-star rating for the
movie. The ratio of negatives to positives is controlled by the flag
--negative_sample_ratio
. A reasonable value of 1-3 means we want 1-3 times as many negative examples as positives. - For users in the eval set, we follow the same process as in training,
except that we do not generate negative examples. When
eval_type
is ranking, movies with ratings below--eval_score_threshold
are removed from the evaluation dataset. - For each example in the evaluation set, a set of ranking candidates can be generated where.
Run the code as below:
LOCAL_TRAINING_INPUT_DIR=/path/to/ml-latest-small/input...
LOCAL_OUTPUT_DIR=/path/to/output
python preprocess.py --input_dir "$LOCAL_TRAINING_INPUT_DIR" \
--output_dir "$LOCAL_OUTPUT_DIR" \
--percent_eval 20 \
--negative_sample_ratio 1 \
--negative_sample_label 0.0 \
--eval_type ranking \
--eval_score_threshold 4.5 \
--num_ranking_candidate_movie_ids 1000 \
--partition_random_seed 0
First set project, bucket, and data path:
PROJECT=$(gcloud config list project --format "value(core.project)")
BUCKET="gs://${PROJECT}-ml"
GCS_PATH=${BUCKET}/${USER}/movielens
GCS_TRAINING_INPUT_DIR=gs://path/to/ml-20m/input...
We can now run the pre-processing code on cloud as below:
PREPROCESS_OUTPUT="${GCS_PATH}/movielens_$(date +%Y%m%d_%H%M%S)"
python preprocess.py --input_dir "${GCS_TRAINING_INPUT_DIR}" \
--output_dir "${PREPROCESS_OUTPUT}" \
--percent_eval 20 \
--project_id ${PROJECT} \
--negative_sample_ratio 1 \
--negative_sample_label 0.0 \
--eval_type ranking \
--eval_score_threshold 4.5 \
--num_ranking_candidate_movie_ids 1000 \
--partition_random_seed 0 \
--cloud
Model training step takes the pre-processed TFRecords and trains recommendation models such as matrix factorization and deep neural network model.
The matrix factorization model associates each user u with a user-factor vector -- p_u -- and each item i with an item-factor vector -- q_i. The goal is to learn p_u and q_i that minimizes reconstruction error (L2 loss) between the true and the predicted ratings (as computed from p_u^T * q_i).
For matrix factorization, eval_type
can be regression
or ranking
.
- In regression mode, we predict the rating of a target movie. RMSE (root mean square error) and MAE (mean absolute error) metrics are used to compare model performance.
- In ranking mode, we produce a ranked list of top K recommended movies and evaluate the models using recall@K, precision@K and MAP@K metrics.
The deep model uses a neural network to learn low-dimensional representations of
users and items (i.e. user and item embeddings). The model learns a linear
embedding that maps user features -- based on rated movie IDs and their genres
-- to 64-dim vectors which are then passed through 2 hidden layers with ReLu activation
functions. The output of the last hidden layers is fed into a softmax layer to
make prediction on the most likely item to recommend. During training, the
network parameters are learned to minimize a cross entropy loss between the
predicted class and the heldout class (item ID). Note that for DNN softmax model,
eval_type
must be set to ranking
.
python trainer/task.py --raw_metadata_path "$LOCAL_OUTPUT_DIR/raw_metadata" \
--transform_savedmodel "$LOCAL_OUTPUT_DIR/transform_fn" \
--train_data_paths "$LOCAL_OUTPUT_DIR/features_train*" \
--eval_data_paths "$LOCAL_OUTPUT_DIR/features_eval*" \
--output_path "$LOCAL_OUTPUT_DIR/model/dnn_softmax" \
--train_steps 1000 \
--eval_steps 100 \
--model_type dnn_softmax \
--eval_type ranking \
--l2_weight_decay 0.001 \
--learning_rate 0.01
Example run to train a DNN softmax model:
JOB_ID="movielens_deep_$(date +%Y%m%d_%H%M%S)"
gcloud ml-engine jobs submit training "$JOB_ID" \
--stream-logs \
--module-name trainer.task \
--package-path trainer \
--staging-bucket "$BUCKET" \
--region us-central1 \
--config config.yaml \
-- \
--raw_metadata_path "${PREPROCESS_OUTPUT}/raw_metadata" \
--transform_savedmodel "${PREPROCESS_OUTPUT}/transform_fn" \
--eval_data_paths "${PREPROCESS_OUTPUT}/features_eval*.tfrecord.gz" \
--train_data_paths "${PREPROCESS_OUTPUT}/features_train*.tfrecord.gz" \
--output_path "${GCS_PATH}/model/${JOB_ID}" \
--model_type dnn_softmax \
--eval_type ranking \
--l2_weight_decay 0.01 \
--learning_rate 0.05 \
--train_steps 500000 \
--eval_steps 500 \
--top_k_infer 100
Example run to train a matrix factorization model, using hyperparameter tuning:
JOB_ID="movielens_factorization_$(date +%Y%m%d_%H%M%S)"
gcloud ml-engine jobs submit training "$JOB_ID" \
--stream-logs \
--module-name trainer.task \
--package-path trainer \
--staging-bucket "$BUCKET" \
--region us-central1 \
--config config_hypertune.yaml \
-- \
--raw_metadata_path "${PREPROCESS_OUTPUT}/raw_metadata" \
--transform_savedmodel "${PREPROCESS_OUTPUT}/transform_fn" \
--eval_data_paths "${PREPROCESS_OUTPUT}/features_eval*.tfrecord.gz" \
--train_data_paths "${PREPROCESS_OUTPUT}/features_train*.tfrecord.gz" \
--output_path "${GCS_PATH}/model/${JOB_ID}" \
--model_type matrix_factorization \
--eval_type ranking \
--l2_weight_decay 0.01 \
--learning_rate 0.05 \
--train_steps 500000 \
--eval_steps 500 \
--movie_embedding_dim 64 \
--top_k_infer 100
Once the model finishes training we can deploy it into CloudML Engine for prediction.
First select a model from the export directory.
gsutil ls -l "${GCS_PATH}/model/${JOB_ID}/export/Servo/"
# Set the model source from the output above up to the timestamp, for example
# MODEL_SOURCE=gs://my-bucket/someuser/movielens/model/movielens_deep_20170704_134621/export/Servo/1499209413002
MODEL_SOURCE=gs://path/to/my/model...
# Deploy a model to CloudML Engine.
gcloud ml-engine models create "movielens" --regions us-central1
gcloud ml-engine versions create "v1" --model "movielens" --origin "${MODEL_SOURCE}"
Now we are ready to issue online prediction requests. Each instance in input file results in top_k_infer related movie ids along with their ranking scores (by default top_k_infer=100).
Select a small JSON text file from the preprocessed data. Each of the entries on the file is a b64 encoded tf.Example record suitable for online prediction.
gsutil ls -lh "${PREPROCESS_OUTPUT}/features_predict-*txt"
# Copy the file locally with the following command
# gsutil cp ${PREPROCESS_OUTPUT}/features_predict-00476-of-00560.txt .
gsutil cp gs://path/to/my/file... .
Run online prediction.
LOCAL_PREDICTION_FILE=path/to/local/file
gcloud ml-engine predict --model "movielens" --version "v1" --json-instances "${LOCAL_PREDICTION_FILE}"
For batch prediction requests, we take the following steps.
# Select a small file gzipped TF Record file from the preprocessed data.
gsutil ls -lh "${PREPROCESS_OUTPUT}/features_predict-*tfrecord.gz"
# For example
# GCS_PREDICTION_FILE=${PREPROCESS_OUTPUT}/features_predict-00543-of-00560.tfrecord.gz
GCS_PREDICTION_FILE=gs://path/to/my/file...
JOB_ID="movielens_batch_prediction_$(date +%Y%m%d_%H%M%S)"
gcloud ml-engine jobs submit prediction "${JOB_ID}" \
--model "movielens" \
--input-paths "${GCS_PREDICTION_FILE}" \
--output-path "${GCS_PATH}/prediction/${JOB_ID}"\
--region us-central1 \
--data-format TF_RECORD_GZIP
Once the job is successfully submitted, you should be able to find it on the ML Engine section of the Google Cloud Platform console. Check the job status there. Eventually the resultant file(s) should be on the GCS location in the specified output path.