A data pipeline for realtime game recommendation
The Generator component is used to generate new users and simulate user actions. The generated events will be sent to Kafka.
First, random attributes were assigned to a user when she/he was created, including:
adventurousness
: how likely is the user willing to try new thingswealthiness
: how much is the user willing to pay for a gameactiveness
: how 'active' the user isgenre_affinity
: how much the user like a specific genre of games (e.g. RPG, FPS, Sports, etc.)
Then, the Generator will fake actions based on the said attributes. For example, a RPG lover (a user with high RPG affinity) is more likely to buy/play Skyrim V, while a FPS fan more likely to pick CS:GO; a rich player thinks it is acceptable to pay $19.99 for a game, while a user with low wealthiness
may be reluctant to do so.
The Streamer component is used to aggregated & reduced streaming data in small batches, throttle the writing frequency to database, reduce the round trip time and increase the overall throughput.
Currently, it's a two-node m4.large cluster, which is able to handle up to 10,000 events/sec.
The Trainer is used to train the Recommender model in batch. Spark MLlib ALS method for Collaborative Filtering is used. Both explicit and implicit method were tested, and the former performs better. Ratings are converted from playtime, using the formula rating = BASE_RATING + log10(playtime)
.
The recommender component combines the live streaming events and the batch trained model, recompute the user factor, and generate live recommendation that reflects the users' most recent activity.