-
Notifications
You must be signed in to change notification settings - Fork 1
Recommendation Algorithm
Our recommendation algorithm comprises of three steps:
- Seeding: select favourite movies as seeds for recommendation
- Associating: find similar movies based on the seeds
- Fitting: select based the projected score based on user history and public ratings
During the seeding step, we select all the movies that are rated 8 points and above by the user. These movies are considered the user's favourite movies, and we use these movies as the starting point to generate a pool of similar movies for further selection.
To create a pool of similar movies, we need to first calculate the similarity between each movie. By assigning different weights to data fields such as actors, directors, genres and the length of runtime, we manage to calculate a numeric value that represents similarity, ranged between 0 and 1, thereby generating the similar movie pool based on certain threshold. In our case, we used 0.5 as an arbitrary start.
The public ratings are updated based on the following conditions.
- If the vote count is smaller than 1000
- If the last updated date is at least 2 days ago
The first condition is to accommodate new movies and unpopular movies. They do not have a large viewer base and thus any new ratings might cause an observable change in the overall public ratings. The second condition is more than a routinely check for other movies, and it will only occur during the first loop of the entire process.
Using the three public ratings and user ratings, we are able to create a multi-linear regression model. With this model, given the three public ratings of an unwatched movie of a user, we are able to predict the expected score for that particular user.
A sequential process comprises of the above three steps takes roughly 5 mins to 6 mins per user. Upon examination, we discovered that most of the time was spent on the calculation of similarity. Thus, we decided to optimise the process using the following solution.
In order to reduce the amount of time spent calculating the similarity, we create a sparse matrix in database. A separate script will run continuously to calculate the similarities between user viewing histories and the entire movie databases, starting from the newest.
After the implementation of the sparse matrix (caching calculation results), 3 seconds are used for each user at the current user base. However, we also discovered that, as the number of rows of the matrix increases, longer time is required to query the similarity value. After the number of rows increased from 1 million to 5 million, 2 extra seconds are observed, which could be critical if the number of users is large enough.
This is inevitable if both of the size of user base and movie database are continuously growing, but the current solution is sufficient at this stage of deployment.
To increase the efficiency of this algorithm in the future, we could adopt the following methods.
- partition the sparse matrix into smaller trunks, based on the index of the movies, so we could locate the similar of two movies easily
- run recommendation script concurrently based on the number of users, the average time of each loop should be kept under a certain threshold so the user would not have to wait for a lengthy amount of time
- rating updates could also be run as a separate script, but this required the API support from various websites since it requires a constant data inflow.
-
2.1 Team Details
2.2 Project Worklog
-
5.1 Usability
5.2 Efficiency
5.3 Robustness
5.4 Security