This repository contains the code for a content-based movie recommender system using cosine similarity. The project leverages metadata from movies to suggest similar movies based on their content. This type of recommender system is particularly useful for recommending items with similar attributes and providing personalized suggestions to users.
The data used in this project comes from the TMDB Movie Dataset available on Kaggle. This dataset consists of two files tmdb_5000_credits.csv
and tmdb_5000_movies.csv
.
- Dataset from Kaggle: TMDB Movie Dataset
- Merged datasets based on the movie title. These files contain comprehensive information about over 5,000 movies, including their cast, crew, plot keywords, genres, and more.
- id
- title
- overview
- genres
- keywords
- cast
- crew
Preprocessing is a crucial step in any machine learning project. In this stage, I
- Removed missing values and duplicates.
- Converted string data to lists.
- Removed spaces in names and concatenated columns to create a
tags
column. - Converted
tags
to lowercase. - Applied stemming to reduce words to their root form (e.g., "loved", "loving", and "love" become "love").
Extracted important features from the dataset to create a comprehensive tags
column. This column combines various textual data such as cast, crew, genres, and keywords into a single field, which serves as the input for our model.
- Used
CountVectorizer
from scikit-learn to convert text data into vectors. - Calculated cosine similarity between movie vectors.
A function is created to recommend movies based on the cosine similarity scores. Given a movie title, the function returns a list of similar movies, helping users discover new content based on their preferences.
recommend('Spider-Man')
Output:
- Spider-Man 3
- Spider-Man 2
- The Amazing Spider-Man 2
- Arachnophobia
- Kick-Ass