Using Letterboxd personal data, TMDB API, and data science techniques to analyze movie watching data
As an avid movie watcher, the opportunity to combine my loves of data science and film was too good to pass up. Having been a Letterboxd user for many years, I leveraged my personal watch history data to better understand my viewing patterns and learn more about myself in the process. For more detailed report on this project and its methodology, please check out my LinkedIn article and its follow-up detailing the credits code implementation.
The first iteration of this project utilized PowerBI to do the end reporting, but as of December 2023 I created a web UI version that utilizes MongoDB and the streamlit library! The two Python scripts necessary to add/update the data to a MongoDB collection have been added as optional scripts to this project. You can find the code repo on my GitHub and the end site can be accessed at letterboxd.streamlit.app
- Cinephiles with a passion for coding
- Developers interested in multiple areas of development (API calling, sentiment analysis, regression-based modeling, etc.)
- Affinity users of other sub-communities (i.e. having a Goodreads instead of Letterboxd/IMDB/a movie logging site) who want to also derive personal analytics from their platform usage
- Download personal movie data from Letterboxd: Setting -> Import & Export -> Export Your Data
- You can also access the direct link to download your data here
- Save the following files:
watched.csv
,ratings.csv
,reviews.csv
, anddiary.csv
- Request an API key from TMDB
- Once API Key retrieved, use with
movies_api.py
to retrieve additional movie data
- The TMDB allows for 30-40 API requests every 10 seconds, so if you have thousands of movies logged as I do this could factor into the performance time of
movies_api.py
- If you've logged an limited series/prestige TV on the app (like the Emmy award winning limited series Big Little Lies) those won't have any TMDB API hits since it is pointed at the movie side of the database. I removed those records since they aren't within the scope of the project anyways.
- Even though the dashboard was created in Power BI, I wrote the code in
movies_eda.ipynb
to re-create all the visualizations from the final dashboard. I included it as an ipynb rather than just a .py script so you could see the output of each code chunk, but a .py version would be suitable as well if using a different IDE
movies_api.py
- Optional
movies_hours.py
movies_api_credits.py
movies_api_credits_cleaning.py
movies_cleaning.py
movies_sentiment.py
movies_modeling.py
- Optional
movies_eda.py
ormovies_eda.ipynb
- Optional
mongodb_create.py
ormongodb_update.py
Logged_Date
-- Date I logged the film on LetterboxdName
-- Name of the film as it appears on Letterboxd's siteYear
-- Generally, the year of the US release date. Can vary depending on whether it was released internationally or at film festivals firstRating
-- Records on a scale of 0 to 5 by increments of 0.5 the star rating I gave the filmReview
-- Boolean value that preserves whether or not I wrote a review for the film on Letterboxdid
-- Unique identifying value in TMDB's databaseenglish_language
-- Boolean value that records whether or not the movie's original language is English. Considered breaking this value out further but over 90% of them are surprisingly listed as English language in TMDBoverview
-- Provides brief synopsis of the filmpopularity
-- Internally calculated score based on site interaction data. More information about this feature can be found herevote_average
-- Average user rating of the film on a scale of 0 to 10vote_count
-- Total number of users who rated the filmvote_revenue
-- Total amount of money grossed at the domestic and international box officeruntime
-- Total running length of the film excluding commercials, measured in minutestagline
-- Marketing verbiage which provides a punchy incentive for potential viewers to choose to watch the filmwatch_count
-- Number of times you have seen the film using diary entriesmin_watched
-- runtime * watch_countLogged_DOW
-- Extracts day of the week from theLogged_Date
values, recorded in numeric form (0 - Monday, 1 - Tuesday, 2 - Wednesday, 3 - Thursday, 4 - Friday, 5 - Saturday, 6 - Sunday)Logged_Month
-- Extracts month value from theLogged_Date
valuesLogged_Year
-- Extracts year value from theLogged_Date
valuesLogged_Week
-- Calculates from 0 to 54 the week value from theLogged_Date
valuesDaily_Movie_Count
-- Calculates using theLogged_Date
values how many movies I watched on a given dateWeekly_Movie_Count
-- Calculates using theLogged_Week
andLogged_Year
values how many movies I watched on a given weekgenres
-- Several boolean columns exist that indicate whether or not the movie was classified into the following genres: (Action, Crime, War, Drama, Thriller, Mystery, Comedy, Romance, Sci_Fi, Animation, Documentary, Adventure, Music, Horror, Fantasy, History, Western, Rom_Com)female_roles
-- Measures the number of female roles in the first 20 billed of a movie's acting creditsfemale_driven
-- Boolean value that records whether 9 or more of those 20 roles are female, therefore classifying the film as "female-driven"female_directed
-- Boolean value that records whether or not the director of the film self-identifies as femalenegativity_percentage
-- Measures what percentage of the string input has a negative associationneutrality_percentage
-- Measures what percentage of the string input has a neutral associationpositivity_percentage
-- Measures what percentage of the string input has a positive associationmovie_sentiment
-- The compound score is the aggregate sum of positive, negative & neutral percentages. The closer this value is to 1, the more positive the movie's overview is
Integrate additional movie attributes such as the film's director, leading actors, and thematic contentCompleted Jan 2023 with "credits" expansionCalculate total number of minutes and hours of movies watched using re-watch logs in the Diary datasetCompleted Dec 2024 with movies_hours code- Rather than just the film's lanaguage, integrating country of origin to better understand domestic vs. international viewing
- Left joining on the Diary dataset rather than Watched one to conduct time series analysis/predict what genre or type of movies I'll watch next Partially addressed in July 2024 with expansion to calculate minutes watched per film
- TMDB API Details
- IMDB API for those who don't want to use TMDB
- Accessing total IMDB raw data not advised because the full data has around 100 million records
- Python API Tutorial
- Python Sentiment Analysis