movies

Using Letterboxd personal data, TMDB API, and data science techniques to analyze movie watching data

As an avid movie watcher, the opportunity to combine my loves of data science and film was too good to pass up. Having been a Letterboxd user for many years, I leveraged my personal watch history data to better understand my viewing patterns and learn more about myself in the process. For more detailed report on this project and its methodology, please check out my LinkedIn article and its follow-up detailing the credits code implementation.

The first iteration of this project utilized PowerBI to do the end reporting, but as of December 2023 I created a web UI version that utilizes MongoDB and the streamlit library! The two Python scripts necessary to add/update the data to a MongoDB collection have been added as optional scripts to this project. You can find the code repo on my GitHub and the end site can be accessed at letterboxd.streamlit.app

Who is this project for?

Cinephiles with a passion for coding
Developers interested in multiple areas of development (API calling, sentiment analysis, regression-based modeling, etc.)
Affinity users of other sub-communities (i.e. having a Goodreads instead of Letterboxd/IMDB/a movie logging site) who want to also derive personal analytics from their platform usage

Accessing The Data

Download personal movie data from Letterboxd: Setting -> Import & Export -> Export Your Data
You can also access the direct link to download your data here
Save the following files: watched.csv, ratings.csv, reviews.csv, and diary.csv
Request an API key from TMDB
Once API Key retrieved, use with movies_api.py to retrieve additional movie data

Usage Insights

The TMDB allows for 30-40 API requests every 10 seconds, so if you have thousands of movies logged as I do this could factor into the performance time of movies_api.py
If you've logged an limited series/prestige TV on the app (like the Emmy award winning limited series Big Little Lies) those won't have any TMDB API hits since it is pointed at the movie side of the database. I removed those records since they aren't within the scope of the project anyways.
Even though the dashboard was created in Power BI, I wrote the code in movies_eda.ipynb to re-create all the visualizations from the final dashboard. I included it as an ipynb rather than just a .py script so you could see the output of each code chunk, but a .py version would be suitable as well if using a different IDE

Script Execution Order

movies_api.py
Optional movies_hours.py
movies_api_credits.py
movies_api_credits_cleaning.py
movies_cleaning.py
movies_sentiment.py
movies_modeling.py
Optional movies_eda.py or movies_eda.ipynb
Optional mongodb_create.py or mongodb_update.py

Data Dictionary

Logged_Date -- Date I logged the film on Letterboxd
Name -- Name of the film as it appears on Letterboxd's site
Year -- Generally, the year of the US release date. Can vary depending on whether it was released internationally or at film festivals first
Rating -- Records on a scale of 0 to 5 by increments of 0.5 the star rating I gave the film
Review -- Boolean value that preserves whether or not I wrote a review for the film on Letterboxd
id -- Unique identifying value in TMDB's database
english_language -- Boolean value that records whether or not the movie's original language is English. Considered breaking this value out further but over 90% of them are surprisingly listed as English language in TMDB
overview -- Provides brief synopsis of the film
popularity -- Internally calculated score based on site interaction data. More information about this feature can be found here
vote_average -- Average user rating of the film on a scale of 0 to 10
vote_count -- Total number of users who rated the film
vote_revenue -- Total amount of money grossed at the domestic and international box office
runtime -- Total running length of the film excluding commercials, measured in minutes
tagline -- Marketing verbiage which provides a punchy incentive for potential viewers to choose to watch the film
watch_count -- Number of times you have seen the film using diary entries
min_watched -- runtime * watch_count
Logged_DOW -- Extracts day of the week from the Logged_Date values, recorded in numeric form (0 - Monday, 1 - Tuesday, 2 - Wednesday, 3 - Thursday, 4 - Friday, 5 - Saturday, 6 - Sunday)
Logged_Month -- Extracts month value from the Logged_Date values
Logged_Year -- Extracts year value from the Logged_Date values
Logged_Week -- Calculates from 0 to 54 the week value from the Logged_Date values
Daily_Movie_Count -- Calculates using the Logged_Date values how many movies I watched on a given date
Weekly_Movie_Count -- Calculates using the Logged_Week and Logged_Year values how many movies I watched on a given week
genres -- Several boolean columns exist that indicate whether or not the movie was classified into the following genres: (Action, Crime, War, Drama, Thriller, Mystery, Comedy, Romance, Sci_Fi, Animation, Documentary, Adventure, Music, Horror, Fantasy, History, Western, Rom_Com)
female_roles -- Measures the number of female roles in the first 20 billed of a movie's acting credits
female_driven -- Boolean value that records whether 9 or more of those 20 roles are female, therefore classifying the film as "female-driven"
female_directed -- Boolean value that records whether or not the director of the film self-identifies as female
negativity_percentage -- Measures what percentage of the string input has a negative association
neutrality_percentage -- Measures what percentage of the string input has a neutral association
positivity_percentage -- Measures what percentage of the string input has a positive association
movie_sentiment -- The compound score is the aggregate sum of positive, negative & neutral percentages. The closer this value is to 1, the more positive the movie's overview is

Future Project Expansions

~~Integrate additional movie attributes such as the film's director, leading actors, and thematic content~~ Completed Jan 2023 with "credits" expansion
~~Calculate total number of minutes and hours of movies watched using re-watch logs in the Diary dataset~~ Completed Dec 2024 with movies_hours code
Rather than just the film's lanaguage, integrating country of origin to better understand domestic vs. international viewing
Left joining on the Diary dataset rather than Watched one to conduct time series analysis/predict what genre or type of movies I'll watch next Partially addressed in July 2024 with expansion to calculate minutes watched per film

Helpful Data Resources

TMDB API Details
IMDB API for those who don't want to use TMDB
Accessing total IMDB raw data not advised because the full data has around 100 million records
Python API Tutorial
Python Sentiment Analysis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

movies

Who is this project for?

Accessing The Data

Usage Insights

Script Execution Order

Data Dictionary

Future Project Expansions

Helpful Data Resources

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
LICENSE		LICENSE
README.md		README.md
mongodb_create.py		mongodb_create.py
mongodb_update.py		mongodb_update.py
movies_api.py		movies_api.py
movies_api_credits.py		movies_api_credits.py
movies_api_credits_cleaning.py		movies_api_credits_cleaning.py
movies_cleaning.py		movies_cleaning.py
movies_eda.ipynb		movies_eda.ipynb
movies_hours.py		movies_hours.py
movies_modeling.py		movies_modeling.py
movies_sentiment.py		movies_sentiment.py

License

amotter443/movies

Folders and files

Latest commit

History

Repository files navigation

movies

Who is this project for?

Accessing The Data

Usage Insights

Script Execution Order

Data Dictionary

Future Project Expansions

Helpful Data Resources

About

Topics

Resources

License

Stars

Watchers

Forks

Languages