Skip to content

A data model that predicts the IMDb rating of a movie based on features like genre, director, and actors. Using regression techniques to tackle this problem.

License

Notifications You must be signed in to change notification settings

noturlee/IMDb-DataAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Movie Rating Prediction Model

GitHub views

Table of Contents

  1. Objective
  2. Data Preprocessing
  3. Model Building and Evaluation
  4. Visualizations
  5. Outputs
  6. Conclusion

Objective

The objective of this project is to build a model that predicts the rating of a movie based on features such as genre, director, and actors. By analyzing historical movie data, we aim to develop a regression model that accurately estimates the rating given to a movie by users or critics. This project involves data analysis, preprocessing, feature engineering, and machine learning modeling techniques to gain insights into the factors that influence movie ratings and build a reliable prediction model.

Data Preprocessing

Loading the Dataset

The dataset is loaded with a specified encoding to ensure proper reading of data.

Handling Missing Values

  • The 'Rating' column's missing values are filled with the mean rating.
  • Missing values in other columns are imputed with the mean for numeric columns and the most frequent value for categorical columns.

Feature Engineering

A new feature 'Total Actors' is created to capture the number of actors listed in each movie.

Handling Non-numeric Values

Non-numeric columns with empty strings are converted to NaN for proper handling.

Model Building and Evaluation

Splitting the Data

The dataset is split into training and testing sets.

Preprocessing Pipeline

  • Numeric features are scaled, and missing values are imputed with the mean.
  • Categorical features are one-hot encoded, and missing values are imputed with the most frequent value.

Model Selection

Four regression models are selected for evaluation:

  • Linear Regression
  • Ridge Regression
  • Lasso Regression
  • Random Forest Regressor

Cross-Validation

Each model is evaluated using 5-fold cross-validation to assess its performance on the training data.

Model Fitting and Testing

Models are fitted on the full training set and evaluated on the test set using metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R2 Score).

Visualizations

The following visualizations are generated to provide insights into the model's performance and data characteristics:

Distribution of Movie Ratings

A histogram showing the distribution of movie ratings in the dataset.

Screenshot 2024-06-14 at 23 26 40

Correlation Matrix

A heatmap showing the correlation between numeric features.

Screenshot 2024-06-14 at 23 28 01

Comparison of R-squared Scores

A bar plot comparing the cross-validated R-squared scores of different models.

Screenshot 2024-06-14 at 23 39 52

Actual vs. Predicted Ratings

A scatter plot showing the relationship between actual and predicted ratings for the test set.

Screenshot 2024-06-14 at 23 40 09

Residual Plot

A scatter plot showing the residuals (errors) of the predicted ratings.

Screenshot 2024-06-14 at 23 40 14

Outputs

The key outputs from the model evaluation are:

  • Cross-validated R-squared: Measures the proportion of variance in the dependent variable that is predictable from the independent variables. A higher R-squared value indicates better model performance.
  • Mean Squared Error (MSE): Measures the average squared difference between the predicted and actual ratings. Lower MSE indicates better model performance.
  • Root Mean Squared Error (RMSE): The square root of MSE, providing a measure of prediction error in the same units as the ratings.
  • R-squared (R2 Score): Indicates how well the model's predictions approximate the actual data points. A higher R-squared value indicates better fit.
Screenshot 2024-06-14 at 23 30 03

Conclusion

The models built in this project aim to predict movie ratings based on features like genre, director, and actors. The evaluation metrics show that while the models can provide some insights, their predictive power is relatively modest (R-squared values close to zero). This suggests that while these features contribute to movie ratings, other factors not captured in this dataset may also play significant roles. Further feature engineering, data enrichment, and model tuning could improve the accuracy of these predictions.

The project demonstrates the entire process of data analysis, preprocessing, feature engineering, and machine learning modeling to answer the question of predicting movie ratings, providing a foundation for further exploration and improvement.

About

A data model that predicts the IMDb rating of a movie based on features like genre, director, and actors. Using regression techniques to tackle this problem.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages