Skip to content

Latest commit

 

History

History
38 lines (29 loc) · 2 KB

README.md

File metadata and controls

38 lines (29 loc) · 2 KB

Investigate-a-Dataset

Project Details

Introduction

For this project, I conducted my own data analysis and created a file to share documenting the findings. I started by taking a look at the dataset and brainstorming what questions I could answer using it. Then I used pandas and NumPy to answer the questions I was most interested in, and created a report sharing the answers.

Step One - Choose a Data Set

I chost the TMDB dataset: https://www.kaggle.com/tmdb/tmdb-movie-metadata

Step Two - Get Organized

Create a single folder containing:

  • The report communicating your findings
  • Any Python code you wrote as part of your analysis
  • The data set used
  • A Jupyter notebook,

Step Three - Analyze the Data

Brainstorm some questions that could be answered about the data set, then start answering those questions. Y Find some questions in the data set options to get started.

Try and suggest questions that promote looking at relationships between multiple variables. Aim to analyze at least one dependent variable and three independent variables in the investigation. Use NumPy and pandas where they are appropriate!

This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue. ● Certain columns, like ‘cast’ and ‘genres’, contain multiple values separated by pipe (|) characters. ● There are some odd characters in the ‘cast’ column. Don’t worry about cleaning them, leave them as is. ● The final two columns ending with “_adj” show the budget and revenue of the associated movie in terms of 2010 dollars, accounting for inflation over time.

Example questions

Which genres are most popular from year to year? What kinds of properties are associated with movies that have high revenues?