Data Analysis with Jupyter and pandas

Jupyter Notebooks
Jupyter use cases
Narrative with Markdown
pandas
First Python Notebook

Jupyter Notebooks

The Jupyter notebook is an interactive coding environment especially well-suited for data analysis. It provides a browser-based interface that is much more user-friendly than the Python interactive interpreter.

The ability to blend narrative with code and visualizations has made Jupyter notebooks an increasingly popular tool among data journalists. With Jupyter, it's a cinch to create readable, reproducible data analyses that can be shared with fellow reporters, editors or the public.

Jupyter typically runs on a local web server, although a number of platforms and services allow you to view notebooks online and even work with Jupyter without having to install anything.

Binder
Github
Google Colab
Kaggle
nbviewer

A growing number of newsrooms use Jupyter notebooks to collaborate internally, as well as share their work with the public. The Los Angeles Times Data Desk uses Jupyter to tell data-driven stories such as this analysis of California homes within fire zones.

Better yet, LAT journalists share their analyses with the world, as does BuzzFeed News.

Jupyter use cases

Jupyter is often used for all steps in a data project, from data acquisition through analysis and visualization.

A Jupyter-only workflow can be especially helpful on projects with relatively small data sets. By performing all data work in Jupyter, you can minimize context switching and technical overhead.

However, complex or time-consuming data acquisition processes are not always a great fit for Jupyter. For example, a traditional Python script or data pipeline is arguably more appropriate for a complex web scraper that runs hourly and feeds an "evergreen" database. Of course, the data produced by such a script is always accessible to a Jupyter notebook.

It's worth noting that it is possible to run heavy web scrapes in a Jupyter notebook. It's even possible to run a Jupyter notebook on the command line, similar to a standard Python script.

For this course, however, we'll decouple "expensive" data acquisition steps from Jupyter in order to keep notebooks light-weight and focused on data wrangling and analysis.

Narrative with Markdown

Here's a basic guide to getting started with Markdown, a simple language for generating formatted text.

pandas

The pandas library is the workhorse of data wrangling and analysis in Python. It provides a wide range of functionality for common data tasks such as filtering, joining, and aggregating data.

pandas is a big library and its official documentation can be a bit intimidating for beginners.

Below are a few resources that can help level up on pandas:

First Python Notebook
Kaggle pandas tutorial
10 Minutes to pandas
Pandas cheat sheets (print these! They're awesome!)
- basics cheat sheet
- data wrangling

First Python Notebook

The First Python Notebook tutorial, created by data journalist Ben Welsh of The Los Angeles Times, is an excellent gentle introduction to Jupyter and pandas. In addition to introducing key features of Jupyter, it walks through basic pandas skills using California campaign finance data.

It's worth taking the time to work through the entire tutorial.

Below are links to key sections in case you've moved on to your own project and need a quick refresher on particular skills.

Reading data
DataFrame helpers (head, info, etc.)
Columns
Filtering
Merging
Summing and descriptive stats (min, max, median, describe, etc.)
Sorting
Groupby
Charts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!