The Jupyter notebook is an interactive coding environment especially well-suited for data analysis. It provides a browser-based interface that is much more user-friendly than the Python interactive interpreter.
The ability to blend narrative with code and visualizations has made Jupyter notebooks an increasingly popular tool among data journalists. With Jupyter, it's a cinch to create readable, reproducible data analyses that can be shared with fellow reporters, editors or the public.
Jupyter typically runs on a local web server, although a number of platforms and services allow you to view notebooks online and even work with Jupyter without having to install anything.
A growing number of newsrooms use Jupyter notebooks to collaborate internally, as well as share their work with the public. The Los Angeles Times Data Desk uses Jupyter to tell data-driven stories such as this analysis of California homes within fire zones.
Better yet, LAT journalists share their analyses with the world, as does BuzzFeed News.
Jupyter is often used for all steps in a data project, from data acquisition through analysis and visualization.
A Jupyter-only workflow can be especially helpful on projects with relatively small data sets. By performing all data work in Jupyter, you can minimize context switching and technical overhead.
However, complex or time-consuming data acquisition processes are not always a great fit for Jupyter. For example, a traditional Python script or data pipeline is arguably more appropriate for a complex web scraper that runs hourly and feeds an "evergreen" database. Of course, the data produced by such a script is always accessible to a Jupyter notebook.
It's worth noting that it is possible to run heavy web scrapes in a Jupyter notebook. It's even possible to run a Jupyter notebook on the command line, similar to a standard Python script.
For this course, however, we'll decouple "expensive" data acquisition steps from Jupyter in order to keep notebooks light-weight and focused on data wrangling and analysis.
Here's a basic guide to getting started with Markdown, a simple language for generating formatted text.
The pandas library is the workhorse of data wrangling and analysis in Python. It provides a wide range of functionality for common data tasks such as filtering, joining, and aggregating data.
pandas is a big library and its official documentation can be a bit intimidating for beginners.
Below are a few resources that can help level up on pandas:
- First Python Notebook
- Kaggle pandas tutorial
- 10 Minutes to pandas
- Pandas cheat sheets (print these! They're awesome!)
The First Python Notebook tutorial, created by data journalist Ben Welsh of The Los Angeles Times, is an excellent gentle introduction to Jupyter and pandas. In addition to introducing key features of Jupyter, it walks through basic pandas skills using California campaign finance data.
It's worth taking the time to work through the entire tutorial.
Below are links to key sections in case you've moved on to your own project and need a quick refresher on particular skills.
- Reading data
- DataFrame helpers (head, info, etc.)
- Columns
- Filtering
- Merging
- Summing and descriptive stats (min, max, median, describe, etc.)
- Sorting
- Groupby
- Charts