School of Data (http://schoolofdata.org)
Sildes used in the presentations: view
Welcome! In this session we will show you how you can use Python (2.7) clean data! You will get information from a series of spreadsheets that have more or less the same structure and create a single clean CSV file. Pretty rad, huh?
This tutorial was inspired by a script I created for J++ SĂŁo Paulo to clean some Brazil's government data and an excellent OpenRefine workshop that Sarah Cohen gave during the Computer Assisted Reporting Conference 2016 (CAR Conference 2016) in Denver, USA.
I will assume:
- You know what Python,
CSV
andXLS
files are; - You know what regular expressions are;
- You know what modules/libraries are and how to import them;
- You have some basic knowledge of how computer algorithms work (assign values to variables, create if/else tests, control flows - while/for - etc);
- You are not afraid of learning a bunch of cool tricks :)
Even if you don't know all of the above you should be able to follow!
- Python 2.7 (should work on 3.5 as well, though I haven't tested. Let me know how it goes!);
- Built-in Regular Expression (
re
) module to find specific text; - Built-in
CSV
module to create and manipulateCSV
files; listdir()
function from built inos
module, to get the list of file names in a folder;xlrd
library to manipulateXLS
files. You can install it usingpip install xlrd
on your terminal.
Let's go ahead and do our imports. Create a python file and include the following at the top:
import re # Regular Expression library
import csv # The CSV library
from os import listdir # The listdir() function from the OS library
from xlrd import open_workbook # The open_workbook class from the xlrd library
Good. We should now be able to start cleaning, but a bit of context first.
I'll show you how to solve an annoying problem that might pop up anytime during your data journalism career: the data you want is spread across multiple files that share the same structure between themselves. This could be the case in a number of situations:
- A government agency publishes a report every month on their website with updated information;
- A company publishes, every quarter, reports about their financial status;
- A civil society organisation releases an
.xls
report every week about what's going on in the area they care about; - Any situation where there are a number of files that preserve the same structure between themselves and you want to consolidate all the data together in a single
CSV
file for further analysis.
We will explore a specific case and learn the techniques to work around the issue. Keep in mind the specific case here is not important. You can extrapolate the approach presented here to apply anywhere else where the same conditions apply. Let's get started!
The original data is here: https://www.health.ny.gov/health_care/managed_care/reports/enrollment/monthly/
The xls files in the data/
folder are Medicaid long-term managed care reports from New York State in the United States. This data can be used as a way to determine which company would make a good subject based on its growth and size.
The published data is in the XLS format. While these reports are useful for human inspection, they can't really be processed by a computer as they are. We need to find a way to put all the information in every single file together and clean it. Sarah Cohen shows how to do it in OpenRefine. You will learn how to do it with Python.
Computers are really good at doing the same thing over and over. What we want is to find common patterns and assign our script to look for them and repeat the procedure until it's finished. Also, we have to make sure the files share the same pattern -- they have the same number of columns, the same categories, they have the same keywords or empty spaces in similar place, etc.
You may have noticed in the source files that the series go far back 2013. To make things easier, I excluded from this excersise files before 2013. Why? Because the Yankee government changed the structure of the files slightly, with a different number of columns and the type of data. Including them would give an extra layer of complexity to this tutorial. I may include how to extract data from them in the future.
Since the files after 2013 share the same basic structure, we will figure out how get the information we want from one of them and assign our script to do the same thing for all the other files.
Think about which columns your final dataset will have. What are you looking for? In our case, it makes sense to think about having the following columns:
YEAR
MONTH
COUNTY
PLAN NAME
ADC
(TANF ADC & MA-ADC)SNA
(SNA HR & MA-HR)SSI
(SSI & MA-SSI)NYSoH
TOTAL ENROLLED
The other columns are aggregated numbers that we don't really need. We can get the same numbers by adding the values in the columns above.
Yes! Several! In every file:
- ...the information we need is always in the first sheet of the
XLS
file - ...dates (month & year) are always in the same cell (
A4
) - ...names of counties and plans are always in the same columns (
A
&B
) - ...the list of plan names always start with the string
TOTALS:
- ...the county name is always immediately one cell to the left of
TOTALS:
- ...the list of plan names always end in an empty space (after that we have data for New York city, which we won't capture)
Notice that in some files the name of the columns are different. There are two cases:
- Files with the
NYSoH
column - Files without the
NYSoH
column
No problem. We will take care of each one of those cases populating our CSV
file with the correct data.
Before starting your cleaning procedure take your time evaluating the files. Write down patterns that are more evident. Watch out for differencies between the files. They might be similar, but have slight differences, such as different column names or aggregated data.
Alright. Let's get to it.
When you open the XLS
files, one of the first things you'll notice is that cell A4
holds the information about the month and year of the current file. That is always the case. We will store this information in two separate columns in our final CSV
file: MONTH
and YEAR
. To do that, let's create a function that will give us a dictionary with key/value pairs for the MONTH
and for the YEAR
. We will use this dictionary later.
# Gets the date from the specific cell in the documents
# and use regular expression to separate month from year.
# returns a dictionary with YEAR and MONTH
def getDate(worksheet):
'''
worksheet: an xlrd book object
return: dictionary
'''
date = worksheet.cell_value(3, 0)
match = re.match(r'NYS[ ,]+(\w+)[ ,](\d+)', date).groups()
date = {'YEAR': match[1],
'MONTH': match[0]
}
return date
Let's see how this works.
This function needs an xlrd
book object (essentially, the sheet where will extract the information from) that we will create later. It uses the cell_value
method from xlrd
to get the contents of a specific cell. The first parameter (3) is the row and the second is the column. Since we want the contents of cell A4
we use the pair 3 (the count starts at zero!) and 0. We store the contents of cell A4
in the variable date
.
After that we do a regex (Regular Expression) search in our date
variable to extract the month and the year. The government sometimes uses a comma to separate NYS
from the month, sometimes it doesn't, which is frustrating. That also happens between month and year. But no matter! Regex to the rescue.
NYS[ ,]+
= matches a string that starts with NYS and is followed by either an empty space or a comma and any number of characters after that.
(\w+)[ ,] =
will do our first capture (that's why the parethesis are there): any word after the above followed by an empty space or a comma.
(\d+)
= will do our second capture, any number of digits after the above.
With our two values captured, we create our dictionary and return it.
We will create our CSV
file row by row. So let's create an empty dictionary to store the information about the current row we will be scraping.
row = {}
Let's also assign the path to the folder where the XLS
files are.
basedir = 'data/'
And let's use the listdir()
function to get the list of filenames in that folder.
files = listdir(basedir)
We will go over each and every XLS
file. We need to tell the script which file it will be working on at any given time. Let's make sure our list has only files with the XLS
extension.
sheets = [filename for filename in files if filename.endswith("xls")]
Whoa! Wait a minute. I just did what is called a list comprehension: get the filenames in the files list but only those which end with 'xls'
. Pretty cool, right?
The first line of a CSV
file is usually the name of the columns, you know, the header. Since we will be writing row by row in our CSV
file using a series of repetitions, we need a device to know if we have written the header or not. Let's create a flag that will help us with that.
header_is_written = False
We will set header_is_written
to True after we write the header.
Ok. Now we're going to scrape the information from all the files in the sheets list we just created. To do that we will start a simple for loop.
We will also open the worksheet using open_workbook
from xlrd
. Since the information we need is in the first sheet of the file, let's use the sheet_by_index()
function to open that sheet:
for filename in sheets:
worksheet = open_workbook(basedir + filename).sheet_by_index(0)
Now let's get the date
from this file using the function we created in the beginning:
date = getDate(worksheet)
row['YEAR'] = date['YEAR']
row['MONTH'] = date['MONTH']
Notice we're still inside the loop, so the code needs to be idented. This will be the case moving forward.
Column B is the most important. It has the name of the companies and all the names of the counties are right next to it. Let's get the values in that column and store them in a list. After that we will iterate over them to get what we need.
col1_values = worksheet.col_values(1) # column A == 0, column B == 1...
We're iterating over range
instead of the values because we need to keep track of the coordinate of the cells. With the coordinates we can get the actual content, so we should be fine.
We're all set to get the data we need. Remember we need to get the names of the counties, the name of the companies and the values associated with them. Our reference is the value TOTALS:
in Column B. If we find it we will be one cell away from the county name. Let's iterate over the values of Column B. If we find TOTALS:
let's save the value immediately to the left in our row dictionary, assigning it to the key COUNTY
. We will be using the cell_value
method from xlrd
. I'm also using the strip function. It will trim trailing and leading spaces from the cell value, just in case.
for i in range(len(col1_values)):
if worksheet.cell_value(i, 1) == u'TOTALS:':
row['COUNTY'] = worksheet.cell_value(i, 0).strip()
county = True
counter = 1
Now we need to get the plan names that are associated with that county. Our robot is still looking for the TOTALS:
value and when it finds it, we will have reached the next county. So let's create a flag called county. True
if the robot doesn't find the next TOTALS:
, False
if it hits the next county. Also, since the plan names are the plan names right after TOTALS:
, we create a counter that will basically increment the row number by one to get the values down the list until we find another TOTALS:
.
while county is True:
row['PLAN NAME'] = worksheet.cell_value(i + counter, 1).strip()
values = worksheet.row_slice(i + counter, start_colx=2, end_colx=7)
So, while the county flag is True
, we will store the PLAN NAME
in the dictionary adding the counter to the position of the row. We're also using the nifty row_slice
method from xlrd
to generate a list of all the associated values for that PLAN NAME
, from Columns C (2) to G (6 - the parameter end_colx
gets up to the number before the one you set).
Since we have a list of values in the same order of the columns, we will assign each one of them to the appropriate key in the row
dictionary. The list won't have the actual values, but xlrd
objects with the values. To get the values we need to use the getter .value
. We're also converting them to integers with the int()
function.
Remmeber that some files have the column NYSoH
and others don't. Depending on the case, the values on the list will change places. A simple if/else
statement will take care of that. The SSI
column becomes the 4th number in the list whenever NYSoH
is not present. When NYSoH
is not present, let's give it an empty value.
We have to give NYSoH
an empty value because we will be writing rows to the CSV
file. All the rows must have the same amount of values, otherwise things will break.
We're also incrementing counter
by 1 so that whenever it goes back to the beginning of the loop, it will look for the row after the one we just stored.
if worksheet.cell_value(4,5) == u'NYSoH':
row['ADC'] = int(values[0].value)
row['SNA'] = int(values[1].value)
row['SSI'] = int(values[2].value)
row['NYSoH'] = int(values[3].value)
row['TOTAL'] = int(values[4].value)
else:
row['ADC'] = int(values[0].value)
row['SNA'] = int(values[1].value)
row['SSI'] = int(values[3].value)
row['NYSoH'] = ""
row['TOTAL'] = int(values[4].value)
counter += 1
Everything is good. We have the info we need to write the row in the CSV
file. Before we do that, let's set the county flag to False
if it finds TOTALS:
once it increments the counter. Also, if the PLAN NAME
in that row was assigned an empty space, it means we reached the end of the counties column. The empty space is what separates the data from the counties from the data for New York city. If that's the case, we will break
to get outside of this loop and go to the next file without writing the row with the empty space.
if worksheet.cell_value(i + counter, 1) == u'TOTALS:':
county = False
# If we reach a point in this column that the value is blank, it means our search in this file is over!
elif row['PLAN NAME'] == '':
break
Alright. Time to write the row dictionary to the CSV
file. Easy. We will use the CSV
module we imported earlier. Remember the flag we created for the header? The handy thing about having stored all the row information in a dictionary is that we can use the writeheader()
function to get the name of the keys and write the header. Once we do that, we update the header_is_written
flag to True
and never use it again, since we need to do this only once, at the very first iteration :)
# Saving the row in the CSV file...
with open('table.csv', 'ab') as f:
w = csv.DictWriter(f, row.keys())
if header_is_written is False:
w.writeheader()
header_is_written = True
w.writerow(row)
Yeah. That's it.
There's no formula that will fit all files out there. Every situation will bring new challenges and you will have to figure out what the patterns are and the best way to approach them. There are many ways to clean the same file. We could have used a column other than B as our reference, for example. Use the approach that makes more sense to you.
Let's take a look at the whole code, shall we? :)
import re
import csv
from xlrd import open_workbook
from os import listdir
# Gets the date from the specific cell in the documents
# and use regular expression to separate month from year.
# returns a dictionary with YEAR and MONTH
def getDate(worksheet):
'''
:param worksheet:
:return: dictionary
'''
date = worksheet.cell_value(3, 0)
match = re.match(r'NYS[ ,]+(\w+)[ ,](\d+)', date).groups()
date = {'YEAR': match[1],
'MONTH': match[0]
}
return date
# Empty dictionary where we will store each row, before parsing it to the CSV
row = {}
# This is where the xls files are
basedir = 'data/'
# This is an os function that returns a list of filenames in a folder
files = listdir(basedir)
# Empty list to store only XLS files found in the folder
sheets = [filename for filename in files if filename.endswith("xls")]
header_is_written = False
# Iterating over the files in folder
for filename in sheets:
# Opens the xls file
worksheet = open_workbook(basedir+filename).sheet_by_index(0)
# Get the date from it
date = getDate(worksheet)
row['YEAR'] = date['YEAR']
row['MONTH'] = date['MONTH']
# We're going to iterate over all the values in column[1]
col1_values = worksheet.col_values(1)
for i in range(len(col1_values)):
if worksheet.cell_value(i, 1) == u'TOTALS:':
row['COUNTY'] = worksheet.cell_value(i, 0).strip()
county = True
counter = 1
# While we're in that county, let's save all the data it has for different plan names
while county is True:
row['PLAN NAME'] = worksheet.cell_value(i + counter, 1).strip()
values = worksheet.row_slice(i + counter, start_colx=2, end_colx=7)
if worksheet.cell_value(4,5) == u'NYSoH':
row['ADC'] = int(values[0].value)
row['SNA'] = int(values[1].value)
row['SSI'] = int(values[2].value)
row['NYSoH'] = int(values[3].value)
row['TOTAL'] = int(values[4].value)
else:
row['ADC'] = int(values[0].value)
row['SNA'] = int(values[1].value)
row['SSI'] = int(values[3].value)
row['NYSoH'] = ""
row['TOTAL'] = int(values[4].value)
counter += 1
# If the robot finds another cell with the value "TOTALS:", it means we reached another county
# Time to break and start again
if worksheet.cell_value(i + counter, 1) == u'TOTALS:':
county = False
# If we reach a point in this column that the value is blank, it means our search is over!
elif row['PLAN NAME'] == '':
break
# Saving the row in the CSV file...
with open('table.csv', 'ab') as f:
w = csv.DictWriter(f, row.keys())
if header_is_written is False:
w.writeheader()
header_is_written = True
w.writerow(row)