week4.html

<!DOCTYPE html>
<html>

<head>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  <title>Acquiring, cleaning, and formatting data</title>
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <link href="css/bootstrap.min.css" rel="stylesheet"> 
  <link href="css/custom.css" rel="stylesheet">
</head>

<body class="markdown github">

	<header class="navbar-inverse navbar-fixed-top">
		<div class="container">
			<nav role="navigation">
				<div class="navbar-header">
					<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
						<span class="sr-only">Toggle navigation</span>
						<span class="icon-bar"></span>
						<span class="icon-bar"></span>
						<span class="icon-bar"></span>
					</button>
					<a href="index.html" class="navbar-brand">J298 Data Journalism</a>
				</div> <!-- /.navbar-header -->
				<!-- Collect the nav links, forms, and other content for toggling -->
				<div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
					<ul class="nav navbar-nav">
						<li class="dropdown">
							<a href="#" class="dropdown-toggle" data-toggle="dropdown">Class notes<b class="caret"></b></a>
							<ul class="dropdown-menu">
								<li><a href="week1.html">What is data?</a></li>
								<li><a href="week2.html">Types of stories</a></li>
								<li><a href="week3.html">Working with spreadsheets</a></li>
								<li><a href="week4.html">Acquiring, cleaning, and formatting data</a></li>
								<li><a href="week5.html">R, RStudio, and the tidyverse</a></li>
								<li><a href="week6.html">Data journalism in the tidyverse</a></li>
								<li><a href="week7.html">Don't let the data lie to you</a></li>
								<li><a href="week8.html">Databases and SQL</a></li>
								<li><a href="week9.html">Finding stories using maps</a></li>
								<li><a href="week10.html">Maps meet databases</a></li>
								<li><a href="week11.html">More PostGIS</a></li>
								<li><a href="week12.html">R practice</a></li>
								<li><a href="week13.html">PostGIS practice</a></li>
								<li><a href="week14.html">More fun with R</a></li>
							</ul>
						</li>
						<li><a href="software.html">Software</a></li>
						<li><a href="datasets.html">Data</a></li>
						<li><a href="questions.html">If you get stuck</a></li>
						<li class="dropdown">
							<a href="#" class="dropdown-toggle" data-toggle="dropdown">Email instructors<b class="caret"></b></a>
							<ul class="dropdown-menu">
								<li><a href="mailto:p.aldhous@gmail.com">Peter Aldhous</a></li>
								<li><a href="mailto:abh@berkeley.edu">Amanda Hickman</a></li>
							</ul>
						</li>
					</ul>
				</div><!-- /.navbar-collapse -->
			</nav>
		</div> <!-- /.navbar-header -->
	</header>
	<div class="container all">

<h1 id="acquiring,-cleaning,-and-formatting-data"><a name="acquiring,-cleaning,-and-formatting-data" href="#acquiring,-cleaning,-and-formatting-data"></a>Acquiring, cleaning, and formatting data</h1><p>Not so many years ago, data was hard to obtain. Often journalists would have to painstakingly compile their own datasets from paper records. The Internet has changed the game. Today, many government and other public databases can now be queried online, and the results of those searches downloaded. Other datasets can be downloaded in their entirety.</p><p>The main problem today is usually not finding relevant data, but in working out whether it can be trusted, spotting and correcting errors and inconsistencies, and getting it in the right format for analysis and visualization.</p><p>In this class, we will cover some tips and tricks for finding the data you need online, and getting it onto your computer. We will consider how to recognize and clean “dirty” data, and review some common data formats, and learn how to convert from one to another.</p><h3 id="the-data-we-will-use"><a name="the-data-we-will-use" href="#the-data-we-will-use"></a>The data we will use</h3><p>Download the data for this class from <a href="data/week4.zip">here</a>, unzip the folder and place it on your desktop. It contains the following files:</p><ul>
<li><p><code>techexports.xls</code> <a href="http://data.worldbank.org/indicator/TX.VAL.TECH.CD">High-technology exports</a> from 1990 to 2015, in current US dollars, from the UN Comtrade database, supplied via the World Bank. High-technology exports include products in aerospace, computers, pharmaceuticals, scientific instruments, and electrical machinery.</p>
</li><li><p><code>ucb_stanford_2014.csv</code> Data on federal government grants to UC Berkeley and Stanford University in 2014, downloaded from <a href="https://www.usaspending.gov/Pages/Default.aspx">USASpending.gov</a>.</p>
</li><li><p><code>alerts-actions_2017.xls</code> Records of <a href="http://www.mbc.ca.gov/Publications/Disciplinary_Actions/">disciplinary alerts issued and actions taken</a> by the Medical Board of California in 2017.</p>
</li></ul><p>Not so many years ago, data was hard to obtain. Often journalists would have to painstakingly compile their own datasets from paper records. The Internet has changed the game. Today, many government and other public databases can now be queried online, and the results of those searches downloaded. Other datasets can be downloaded in their entirety.</p><p>The main problem today is usually not finding relevant data, but in working out whether it can be trusted, spotting and correcting errors and inconsistencies, and getting it in the right format for analysis.</p><p>In this class, we will cover some tips and tricks for finding the data you need online, and getting it onto your computer. We will also review some common data formats, and learn how to convert from one to another.</p><h3 id="understand-common-data-formats"><a name="understand-common-data-formats" href="#understand-common-data-formats"></a>Understand common data formats</h3><p><strong><a href="http://en.wikipedia.org/wiki/Comma-separated_values">CSV</a></strong> is just one variant of a “delimited” text file, in which the fields or columns in a table of data are separated by commas (hence comma-separated values, or CSV) or another character such as a tab. You may also encounter “fixed width” text files, in which the fields are lined up, with each located a specified number of characters across each row. Spreadsheets such as Libre Office Calc or Microsoft Excel can import fixed-width files, which you can then export in a delimited format for use in other software.</p><p>As we’ve already discussed, text files are great for transferring data from one software application to another during analysis and visualisation, but other formats that are easier for machines to read are typically used when transferring data between computers online.</p><p><strong><a href="http://json.org/">JSON</a></strong>, or JavaScript Object Notation, treats data as a series of “objects,” which begin and end with curly brackets. Each object in turn contains a series of name-value pairs. There is a colon between the name and value in each pair, and the pairs separated by commas.</p><p>Here, for example, are the first few rows of a simple dataset relating to a controversial theory claiming that the extent to which a country has developed a democratic political system is driven largely by the historical prevalence of infectious disease:</p><pre class="json hljs"><code class="json" data-origin="<pre><code class=&quot;json&quot;>[{&quot;country&quot;:&quot;Bahrain&quot;,&quot;income_group&quot;:&quot;High income: non-OECD&quot;,&quot;democ_score&quot;:45.6,&quot;infect_rate&quot;:23},
{&quot;country&quot;:&quot;Bahamas, The&quot;,&quot;income_group&quot;:&quot;High income: non-OECD&quot;,&quot;democ_score&quot;:48.4,&quot;infect_rate&quot;:24},
{&quot;country&quot;:&quot;Qatar&quot;,&quot;income_group&quot;:&quot;High income: non-OECD&quot;,&quot;democ_score&quot;:50.4,&quot;infect_rate&quot;:24},
{&quot;country&quot;:&quot;Latvia&quot;,&quot;income_group&quot;:&quot;High income: non-OECD&quot;,&quot;democ_score&quot;:52.8,&quot;infect_rate&quot;:25},
{&quot;country&quot;:&quot;Barbados&quot;,&quot;income_group&quot;:&quot;High income: non-OECD&quot;,&quot;democ_score&quot;:46,&quot;infect_rate&quot;:26}]
</code></pre>">[{"<span class="hljs-attribute">country</span>":<span class="hljs-value"><span class="hljs-string">"Bahrain"</span></span>,"<span class="hljs-attribute">income_group</span>":<span class="hljs-value"><span class="hljs-string">"High income: non-OECD"</span></span>,"<span class="hljs-attribute">democ_score</span>":<span class="hljs-value"><span class="hljs-number">45.6</span></span>,"<span class="hljs-attribute">infect_rate</span>":<span class="hljs-value"><span class="hljs-number">23</span></span>},
{"<span class="hljs-attribute">country</span>":<span class="hljs-value"><span class="hljs-string">"Bahamas, The"</span></span>,"<span class="hljs-attribute">income_group</span>":<span class="hljs-value"><span class="hljs-string">"High income: non-OECD"</span></span>,"<span class="hljs-attribute">democ_score</span>":<span class="hljs-value"><span class="hljs-number">48.4</span></span>,"<span class="hljs-attribute">infect_rate</span>":<span class="hljs-value"><span class="hljs-number">24</span></span>},
{"<span class="hljs-attribute">country</span>":<span class="hljs-value"><span class="hljs-string">"Qatar"</span></span>,"<span class="hljs-attribute">income_group</span>":<span class="hljs-value"><span class="hljs-string">"High income: non-OECD"</span></span>,"<span class="hljs-attribute">democ_score</span>":<span class="hljs-value"><span class="hljs-number">50.4</span></span>,"<span class="hljs-attribute">infect_rate</span>":<span class="hljs-value"><span class="hljs-number">24</span></span>},
{"<span class="hljs-attribute">country</span>":<span class="hljs-value"><span class="hljs-string">"Latvia"</span></span>,"<span class="hljs-attribute">income_group</span>":<span class="hljs-value"><span class="hljs-string">"High income: non-OECD"</span></span>,"<span class="hljs-attribute">democ_score</span>":<span class="hljs-value"><span class="hljs-number">52.8</span></span>,"<span class="hljs-attribute">infect_rate</span>":<span class="hljs-value"><span class="hljs-number">25</span></span>},
{"<span class="hljs-attribute">country</span>":<span class="hljs-value"><span class="hljs-string">"Barbados"</span></span>,"<span class="hljs-attribute">income_group</span>":<span class="hljs-value"><span class="hljs-string">"High income: non-OECD"</span></span>,"<span class="hljs-attribute">democ_score</span>":<span class="hljs-value"><span class="hljs-number">46</span></span>,"<span class="hljs-attribute">infect_rate</span>":<span class="hljs-value"><span class="hljs-number">26</span></span>}]
</code></pre><p><strong><a href="http://www.w3.org/XML/">XML</a></strong>, or Extensible Markup Language, is another format often used to move data around online. For example, the RSS feeds through which you can subscribe to content from blogs and websites using a reader such as <a href="https://feedly.com/">Feedly</a> are formatted in XML.</p><p>In XML data is structured by enclosing values within “tags,” similar to those used to code different elements on a web page in HTML. Here is that same data in XML format:</p><pre class="xml hljs"><code class="xml" data-origin="<pre><code class=&quot;xml&quot;>&amp;lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&amp;gt;
&amp;lt;rows&amp;gt;
  &amp;lt;row country=&quot;Bahrain&quot; income_group=&quot;High income: non-OECD&quot; democ_score=&quot;45.6&quot; infect_rate=&quot;23&quot; &amp;gt;&amp;lt;/row&amp;gt;
  &amp;lt;row country=&quot;Bahamas, The&quot; income_group=&quot;High income: non-OECD&quot; democ_score=&quot;48.4&quot; infect_rate=&quot;24&quot; &amp;gt;&amp;lt;/row&amp;gt;
  &amp;lt;row country=&quot;Qatar&quot; income_group=&quot;High income: non-OECD&quot; democ_score=&quot;50.4&quot; infect_rate=&quot;24&quot; &amp;gt;&amp;lt;/row&amp;gt;
  &amp;lt;row country=&quot;Latvia&quot; income_group=&quot;High income: non-OECD&quot; democ_score=&quot;52.8&quot; infect_rate=&quot;25&quot; &amp;gt;&amp;lt;/row&amp;gt;
  &amp;lt;row country=&quot;Barbados&quot; income_group=&quot;High income: non-OECD&quot; democ_score=&quot;46&quot; infect_rate=&quot;26&quot; &amp;gt;&amp;lt;/row&amp;gt;
&amp;lt;/rows&amp;gt;
</code></pre>"><span class="hljs-pi">&lt;?xml version="1.0" encoding="UTF-8"?&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-title">rows</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-title">row</span> <span class="hljs-attribute">country</span>=<span class="hljs-value">"Bahrain"</span> <span class="hljs-attribute">income_group</span>=<span class="hljs-value">"High income: non-OECD"</span> <span class="hljs-attribute">democ_score</span>=<span class="hljs-value">"45.6"</span> <span class="hljs-attribute">infect_rate</span>=<span class="hljs-value">"23"</span> &gt;</span><span class="hljs-tag">&lt;/<span class="hljs-title">row</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-title">row</span> <span class="hljs-attribute">country</span>=<span class="hljs-value">"Bahamas, The"</span> <span class="hljs-attribute">income_group</span>=<span class="hljs-value">"High income: non-OECD"</span> <span class="hljs-attribute">democ_score</span>=<span class="hljs-value">"48.4"</span> <span class="hljs-attribute">infect_rate</span>=<span class="hljs-value">"24"</span> &gt;</span><span class="hljs-tag">&lt;/<span class="hljs-title">row</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-title">row</span> <span class="hljs-attribute">country</span>=<span class="hljs-value">"Qatar"</span> <span class="hljs-attribute">income_group</span>=<span class="hljs-value">"High income: non-OECD"</span> <span class="hljs-attribute">democ_score</span>=<span class="hljs-value">"50.4"</span> <span class="hljs-attribute">infect_rate</span>=<span class="hljs-value">"24"</span> &gt;</span><span class="hljs-tag">&lt;/<span class="hljs-title">row</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-title">row</span> <span class="hljs-attribute">country</span>=<span class="hljs-value">"Latvia"</span> <span class="hljs-attribute">income_group</span>=<span class="hljs-value">"High income: non-OECD"</span> <span class="hljs-attribute">democ_score</span>=<span class="hljs-value">"52.8"</span> <span class="hljs-attribute">infect_rate</span>=<span class="hljs-value">"25"</span> &gt;</span><span class="hljs-tag">&lt;/<span class="hljs-title">row</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-title">row</span> <span class="hljs-attribute">country</span>=<span class="hljs-value">"Barbados"</span> <span class="hljs-attribute">income_group</span>=<span class="hljs-value">"High income: non-OECD"</span> <span class="hljs-attribute">democ_score</span>=<span class="hljs-value">"46"</span> <span class="hljs-attribute">infect_rate</span>=<span class="hljs-value">"26"</span> &gt;</span><span class="hljs-tag">&lt;/<span class="hljs-title">row</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-title">rows</span>&gt;</span>
</code></pre><h3 id="data-portals"><a name="data-portals" href="#data-portals"></a>Data portals</h3><p>Life is much easier if you can find everything you need in one place. The main effort to centralize access to data by the U.S. federal government is <a href="http://www.data.gov/">Data.gov</a>. You can search for data from the home page, or follow the <a href="http://catalog.data.gov/dataset">Data</a> and Topics links from the top menu.</p><p>Be warned, however, that Data.gov is a work in progress, and does not contain all of the U.S. government’s data. Some of the most useful datasets are still only available on the websites of individual federal agencies. <a href="http://fedstats.sites.usa.gov/">FedStats</a> has <a href="http://fedstats.sites.usa.gov/agencies/">links to agencies</a> with data collections.</p><p>It is worth familiarizing yourself with the main government agencies that have responsibility for the beats you are interested in, and the datasets they maintain.</p><p>Other data portals at various levels of government are emerging. The City and County of San Francisco, for example, was at the forefront of the Open Data movement, establishing <a href="https://data.sfgov.org/">DataSF</a> in 2009.</p><p>If you need to make comparisons between nations, the <a href="http://www.worldbank.org/"><strong>World Bank</strong></a> probably has what you need. Its <a href="http://data.worldbank.org/indicator/?tab=all">World Development Indicators catalog</a> containing data for more than 7,000 different measures, compiled by the bank and other UN agencies.</p><p>You can navigate the site using the search box or using the topics links to the right. When you click on a particular indicator, you are sent to a page that gives options to download the dataset from a link near the top right of the page. The data in some cases goes back as far as 1960, and is listed both by individual country and summarized by regions and income groups.</p><p>Other useful sources of data for international comparisons are <a href="http://www.gapminder.org/data/"><strong>Gapminder</strong></a> and the <strong><a href="http://unstats.un.org/unsd/default.htm">UN Statistical Division</a></strong>. For health data in particular, try the <strong><a href="http://www.oecd.org/els/health-systems/health-data.htm">Organisation for Economic Co-operation and Development</a></strong> and the <strong><a href="http://www.who.int/gho/en/">World Health Organization</a></strong>.</p><h3 id="search-for-data-on-the-web"><a name="search-for-data-on-the-web" href="#search-for-data-on-the-web"></a>Search for data on the web</h3><p>Often, your starting point in searching for data and documents will be Google. It can be worth focusing your queries using Google’s <a href="http://www.google.com/advanced_search">advanced search</a>:</p><p><img src="img/class4_1.jpg" alt=""></p><p>(Source: <a href="http://www.google.com/advanced_search">Google</a>)</p><p>The options to search by <code>site or domain</code> and <code>file type</code> can be especially useful when looking for data. For example, the <a href="http://www.usgs.gov/">U.S. Geological Survey</a> is the best source of data on earthquakes and seismic risk, so when searching for this information, specifying the domain <code>usgs.gov</code> would be a good idea. You can make the domains as narrow or broad as you like: <code>.edu</code>, for instance, would search the sites of all US academic institutions using that top-level domain; <code>ucsc.edu</code> would search the web pages of this university only.</p><p>The file type search offers a drop-down menu, with the options including Excel spreadsheets, and Google Earth <code>KML</code> and <code>KMZ</code> files. These are common data formats, but you are not limited to those on the menu. In a regular Google search, type a space after your search terms followed by <code>filetype:xxx</code>, where <code>xxx</code> is the suffix for the file type in question. For example, <code>dbf</code> will look for database tables in this format. Combining file type and domain searches can be a good way to find data an agency has posted online — some of which may not otherwise be readily accessible.</p><p>Searching for PDFs and <code>PPT</code> presentations can be a good way to find documents that may not otherwise be easy to find.</p><p>One common data format doesn’t show up file type searches. Geographical data is often made available as “shapefiles.” Because they consist of multiple files that are usually stored in compressed folders, shapefiles can’t readily be searched using a file type suffix, but they can usually be found by adding the terms “shapefile” or “GIS data” to a regular Google search.</p><h3 id="search-online-databases"><a name="search-online-databases" href="#search-online-databases"></a>Search online databases</h3><p>Many important public databases can be searched online, and some offer options to download the results of your queries. Most of these databases give a simple search box, but it’s always worth looking for the advanced search page, which will offer more options to customize your search. Here, for example, is the <a href="http://clinicaltrials.gov/ct2/search/advanced">advanced search</a> page for <a href="http://clinicaltrials.gov/">ClinicalTrials.gov</a>:</p><p><img src="img/class4_2.jpg" alt=""></p><p>(Source: <a href="http://clinicaltrials.gov/ct2/search/advanced">ClinicalTrials.gov</a>)</p><p>When you start working with a new online database, take some time to familiarize yourself with how its searches work: Read the Help or FAQs, and then run test searches to see what results you obtain. <a href="http://clinicaltrials.gov/ct2/help/how-find/index">Here</a>, for example, is the “How To” section of ClinicalTrials.gov.</p><p>Many online databases can be searched using Boolean logic, using the operators <code>AND</code>, <code>OR</code> and <code>NOT</code> to link search terms together. So find out how a particular database uses Boolean logic — and the default settings that it will use if you list search terms without any Boolean operators.</p><p>Putting search terms in quote marks often searches for a specific phrase. For example, searching for “heart attack” on ClinicalTrials.gov will give only give results in which those two words appear together; leaving out the quote marks will include any trial in which both words appear.</p><p>Also find out whether the database allows “wildcards,” symbols such as <code>*</code> or <code>%</code> that can be dropped into your search to obtain results with variations on a word or number.</p><h3 id="look-for-download-options----and-know-when-you-are-hitting-the-wall"><a name="look-for-download-options----and-know-when-you-are-hitting-the-wall" href="#look-for-download-options----and-know-when-you-are-hitting-the-wall"></a>Look for download options — and know when you are hitting the wall</h3><p>Having run a search on an online database, you will usually want to download the results, so look for the download links or buttons.</p><p>A common problem with online databases, however, is that they may impose limits on the number of results that are returned on each search. And even when a search returns everything, there may be a limit on how many of those results can be downloaded to your own computer.</p><p>If broad searches on a database keep returning the same number of results, that is a sign that you are probably running up against a search limit, and any download will not contain the complete set of data that you are interested in. However, you may be able to work out ways of searching to obtain all of the data in chunks.</p><h3 id="download-the-entire-database"><a name="download-the-entire-database" href="#download-the-entire-database"></a>Download the entire database</h3><p>Downloading an entire database, where this is allowed, frees you from the often-limited options given on an online advanced search form: You can then upload the data into your own database software, and query it in any way that you want. We will learn how to do this next week.</p><p>So always look for ways to grab all of the data. One trick is to run a search on just the database’s wildcard character, or with the query boxes left blank. If you do the latter at ClinicalTrials.gov, for instance, your search will return all of the trials in the database, which can then be downloaded using the options at the bottom of the results page.</p><p>Other databases have an online search form, but also have a separate link from where data to be downloaded in its entirety, usually as a text file or series of text files. One example is Bioresearch Monitoring Information System, which lists doctors and other researchers involved in testing experimental drugs being considered for marketing approval by the Food and Drug Administration. It can be searched online <a href="http://www.accessdata.fda.gov/scripts/cder/BMIS/index.cfm?fuseaction=Search.ShowAdvancedSearchForm">here</a>, but can also be downloaded in full from <a href="http://www.fda.gov/Drugs/InformationOnDrugs/ucm135162.htm">here</a>.</p><p>Note that large text files are again often stored in compressed folders, so may be invisible to a Google search by file type.</p><h3 id="automate-downloads-of-multiple-data-files"><a name="automate-downloads-of-multiple-data-files" href="#automate-downloads-of-multiple-data-files"></a>Automate downloads of multiple data files</h3><p>Often data or documents don’t reside in a single searchable database, but instead exist online as a series of separate files. In such cases, clicking on each link is tedious and time-consuming. But you can automate the process using the <a href="https://chrome.google.com/webstore/detail/batch-link-downloader/aiahkbnnpafepcgnhhecilboebmmolnn?hl=en-US">Batch Link Downloader</a> Chrome extension.</p><p>To illustrate, go to <a href="http://www.gapminder.org/data/">Gapminder’s data catalog</a>, and select <code>All</code> indicators. The webpage now includes links to more than 500 downloadable spreadsheets.</p><p>Create a folder called <code>gapminder</code> on your desktop to hold the downloaded files.</p><p>Now temporaily change your download destination in Chrome so that the files are saved into this folder.</p><p>Select <code>Chrome&gt;Preferences</code> from its top menu, then click the hamburger menu at top left:</p><p><img src="img/class4_3.jpg" alt=""></p><p>Open up the <code>Advanced</code> menu and select <code>Downloads</code>, then change to the location to your <code>gapminder</code> folder.</p><p>The Batch Link Downloader downloads the target of any link on a web page. If it’s a link to an image it will download an image; if it’s a link to another webpage it will download that webpage; if it’s a link to a data file such as a CSV file, it will download the file.</p><p>Now, back on the Gapminder data page, hover over some of the spreadsheet download icons, and see that the links all end with: <code>&amp;Output=xlsx</code>. This is because they are the links to export Excel spreadhseets from the Google Sheets in which the data is stored.</p><p>Now click the Batch Link Downloaded icon:</p><p><img src="img/class4_4.jpg" alt=""></p><p>Enter <code>xlsx</code> into the text box, click the <code>Add by filename pattern</code> button and all of the spreadsheets should be selected for download:</p><p><img src="img/class4_5.jpg" alt=""></p><p>(If you were downloading CSV files, you would enter <code>.csv</code> into the text box, and so on.)</p><p>Click <code>Start download</code> and all the files should download into the target folder.</p><p>Once finished, make sure to switch you download location back, using Chrome Preferences as before!</p><h3 id="extract-data-from-tables-on-the-web"><a name="extract-data-from-tables-on-the-web" href="#extract-data-from-tables-on-the-web"></a>Extract data from tables on the web</h3><p>On other occasions, data may exist in tables on the web. Copying and pasting data from web tables can be tricky, but the <a href="https://chrome.google.com/webstore/detail/table-capture/iebpjdmgckacbodjpijphcplhebcmeop?hl=en">Table Capture</a> Chrome extension simplifies the process.</p><p>To illustrate what Table Captures does, go to <a href="https://www.nih.gov/about-nih/what-we-do/nih-almanac/appropriations-section-1">this table</a> showing the budget history for components of the National Institutes of Health.</p><p>Click on the Table Capture icon:</p><p><img src="img/class4_6.jpg" alt=""></p><p>Now you can click the <code>Copy to clipboard</code> link or <code>To Google Doc</code> to paste into a blank file (Excel spreadsheet or text file) or Google Sheet respectively. (If pasting into a text file, the delimeters between the columns with be tabs.)</p><p><img src="img/class4_7.jpg" alt=""></p><p>This table is in two parts, so click on the <code>Section 2</code> link, and repeat the process to grab all the data.</p><h3 id="manipulate-urls-to-expose-the-data-you-need"><a name="manipulate-urls-to-expose-the-data-you-need" href="#manipulate-urls-to-expose-the-data-you-need"></a>Manipulate urls to expose the data you need</h3><p>As you search for data using web query forms, make a habit of looking at what happens to the url. Often it will contain patterns detailing the search you have run, and it will be possible to alter the data provided by manipulating the url. This can be quicker than filling in search forms. In some cases it may even reveal more data than default search options will allow.</p><p>To illustrate how this works, go to the <a href="http://www.isrctn.com/">ISRCTN clinical tral registry</a>, and then navigate to the <a href="http://www.isrctn.com/editAdvancedSearch">advanced search page</a>. Enter <code>cancer</code> under <code>Condition</code> and note that more than 2,000 trials are returned.</p><p>The url should now have changed to the following:</p><pre class="javascript hljs"><code class="Javascript" data-origin="<pre><code class=&quot;Javascript&quot;>http://www.isrctn.com/search?q=&amp;amp;filters=condition%3Acancer&amp;amp;searchType=advanced-search
</code></pre>">http:<span class="hljs-comment">//www.isrctn.com/search?q=&amp;filters=condition%3Acancer&amp;searchType=advanced-search</span>
</code></pre><p>Notice that the default is to display 10 results per page, and then change the view to see the maximum of 100 trials allowed by the dropdown menu. The url should now read:</p><pre class="javascript hljs"><code class="Javascript" data-origin="<pre><code class=&quot;Javascript&quot;>http://www.isrctn.com/search?pageSize=100&amp;amp;sort=&amp;amp;page=1&amp;amp;q=&amp;amp;filters=condition%3Acancer&amp;amp;searchType=advanced-search
</code></pre>">http:<span class="hljs-comment">//www.isrctn.com/search?pageSize=100&amp;sort=&amp;page=1&amp;q=&amp;filters=condition%3Acancer&amp;searchType=advanced-search</span>
</code></pre><p>Now change that url to:</p><pre class="javascript hljs"><code class="Javascript" data-origin="<pre><code class=&quot;Javascript&quot;>http://www.isrctn.com/search?pageSize=2500&amp;amp;sort=&amp;amp;page=1&amp;amp;q=&amp;amp;filters=condition%3Acancer&amp;amp;searchType=advanced-search
</code></pre>">http:<span class="hljs-comment">//www.isrctn.com/search?pageSize=2500&amp;sort=&amp;page=1&amp;q=&amp;filters=condition%3Acancer&amp;searchType=advanced-search</span>
</code></pre><p>Having done so, all of the registered clinical trials involving cancer should now be displayed on a single page. We could now use Batch Link Download to download all of the individual web pages describing each of these trials, or we could use this url as the starting point to scrape data from each of those pages.</p><h3 id="use-application-programming-interfaces-(apis)"><a name="use-application-programming-interfaces-(apis)" href="#use-application-programming-interfaces-(apis)"></a>Use application programming interfaces (APIs)</h3><p>Websites like the ISRCTN clinical trial registry are not expressly designed to be searched by manipulating their urls, but some organizations make their data available through APIs that can be queried by constructing a url in a similar way. This allows websites and apps to call in specific chunks of data as required, and work with it “on the fly.”</p><p>To see how this works, go to the U.S. Geological Survey’s <a href="http://earthquake.usgs.gov/earthquakes/search/">Search Earthquake Archives</a> page, where we will search for all earthquakes with a <a href="http://www.geo.mtu.edu/UPSeis/magnitude.html">magnitude</a> of 5 or greater that occured witin 6,000 kilometers of the geographic center of the contiguous United States, which <a href="http://tools.wmflabs.org/geohack/geohack.php?pagename=Geographic_center_of_the_contiguous_United_States&amp;params=39.828175_N_98.579500_W_region:US_type:landmark">this site</a> tells us lies at a latitude of <code>39.828175</code> degrees and a longitude of <code>-98.5795</code> degrees. We will initially ask for the data in a format called <a href="http://geojson.org/">GeoJSON</a> (a variant of JSON, a common format used to trasmit data over the web). Enter <code>1960-01-01T00:00:00</code> under <code>Start</code> for <code>Date &amp; Time</code> boxes so that we obtain all recorded earthquakes from the beginning of 1960 onward. The search form should look like this:</p><p><img src="img/class4_8.jpg" alt=""></p><p>(Source: <a href="http://earthquake.usgs.gov/earthquakes/search/">U.S. Geological Survey</a>)</p><p>You should recieve a quantity of data at the following url:</p><pre class="javascript hljs"><code class="Javascript" data-origin="<pre><code class=&quot;Javascript&quot;>http://earthquake.usgs.gov/fdsnws/event/1/query?starttime=1960-01-01T00:00:00&amp;amp;latitude=39.828175&amp;amp;longitude=-98.5795&amp;amp;maxradiuskm=6000&amp;amp;minmagnitude=5&amp;amp;format=geojson&amp;amp;orderby=time
</code></pre>">http:<span class="hljs-comment">//earthquake.usgs.gov/fdsnws/event/1/query?starttime=1960-01-01T00:00:00&amp;latitude=39.828175&amp;longitude=-98.5795&amp;maxradiuskm=6000&amp;minmagnitude=5&amp;format=geojson&amp;orderby=time</span>
</code></pre><p>See what happens if you append <code>-asc</code> to the end of that url: This should sort the the earthquakes from oldest to newest, rather than the default of newest to oldest. <a href="http://earthquake.usgs.gov/fdsnws/event/1/">Here</a> is the full documentation for querying the earthquake API by manipulating these urls,</p><p>Now remove the <code>-asc</code> and replace <code>geojson</code> in the url with <code>csv</code>. The data should now download in CSV format.</p><h3 id="pdfs:-the-bane-of-data-journalism"><a name="pdfs:-the-bane-of-data-journalism" href="#pdfs:-the-bane-of-data-journalism"></a>PDFs: The bane of data journalism</h3><p>Some organizations persist in making data available as PDFs, rather than text files, spreadsheets or databases. This makes the data hard to extract. While you should always ask for data in a more friendly format — ideally a CSV or other simple text file — as a data journalist you are at some point likely to find yourself needing to pull data out of a PDF.</p><p>For digital PDFs, <strong><a href="http://tabula.technology/">Tabula</a></strong> is a useful data extraction tool — however it will not work with PDFs created by scanning the original document, which have to be interpreted using Optical Character Recognition (OCR) software.</p><p>Also useful is the online service <strong><a href="http://www.cometdocs.com/">Cometdocs</a></strong>. While it is a commercial tool, members of Investigative Reporters and Editors can <a href="http://ire.org/blog/ire-news/2013/05/22/ire-announces-partnership-cometdocs/">obtain a free account</a>. Cometdocs can read scanned PDFs, however its accuracy will vary depending on how well the OCR works on the document in question.</p><p><strong><a href="https://acrobat.adobe.com/us/en/products/acrobat-pro.html">Adobe Acrobat Pro</a></strong> will also perform OCR on scanned PDFs.</p><p><a href="https://github.com/jsfenfen/parsing-prickly-pdfs">Here</a> is a useful set of resources detailing more advanced tools for dealing with hard-to-parse PDFs.</p><h3 id="can-i-trust-this-data?"><a name="can-i-trust-this-data?" href="#can-i-trust-this-data?"></a>Can I trust this data?</h3><p>Having identified a possible source of data for your project, you need to ask: Is it reliable, accurate and useful? If you rush into analysis without considering this question, your hard work may be undermined by the maxim: “Garbage In, Garbage Out.”</p><p>The best rule of thumb in determining the reliability of a dataset is find out whether it has been used for analysis before, and if so, by whom. If a dataset was put together for an academic study, or is actively curated so it can be made available for experts to analyse, you can be reasonably confident that it is as complete and accurate as it can be — the US Geological Survey’s earthquake data is a good example.</p><p>While in general you might be more trusting of data downloaded from a <code>.gov</code> or <code>.edu</code> domain than something found elsewhere on the web, don’t simply assume that it is reliable and accurate. Before using any dataset, do some background research to find out how it was put together, and whether it has been rigorously checked for errors. If possible, try to speak to the people responsible for managing the database, and any academics or other analysts who have used the data. They will be your best guide to a dataset’s strengths and weaknesses.</p><p>Even for well-curated data, make a point of speaking with experts who compile it or use it, and ask them about the data’s quirks and limitations. From talking with experts on hurricanes, for example, I know not to place too much trust in data on North Atlantic storms prior to about 1990, before satellite monitoring was well developed — even though the <a href="http://www.aoml.noaa.gov/hrd/hurdat/Data_Storm.html">data available from the National Oceanic and Atmospheric Adminstration</a> goes back to 1851.</p><p>Always ask probing questions of a dataset before putting your trust in it. Is this data complete? Is it up-to-date? If it comes from a survey, was it based on a representative sample of people who are relevant to your project? Remember that the first dataset you find online may not be the most relevant or reliable.</p><h3 id="recognize-dirty-data"><a name="recognize-dirty-data" href="#recognize-dirty-data"></a>Recognize dirty data</h3><p>In an ideal world, every dataset we find would have been lovingly curated, allowing us to start analysing and visualising without worrying about its accuracy.</p><p>In practice, however, often the best available data has some flaws, which may need to be corrected as far as is possible. So before starting to work with a new dataset, look for common errors.</p><p>Look for glitches in the alignment of columns, which may cause data to appear in the wrong field.</p><p>For people’s names, look for variations in spelling, format, initials and accents, which may cause the same person to appear in multiple guises. Similar glitches may affect addresses, and any other information entered as text.</p><p>Some fields offer some obvious checks: if you see a zip code that contains anoything other than five or nine digits, for instance, you know it must be wrong.</p><p>Dates can also be entered incorrectly, so it’s worth scanning for those that fall outside the timeframe that should be covered by the data.</p><p>Also scan numbers for any obvious outliers. These values are worth checking out. Are they correct, or did someone misplace a decimal point or enter a number in the wrong units?</p><p>Other common problems are white spaces before and after some entries, which may need to be stripped out.</p><p>At all stages of your work, pay attention to zeros. Is each one actually supposed to represent zero, or should the cell in fact be empty? Take particular care exporting data from one software tool and importing to another, and check how empty cells, or “nulls,” have been handled.</p><h3 id="clean-and-process-data-with-open-refine"><a name="clean-and-process-data-with-open-refine" href="#clean-and-process-data-with-open-refine"></a>Clean and process data with Open Refine</h3><p>Checking and cleaning “dirty” data can be the most labor intensive part of  many data journalism projects, but <strong><a href="http://openrefine.org/">Open Refine</a></strong> can streamline the task — and also create a reproducible script to quickly repeat the process on data that must be cleaned and processed in the same way.</p><p>When you launch Open Refine, it opens in your web browser. However, any data you load into the program will remain on your computer — it does not get posted online.</p><p>The opening screen should look like this:</p><p><img src="img/class4_9.jpg" alt=""></p><h4 id="reshape-data-from-wide-to-long-format"><a name="reshape-data-from-wide-to-long-format" href="#reshape-data-from-wide-to-long-format"></a>Reshape data from wide to long format</h4><p>Click the <code>Browse</code> button and navigate to the file <code>techexports.xls</code>. Click <code>Next&gt;&gt;</code>, and check that data looks correct:</p><p><img src="img/class4_10.jpg" alt=""></p><p>Open Refine should recognize that the data is in a Excel spreadsheet file, but if not you can use the panel at bottom to specify the correct file type and format for the data. If your file has some initial header lines that need to be ignored, Open Refine can deal with that, too.</p><p>Numbers and dates should appear in green, plain text in black. Also change the <code>Project name</code> to remove the <code>xls</code>.</p><p>When you are satisfied that the data has been read correctly, click the <code>Create Project &gt;&gt;</code> button at top right. The screen should now look like this:</p><p><img src="img/class4_11.jpg" alt=""></p><p>As you can see, the data is in wide format, with values for technology exports by country organized in columns, one for each year. To convert this to long format, click on the small downward-pointing triangle for the first of these year columns, and select <code>Transpose&gt;Transpose cells across columns into rows</code>.</p><p>Fill in the dialog box as below, making sure that <code>From Column</code> and <code>To Column</code> are highlighted correctly, that the <code>Key column</code> and <code>Value column</code> have been given appropriate names, and that <code>Fill down in other columns</code> is checked. (Failing to do check this box will mean that the region names each will only appear once in the reshaped data, rather than being copied down to appear next to the corresponding data for year and oil production.)</p><p><img src="img/class4_12.jpg" alt=""></p><p>Click <code>Transpose</code> and then the <code>50</code> rows link, to see the first 50 rows of the reshaped data:</p><p><img src="img/class4_13.jpg" alt=""></p><p>Click the <code>Export</code> button at top right and you will see options to export the data in a variety of file types, including <code>Comma-separated value</code> and <code>Excel</code> spreadsheet.</p><h4 id="clean-dirty-data"><a name="clean-dirty-data" href="#clean-dirty-data"></a>Clean dirty data</h4><p>Click the Open Refine logo at top left to return to the opening screen. Create a new project from the file <code>ucb_stanford_2014.csv</code>.</p><p>When importing a CSV or other text file, check that the correct column separator has been recognized, and also make sure to check <code>Parse cell text into numbers, dates ...</code>, otherwise every column will be imported as plain text:</p><p><img src="img/class4_14.jpg" alt=""></p><p>Once imported, the project should look like this:</p><p><img src="img/class4_15.jpg" alt=""></p><p>Again, each field/column has a button with a downward-pointing triangle. Click on these buttons and you get the option to create “facets” for the column, which provide a powerful way to edit and clean data.</p><p>Click on the button for the field <code>Recipent City</code>, and select <code>Facet&gt;Text facet</code>. A summary of the various entries now appears in the panel to the left:</p><p><img src="img/class4_16.jpg" alt=""></p><p>The numbers next to each entry show how many records there are for each value.</p><p>We can edit entries individually: Select <code>Veterans Bureau Hospi</code>, which is clearly not a city, click on the <code>Edit</code> link, change it to <code>Unknown</code>. (If cleaning this data for a real project, we would need to check with an external source to get the actual city for this entry.)</p><p>Another problem is that we have a mixture of cases, with some entries in Title or Proper Case, some in UPPERCASE. We can fix this back in the field itself. Click its button again and select <code>Edit cells&gt;common transforms&gt;To titlecase</code>.</p><p>Now notice that we apparently have duplicate entries for <code>Berkeley</code>, <code>Palo Alto</code> and <code>Stanford</code>. This is the result of trailing white space after the city names for some entries. Select <code>Edit cells&gt;common transforms&gt;Trim leading and trailing whitespace</code> and notice how the problem resolves:</p><p><img src="img/class4_17.jpg" alt=""></p><p>Having cleaned this field, close the facet by clicking the cross at top left.</p><p>Now create a text facet for the field <code>Recipient</code>:</p><p><img src="img/class4_18.jpg" alt=""></p><p>What a mess! The only possibilities are Stanford or Berkeley, yet there are multiple variants of each, many including <code>Board of Trustees</code> for Stanford and <code>Regents of</code> for UC Berkeley.</p><p>First, manually edit <code>Interuniveristy Center for Japanese Language</code> to <code>Stanford</code>, which is where this center is based.</p><p>We could continute editing manually, but to illustrate Open Refine’s editing functions click on the <code>Cluster</code> button. Here you can experiment with different clustering algorithms to edit entries that may be variants of the same thing. Select <code>key collision</code> and <code>metaphone3</code>, then start checking the clusters and renaming them as <code>Berkeley</code> or <code>Stanford</code> as appropriate:</p><p><img src="img/class4_19.jpg" alt=""></p><p>Click <code>Merge Selected &amp; Close</code> and the facet can then be quickly edited manually:</p><p><img src="img/class4_20.jpg" alt=""></p><p>Often we may need to convert fields to text, numbers or dates. For example, click on the button for <code>Award Date</code> and select <code>Edit cells&gt;common transforms&gt;To date</code> and see that it changes from a string of text to a date in standard format.</p><p>Notice the field <code>Award amount</code>, which is a value in dollars. Negative values are given in brackets. Because of these symbols, the field is being<br>recognized as a string of text, rather than a number. So to fix this problem, we have to remove the symbols.</p><p>Select <code>Edit colum&gt;Add column based on this column...</code> and fill in the dialog box as follows:</p><p><img src="img/class4_21.jpg" alt=""></p><p>This is the expression:</p><pre class="javascript hljs"><code class="JavaScript" data-origin="<pre><code class=&quot;JavaScript&quot;>value.replace('$','').replace(',','').replace('(','-').replace(')','')</code></pre>">value.replace(<span class="hljs-string">'$'</span>,<span class="hljs-string">''</span>).replace(<span class="hljs-string">','</span>,<span class="hljs-string">''</span>).replace(<span class="hljs-string">'('</span>,<span class="hljs-string">'-'</span>).replace(<span class="hljs-string">')'</span>,<span class="hljs-string">''</span>)
</code></pre><p>Here <code>value</code> refers to the value in the original column, and <code>replace</code> is a function that replaces characters in the value. We can run several <code>replace</code> operations by “chaining” them together.</p><p>In this expression we are replacing the <code>$</code> symbols, the commas separating thousands, and the closing brackets with nothing; we are replacing the opening brackets with a hyphen to designate negative numbers.</p><p>Click <code>OK</code> and the new column will be created. Note that it is still being treated as text, but that can be corrected by selecting <code>Edit cells&gt;common transforms&gt;To number</code>.</p><p>This is just one example of many data transformation functions that can be accessed using Open Refine’s expression language, called GREL. Learning these functions can make Open Refine into a very powerful data processing tool. Study the “Further reading” links for more.</p><p>Open Refine’s facets can also be used to inspect columns containing numbers. Select <code>Facet&gt;Numeric facet</code> for the new field. This will create a histogram showing the distribution of numbers in the field:</p><p><img src="img/class4_22.jpg" alt=""></p><p>We can then use the slider controls to filter the data, which is good for examining possible outliers at the top of bottom of the range. Notice that here a small number of grants have negative values, while there is one grant with a value of more than $3 billion from the National Science Foundation. This might need to be checked out to ensure that it is not an error.</p><p>While most of the data processing we have explored could also be done in a spreadsheet, the big advantage of Open Refine is that we can extract a “pipeline” for processing data to use when we obtain data in the same format in future.</p><p>Select <code>Undo / Redo</code> at top left. Notice that clicking on one of the steps detailed at left will transform the data back to that stage in our processing. This means you don’t need to worry about making mistakes, as it’s always possible to revert to an earlier state, before the error, and pick up from there.</p><p>Return to the final step, then click the <code>Extract</code> button. At the dialog box, check only those operations that you will want to perform in future (typically generic transformations on fields/columns, and not correcting errors for individual entries). Here I have unchecked all of the corrections in the text facets, and selected just those operations that I know I will want to repeat if I obtain data from this source again:</p><p><img src="img/class4_23.jpg" alt=""></p><p>This will generate JSON in the right hand panel that can be copied into a blank text file and saved.</p><p>To process similar data in future. Click the <code>Apply</code> button on the <code>Undo / Redo</code> tab, paste in the text from this file, and click <code>Perform Operations</code>. The data will then be processed automatically.</p><p>When you are finished cleaning and processing your data, click the <code>Export</code> button at top right to export as a CSV file or in other formats.</p><p>Open Refine is a very powerful tool that will reward efforts to explore its wide range of its functions for manipulating data, and to learn its expression language. See the Further Reading for more.</p><h3 id="convert-from-one-data-format-to-another"><a name="convert-from-one-data-format-to-another" href="#convert-from-one-data-format-to-another"></a>Convert from one data format to another</h3><p><strong><a href="https://shancarter.github.io/mr-data-converter/">Mr Data Converter</a></strong> is a web app made by Shan Carter, formerly of the graphics team at <em>The New York Times</em>, now with Google, that makes it easy to convert data from a spreadsheet or delimited text file to JSON or XML.</p><p>Copy the data from a <code>CSV</code> or tab-delimited text file and paste it into the top box, select the output you want, and it will appear at the bottom. You will generally want to select the <code>Properties</code> variants of JSON or XML.</p><p>You can then copy and paste this output into a text editor, and save the file with the appropriate suffix (<code>xml</code>, <code>json</code>).</p><p><img src="img/class4_24.jpg" alt=""></p><p>(Source: <a href="https://shancarter.github.io/mr-data-converter/">Mr Data Converter</a>)</p><p>To convert data from JSON or XML into text files, use Open Refine. First create a new project and import your JSON or XML file. Then use the <code>Export</code> button and select <code>Tab-separated value</code> or <code>Comma-separated value</code> to export as a text file.</p><h3 id="excercise/assignment"><a name="excercise/assignment" href="#excercise/assignment"></a>Excercise/assignment</h3><ul>
<li>Using Open Refine, process the Medical Board of California 2017 disciplinary data in the file <code>alerts-actions_2017.xls</code> in the following way:<ul>
<li>Separate city and state into two columns.</li><li>Process the doctors’ names into four columns for each of: last name; first name; other names; suffix (Jr., II etc, if present).</li><li>Make sure no text columns have leading or trailing white space.</li><li>Remove the space from the license numbers so that they have the format <code>G34377</code>, <code>A92917</code>, and so on.</li><li>Remove the multiple header rows through the data</li><li>Rename all columns so that there are no spaces in the column names.</li><li>Fix any other problems you may find in the data.</li><li>Export the processed data as a CSV file.</li><li>Extract the JSON that would allow your data processing pipeline to be repeated on another of the downloads from <a href="http://www.mbc.ca.gov/Publications/Disciplinary_Actions/">this page</a>. Save the extracted JSON in a text file with the extension <code>.json</code>.</li><li>File your CSV and JSON files via bCourses.</li></ul>
</li></ul><p><strong>Deadline: Weds Feb 14 at 8pm</strong></p><h3 id="further-reading/resources"><a name="further-reading/resources" href="#further-reading/resources"></a>Further reading/resources</h3><p>Paul Bradshaw. <em><a href="https://leanpub.com/scrapingforjournalists">Scraping For Journalists</a></em></p><p><a href="https://github.com/OpenRefine/OpenRefine/wiki">Open Refine Wiki</a></p><p><a href="https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Users">Open Refine Documentation</a></p><p><a href="https://github.com/OpenRefine/OpenRefine/wiki/Recipes">Open Refine Recipes</a></p>

	</div> <!-- /.container all -->
	<script src="https://code.jquery.com/jquery.min.js"></script>
	<script src="js/bootstrap.min.js"></script>
</body>
</html>