Note: This exercise set is part of the Stanford Computational Journalism Lab. I've also written a blog post that gives a little more elaboration about the libraries used and a few of the exercises.
This repository contains 101 Web data-collection tasks in Python 3 that I assigned to my Computational Journalism class in Spring 2015 to give them regular exercise in programming and conducting research, and to expose them to the variety of data published online.
The hard part of many of these tasks is researching and finding the actual data source. The scripts need only concern itself with fetching the data and printing the answer in the least painful way possible. Since the Computational Journalism class wasn't intended to be an actual programming class, adherence to idioms and best codes practices was not emphasized...(especially since I'm new to Python myself!)
Some examples of the tasks:
- The California city whose city manager has the highest total wage per capita in 2012 (expanded version)
- In the most recently transcribed Supreme Court argument, the number of times laughter broke out (expanded version)
- Number of days until Texas's next scheduled execution
- The U.S. congressmember with the most Twitter followers
- The number of people who visited a U.S. government website using Internet Explorer 6.0 in the last 90 days
The table below links to the available scripts. If there's not a link, it means I haven't committed the code. Some of them I had to rethink a less verbose solution (or the target changed, as the Internet sometimes does), and now this repo has taken a backseat to many other data projects on my list. ¯\_(ツ)_/¯
Note: A lot of the code is not best practice. The tasks are a little repetitive so I got bored and ignored PEP8 and/or tried new libraries/conventions for fun.
Note: The "related URL" links to either the official source of the data, or at least a page with some background information. The second column of this table refers to line count of the script, not the answer to the prompt.
The repo currently contains scripts for 100 of 101 tasks:
Title | Line count |
---|---|
1. Number of datasets currently listed on data.gov [related URL] [script] |
7 lines |
2. The name of the most recently added dataset on data.gov [related URL] [script] |
7 lines |
3. The number of people who visited a U.S. government website using Internet Explorer 6.0 in the last 90 days [related URL] [script] |
4 lines |
4. The number of librarian-related job positions that the federal government is currently hiring for [related URL] [script] |
6 lines |
5. The name of the company cited in the most recent consumer complaint involving student loans [related URL] [script] |
27 lines |
6. From 2010 to 2013, the change in median cost of health, dental, and vision coverage for California city employees [related URL] [script] |
38 lines |
7. The number of listed federal executive agency internet domains [related URL] [script] |
8 lines |
8. The number of times when a New York heart surgeon's rate of patient deaths for all cardiac surgical procedures was "significantly higher" than the statewide rate, according to New York state's analysis. [related URL] [script] |
7 lines |
9. The number of roll call votes that were rejected by a margin of less than 5 votes, in the first session of the U.S. Senate in the 114th Congress [related URL] [script] |
26 lines |
10. The title of the highest paid California city government position in 2010 [related URL] [script] |
35 lines |
11. How much did the state of California collect in property taxes, according to the U.S. Census 2013 Annual Survey of State Government Tax Collections? [related URL] [script] |
23 lines |
12. In 2010, the year-over-year change in enplanements at America's busiest airport [related URL] [script] |
51 lines |
13. The number of armored carrier bank robberies recorded by the FBI in 2014 [related URL] [script] |
15 lines |
14. The number of workplace fatalities at reported to the federal and state OSHA in the latest fiscal year [related URL] [script] |
14 lines |
15. Total number of wildlife strike incidents reported at San Francisco International Airport [related URL] [script] |
48 lines |
16. The non-profit organization with the highest total revenue, according to the latest listing in ProPublica's Nonprofit Explorer [related URL] [script] |
11 lines |
17. In the "Justice News" RSS feed maintained by the Justice Department, the number of items published on a Friday [related URL] [script] |
11 lines |
18. The number of U.S. congressmembers who have Twitter accounts, according to Sunlight Foundation data [related URL] [script] |
9 lines |
19. The total number of preliminary reports on aircraft safety incidents/accidents in the last 10 business days [related URL] [script] |
12 lines |
20. The number of OSHA enforcement inspections involving Wal-Mart in California since 2014 [related URL] [script] |
25 lines |
21. The current humidity level at Great Smoky Mountains National Park [related URL] [script] |
6 lines |
22. The names of the committees that Sen. Barbara Boxer currently serves on [related URL] [script] |
7 lines |
23. The name of the California school with the highest number of girls enrolled in kindergarten, according to the CA Dept. of Education's latest enrollment data file. [related URL] [script] |
21 lines |
24. Percentage of NYPD stop-and-frisk reports in which the suspect was white in 2014 [related URL] [script] |
24 lines |
25. Average frontal crash star rating for 2015 Honda Accords [related URL] [script] |
14 lines |
26. The dropout rate for all of Santa Clara County high schools, according to the latest cohort data in CALPADS [related URL] [script] |
48 lines |
27. The number of Class I Drug Recalls issued by the U.S. Food and Drug Administration since 2012 [related URL] [script] |
14 lines |
28. Total number of clinical trials as recorded by the National Institutes of Health [related URL] [script] |
7 lines |
29. Number of days until Texas's next scheduled execution [related URL] [script] |
24 lines |
30. The total number of inmates executed by Florida since 1976 [related URL] [script] |
10 lines |
31. The number of proposed U.S. federal regulations in which comments are due within the next 3 days [related URL] [script] |
29 lines |
32. Number of Titles that have changed in the United States Code since its last release point [related URL] [script] |
6 lines |
33. The number of FDA-approved, but now discontinued drug products that contain Fentanyl as an active ingredient [related URL] [script] |
14 lines |
34. In the latest FDA Weekly Enforcement Report, the number of Class I and Class II recalls involving food [related URL] [script] |
10 lines |
35. Most viewed data set on New York state's open data portal as of this month [related URL] [script] |
9 lines |
36. Total number of visitors to the White House in 2012 [related URL] [script] |
27 lines |
37. The last time the CIA's Leadership page has been updated [related URL] [script] |
6 lines |
38. The domain of the most visited U.S. government website right now [related URL] [script] |
5 lines |
39. Number of medical device recalls issued by the U.S. Food and Drug Administration in 2013 [related URL] [script] |
6 lines |
40. Number of FOIA requests made to the Chicago Public Library [related URL] [script] |
6 lines |
41. The number of currently open medical trials involving alcohol-related disorders [related URL] [script] |
5 lines |
42. The name of the Supreme Court justice who delivered the opinion in the most recently announced decision [related URL] [script] |
31 lines |
43. The number of citations that resulted from FDA inspections in fiscal year 2012 [related URL] [script] |
10 lines |
44. Number of people visiting a U.S. government website right now [related URL] [script] |
6 lines |
45. The number of security alerts issued by US-CERT in the current year [related URL] [script] |
6 lines |
46. The number of Pinterest accounts maintained by U.S. State Department embassies and missions [related URL] [script] |
13 lines |
47. The number of international travel alerts from the U.S. State Department currently in effect [related URL] [script] |
7 lines |
48. The difference in total White House staffmember salaries in 2014 versus 2010 [related URL] [script] |
19 lines |
49. Number of sponsored bills by Rep. Nancy Pelosi that were vetoed by the President [related URL] [script] |
11 lines |
50. In the most recently transcribed Supreme Court argument, the number of times laughter broke out [related URL] [script] |
22 lines |
51. The title of the most recent decision handed down by the U.S. Supreme Court [related URL] [script] |
6 lines |
52. The average wage of optomertrists according to the BLS's most recent National Occupational Employment and Wage Estimates report [related URL] [script] |
8 lines |
53. The total number of on-campus hate crimes as reported to the U.S. Office of Postsecondary Education, in the most recent collection year [related URL] [script] |
45 lines |
54. The number of people on FBI's Most Wanted List for white collar crimes [related URL] [script] |
6 lines |
55. The number of Government Accountability Office reports and testimonies on the topic of veterans [related URL] [script] |
10 lines |
56. Number of times Rep. Darrell Issa's remarks have made it onto the Congressional Record [related URL] [script] |
9 lines |
57. The top 3 auto manufacturers, ranked by total number of recalls via NHTSA safety-related defect and compliance campaigns since 1967. [related URL] [script] |
24 lines |
58. The number of published research papers from the NSA [related URL] [script] |
6 lines |
59. The number of university-related datasets currently listed at data.gov [related URL] [script] |
7 lines |
60. Number of chapters in Title 20 (Education) of the United States Code [related URL] [script] |
15 lines |
61. The number of miles traveled by the current U.S. Secretary of State [related URL] [script] |
6 lines |
62. For all of 2013, the number of potential signals of serious risks or new safety information that resulted from the FDA's FAERS [related URL] [script] |
14 lines |
63. In the current dataset behind Medicare's Nusring Home Compare website, the total amount of fines received by penalized nursing homes [related URL] [script] |
35 lines |
64. from March 1 to 7, 2015, the number of times in which designated FDA policy makers met with persons outside the U.S. federal executive branch [related URL] [script] |
5 lines |
65. The number of failed votes in the roll calls 1 through 99, in the U.S. House of the 114th Congress [related URL] [script] |
12 lines |
66. The highest minimum wage as mandated by state law. [related URL] [script] |
28 lines |
67. For the most recently posted TSA.gov customer satisfication survey, post the percentage of respondents who rated their "overall experience today" as "Excellent" [related URL] |
|
68. Number of FDA-approved prescription drugs with GlaxoSmithKline as the applicant holder [related URL] [script] |
11 lines |
69. The average number of comments on the last 50 posts on NASA's official Instagram account [related URL] [script] |
40 lines |
70. The highest salary possible for a White House staffmember in 2014 [related URL] [script] |
10 lines |
71. The percent increase in number of babies named Archer nationwide in 2010 compared to 2000, according to the Social Security Administration [related URL] [script] |
32 lines |
72. The number of magnitude 4.5+ earthquakes detected worldwide by the USGS [related URL] [script] |
8 lines |
73. The total amount of contributions made by lobbyists to Congress according to the latest downloadable quarterly report [related URL] [script] |
34 lines |
74. The description of the bill most recently signed into law by the governor of Georgia [related URL] [script] |
12 lines |
75. Total number of officer-involved shooting incidents listed by the Philadelphia Police Department [related URL] [script] |
9 lines |
76. The total number of publications produced by the U.S. Government Accountability Office [related URL] [script] |
9 lines |
77. Number of Dallas officer-involved fatal shooting incidents in 2014 [related URL] [script] |
7 lines |
78. Number of Cupertino, CA restaurants that have been shut down due to health violations in the last six months. [related URL] [script] |
6 lines |
79. The change in total airline revenues from baggage fees, from 2013 to 2014 [related URL] [script] |
19 lines |
80. The total number of babies named Odin born in Colorado according to the Social Security Administration [related URL] [script] |
20 lines |
81. The latest release date for T-100 Domestic Market (U.S. Carriers) statistics report [related URL] [script] |
13 lines |
82. In the most recent FDA Adverse Events Reports quarterly extract, the number of patient reactions mentioning "Death" [related URL] [script] |
47 lines |
83. The sum of White House staffermember salaries in 2014 [related URL] [script] |
12 lines |
84. The total number of notices published on the most recent date to the Federal Register [related URL] [script] |
6 lines |
85. The number of iPhone units sold in the latest quarter, according to Apple Inc's most recent 10-Q report [related URL] [script] |
49 lines |
86. Number of computer vulnerabilities in which IBM was the vendor in the latest Cyber Security Bulletin [related URL] [script] |
10 lines |
87. Number of airports with existing construction related activity [related URL] [script] |
6 lines |
88. The number of posts on TSA's Instagram account [related URL] [script] |
24 lines |
89. In fiscal year 2013, the short description of the most frequently cited type of FDA's inspectional observations related to food products. [related URL] [script] |
32 lines |
90. The currently serving U.S. congressmember with the most Twitter followers [related URL] [script] |
76 lines |
91. Number of stop-and-frisk reports from the NYPD in 2014 [related URL] [script] |
22 lines |
92. In 2012-Q4, the total amount paid by Rep. Aaron Schock to Lobair LLC, according to Congressional spending records, as compiled by the Sunlight Foundation [related URL] [script] |
14 lines |
93. Number of Github repositories maintained by the GSA's 18F organization, as listed on Github.com [related URL] [script] |
5 lines |
94. The New York City high school with the highest average math score in the latest SAT results [related URL] [script] |
96 lines |
95. Since 2002, the most commonly occurring winning number in New York's Lottery Mega Millions [related URL] [script] |
9 lines |
96. The number of scheduled arguments according to the most recent U.S. Supreme Court argument calendar [related URL] [script] |
11 lines |
97. The New York school with the highest rate of religious exemptions to vaccinations [related URL] [script] |
10 lines |
98. The latest estimated population percent change for Detroit, MI, according to the latest Census QuickFacts summary. [related URL] [script] |
8 lines |
99. According to the Medill National Security Zone, the number of chambered guns confiscated at airports by the TSA [related URL] [script] |
11 lines |
100. The California city whose city manager earns the most total wage per population of its city in 2012 [related URL] [script] |
23 lines |
101. The number of women currently serving in the U.S. Congress, according to Sunlight Foundation data [related URL] [script] |
8 lines |
Each task is meant to be a self-contained script: you run it, and it prints the answer I'm looking for. The scripts in this repo should "just work"...if you have all the dependencies installed that I had while writing them, and the web URLs they target haven't changed...so, basically, these may not work at all.
To copy the scripts quickly via the command-line; by default, a ./search-script-scrape directory will be created:
$ git clone https://github.com/compjour/search-script-scrape.git
To run a script:
$ cd search-script-scrape
$ python3 scripts/1.py
I leave it to you and Google to figure out how to run Python 3 on your own system. FWIW, I was using the Python 3.4.3 provided by the Anaconda 2.2.0 installer for OS X. The most common third-party libraries used are Requests for downloading the files and lxml for HTML parsing.
To reiterate: each of these scripts are meant to print out single answers, and so they don't actually show the full potential of how programming can automate data collection. As you get better at programming and recognizing its patterns, you'll find out how easy it is to abstract what seemed like a narrow task into something much bigger.
For example, Script #50 prints out the number of times laughter broke out in the most recently transcribed Supreme Court argument. Change two lines and that script will print out the laugh count in every transcribed Supreme Court argument: (demo here)
The same kind of small code restructuring can be done to many of the tasks here. And you can also modify the parameters; why limit yourself to finding the highest paid "City Manager" in California when you can extend the search to every kind of California employee, across every year of salary data? (demo here)
And of course, in real-world data projects, you aren't typically interested in just printing the answer to your Terminal. You generally want to send them to a spreadsheet or spreadsheet and eventually to a web application (or other kind of publication). That's just a few more lines of programming, too...So while this repo contains a bunch of toy scripts, see if you can think of ways to turn them into bigger data explorations.
The original requirement was that students finish all 100 scripts by the end of the quarter. That didn't quite work out so I reduced the requirement to 50. It was a bad idea to make this a "oh, just turn it in at the end of the year", as most people have the tendency to wait for finals week to do such work.
Most of the tasks are pretty straightforward, in terms of the Python programming. The majority of the time is figuring out exactly what the hell I'm referring to, so next time I do this, I'll probably provide the URL of the target page rather than having people attempt to divine the Google Path I used to get to the data.
- Class instructions for Computational Journalism: Search-Script-Scrape
- List of tasks as a Google Doc