Skip to content

Latest commit

 

History

History
100 lines (66 loc) · 5.43 KB

File metadata and controls

100 lines (66 loc) · 5.43 KB

Web Scraping for the News

Overview

Web scraping is the act of automating the acquisition of data or files (images, videos, documents, etc.) from the web. The data may live on one or more pages of a website, or perhaps many different websites. At root, web scraping involves writing code to mimic the actions a human might take to visit a site in a web browser and manually extract information.

Value of scraping

Why is web scraping a valuable tool for the news?

Often, journalistically valuable information is locked up on a website that lacks easier methods for data acquisition. Not all government agencies, for example, offer downloadable CSVs or APIs. Nor do they always respond to public records requests in a timely or helpful manner.

Web scraping allows journalists to acquire information in the face of technical or bureaucratic hurdles.

Scraping is also useful in scenarios where a website offers the most up-to-date or widest scope of information. In such cases, web scraping can help journalists tell a more accurate and timely story.

Here are a few examples where web scraping helped produce news:

Scraping mechanics and challenges

On a technical level, web scraping typically involves extracting information from the HTML of a website and/or files linked to by the website, which may in turn be downstream web pages. This process can be automated by understanding the anatomy of a site -- how pages are structured, URL patterns, etc. -- and then creating a script that retrieves (or visits) web pages and extracts the target information.

Scraping can be more or less difficult depending on the nature of the site. A simple site with no dynamic content and predictable URL patterns could be a quick job, compared to one that uses web forms, randomized URLs, cookies or sessions, dynamically generated content, password-based logins, etc. Sites often use a combination of these strategies, so it's important to spend time learning how a site works and choose an appropriate scraping strategy.

The option of last resort

Web scraping is a brittle activity. Sites move, URL and page structures evolve, interactivity gets added or removed.

Shiny new web scrapers inevitably break in the days, months and years after they were written.

Further, websites often do not reflect the most recent or most accurate information.

For these reasons, scraping should be treated as an option of last resort. When a government website does not offer easy methods for obtaining data, journalists typically reach out to the agency and possibly file public records requests to obtain structured data or digital files. They seek to exhaust easier options before turning to their scraping toolkit.

Ethical scraping

Scraping ethically implies a number of best practices. To mention a few:

  • Respecting a site's terms of use
  • Identifying yourself clearly
  • Taking care not to overwhelm a site with large volumes of requests

Here are a few articles that lay out ethical concerns in more detail:

Keep in mind that opinions vary about what is or is not "ethical" -- or legal -- when it comes to scraping. It's an issue that has been tested in the courts and will continue to be fought over.

Be mindful of your legal responsibilities and potential liability when scraping the web.

Python example

The requests and BeautifulSoup libraries are workhorses of basic web scraping in Python.

pip install requests bs4

A super simple scraping example that extracts the text of the h1 HTML tag on http://example.com.

import bs4, requests
url = "http://www.example.com"
html = requests.get(url).text
soup = bs4.BeautifulSoup(html)
h1 = soup.find('h1')
print(h1.text)

More resources