- Overview
- Value of scraping
- Scraping mechanics and challenges
- The option of last resort
- Ethical scraping
- Python example
- More resources
Web scraping is the act of automating the acquisition of data or files (images, videos, documents, etc.) from the web. The data may live on one or more pages of a website, or perhaps many different websites. At root, web scraping involves writing code to mimic the actions a human might take to visit a site in a web browser and manually extract information.
Why is web scraping a valuable tool for the news?
Often, journalistically valuable information is locked up on a website that lacks easier methods for data acquisition. Not all government agencies, for example, offer downloadable CSVs or APIs. Nor do they always respond to public records requests in a timely or helpful manner.
Web scraping allows journalists to acquire information in the face of technical or bureaucratic hurdles.
Scraping is also useful in scenarios where a website offers the most up-to-date or widest scope of information. In such cases, web scraping can help journalists tell a more accurate and timely story.
Here are a few examples where web scraping helped produce news:
- Accidential shootings involving kids often go unpunished, by The Associated Press, relied on data scraped from the Gun Violence Archive.
- Amazon Says It Puts Customers First. But Its Pricing Algorithm Doesn't, by ProPublica. Here's the behind-the-scenes look at how they scraped and analyzed Amazon data.
- Dollars for Docs, a searchable news app by ProPublica. Here's a write-up on the scraping aspect of the work.
On a technical level, web scraping typically involves extracting information from the HTML of a website and/or files linked to by the website, which may in turn be downstream web pages. This process can be automated by understanding the anatomy of a site -- how pages are structured, URL patterns, etc. -- and then creating a script that retrieves (or visits) web pages and extracts the target information.
Scraping can be more or less difficult depending on the nature of the site. A simple site with no dynamic content and predictable URL patterns could be a quick job, compared to one that uses web forms, randomized URLs, cookies or sessions, dynamically generated content, password-based logins, etc. Sites often use a combination of these strategies, so it's important to spend time learning how a site works and choose an appropriate scraping strategy.
Web scraping is a brittle activity. Sites move, URL and page structures evolve, interactivity gets added or removed.
Shiny new web scrapers inevitably break in the days, months and years after they were written.
Further, websites often do not reflect the most recent or most accurate information.
For these reasons, scraping should be treated as an option of last resort. When a government website does not offer easy methods for obtaining data, journalists typically reach out to the agency and possibly file public records requests to obtain structured data or digital files. They seek to exhaust easier options before turning to their scraping toolkit.
Scraping ethically implies a number of best practices. To mention a few:
- Respecting a site's terms of use
- Identifying yourself clearly
- Taking care not to overwhelm a site with large volumes of requests
Here are a few articles that lay out ethical concerns in more detail:
Keep in mind that opinions vary about what is or is not "ethical" -- or legal -- when it comes to scraping. It's an issue that has been tested in the courts and will continue to be fought over.
Be mindful of your legal responsibilities and potential liability when scraping the web.
The requests and BeautifulSoup libraries are workhorses of basic web scraping in Python.
pip install requests bs4
A super simple scraping example that extracts the text of the h1
HTML tag on http://example.com.
import bs4, requests
url = "http://www.example.com"
html = requests.get(url).text
soup = bs4.BeautifulSoup(html, 'html.parser')
h1 = soup.find('h1')
print(h1.text)
If you're curious why we include
html.parser
as an option when we call BeautifulSoup, check out bs4's docs on installing a parser and differences between parsers.
- Web scraping exercises - A few sites to challenge your scraping skills.
- Web scraping resources - Tutorials, key concepts, code libraries for scraping, etc.