Web scraping can be more or less difficult depending on the nature of a website. A simple site with no dynamic content and predictable URL patterns could be a quick job, compared to one that uses web forms, randomized URLs, sessions/cookies, dynamically generated content, password-based logins, etc.
Below are some key strategies, concepts and tools that will help with web scraping.
Before you can scrape a site, you have to understand how it works. Here are some questions to ask when assessing a site:
-
Is the target information located on a single web page?
-
Is there a landing page with a list of items that link to child pages with additional details?
-
Does the site use pagination to present a long list of data, files, downstream pages, etc?
-
Do target pages/files have predictable URLs?
-
Do you have to fill out out a search form before seeing target results?
-
Does the site require a user to log in?
-
Is the site using sessions to manage client connections?
-
Is the target data in the source HTML or is it dynamically generated by Javascript after the page has loaded in the browser.
To answer these questions, you must go beyond simply clicking around a website. You must use your web browser to view its source code and use Developer Tools to look under the hood.
Here are some additional resources to level up on core skills for dissecting websites:
-
HTML - A markup language that tells your browser how to structure a page.
-
HTTP - The verbs of the Web. This is how clients and servers communicate. If nothing else, you should understand how to use GET and POST requests (the latter are commonly used with web forms).
-
CSS - A style language that tells a browser how to display page elements. CSS selectors are invaluable for extracting data from elements in a page.
-
Inspecting the Web with Chrome's Dev Tools - A gentle intro to core web technologies through the lens of Chrome Dev tools
-
Javascript and the Document Object Model (DOM) - You don't need to be a master of Javascript to scrape, but it's critical to understand how Javascript can manipulate the DOM to dynamically generate/alter content after a page has been loaded.
Below are some high-level strategies for a few common scraping scenarios. Keep in mind that you may run into sites that require you to combine these strategies -- e.g. basic scraping techniques with more advanced stateful web scraping.
Let's define a basic scrape as a site where target information is located in the source code itself, and any child pages are easily accessible (perhaps because they use predictable URLs that can be harvested from a landing or index page). The site doesn't use forms or require logins, it does not dynamically generate content, and does not use sessions/cookies. In such cases, you can likely get away with simply using the requests library to grab HTML pages and the BeautifulSoup library to parse and extract data from each page's HTML.
If a site contains a list of records that in turn lead to child pages with more detail, you can often scrape the so-called "index" page to harvest links to the child pages. Or perhaps there's a piece of metadata in the records on the index page (e.g. a company ID) that will let you dynamically generate the links to child pages.
The FDIC Failed Banks site and Data.gov are good examples:
https://www.fdic.gov/bank/individual/failed/wafedbank.html
https://catalog.data.gov/dataset/national-student-loan-data-system
Some sites use so-called query strings, which are
extra search parameters added to a URL as one or more key=value
pairs (following a question mark). Here are two examples:
https://www.whitehouse.gov/?s=coronavirus
The requests library can you help you construct such URLs.
Often, you will have to fill out a search form to locate target data. Such forms can be handled in a few ways, depending on the nature of the site. If the form generates a predictable URL (perhaps using URL parameters), you can dig into the form options in the HTML and figure out how to dynamically construct the URL. You can test this by manually filling out and submitting the form and examining the URL of the resulting page.
Many web forms use POST requests, where the form information is sent as part of the body of the web request (as opposed to embedded in the URL). In such cases, you can use a tool such as requests.post or Selenium to fill out and submit the form.
Sites that require logins can often be handled by simply passing in your login credentials as part of a web form (see Web Forms above). The requests library provides several ways to authenticate, or you can use a stateful web browser such as Selenium to fill out a login form. Login-based sites will also often use sessions/cookies to manage your interactions. See Stateful Web Scraping below for more details on how to handle this.
Many sites use Javascript to dynamically add or transform page content after you've loaded the page. This means that what you see in the source HTML using View Source will not match what you see in the browser (or the Elements tab of Chrome Developer Tools).
Scraping such a page requires using a library such as Selenium, which uses the "web driver" technology behind browsers such as Firefox to automate browser interactions.
Selenium gives you access to the Document Object Model (DOM) -- the content as seen by a real web browser. The DOM is the internal representation of a page that reflects both the static HTML and elements/styles dynamically added or manipulated by Javascript.
Selenium allows you to automate interactions with the DOM -- the same as a human using a browser -- to generate the content that you're targeting, such as scrolling down a page or stepping through a paginated list of results.
Further, the Python Selenium library provides convenient helper methods to help you access DOM elements, for instance using CSS selectors. This is similar in concept (and often in syntax) to how BeautifulSoup helps you parse and extract data from HTML.
Some websites will use sessions to uniquely identify a visitor and maintain a record of the visitor's interaction -- or state --with the site. Web servers often manage sessions by sending a unique key in a cookie to your browser. This key is passed back to the server for each new request a visitor makes (e.g. when submitting a search form). Scraping a session based-site requires you to manage the session in your code. The requests library has support for managing sessions.
Alternatively, you can use a library such as Selenium to mimic a browser and get session management for "free" (although note that session management in Selenium can also get tricky depending on the nature of the site).