This documentation is for version 1. Version 2 has changed a lot and I'm afraid I havn't been able to update the documentation with the new features and changes so you are best off looking at the code. It should mostly work as before though hopefully. This is mainly a project for my own personal use. If you would like better documentation or encounter any errors please file and issue and I'll do my best to help you out.
Extract listings data from paginated web pages.
It uses Cheerio to access the DOM
If you are using Chrome you can get an accurate CSS selector for a given element quite easily. See this Stack Overflow answer
For debugging set the DEBUG=paginated-listings-scraper
environment variable
npm i paginated-listings-scraper
import { scrapeListing } from 'paginated-listings-scraper';
const options = {
dataSelector: {
text: '.text-block',
title: 'h3',
},
filter: '.row.blank',
maximumDepth: 3,
nextPageSelector: 'a.next-page',
parentSelector: '.row',
terminate: (element, $) => element.find($('.bad-apple')).length,
url: 'http://paginatedlisitings.com',
};
const data = await scrape(options);
// returns a promise
// data = [{ title: 'Old McDonald', text: 'Had a farm', } ...]
The url of the page you wish to scrape. Ideally this should be a paginated page consisting of elements in a list format. It uses request-native-promise
to fetch the page. See request
The CSS selector of the elements you wish to iterate over. Each element found matching this selector will be mapped using dataSelector to extract the specified data. See cheerio selectors, cheerio find and cheerio map
Used to extract data from the elements returned from parentSelector
. It can be either a function or an object of keys in the form { name: cssSelector }
. cssSelector
can be a string or a function.
If an object is used it will iterate over each of its keys and extract the text contained within the element returned by the css selector. It will return each item as an object in the form { name: data }
.
If a function is used it will receive the element currently being acted on as a cheerio element as well as the cheerio function created from the DOM as arguments which will allow you to select whatever data you need.
//
dataSelector(element, $) {
return element.find($('#sweet.sweet.data')).text()
}
See cheerio selectors and cheerio find
The returned value from this will be added to an array which will eventually be returned by the scraper
Gets the url of the next page to be scraped. Can be either a CSS selector or a function. If a selector is used it gets the href property of the element. If the href is not a valid url than it assumes it is a path and concatenates this with the origin of the url that was initially passed in as the url
option
If you need something more custom then this then use a function. The function will receive the original Url and the loaded Cheerio DOM as an argument which will allow you to select whatever you want from the page.
nextPageSelector({ $, url, depth }) {
return `${origin}${$('a.hard-to-get').attr('data-hidden-href')}`
}
This function should return a Url which will be used to request the next page to be scraped. See cheerio selectors and cheerio find
The page number at which the scraper will stop. If set to 0 no pages will be scraped. Must be a number
A function that is run to determine whether or not to stop scraping. It is acted on each element returned by the parentSelector
. It recieves the element currently being acted on as a cheerio element as well as the cheerio function created from the DOM as an arguments
terminate(element, $) {
return !!element.attr('data-important-confiential-stuff')
}
Must return something truthy or falsey. See cheerio selectors
Can be either a CSS selector or a function. It is used to filter out unwanted elements before the inital iteration takes place. See cheerio filter for explanation and example usage
States whether or not it should return the data its collected so far when it encounters an error while scraping a page. This will mean no error will be propagated so be careful.