Skip to content

Generating tests for your scrapers

Richard Smith-Unna edited this page Jun 23, 2014 · 2 revisions

journal-scrapers comes with a Ruby script to generate regression tests for scraper definitions.

To generate tests, you'll need:

  1. a scraper definition in ScraperJSON format
  2. a list of 5-10 URLs for which the scraper should work

The test generator script is in scripts/make_tests.rb.

Prerequisites

To run the tests you'll need Ruby installed, with rubygems and the trollop gem installed.

To install Ruby and rubygems use RVM:

\curl -sSL https://get.rvm.io | bash -s stable --ruby=2.1.2

To install trollop use rubygems:

gem install trollop

You also need to have the quickscrape scraper installed. See the quickscrape documentation for instructions.

Generating tests

Place all your URLs (you need between 5 and 10) in a file, one per line, e.g.:

http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001874
http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001882
http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1004441
http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1004433
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0098781
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0099348

Next you need to run the test-generator script. You can see the script's help:

$ scripts/make_tests.rb --help
make_tests.rb - ScraperJSON test generator script
by Richard Smith-Unna <[email protected]> for ContentMine.org

This script generates a test file from a ScraperJSON scraper definition
and a list of URLs the scraper applies to.

The test files record what the scraper extracts from each URL so that tests can detect when the scrapers break.

Example use:
make_tests.rb --definition scraper.json --urls urls.txt

Options:
  --definition, -d <s>:   Path to ScraperJSON definition
        --urls, -u <s>:   File containing a list of 5-10 URLs to test
            --help, -h:   Show this message

Now just run the test-generator script, passing it the path to your scraper definition and the file containing the URLs:

scripts/make_tests.rb --definition scrapers/somejournal.json --urls tests/somejournal_test_urls.txt

The test-generator will now run quickscrape for each test URL using your scraper definition. It will store the results in a test format in the test directory. In the case of the example above, the new file will be called test/somejournal.json.

Once the test file has been generated, you're ready to make a pull request with your contributed scraper.