description |
---|
Memorious is the web crawling framework that is part of Aleph toolkit. It can be used to periodically retrieve structured and unstructured data from the web and load it into Aleph. |
Memorious (named after Funes the Memorious) is a light-weight distributed web scraping toolkit. It can:
- Maintain an overview of a fleet of crawlers
- Scrape and store both structured and unstructured data from the web
- Load the scraped data to Aleph in a variety of ways
- Schedule crawler execution in regular intervals
- Store execution information and error messages
- Distribute scraping tasks across multiple machines
- Make crawlers modular and simple tasks re-usable
- Get out of your way as much as possible
Memorious has a neat user interface to monitor the status of your crawler fleet at a glance. The interface also lets you start, stop and inspect crawlers with ease.
Memorious crawlers consist of a YAML configuration file and some (optional) Python functions to define crawler operations. Some built-in utility operations already come packaged with Memorious. For example, frequent operations like making HTTP requests, writing data into a database - can be done using built-in operations provided by Memorious. Memorious can also provide handy utilities to load the scraped data to Aleph for further processing.
A really simple crawler configuration in Memorious might look like this:
# Scraper for the OCCRP web site.
name: occrp_web_site
description: 'Organized Crime and Corruption Reporting Project'
# Uncomment to run this scraper automatically:
# schedule: weekly
pipeline:
init:
# This first stage will get the ball rolling with a seed URL.
method: seed
params:
urls:
- https://occrp.org
handle:
pass: fetch
fetch:
# Download the seed page
method: fetch
params:
# These rules specify which pages should be scraped or included:
rules:
and:
- domain: occrp.org
handle:
pass: parse
parse:
# Parse the scraped pages to find if they contain additional links.
method: parse
params:
# Additional rules to determine if a scraped page should be stored or not.
# In this example, we're only keeping PDFs, word files, etc.
store:
or:
- mime_group: archives
- mime_group: documents
handle:
store: store
# this makes it a recursive web crawler:
fetch: fetch
store:
# Store the crawled documents to a directory
method: aleph_emit
params:
collection: occrp_web_site
To learn more about Memorious, you can:
- Visit the documentation available at https://memorious.readthedocs.io
- Explore and contribute to the source code
- Adapt the example project which includes some test crawlers and docker configuration.