The crawler toolkit is part of the Offshore Journalism initiative. It's a proof-of-concept of the preversation
meta tag. Thus this project is divided in two parts, first the crawler, a django application designed to crawl feeds (RSS, atom or Twitter account) and preserve, if needed, articles tagged with preservation meta. The second part (the test site) is dedicated to test the preservation tags. It implements a simple version of the preservation meta tags and is based on Jekyll.
To install all dependencies (see Dependencies) you must have the following programs installed on your computer:
- python (>= 3.5)
- ruby (>= 2.4)
- homebrew is recommanded if you're on Mac OS X
- rvm is also recommanded
git clone https://github.com/jplusplus/CrawlerToolkit.git
cd CrawlerToolkit
# On Mac OS X (with homebrew)
brew install redis
# On Ubuntu (16.04+)
sudo apt-get install redis-server
# On RedHat/Fedora distributions
sudo dnf install redis
./manage.sh install
This application relies on environnement variable to run.
Name | Purpose |
---|---|
DJANGO_SETTINGS_MODULE |
Change the settings file to use for the django app (ex settings_dev and settings_heroku ) |
AWS_ACCESS_KEY_ID |
Amazon Web Service acces key's id, required on heroku to serve & upload static files. |
AWS_SECRET_ACCESS_KEY |
As above, required for static files serving & uploading. |
AWS_STORAGE_BUCKET_NAME |
The name of the S3 storage bucket |
TWITTER_ACCESS_TOKEN |
Token to access Twitter's API |
TWITTER_ACCESS_SECRET |
Acces's secret for Twitter's API |
TWITTER_CONSUMER_KEY |
Twitter consumer key for Twitter's API |
TWITTER_CONSUMER_SECRET |
Token to access Twitter's API |
To configure the local application we use and .env
file. To configure it copy the .env.template
file:
cp .env.template .env
Then edit .env
to fill the proper variables
All configuration variables can be edited from the heroku dashboard or with the following command.
# To set a variable
./manage.sh set <VARIABLE NAME> <value>
# To get a variable's value
./manage.sh get <VARIABLE NAME>
This project has been configured to be managed with simple commands (see How to use). But in order certain services needs to be configured.
You will need to install the surge npm package to deploy the test-site.
$ sudo npm install -g surge
$ surge login
To use the heroku manage.sh
commands you must have the heroku-cli package installed on your OS. Once this package
is installed you must log in:
$ heroku login
Then add the proper heroku
git remote with the following command
# replace <app> with your heroku's application name
$ heroku git:remote -a <app>
This project uses the Twitter's API in order to retrieve tweets from twitter feeds. Thus, you'll need to create a twitter app and generate a set of Token Access (in the Keys and Access Tokens tab). Then report the various keys, secrets and tokens in the appropriate environnement variables
# 1. Start the redis server
./manage.sh start_redis <optional port, default: 3000>
# 2. Run the crawler
./manage.sh start_crawler <optional port, default: 4000>
# 3. Run the test site
./manage start_test_site <optional port, default 5000>
If you need to perform operations on the application you have access to all django commands throught the following command:
./manage.sh django --help
./manage.sh jekyll --help
Currently, the test site is built thanks to Jekyll and the minimal-mistakes theme.
So in order to make a new post work properly you'll need to create a post in tests-site/_posts
folder (like on Jekyll) but with the single
layout instead of the post
that you'd expect.
Also, the purpose of this site is to test the preservation meta tags (see the specs).
To do add one or more preservation meta tag you just have to add a preservation
field in the post header as follows:
---
layout: single
title: "The article title"
categories: this is a test
preservation:
- type: notfound_only
value: true
- type: release_date
value: 2018-01-01
- type: priority
value: true
---
The crawler itself is parametered to be deployed on heroku with the following command
# This helper function calls the following git command:
# git subtree push --prefix crawl/ heroku master
./manage.sh deploy
By default we parametered the test-site
to be deployed on surge.sh.
$ ./manage.sh deploy_test_site