AmCAT-Scraping

A seperate repository for scraping to AmCAT.

To install:

export INSTALLDIR=$HOME #edit this line if you want to install in a different directory
export AMCAT_HOST=http://amcat.vu.nl
export AMCAT_USER=xxx
export AMCAT_PASSWORD=xxx

git clone https://github.com/amcat/amcat-scraping.git $INSTALLDIR/amcatscraping
git clone https://github.com/vanatteveldt/amcatclient.git $INSTALLDIR/amcatclient

# Install dependencies
sudo pip install -r $INSTALLDIR/amcatscraping/requirements.txt

# To run scripts in amcatscraping, PYTHONPATH needs to be set to the directory it's in:
echo >> ~/.bashrc
echo 'export PYTHONPATH=$PYTHONPATH':$INSTALLDIR >> ~/.bashrc

# To run scrapers at their scheduled time, we use a script that should run every minute. Add it to cron:
(crontab -l ; echo "* * * * * python "$INSTALLDIR/amcatscraping/maintenance/timed_actions.py $AMCAT_HOST $AMCAT_USER $AMCAT_PASSWORD)| crontab -

Different types of scrapers

We have 2 different types of scrapers: periodic and daterange. The former is a scraper without date options. It takes any article that is currently available on the website. This is useful for media that don't come with an archive, such as RSS feeds. The latter takes min datetime and max datetime arguments, and is supposed to scrape only those articles that fall within this range. This is useful for media that do keep an archive, so we can scrape articles from any given date at any time.

Registering a scraper

$ python maintenance/db.py add
usage: db.py add [-h] [--cron CRON] [--username USERNAME]
                 [--password PASSWORD] [--label LABEL]
                 active articleset project {periodic,daterange} classpath

Example for a daterange scraper:

$ python maintenance/db.py add t 100 10 daterange amcatscraping.scrapers.newspapers.volkskrant.VolksKrantScraper --username user1 --password pass1

Daterange scrapers do not need a cron argument, for they are updated all together at 2 AM every night.

Example for a periodic scraper:

$ python maintenance/db.py add t 100 10 periodic amcatscraping.scrapers.tv.teletekst.TeletekstScraper --cron "30 * * * *"

Because a cron entry was added during installation, these scrapers will run automatically at their specified times. If you're not familiar with cron, see this article for an introduction.

Running a scraper manually

Different scrapers need different arguments, as you'll see in the 'creating a scraper' section. You'll be best off running it without arguments to have it tell you:

$ python scrapers/tv/teletekst.py
usage: teletekst.py [-h] [--print_errors]
                    project articleset api_host api_user api_password
teletekst.py: error: too few arguments

We clearly need some place to put the articles and some auth:

$ python scrapers/tv/teletekst.py 1 1 http://amcat.vu.nl secret secret
	Scraping articles...
	..........x.x.x.x..
	Found 15 articles. postprocessing...
		Filling in defaults...
		Checking properties...
	Saving.

Articleset 1 on amcat.vu.nl now contains 15 articles of Teletekst.

Creating (coding) a scraper

TBA

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
celery		celery
maintenance		maintenance
scrapers		scrapers
scraping		scraping
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt
tools.py		tools.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AmCAT-Scraping

Different types of scrapers

Registering a scraper

Example for a daterange scraper:

Example for a periodic scraper:

Running a scraper manually

Creating (coding) a scraper

About

Releases

Packages

License

nruigrok/amcat-scraping

Folders and files

Latest commit

History

Repository files navigation

AmCAT-Scraping

Different types of scrapers

Registering a scraper

Example for a daterange scraper:

Example for a periodic scraper:

Running a scraper manually

Creating (coding) a scraper

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages