This project provides a command line interface to extract postings from https://derstandard.at (only derstandard.at, other online newspapers are not supported), to provide basic statistics and to apply a sentiment analysis. The coding style for this project is quick'n'dirty. I highly recommend to create backups of the resulting sqlite database files if you are done crawling.
- Working python 3.7+ installation (https://www.python.org/downloads/)
- Google Chrome installed (https://www.google.com/chrome/)
- Download
chromedriver
file matching your installed Chrome version (https://sites.google.com/a/chromium.org/chromedriver/downloads)
If you are familiar with git and the command line run:
git clone https://github.com/raphiniert/PostingStentimentAnalysis.git
cd PostingStentimentAnalysis
mkdir bin
mkdir log
cp path/to/chromedriver bin/chromedriver
In case you aren't familiar with the command line download this project as a .zip file,
open it and create two folders, one named bin
and one named log
.
Copy the previously downloaded file chromedrive
into the bin
folder.
Unfortunately, you have to get familiar with the command line anyways to use this tool.
Open a terminal, enter the project folder an execute following steps:
python3 -m venv venv
. venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
Verify if everything worked by running:
python crawl.py -h
If you see the following output, you're good to go.
usage: crawl.py [-h] [--continue-article CONTINUE_ARTICLE] [--retries RETRIES]
[--verbose] [--no-headless]
optional arguments:
-h, --help show this help message and exit
--continue-article CONTINUE_ARTICLE
continue crawling with article
--retries RETRIES max retries per article
--verbose increase output verbosity
--no-headless don't run chrome headless
Specify urls to look for postings in the url_list within the crawl.py
file:
# urls to crawl
url_list = [
"https://www.derstandard.at/story/2000112608982/fpoe-praesentiert-historikerbericht",
"https://www.derstandard.at/story/2000114104569/fpoe-historikerberichtexperten-bewerten-blaues-papier",
"Place your url here", # add a comment if you like
"And another one here",
]
Make sure the virtual environment is activated before you run the following code. You should see (venv) at somewhere in your terminal's current line. Activate it by entering the project folder and run:
. venv/bin/activate
You can just close the terminal or deactivate the virtual environment by running:
deactivate
After specifying the urls run the python script:
python crawl.py
This command creates a logfile, which is stored inside the log
folder.
The resulting sqlite (see https://www.sqlite.org) database gets stored in the postings.db
file.
To access the raw data I suggest using your database tool of choice (most common tools support sqlite databases).
If no such tool comes to your mind you could try using DBeaver (https://dbeaver.io).
Show help
python crawl.py --help
Specify number of retries per article before article gets skipped. Sometimes errors occur and there is a variaty of reasons for that. The tool automatically retries 10 times to continue crawling, but you can modify that value by adding the following argument.
python crawl.py --retries 50
In case something went wrong, you can continue crawling a specific article with the last successfully crawled posting. To do so run:
python crawl.py --continue-article 1
Increase output verbosity to show detailed log messages (you might want to try this if an article fails again and again to see where exactly the error occurs).
python crawl.py --verbose
By default chrome runs in headless mode. That means you want see the web browser running, but if you wish to see the browser you could by adding the following argument (this slows down the process).
python crawl.py --no-headless
Continue crawling for the second article with increased verbosity
python crawl.py --retries 50 --continue-article 2 --verbose
you can run every sql query you can think of on the data. The database structure is simple. There are four tables. (1) Articles, (2) Postings, (3) Users and (4) PostingRatings. A posting belongs to an article and is assigned to a user (except for deleted users). PostingRatings handles the many-to-many relationship of users rating postings.
Query users and the amount of postings for article 1 ordered by the
select users.user_name, count(p.posting_id) cp
from users
inner join postings p on users.user_id = p.user_id
where p.article_id = 1
group by users.user_name
order by cp desc;
TODO: print most common and usefuly stats, such as total postings, ratings and users per article, users posting in both articles etc.
python statistics.py
Download the available spaCy pretrained statistical models for the German language.
python -m spacy download de_core_news_lg
Download the SentiWS (Sentiment Wortschatz) published by the University of Leipzig (https://wortschatz.uni-leipzig.de/en/download). Unzip the files and copy the *.txt files into a folder named 'sentiws':
mkdir sentiws
cp path/to/unziped/folder/*.txt sentiws/
python sentiment.py
python -m spacy download de_core_news_lg
Make sure your virtual environment is enabled. You can enable it by running:
. venv/bin/activate
selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH.
Make sure the chromedriver
file is located in the bin
folder within the project folder.