Scripts to scrape large German news websites & resulting data set of one million German news articles from 01.01.2020 to 31.12.2020. To get the code, simply run
git clone https://github.com/kssrr/german-media-scrape
If you are unfamiliar with git, you can copy-paste & run the setup.R
-script, which will also install the dependencies for you.
Direct download (compressed .tar.gz)
We assembled a demo-dataset that includes all articles between January 1st 2020 and December 31st 2022 from the media outlets taz, Zeit, Süddeutsche, Spiegel & Welt. The data set includes a little over one million German-language news articles (uncompressed ~3.5 GB) of varying length. Article titles are missing for some sites due to an earlier problem with the scrapes; we plan to add them in later versions. The data is hosted here.
The data set includes broad coverage of various impactful events that could be fruitfully analysed, like the German federal election 2021, COVID-19, the 2022 Soccer World Cup, and of course the Russian invasion of Ukraine in early 2022.
Theoretically, the scripts could also be used to scrape data going back as far as the newspapers' archives allow; simply change the corresponding code early on in the scripts where the dates (years) to scrape are specified.
An elaborate example (topic modelling) is shown here, but you can also do a lot of interesting, more basic exploratory analysis with this kind of data, for example examine reporting on political parties:
You could also look at the salience of particular topics:
Or investigate pairwise correlation clusters of keywords (click to enlarge; see here for the methodology):
Special thanks to the University of Münster for providing us with additional computational resources for this project.