This repository contains several German text instances from different sources (facebook comments, tweets etc.; see section references), which are manually re-annotated as either hate speech (hs), offensive/problematic language (p) or non-hate (n). All files are tab-separated CSV files. The corpus is currently under construction and is subject to change.
The HASOC dataset includes 818 German tweets as well as German facebook comments classified as either hate or non-hate. The comments were gathered with a keyword based approach in [1]. As the corpus predominantly contains non-hate, we again sampled from the corpus to obtain a (preliminary) equal ratio of hate and non-hate in this dataset.
The Hatr dataset contains 432 text instances that were extracted from hatr.org, a website that collects German hate posts from various German blogs.
This dataset contains 469 text instances from Ross et al. [2], which is a corpus on offensive tweets about the refugee crisis. The tweets were gathered with a keyword based approach, the keywords all being hashtags in this case.
This dataset contains 2.871 text instances from the tweet corpus described in [3], created as part of the GermEval 2018 Task on the Identification of Offensive Language. For this project, we only included the tweets that were classified as 'OFFENSE'.
The POLLY corpus originally contains about 125.000 politically charged German tweets around the time of the German federal election in 2017 [4]. Here, we only re-annotated a sample of tweets (around 4.500) that were previously annotated as 'with hate' to maintain a balanced dataset overall.
This dataset contains posts from popular and openly available Facebook pages that are known to attract xenophobia [5]. Currently, the dataset published here contains 3.500 comments to posts from the two pages "Pegida" and "I'm a patriot, not a nazi".
The German hate speech project at the University of Würzburg, including the creation of the hate speech dataset presented here, was made possible with funding from the Mapara Stiftung.
Many thanks also go to Lukas Weimer, who previously supervised the project.
[1] Mandl, Thomas, Sandip Modha, Prasenjit Majumder, Daksh Patel, Mohana Dave, Chintak Mandlia, and Aditya Patel. "Overview of the HASOC track at FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European Languages." FIRE '19: Proceedings of the 11th Forum for Information Retrieval Evaluation, 2019. 14–17.
[2] Ross, Björn, Michael Rist, Guillermo Carbonell, Benjamin Cabrera, Nils Kurowsky, and Michael Wojatzki. "Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis." Proceedings of NLP4CMC III: 3rd Workshop on Natural Language Processing for Computer-Mediated Communication, 2016. 6-9.
[3] Wiegand, Michael, Melanie Siegel, and Josef Ruppenhofer. "Overview of the GermEval 2018 Shared Task on the Identification of Offensive Language." Proceedings of GermEval 2018, 14th Conference on Natural Language Processing (KONVENS 2018), 2018. 1-10.
[4] De Smedt, Tom, and Sylvia Jaki. "The Polly corpus: Online political debate in Germany." Proceedings of the 6th Conference on Computer-Mediated Communication (CMC) and Social Media Corpora (CMC-corpora 2018), 2018.
[5] Bretschneider, Uwe and Ralf Peters. "Detecting Offensive Statements towards Foreigners in Social Media." Proceedings of the 50th Hawaii International Conference on System Sciences, 2017.