Skip to content

Latest commit

 

History

History
43 lines (30 loc) · 3.12 KB

File metadata and controls

43 lines (30 loc) · 3.12 KB

Hateful speech Hinglish social media Paper

This repository containts the dataset used by the authors in the paper - "Role of Artificial Intelligence in Detection of Hateful Speech for Hinglish Data on Social Media". It has been collected and labelled by the authors with the approach as stated in the paper. The dataset is collection of various social media platforms namely Instagram, Youtube, Twitter, Reddit etc and has been cleaned manually.

The Labels of the dataset are as follows -

English Hinglish Hindi
Non Hate 0 2 4
Hate 1 3 5

The dataset can be seen sometimes with unicode charachters not readable in Excel and if it happens, we suggest using other tools that support bidirectional unicode characters since there are instances of hindi in the data. Recommened tools include Jupyter Notebook, Notepad etc.

Please acknowledge the authors, if you use any parts of the dataset for your research or experiments and it is requested to keep the usage fair and trusted. Since the dataset shared, is processed to anonymize all the usernames from various platforms, we hope the anonymity is maintained in any usage if performed. The dataset is meant strictly for research purposes.

The paper can be accessed online at Springer

Cite :-

@InProceedings{10.1007/978-981-16-3067-5_8,
author="Srivastava, Ananya
and Hasan, Mohammed
and Yagnik, Bhargav
and Walambe, Rahee
and Kotecha, Ketan",
editor="Choudhary, Ankur
and Agrawal, Arun Prakash
and Logeswaran, Rajasvaran
and Unhelkar, Bhuvan",
title="Role of Artificial Intelligence in Detection of Hateful Speech for Hinglish Data on Social Media",
booktitle="Applications of Artificial Intelligence and Machine Learning",
year="2021",
publisher="Springer Singapore",
address="Singapore",
pages="83--95",
abstract="Social networking platforms provide a conduit to disseminate our ideas, views, and thoughts and proliferate information. This has led to the amalgamation of English with natively spoken languages. Prevalence of Hindi-English code-mixed data (Hinglish) is on the rise with most of the urban population all over the world. Hate speech detection algorithms deployed by most social networking platforms are unable to filter out offensive and abusive content posted in these code-mixed languages. Thus, the worldwide hate speech detection rate of around 44{\%} drops even more considering the content in Indian colloquial languages and slangs. In this paper, we propose a methodology for efficient detection of unstructured code-mix Hinglish language. Fine-tuning-based approaches for Hindi-English code-mixed language are employed by utilizing contextual-based embeddings such as embeddings for language models (ELMo), FLAIR, and transformer-based bidirectional encoder representations from transformers (BERT). Our proposed approach is compared against the pre-existing methods and results are compared for various datasets. Our model outperforms the other methods and frameworks.",
isbn="978-981-16-3067-5"
}