This repository contains the processes used to create the Portuguese hate speech dataset (TuPy), an annotated corpus designed to facilitate the development of advanced hate speech detection models using machine learning (ML) and natural language processing (NLP) techniques. TuPI is formed by combining datasets annotated by Fortuna et. al. (2019), Leite et. al. (2021), Vargas et. al. (2020) in addition to 10 thousand unpublished annotated documents collected in 2023.
This repository is organized as follows:
root.
├── datasets
├── figures
├── notebooks
├── models
├── notebooks
├── src
├── LICENSE
└── README.md
Run the following command
bash INIT.sh
Or install Miniconda 3 than type the following command order:
conda create -n tupi-env python=3.10
conda activate tupi-env
pip install poetry
poetry install
poetry run python -m nltk.downloader stopwords
The TuPi project is the result of the development of Felipe Oliveira's thesis and the work of several collaborators. This project is financed by the Federal University of Rio de Janeiro (UFRJ) and the Alberto Luiz Coimbra Institute for Postgraduate Studies and Research in Engineering (COPPE).