Skip to content

Aggregated the methodologies for extracting data to train speech classifiers.

License

Notifications You must be signed in to change notification settings

Silly-Machine/TuPy-Data-Engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHub pull requests GitHub issues GitHub last commit (branch) GitHub license Python 3.10+

TuPy Data Engineering

This repository contains the processes used to create the Portuguese hate speech dataset (TuPy), an annotated corpus designed to facilitate the development of advanced hate speech detection models using machine learning (ML) and natural language processing (NLP) techniques. TuPI is formed by combining datasets annotated by Fortuna et. al. (2019), Leite et. al. (2021), Vargas et. al. (2020) in addition to 10 thousand unpublished annotated documents collected in 2023.

This repository is organized as follows:

root.
    ├── datasets 
    ├── figures
    ├── notebooks
    ├── models
    ├── notebooks
    ├── src
    ├── LICENSE
    └── README.md

Quick start

Run the following command

bash INIT.sh

Or install Miniconda 3 than type the following command order:

conda create -n tupi-env python=3.10
conda activate tupi-env
pip install poetry
poetry install
poetry run python -m nltk.downloader stopwords

Acknowledge

The TuPi project is the result of the development of Felipe Oliveira's thesis and the work of several collaborators. This project is financed by the Federal University of Rio de Janeiro (UFRJ) and the Alberto Luiz Coimbra Institute for Postgraduate Studies and Research in Engineering (COPPE).

About

Aggregated the methodologies for extracting data to train speech classifiers.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages