GitHub - yurayli/jigsaw-toxic-bias-reduction: unbiased toxicity detection from comments

Jigsaw Unintended Bias in Toxicity Classification

Detect toxic comments and minimize unintended model bias

Few text preprocessing is needed for NN solutions.
Bucket sampler saves a lot of time (2-3x faster), see the code for details
- separate data for buckets, several batches of samples a bucket
- sort sequence lengths for each bucket
- pad a batch from max_seq_length in the batch
Custom loss or sample weighting required for the mitigation of the model bias
Soft label contains more information and can be computed with BCE loss.
Pseudo label is helpful for LSTM-based NN
Knowledge distillation can compress an ensemble to a single model with comparative results.

Word embedding + LSTM-based networks. The embedding is also finetuned with smaller learning rate. Whole network is trained in one-cycle cosine annealed learning rate schedule (ref).
BERT finetuning with slanted triangular learning rate schedule (ref).
GPT-2 finetuning with slanted triangular learning rates.

Code in 2. 3. is based on huggingface's code, and the notebooks are run on kaggle kernel.

Pytorch 1.2.0

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
imgs		imgs
lstm-based		lstm-based
.gitignore		.gitignore
README.md		README.md
bert-config.png		bert-config.png
bert_vis.ipynb		bert_vis.ipynb
explore_data.ipynb		explore_data.ipynb
gpt2-config.png		gpt2-config.png
jgs-gpt2.ipynb		jgs-gpt2.ipynb
jgs_bert.ipynb		jgs_bert.ipynb
jgs_bert_inference.ipynb		jgs_bert_inference.ipynb
lstm_nn_vis.ipynb		lstm_nn_vis.ipynb