https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification top-5% solution
Detect toxic comments and minimize unintended model bias
- Few text preprocessing is needed for NN solutions.
- Bucket sampler saves a lot of time (2-3x faster), see the code for details
- separate data for buckets, several batches of samples a bucket
- sort sequence lengths for each bucket
- pad a batch from max_seq_length in the batch
- Custom loss or sample weighting required for the mitigation of the model bias
- Soft label contains more information and can be computed with BCE loss.
- Pseudo label is helpful for LSTM-based NN
- Knowledge distillation can compress an ensemble to a single model with comparative results.
- Word embedding + LSTM-based networks. The embedding is also finetuned with smaller learning rate. Whole network is trained in one-cycle cosine annealed learning rate schedule (ref).
- BERT finetuning with slanted triangular learning rate schedule (ref).
- GPT-2 finetuning with slanted triangular learning rates.
Code in 2. 3. is based on huggingface's code, and the notebooks are run on kaggle kernel.
Pytorch 1.2.0