Pretraining BERT(DistilBERT) with PyTorch and Huggingface

Step 1. Collect data to train Tokenizer and BERT

refer to data/collate_data.ipynb, utils/train_tokenizer.py

refer to data/collate_data.ipynb

Maked Language Modeling
- 15% of the total (80% : making, 10% : random, 10% : origin)
Next Sentence Prediction
- 50% (0 : not next sentence, 1 : next sentence)

python run_pretraining.py --c config.json --cont --checkpoint results/1000-step

Evaluate refer to evaluate_script.py