Skip to content

A PyTorch implementation of BERT proposed by Devlin et al.

Notifications You must be signed in to change notification settings

many-facedgod/BERT-PyTorch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BERT in PyTorch

A clean implementation of Bidirectional Encoder Representations from Transformers proposed by Devlin et. al. in the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Requirements

  • Python 3
  • PyTorch >= 1.1
  • tqdm
  • Numpy

Training your own BERT

To train your own BERT you need to tokenize your dataset into an array of array of integers, with each array of integers representing one sentence, and consecutive array of arrays should be consecutive sentences in the corpus. You may find the BERTTokenizer (copied from here) useful. You should also have a txt file listing the vocabulary, including the [SEP], [MASK], [PAD] and [CLS] tokens. Each vocabulary item should be listed in a new line. Examples are provided in the data directory. You should then like to initialize the BERT dataset and the BERT trainer and run the train method as shown in examples.py.

Loading the Huggingface weights

The pre-trained TensorFlow weights have been ported to PyTorch by huggingface. If you want to use the pre-trained weights, you can use the function provided in HuggingfaceUtils.py. An example usage is provided in examples.py.

About

A PyTorch implementation of BERT proposed by Devlin et al.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages