Skip to content

PyTorch implementation of Chaudhary et. al. 2020's TopicBERT

License

Notifications You must be signed in to change notification settings

liamrahav/TopicBERT-PyTorch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TopicBERT-PyTorch

PyTorch implementation of Chaudhary et al. 2020's TopicBERT

Getting Started:

Install conda if you have not already done so. Then run

conda env create -f environment.yml

This will create a Python environment that strictly adheres to the versioning indicated in the project proposal. It is intended to closely mirror Google Colab.

Then train the model via main.py. There are many options that can be set, run python main.py -h to see more.

One particularly helpful option is -s PATH or --save PATH, which saves the given options as a JSON file that can easily be used again with --load PATH.

Sample config.json:

{
    "dataset": "reuters8",
    "label_path": ".../labels.txt",
    "train_dataset_path": ".../training.tsv",
    "val_dataset_path": ".../validation.tsv",
    "test_dataset_path": ".../test.tsv",
    "num_workers": 8,
    "batch_size": 16,
    "warmup_steps": 10,
    "lr": 2e-05,
    "alpha": 0.9,
    "num_epochs": 2,
    "clip": 1.0,
    "seed": 42,
    "device": "cuda",
    "val_freq": 0.0,
    "test_freq": 0.0,
    "disable_tensorboard": false,
    "tensorboard_dir": "runs/topicbert-512",
    // directory where checkpoints should be
    "resume": ".../checkpoints/", 
    // whether to look for a checkpoint in above or just save a new one there
    "save_checkpoint_only": true, 
    "verbose": true,
    "silent": false,
    "load": null,
    "save": "config.json"
}

Alternatively, open experiment.ipynb in Google Colab:


Roadmap (DONE)

  • Have working BERT on some dataset (SST-2)
    • Completed on 4/8/21, Liam
  • Reuters8 Dataset & DataLoader set up
    • Dataset & DataLoader done on 4/9/21, Liam
  • BERT doing standalone prediction on Reuters8
    • Done — achieves 99.5% train, 98.0% val accuracy run on Google Colab, 4/10/21, Liam
  • Set up NVDM topic model on some dataset
  • NVDM working on Reuters8
    • Done — error behaves as expected when training, needs further analysis, 4/18/21, Liam
  • Create joint model (TopicBERT)
    • Coding complete, 4/19/21, Liam
  • Achieve near baselines with TopicBERT
    • We achieve 0.96 F1 score on Reuters8 with TopicBERT-512, outperforming the original paper marginally. See differences section for potental factors.
    • Done, 4/19/21, Liam
  • Move from Jupyter to Python modules
    • All "modules" converted, 4/25/21, Liam.
    • training package and main.py complete, 4/26/21, Liam.
  • Measure performance baselines
    • All baselines finalized, 5/3/21, Liam.

Happy to report that the model has performance (runtime & accuracy) characteristics as expected!

Non-modification Extensions Pursued:

  • Pre-train VAE.
    • Implemented HR-VAE as comptatible model with TopicBERT. Currently have ability for the TopicBERT main script to pre-train an HR-VAE model on a dataset. 5/8/21, Liam.

More Extension Ideas:

  • Test new datasets in topic classification
  • Test datasets in a different domain (e.g. NLI, GLUE)

Differences

This section maintains a (non-definitive) list of differences between the original implementation and this repository's code.

  • F_MIN set to 10 on Reuters8 dataset yields a vocab size of K = 4832 rather than K = 4813 reported in the original paper, despite following the same text-cleaning guidelines. We assume this will not significantly affect results.
  • F_MIN set to 100 on the IMDB dataset yields a vocab size of K = 7358 rather than K = 6823 reported in the original paper, despite following the same text-cleaning guidelines. We assume this will not significantly affect results.
  • We use a size 1k validation set for IMDB (24k train), whereas the originaal authors used a 5k validation set.
  • The original authors use bert-base-cased. As all data is lowercased across datasets in the original experiments, we change this to bert-base-uncased.
  • Labels are encoded one-hot. We use torch.max(...)[1] to extract prediction & label indices. These indices can be converted back and forth with label strings via dataset.label_mapping[index] and dataset.label_mapping[label_str].
  • NVDM in the original paper uses tanh activation for multiliayer perceptron in NVDM. However, the author's TensorFlow implementation uses sigmoid. We use GELU, as the NVDM paper (Miao et al. 2016) uses this as well.
  • TopicBERT as described in the paper has a projection layer consisting of a single matrix $\mathbf{P} \in \mathbf{R}^{\hat{H} \times H_B}$. We add GELU activation after $\mathbf{P}$. The original author's TensorFlow implementation utilizes a tf.keras.layers.Dense layer, which adds a bias vector and GELU activation after $\mathbf{P}$.

About

PyTorch implementation of Chaudhary et. al. 2020's TopicBERT

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages