TopicBERT-PyTorch

PyTorch implementation of Chaudhary et al. 2020's TopicBERT

Getting Started:

Install conda if you have not already done so. Then run

conda env create -f environment.yml

This will create a Python environment that strictly adheres to the versioning indicated in the project proposal. It is intended to closely mirror Google Colab.

Then train the model via main.py. There are many options that can be set, run python main.py -h to see more.

One particularly helpful option is -s PATH or --save PATH, which saves the given options as a JSON file that can easily be used again with --load PATH.

Sample config.json:

{
    "dataset": "reuters8",
    "label_path": ".../labels.txt",
    "train_dataset_path": ".../training.tsv",
    "val_dataset_path": ".../validation.tsv",
    "test_dataset_path": ".../test.tsv",
    "num_workers": 8,
    "batch_size": 16,
    "warmup_steps": 10,
    "lr": 2e-05,
    "alpha": 0.9,
    "num_epochs": 2,
    "clip": 1.0,
    "seed": 42,
    "device": "cuda",
    "val_freq": 0.0,
    "test_freq": 0.0,
    "disable_tensorboard": false,
    "tensorboard_dir": "runs/topicbert-512",
    // directory where checkpoints should be
    "resume": ".../checkpoints/", 
    // whether to look for a checkpoint in above or just save a new one there
    "save_checkpoint_only": true, 
    "verbose": true,
    "silent": false,
    "load": null,
    "save": "config.json"
}

Alternatively, open experiment.ipynb in Google Colab:

Roadmap (DONE)

Have working BERT on some dataset (SST-2)
- Completed on 4/8/21, Liam
Reuters8 Dataset & DataLoader set up
- Dataset & DataLoader done on 4/9/21, Liam
BERT doing standalone prediction on Reuters8
- Done — achieves 99.5% train, 98.0% val accuracy run on Google Colab, 4/10/21, Liam
Set up NVDM topic model on some dataset
NVDM working on Reuters8
- Done — error behaves as expected when training, needs further analysis, 4/18/21, Liam
Create joint model (TopicBERT)
- Coding complete, 4/19/21, Liam
Achieve near baselines with TopicBERT
- We achieve 0.96 F1 score on Reuters8 with TopicBERT-512, outperforming the original paper marginally. See differences section for potental factors.
- Done, 4/19/21, Liam
Move from Jupyter to Python modules
- All "modules" converted, 4/25/21, Liam.
- training package and main.py complete, 4/26/21, Liam.
Measure performance baselines
- All baselines finalized, 5/3/21, Liam.

Happy to report that the model has performance (runtime & accuracy) characteristics as expected!

Non-modification Extensions Pursued:

Pre-train VAE.
- Implemented HR-VAE as comptatible model with TopicBERT. Currently have ability for the TopicBERT main script to pre-train an HR-VAE model on a dataset. 5/8/21, Liam.

More Extension Ideas:

Test new datasets in topic classification
Test datasets in a different domain (e.g. NLI, GLUE)

Differences

This section maintains a (non-definitive) list of differences between the original implementation and this repository's code.

F_MIN set to 10 on Reuters8 dataset yields a vocab size of K = 4832 rather than K = 4813 reported in the original paper, despite following the same text-cleaning guidelines. We assume this will not significantly affect results.
F_MIN set to 100 on the IMDB dataset yields a vocab size of K = 7358 rather than K = 6823 reported in the original paper, despite following the same text-cleaning guidelines. We assume this will not significantly affect results.
We use a size 1k validation set for IMDB (24k train), whereas the originaal authors used a 5k validation set.
The original authors use bert-base-cased. As all data is lowercased across datasets in the original experiments, we change this to bert-base-uncased.
Labels are encoded one-hot. We use torch.max(...)[1] to extract prediction & label indices. These indices can be converted back and forth with label strings via dataset.label_mapping[index] and dataset.label_mapping[label_str].
NVDM in the original paper uses tanh activation for multiliayer perceptron in NVDM. However, the author's TensorFlow implementation uses sigmoid. We use GELU, as the NVDM paper (Miao et al. 2016) uses this as well.
TopicBERT as described in the paper has a projection layer consisting of a single matrix $\mathbf{P} \in \mathbf{R}^{\hat{H} \times H_B}$. We add GELU activation after $\mathbf{P}$. The original author's TensorFlow implementation utilizes a tf.keras.layers.Dense layer, which adds a bias vector and GELU activation after $\mathbf{P}$.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
datasets		datasets
docs		docs
models		models
raw_data		raw_data
training		training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
main.py		main.py
vae_main.py		vae_main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TopicBERT-PyTorch

Getting Started:

Roadmap (DONE)

Differences

About

Uh oh!

Releases

Packages

Languages

License

liamrahav/TopicBERT-PyTorch

Folders and files

Latest commit

History

Repository files navigation

TopicBERT-PyTorch

Getting Started:

Roadmap (DONE)

Differences

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages