Introduction

⚠️ NOTE: If you want to train a MicroBERT for your language, please see lgessler/microbert2.

Introduction

MicroBERT is a BERT variant intended for training monolingual models for low-resource languages by reducing model sizes and using multitask learning on part of speech tagging and dependency parsing in addition to the usual masked language modeling.

For more information, please see our paper. If you'd like to cite our work, please use the following citation:

@inproceedings{gessler-zeldes-2022-microbert,
    title = "{M}icro{BERT}: Effective Training of Low-resource Monolingual {BERT}s through Parameter Reduction and Multitask Learning",
    author = "Gessler, Luke  and
      Zeldes, Amir",
    booktitle = "Proceedings of the The 2nd Workshop on Multi-lingual Representation Learning (MRL)",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates (Hybrid)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.mrl-1.9",
    pages = "86--99",
}

Pretrained Models

The following pretrained models are available. Note that each model's suffix indicates the tasks that were used to pretrain it: masked language modeling (m), XPOS tagging (x), or dependency parsing (p).

Usage

Setup

Ensure submodules are initialized:

git submodule update --init --recursive

Create a new environment:

conda create --name embur python=3.9
conda activate embur

Install PyTorch etc.

conda install pytorch torchvision cudatoolkit -c pytorch

Install dependencies:

pip install -r requirements.txt

Experiments

This repo is exposed as a CLI with the following commands:

├── data                  # Data prep commands
│   ├── prepare-mlm
│   └── prepare-ner
├── word2vec              # Static embedding condition
│   ├── train
│   ├── evaluate-ner
│   └── evaluate-parser
├── mbert                 # Pretrained MBERT
│   ├── evaluate-ner
│   └── evaluate-parser
├── mbert-va              # Pretrained MBERT with Chau et al. (2020)'s VA method
│   ├── evaluate-ner
│   ├── evaluate-parser
│   └── train
├── bert                  # Monolingual BERT--main experimental condition
│   ├── evaluate-ner
│   ├── evaluate-parser
│   └── train
├── evaluate-ner-all      # Convenience to perform evals on all NER conditions
├── evaluate-parser-all   # Convenience to perform evals on all parser conditions
└── stats                 # Supporting commands for statistical summaries
    └── format-metrics

To see more information, add --help at the end of any partial subcommand, e.g. python main.py --help, python main.py bert --help, python main.py word2vec train --help.

Adding a language

For each new language to be added, you'll want to follow these conventions:

Put all data under data/$NAME/, with "raw" data going in some kind of subdirectory. (If it is a UD corpus, the standard UD name would be good, e.g. data/coptic/UD_Coptic-Scriptorium)
Ensure that it will be properly handled by the module embur.commands.data. Put a script at embur/scripts/$NAME_data_prep.py that will take the dataset's native format and write it out into data/$NAME/converted, if appropriate.
Update embur.language_configs with the language's information.

If you'd like to add a language's Wikipedia dump, see wiki-thresher.

Please don't hesitate to email me ([email protected]) if you have any questions.

Name		Name	Last commit message	Last commit date
Latest commit History 291 Commits
configs		configs
data		data
embur		embur
repls		repls
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
MicroBERT__MRL_2022_.pdf		MicroBERT__MRL_2022_.pdf
README.md		README.md
main.py		main.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Pretrained Models

Usage

Setup

Experiments

Adding a language

About

Releases

Packages

Contributors 2

Languages

lgessler/microbert

Folders and files

Latest commit

History

Repository files navigation

Introduction

Pretrained Models

Usage

Setup

Experiments

Adding a language

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages