v0.2.0 - Python 3.8, multilabel classification, backtranslation
Major items in this release:
Python 3.8 support
We now run CI on both Python 3.7 and Python 3.8, making both versions officially supported. The only major change for the upgrade is that we now require the installed version of ray
to be at least 0.8.4
.
Multilabel classification
Most models now transparently support true multilabel classification, where the output layer of the model reports a predicted probability for each label rather than a single predicted class. Simply pass a List[List[str]]
in place of a List[str]
whenever you're setting labels, where each inner List[str]
is a list of labels that apply to the document. The model will infer the set of all labels from your data and generate a predicted probability for each label on all new data. Also added a benchmark dataset for the multilabel classification case: the CMU Movie Summary dataset. The interactive apps should also work -- note labels are delimited in CSV/TSV files using nested commas by default, but this can be changed using a command line argument.
Backtranslation
Implemented a new data augmentation approach based on transformers' implementation of the Marian Machine Translation model. Pass a list of target languages, and the model will translate each document from English to each language and back to generate a list of texts which are similar but not exactly the same as the original.
Miscellaneous improvements
- Fix a potential error installing newer versions of sentencepiece (>=0.1.90).
- Fix an error installing an older version of gensim (<3.8.2).
- Fix errors running the interactive apps with tiny sample sizes (although you probably weren't trying to run them with 1 document... right?).
- Fix some encoding errors reading data in the
Transformer
andSpaCyModel
models. - Rework charting in benchmark output to prevent timeout errors during benchmarks.
- Upgraded the version of transformers in the
Transformer
model to 2.8.0, allowing for use of the ELECTRA model.