WellcomeML utils

!!!!! Note this repository is now deprecated and is no longer maintained !!!!!

This package contains common utility functions for usual tasks at the Wellcome Trust, in particular functionalities for processing, embedding and classifying text data. This includes

An intuitive sklearn-like API wrapping text vectorizers, such as Doc2vec, Bert, Scibert
Common API for off-the-shelf classifiers to allow quick iteration (e.g. Frequency Vectorizer, Bert, Scibert, basic CNN, BiLSTM, SemanticSimilarity)
Utils to download and convert academic text datasets for benchmark
Utils to download data from the EPMC API

For more information read the official docs.

1. Quickstart

Installing from PyPi

pip install wellcomeml

This will install the "vanilla" package with very little functionality, such as io, dataset download etc.

If space is not a problem, you can install the full package (around 2.2GB):

pip install wellcomeml[all]

The full package is relatively big, therefore we also have fine-grained installations if you only wish to use one specific module. Those are core, transformers, tensorflow, torch, spacy. You can install one, or more of those you want, e.g.:

pip install wellcomeml[tensorflow, core]

To check that your installation allows you to use a specific module, try (for example):

python -c "import wellcomeml.ml.bert_vectorizer"

If you don't have the correct dependencies installed for a module, an error will appear and point you to the right dependencies.

1.1 Installing wellcomeml[all] on windows

Torch has a different installation for windows so it will not get automatically installed with wellcomeml[all]. It needs to be installed first (this is for machines with no CUDA parallel computing platform for those that do look here https://pytorch.org/ for correct installation):

pip install torch==1.5.1+cpu torchvision==0.6.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
pip install wellcomeml[all]

2. Development

2.1 Build local virtualenv

make

2.2 Contributing to the docs

Make changes to the .rst files in /docs (please do not change the ones starting by wellcomeml as those are generated automatically)

Navigate to the root repository and run

make update-docs

Verify that _build/html/index.html has generated correctly and submit a PR.

2.3 Release a new version (and upload to aws s3/pypi/github)

First create a github token, if you haven't one, with artifact write access and export it to the env variables:

export GITHUB_TOKEN=...

The checklist for a new release is:

Change wellcomeml/__version__.py
Add changelog
make dist
Verify new package was generated correctly on the pip registry and GitHub releases

2.4 (Optional) Installing from other locations

pip3 install <relative path to this folder>

2.5 Transformers

On OSX, if you get a message complaining about the rust compiler, install and initialise it with:

brew install rustup
rustup-init

3. Example usage of some modules

Examples can be found in the subfolder examples.

4. Troubleshooting

If you experience a problem with installing or using WellcomeML please open an issue. It might be worth setting the logging level to DEBUG export LOGGING_LEVEL=DEBUG which will often expose more information that might be informative to resolve the issue.

5. Extras

Module	Description	Extras needed
wellcomeml.ml.attention	Classes that implement keras layers for attention/self-attention	tensorflow
wellcomeml.ml.bert_classifier	Classifier to facilitate fine-tuning bert/scibert	tensorflow
wellcomeml.ml.bert_semantic_equivalence	Classifier to learn semantic equivalence between pairs of documents	tensorflow
wellcomeml.ml.bert_vectorizer	Text vectorizer based on bert/scibert	torch
wellcomeml.ml.bilstm	BILSTM Text classifier	tensorflow
wellcomeml.ml.clustering	Text clustering pipeline	NA
wellcomeml.ml.cnn	CNN Text Classifier	tensorflow
wellcomeml.ml.doc2vec_vectorizer	Text vectorizer based on doc2vec	NA
wellcomeml.ml.frequency_vectorizer	Text vectorizer based on TF-IDF	NA
wellcomeml.ml.keras_utils	Utils for computing metrics during training	tensorflow
wellcomeml.ml.keras_vectorizer	Text vectorizer based on Keras	tensorflow
wellcomeml.ml.sent2vec_vectorizer	Text vectorizer based on Sent2Vec	(Requires sent2vec, a non-pypi package)
wellcomeml.ml.similarity_entity_liking	A class to find most similar documents to a sentence in a corpus	tensorflow
wellcomeml.ml.spacy_classifier	A text classifier based on spacy	spacy, torch
wellcomeml.ml.spacy_entity_linking	Similar to similarity_entity_linking, but uses spacy	spacy
wellcomeml.ml.spacy_knowledge_base	Creates a knowledge base of entities, based on spacy	spacy
wellcomeml.ml.spacy_ner	Named entity recognition classifier based on spacy	spacy
wellcomeml.ml.transformers_tokenizer	Bespoke tokenizer based on transformers	Transformers
wellcomeml.ml.vectorizer	Abstract class for vectorizers	NA
wellcomeml.ml.voting_classifier	Meta-classifier based on majority voting	NA

Name		Name	Last commit message	Last commit date
Latest commit History 1,117 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
tests		tests
wellcomeml		wellcomeml
.flake8		.flake8
.gitignore		.gitignore
.nojekyll		.nojekyll
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
WINDOWS_USERS.md		WINDOWS_USERS.md
codecov.yml		codecov.yml
create_release.sh		create_release.sh
pull_request_template.md		pull_request_template.md
pytest.ini		pytest.ini
requirements_test.txt		requirements_test.txt
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WellcomeML utils

1. Quickstart

1.1 Installing wellcomeml[all] on windows

2. Development

2.1 Build local virtualenv

2.2 Contributing to the docs

2.3 Release a new version (and upload to aws s3/pypi/github)

2.4 (Optional) Installing from other locations

2.5 Transformers

3. Example usage of some modules

4. Troubleshooting

5. Extras

About

Releases 21

Contributors 13

Languages

License

wellcometrust/WellcomeML

Folders and files

Latest commit

History

Repository files navigation

WellcomeML utils

1. Quickstart

1.1 Installing wellcomeml[all] on windows

2. Development

2.1 Build local virtualenv

2.2 Contributing to the docs

2.3 Release a new version (and upload to aws s3/pypi/github)

2.4 (Optional) Installing from other locations

2.5 Transformers

3. Example usage of some modules

4. Troubleshooting

5. Extras

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 21

Contributors 13

Languages