vtext

NLP in Rust with Python bindings

This package aims to provide a high performance toolkit for ingesting textual data for machine learning applications.

Features

Tokenization: Regexp tokenizer, Unicode segmentation + language specific rules
Stemming: Snowball (in Python 15-20x faster than NLTK)
Token counting: converting token counts to sparse matrices for use in machine learning libraries. Similar to CountVectorizer and HashingVectorizer in scikit-learn but will less broad functionality.
Levenshtein edit distance; Sørensen-Dice, Jaro, Jaro Winkler string similarities

Usage

Usage in Python

vtext requires Python 3.6+ and can be installed with,

pip install vtext

Below is a simple tokenization example,

>>> from vtext.tokenize import VTextTokenizer
>>> VTextTokenizer("en").tokenize("Flights can't depart after 2:00 pm.")
["Flights", "ca", "n't", "depart" "after", "2:00", "pm", "."]

For more details see the project documentation: vtext.io/doc/latest/index.html

Usage in Rust

Add the following to Cargo.toml,

[dependencies]
vtext = "0.2.0"

For more details see rust documentation: docs.rs/vtext

Benchmarks

Tokenization

Following benchmarks illustrate the tokenization accuracy (F1 score) on UD treebanks ,

lang	dataset	regexp	spacy 2.1	vtext
en	EWT	0.812	0.972	0.966
en	GUM	0.881	0.989	0.996
de	GSD	0.896	0.944	0.964
fr	Sequoia	0.844	0.968	0.971

and the English tokenization speed,

	regexp	spacy 2.1	vtext
Speed (10⁶ tokens/s)	3.1	0.14	2.1

Text vectorization

Below are benchmarks for converting textual data to a sparse document-term matrix using the 20 newsgroups dataset, run on Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.80GHz,

Speed (MB/s)	scikit-learn 0.20.1	vtext (n_jobs=1)	vtext (n_jobs=4)
CountVectorizer.fit	14	104	225
CountVectorizer.transform	14	82	303
CountVectorizer.fit_transform	14	70	NA
HashingVectorizer.transform	19	89	309

Note however that these two estimators in vtext currently support only a fraction of scikit-learn's functionality. See benchmarks/README.md for more details.

License

vtext is released under the Apache License, Version 2.0.

Name	Name	Last commit message	Last commit date
Latest commit rth Use approx create for tests (#81 ) Jun 18, 2020 908f9dd · Jun 18, 2020 History 143 Commits
.circleci	.circleci	Update to PyO3 0.10 and rust-numpy 0.9 (#69 )	Jun 5, 2020
benchmarks	benchmarks	Renamed `UnicodeSegmentTokenizer` to `UnicodeWordTokenizer`. (#75 )	Jun 13, 2020
ci	ci	Update to PyO3 0.10 and rust-numpy 0.9 (#69 )	Jun 5, 2020
doc	doc	Version 0.2.0	Jun 14, 2020
evaluation	evaluation	Renamed `UnicodeSegmentTokenizer` to `UnicodeWordTokenizer`. (#75 )	Jun 13, 2020
python	python	Use approx create for tests (#81 )	Jun 18, 2020
src	src	Use approx create for tests (#81 )	Jun 18, 2020
.gitignore	.gitignore	Parallel CountVectorizer (#55 )	May 20, 2019
CHANGELOG.md	CHANGELOG.md	Update changelog	Jun 14, 2020
Cargo.toml	Cargo.toml	Use approx create for tests (#81 )	Jun 18, 2020
LICENSE	LICENSE	Relicense under Apache license 2.0 (#44 )	Apr 29, 2019
README.md	README.md	Version 0.2.0	Jun 14, 2020
azure-pipelines.yml	azure-pipelines.yml	Update to PyO3 0.10 and rust-numpy 0.9 (#69 )	Jun 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vtext

Features

Usage

Usage in Python

Usage in Rust

Benchmarks

Tokenization

Text vectorization

License

About

Releases

Packages

Contributors 3

Languages

License

rth/vtext

Folders and files

Latest commit

History

Repository files navigation

vtext

Features

Usage

Usage in Python

Usage in Rust

Benchmarks

Tokenization

Text vectorization

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages