Tokenization

Tokenization is an important NLP task that can be done on the level of words or sentences. UralicNLP comes with a functionality for tokenizing text. UralicNLP can handle abbreviations in all languages that are supported by the Universal Dependencies project.

Full tokenization

To tokenize a text, all you need to do is to run the following:

from uralicNLP import tokenizer
text = "My dog ran. Then a cat showed up!"
tokenizer.tokenize(text)
>> [['My', 'dog', 'ran', '.'], ['Then', 'a', 'cat', 'showed', 'up', '!']]

This returns a list of sentences that contain a list of words

Sentence tokenization

It is also possible to tokenize text on a sentence level:

from uralicNLP import tokenizer
text = "My dog ran. Then a cat showed up!"
tokenizer.sentences(text)
>> ['My dog ran.', 'Then a cat showed up!']

This returns a list of sentences.

Word tokenization

One can also get a list of words without sentence boundaries:

from uralicNLP import tokenizer
text = "My dog ran. Then a cat showed up!"
tokenizer.words(text)
>> ['My', 'dog', 'ran', '.', 'Then', 'a', 'cat', 'showed', 'up', '!']

This returns a list of words.

Tokenize Arabic

UralicNLP has a special method that tokenizes and lemmatizes Arabic text. The input and output are the same as for the full tokenizer.

from uralicNLP import tokenizer
tokenizer.tokenize_arabic("ومن الناس من يقول آمنا بالله وباليوم الآخر وما هم بمؤمنين")
>> [['وَ', 'مَنّ', 'الناس', 'مَنّ', 'قال', 'آمنا', 'بِ', 'الله', 'وَ', 'بـ', 'يوم', 'ال', 'آخر', 'وَ', 'ما', 'هم', 'بِ', 'مؤمن']]
# Web browsers may show this list in an inverted order; the first element is وَ

The method relies on the Arabic FST which needs to be downloaded using

python3 -m uralicNLP.download -l ara

UralicNLP is an open-source Python library by Mika Hämäläinen

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization

Tokenization

Full tokenization

Sentence tokenization

Word tokenization

Tokenize Arabic

Clone this wiki locally