Tokenization

Tokenization is an important NLP task that can be done on the level of words or sentences. UralicNLP comes with a functionality for tokenizing text. UralicNLP can handle abbreviations in all languages that are supported by the Universal Dependencies project.

Full tokenization

To tokenize a text, all you need to do is to run the following:

from uralicNLP import tokenizer
text = "My dog ran. Then a cat showed up!"
tokenizer.tokenize(text)
>> [['My', 'dog', 'ran', '.'], ['Then', 'a', 'cat', 'showed', 'up', '!']]

This returns a list of sentences that contain a list of words

Sentence tokenization

It is also possible to tokenize text on a sentence level:

from uralicNLP import tokenizer
text = "My dog ran. Then a cat showed up!"
tokenizer.sentences(text)
>> ['My dog ran.', 'Then a cat showed up!']

This returns a list of sentences.

Word tokenization

One can also get a list of words without sentence boundaries:

from uralicNLP import tokenizer
text = "My dog ran. Then a cat showed up!"
tokenizer.words(text)
>> ['My', 'dog', 'ran', '.', 'Then', 'a', 'cat', 'showed', 'up', '!']

This returns a list of words.

UralicNLP is an open-source Python library by Mika Hämäläinen

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization

Tokenization

Full tokenization

Sentence tokenization

Word tokenization

Clone this wiki locally