Skip to content

Tokenization

Mika Hämäläinen edited this page Dec 30, 2021 · 5 revisions

Tokenization

Tokenization is an important NLP task that can be done on the level of words or sentences. UralicNLP comes with a functionality for tokenizing text. UralicNLP can handle abbreviations in all languages that are supported by the Universal Dependencies project.

Full tokenization

To tokenize a text, all you need to do is to run the following:

from uralicNLP import tokenizer
text = "My dog ran. Then a cat showed up!"
tokenizer.tokenize(text)
>> [['My', 'dog', 'ran', '.'], ['Then', 'a', 'cat', 'showed', 'up', '!']]

This returns a list of sentences that contain a list of words

Sentence tokenization

It is also possible to tokenize text on a sentence level:

from uralicNLP import tokenizer
text = "My dog ran. Then a cat showed up!"
tokenizer.sentences(text)
>> ['My dog ran.', 'Then a cat showed up!']

This returns a list of sentences.

Word tokenization

One can also get a list of words without sentence boundaries:

from uralicNLP import tokenizer
text = "My dog ran. Then a cat showed up!"
tokenizer.words(text)
>> ['My', 'dog', 'ran', '.', 'Then', 'a', 'cat', 'showed', 'up', '!']

This returns a list of words.

Clone this wiki locally