-
-
Notifications
You must be signed in to change notification settings - Fork 7
Tokenization
Mika Hämäläinen edited this page Dec 30, 2021
·
5 revisions
Tokenization is an important NLP task that can be done on the level of words or sentences. UralicNLP comes with a functionality for tokenizing text. UralicNLP can handle abbreviations in all languages that are supported by the Universal Dependencies project.
To tokenize a text, all you need to do is to run the following:
from uralicNLP import tokenizer
text = "My dog ran. Then a cat showed up!"
tokenizer.tokenize(text)
>> [['My', 'dog', 'ran', '.'], ['Then', 'a', 'cat', 'showed', 'up', '!']]
This returns a list of sentences that contain a list of words
It is also possible to tokenize text on a sentence level:
from uralicNLP import tokenizer
text = "My dog ran. Then a cat showed up!"
tokenizer.sentences(text)
>> ['My dog ran.', 'Then a cat showed up!']
This returns a list of sentences.
One can also get a list of words without sentence boundaries:
from uralicNLP import tokenizer
text = "My dog ran. Then a cat showed up!"
tokenizer.words(text)
>> ['My', 'dog', 'ran', '.', 'Then', 'a', 'cat', 'showed', 'up', '!']
This returns a list of words.
UralicNLP is an open-source Python library by Mika Hämäläinen