-
-
Notifications
You must be signed in to change notification settings - Fork 7
Tokenization
Tokenization is an important NLP task that can be done on the level of words or sentences. UralicNLP comes with a functionality for tokenizing text. UralicNLP can handle abbreviations in all languages that are supported by the Universal Dependencies project.
To tokenize a text, all you need to do is to run the following:
from uralicNLP import tokenizer
text = "My dog ran. Then a cat showed up!"
tokenizer.tokenize(text)
>> [['My', 'dog', 'ran', '.'], ['Then', 'a', 'cat', 'showed', 'up', '!']]
This returns a list of sentences that contain a list of words
It is also possible to tokenize text on a sentence level:
from uralicNLP import tokenizer
text = "My dog ran. Then a cat showed up!"
tokenizer.sentences(text)
>> ['My dog ran.', 'Then a cat showed up!']
This returns a list of sentences.
One can also get a list of words without sentence boundaries:
from uralicNLP import tokenizer
text = "My dog ran. Then a cat showed up!"
tokenizer.words(text)
>> ['My', 'dog', 'ran', '.', 'Then', 'a', 'cat', 'showed', 'up', '!']
This returns a list of words.
UralicNLP has a special method that tokenizes and lemmatizes Arabic text. The input and output are the same as for the full tokenizer.
from uralicNLP import tokenizer
tokenizer.tokenize_arabic("ومن الناس من يقول آمنا بالله وباليوم الآخر وما هم بمؤمنين")
>> [['وَ', 'مَنّ', 'الناس', 'مَنّ', 'قال', 'آمنا', 'بِ', 'الله', 'وَ', 'بـ', 'يوم', 'ال', 'آخر', 'وَ', 'ما', 'هم', 'بِ', 'مؤمن']] # Web browsers may show this list in an inverted order the first element is وَ
The method relies on the Arabic FST which needs to be downloaded using
python3 -m uralicNLP.download -l ara
UralicNLP is an open-source Python library by Mika Hämäläinen