Skip to content

SemFi, SemUr (legacy stuff)

Mika Hämäläinen edited this page Nov 27, 2024 · 1 revision

What are SemFi and SemUr

SemFi is a collection of Finnish words and their syntactic relations. SemFi stores the strength of the syntactic relations between words. SemUr is a collection of automatically translated versions of SemFi for other Uralic languages.

Downloading the models

On command line:

python -m uralicNLP.download --languages fin --semfi

Use the following script to download the semantic databases in Python:

from uralicNLP import semfi
semfi.download("fin")

Use semfi.supported_languages() to list the supported languages.

Queries

Look a word up

You can find information stored in SemFi about words with their lemma and pos.

semfi.get_word("kissa","N", "fin")
>> {'word': u'kissa', 'compund': 0, 'pos': u'N', 'frequency': 23214, 'relative_frequency': 0.000172062683057, 'id': u'kissa_N'}

You can also list homonyms without explicitly giving the pos.

semfi.get_words("kuusi", "fin")
>> [{'word': u'kuusi', 'compund': 0, 'pos': u'N', 'frequency': 3823, 'relative_frequency': 2.83361608221e-05, 'id': u'kuusi_N'}, {'word': u'kuusi', 'compund': 0, 'pos': u'Num', 'frequency': 19897, 'relative_frequency': 0.000147477005461, 'id': u'kuusi_Num'}]

Find related words

word = semfi.get_word("näätä","N", "fin")
semfi.get_all_relations(word, "fin", sort=True) #lists all related words
>> [{'zscore': 6.84208734905, 'frequency': 9, 'relation': u'ROOT', 'word2': {'word': u'olla', 'compund': 0, 'pos': u'V', 'frequency': 5301968, 'relative_frequency': 0.0392983044525, 'id': u'olla_V'}, 'relative_frequency': 0.1125, 'word1': {'word': u'näätä', 'compund': 0, 'pos': u'N', 'frequency': 276, 'relative_frequency': 2.0457181237e-06, 'id': u'näätä_N'}}]

semfi.get_by_relation(word, "dobj", "fin", sort=True) #lists words with a given syntactic relation
>> [{'zscore': 0, 'frequency': 1, 'relation': u'dobj', 'word2': {'word': u'tai', 'compund': 0, 'pos': u'C', 'frequency': 783, 'relative_frequency': 5.80361337268e-06, 'id': u'tai_C'}, 'relative_frequency': 1, 'word1': {'word': u'näätä', 'compund': 0, 'pos': u'N', 'frequency': 276, 'relative_frequency': 2.0457181237e-06, 'id': u'näätä_N'}}, ...]

word2 = semfi.get_word("syödä","V", "fin")
semfi.get_by_word(word, word2, "fin")
>> [{'zscore': 1.48741029327, 'frequency': 3, 'relation': u'ROOT', 'word2': {'word': u'syödä', 'compund': 0, 'pos': u'V', 'frequency': 128242, 'relative_frequency': 0.000950532549347, 'id': u'syödä_V'}, 'relative_frequency': 0.0375, 'word1': {'word': u'näätä', 'compund': 0, 'pos': u'N', 'frequency': 276, 'relative_frequency': 2.0457181237e-06, 'id': u'näätä_N'}}, ...]

SemFi provides many methods for finding related words. One can get words by all relations, by a given relation or find relations by another word. The results can be sorted by their frequency by sort=True.

Cite

If you use SemFi or SemUr, cite the following publication

Hämäläinen, Mika. (2018). Extracting a Semantic Database with Syntactic Relations for Finnish to Boost Resources for Endangered Uralic Languages. In The Proceedings of Logic and Engineering of Natural Language Semantics 15 (LENLS15)

Clone this wiki locally