Skip to content

An expansion of MorphoBr data through modeling of four word-formation processes by suffixation

Notifications You must be signed in to change notification settings

heliolbs/MorphoBrExpansion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MorphoBrExpansion

An expansion of MorphoBr data through modeling of four word-formation processes by suffixation

Author: Hélio L. B. Silva [email protected]

License: GNU General Public License Version 3 (https://www.gnu.org/licenses/gpl-3.0.txt)

How to cite this work: SILVA, H. L. B. Expansão do MorphoBr através da modelagem computacional de processos de formação de palavras em português. 2019. Dissertação (Mestrado em Linguística) - Programa de Pós-Graduação em Linguística, Universidade Federal do Ceará, Fortaleza, 2019.

MorphoBr data is available at https://github.com/LFG-PTBR/MorphoBr

The suffixes used are -izar, -idade, -vel and -mente. From MorphoBr we extracted non-hyphenated lemmas and used them to feed four word-formation processes. The four resulting base files are the following: adjectives.lemas, adverbs.lemas, noun.lemas and verbs.lemas.

The file v1.lemas was created by extracting the first conjugation verb lemmas from verbs.lemmas in order to provide base forms for word-formation process by suffixation of -vel.

The following files were created by suffixing adjective base forms with -idade, -izar and -mente suffixes: adjIDADE.lemas, adjIZAR.lemas and adjMENTE.lemas.

The file adjICO.lemas was created by extracting all adjectives suffixed by -ico in order to remove their diacritics separately. The following files were created by suffixing them with -idade, -izar and -mente suffixes: adjICAMENTE.lemas, adjICIDADE.lemas and adjICIZAR.lemas.

The file adjICIDADE-Duplicadas.lemas was created as an error during the process of removing diacritics. This file contains double entries, because of the difference between European Portuguese and Brazilian Portuguese ("tónico", "tônico"). The file adjICIDADE.lemas was created after removing the double entries.

The files subsÇÃO.lemas and subsMENTO.lemas were created by extracting from nouns.lemas the nouns suffixed by -ção and the nouns suffixed by -mento, respectively.

Finally, the following files contain the words generated by our process: novosadjetivos.lemas, novosadverbios.lemas, novossubstantivos.lemas and novosverbos.lemas.

Each of the Transducer folders contains the one of the following types of files: build-fst.xfst, nPoS.lemas and nPoS.lexc

Adjectival and Verbal Transducer folders also contain their regras.xfst files.

To generate the new forms from the files "build-fst.xfst" on Linux, open the terminal, navigate to the directory where the file is, call the compiler xfst, then write the following command and type enter:

source build-fst.xfst

The next step is to print all the forms from the transducer we have just created. To to this, write the following command and type enter:

print words > newwords.dict

The files ".dict" contain the pairs of inflected forms and PoStagged forms, following MorphoBr's structure.

About

An expansion of MorphoBr data through modeling of four word-formation processes by suffixation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages