MoDeST: a Morphological Decomposition & Segmentation Trove

The point of MoDeST is two-fold:

Provide a general object-oriented Python interface to access morphological decompositions and segmentations;
Host morphological datasets generated by smaller research groups that would otherwise have a hard time being found.

Morphological decomposition is the task of recognising which building blocks a word was originally constructed from. These building blocks are its morphemes. As an example, the Dutch derivation isometrisch ("isometric") can be decomposed into the morphemes iso, meter and isch.

Morphological segmentation is the task of isolating the substrings of a word that correspond to its morphemes. These substrings are called morphs. In the above example, the segmentation would be iso/metr/ic.

Languages and Datasets

The supported languages are simply under modest.languages, so the list will not be reproduced here. The list of datasets roughly coincides with the downloaders under modest.datasets. Currently, the package supports:

CELEX
MorphyNet
MorphoChallenge2010
CompoundPiece

Installation

Run

pip install "modest[github] @ git+https://github.com/bauwenst/MoDeST"

Example

Yes, it really is this easy, with full type checking and autocompletion by your IDE:

from modest.languages.english import English_Celex, English_MorphoChallenge2010, English_MorphyNet_Inflections

for item in English_Celex().generate():
    print(item.word, "should be segmented as", item.segment(), "which derives from", item.decompose())

Repo layout

Currently, the repo looks as follows:

data/              ---> Datasets hosted specifically by MoDeST on GitHub. Will NOT be downloaded when you install the package.
src/modest/        ---> All source code for the Python package that will be installed in your interpreter.
    languages/     ---> Per-language definitions of the classes users will interact with.
    datasets/      ---> Support code for pulling in and reading remote data.
    formats/       ---> Support code for turning tag formats into objects. (Tag formats are independent of how the tags are stored.)
    interfaces/    ---> Declarations of the interfaces users will interact with.

Currently, every language has its own file under languages/. The assumption is that the datasets pertaining to one language are sufficiently encapsulated that this will not clutter the imports from such a file. There are two arguments in favour of going from languages/{language}.py to instead languages/{language}/{dataset}.py:

Autocompletion for the last . of the import suggests exactly the list of available datasets for that language;
You do not have to have all the packages installed required to download/build all the datasets for a language if you only need one. (However, realistically, since MoDeST is for final datasets rather than making datasets, the code for pulling them should not be that complicated.)

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
doc		doc
src/modest		src/modest
tst		tst
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MoDeST: a Morphological Decomposition & Segmentation Trove

Languages and Datasets

Installation

Example

Repo layout

About

Releases

Packages

Languages

License

bauwenst/MoDeST

Folders and files

Latest commit

History

Repository files navigation

MoDeST: a Morphological Decomposition & Segmentation Trove

Languages and Datasets

Installation

Example

Repo layout

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages