Skip to content

Latest commit

 

History

History
90 lines (61 loc) · 3.21 KB

README.md

File metadata and controls

90 lines (61 loc) · 3.21 KB

Flake8 Pytest

Dictionary Tools

This repository contains tools for compiling and deploying dictionaries for LanguageTool.

Maintainer

The owner, maintainer, and main dev for this repository is @p-goulart. Any potential shell and perl components may be better explained by @jaumeortola, though.

Setup

Python dependencies

This is set up as a Poetry project, so you must have Poetry installed and ready to go.

Make sure you are using a virtual environment and then:

poetry install --with test,dev

System dependencies

In addition to the Python dependencies, you will also need to have Hunspell binaries installed on your system.

The most important one is unmunch. Check if it's installed:

which unmunch
# should return a path to a bin directory, like
# /opt/homebrew/bin/unmunch

If it's not installed, you may need to compile Hunspell from source. Clone the Hunspell repo and then, from inside it, these steps should work on Ubuntu:

# install a bunch of dependencies needed for compilation
sudo apt-get install autoconf automake autopoint libtool
autoreconf -vfi
./configure
make
sudo make install
sudo ldconfig

LT dependencies

The scripts here also depend on the languagetool Java codebase (for word tokenisation).

Make sure you have LT cloned locally, and export the following environment variable in your shell configuration:

export LT_HOME=/path/to/languagetool

If this is not done, the code in this project will set that variable as a default to ../languagetool (meaning one directory up from wherever this repo is cloned).

Usage

This repository should be a submodule of language-specific repositories. For example, the Portuguese repository.

⚠️ Note that the name of this repository is in kebab-case, but Python modules should be imported in snake_case. Therefore, when importing this as a submodule, make sure to set the path to dict_tools, which uses the underscore. If you don't do this, you may fail to import it as a module.

build_tagger_dicts.py

This is the script that takes compiles source files into a binary dictionary to be used by the LT POS tagger, Word Tokeniser, and Synthesiser.

You can check the usage parameters by invoking it with --help:

poetry run python scripts/build_tagger_dicts.py --help

build_spelling_dicts.py

This is the script that takes all the Hunspell and helper files as input and yields as output binary files to be used by the Morfologik speller.

You can check the usage parameters by invoking it with --help:

poetry run python scripts/build_spelling_dicts.py --help