Skip to content

Latest commit

 

History

History
81 lines (62 loc) · 3.05 KB

README.md

File metadata and controls

81 lines (62 loc) · 3.05 KB

MorphDisamb

A Hungarian morphological disambiguator using recurrent and convolutional neural networks.

To analyse unknown words, HFST and the path for a transducer is required.

As a Hungarian transducer, emMorph can be used - please provide the path of the compiled transducer.

Usage

Training

Training requires a corpus with the following format:

  • empty lines separate the sentences
  • other lines consist of tab-separated colums:
    • the first column holds the word
    • the last column holds the disambiguated analysis

To train a new (convolutional) model:

python main.py -t -C [--batch 64] [--epoch 128] [--directory corpus_directory] [--file corpus_file] [--transducer transducer_path]

To continue the training of a saved (recurrent) model:

python main.py -t -R -l 2017-11-13-14-19 [--batch 64] [--epoch 128] [--directory corpus_directory] [--file corpus_file] [--transducer transducer_path]

In case only the corpus directory is provided, each file within it will be handled as corpus file.

Evaluation

Evaluation requires a corpus with the same format as training.

To evaluate a fresh training:

python main.py -t -e -C [--batch 64] [--epoch 128] [--directory corpus_directory] [--file corpus_file] [-l 2017-11-16-15-54] [--transducer transducer_path]

To evaluate a saved model:

python main.py -e -R -l 2017-11-13-14-19 [--directory corpus_directory] [--file corpus_file] [--transducer transducer_path]

The output of the evaluation:

  • writes the neural network loss and accuracy to standard output
  • writes the disambiguation results into a file with the following properties:
    • file name format: disambiguated-.txt
    • sentences are separated by empty lines
    • the original word, and the expected and got analyses are written into the file (each in separate lines, and the analyses are indented)
    • at the end of the file, the correctly disambigguated word and sentence count and ratio is shown

Disambiguation

The source for disambiguation can be the standard input or a file.

The file can have the same format which was required for training and evaluation. Multiple columns aren't necessary, the file can hold only the words.

In case of use input, quntoken is required for tokenization. The user input has to be usual text without separating words into lines.

Disambiguation with file input:

python main.py -d -R -l 2017-11-13-14-19 --directory input_dir --file input_file [--transducer transducer_path]

Disambiguation from standard input:

python main.py -d -R -l 2017-11-13-14-19 [--transducer transducer_path]

OR

cat input_file_path | python main.py -d -R -l 2017-11-13-14-19 [--transducer transducer_path]

BibTex

@thesis{NagyN2017,
	author = {Nagy, Nikolett},
	title = {Hungarian morphological disambiguation using recurrent and convolutional neural networks},
	institution = {Budapest University of Technology and Economics},
	year = {2017}
}