Skip to content

API Reference

Jetic Gu edited this page Jul 12, 2017 · 27 revisions

Introduction

On this page one can find the API references of the latest version of master branch.

API reference(master) V0.5a

API references of all versions of master branch are also available below:

API references of latest version in our other branches are (partially)available below. Please note that due to the fact that this project is still in development, these references should be considered as drafts. Implementations may change overtime and it may take time for the API references to be updated accordingly.

  • V0.5a toutanova The models here are using the same version of APIs as the master branch.

  • V0.5a improvement The models here are using the same version of APIs as the master branch. This branch aims at creating a better interface for HMM, to make writing extensions more convenient.

Explanation of version numbers

Each individual module of the HMM aligner has its now API versions, which is not related to the internal code but only to the Interfaces. It is normal for different modules that are working together to have different individual API versions. For supported versions or module version dependencies please refer to the individual API reference Wiki pages.

API versions that will not be documented due to the fact that they are early working stages of a different branch (versions ending with a) which is documented as development goes on, will be of the following format: N.Nd, where Ns are numbers.

API versions of purely experimental branches or branches for archiving purposes that are not likely to be merged directly into master branch will be of the following format: N.Nc, where Ns are numbers.

Also, starting from v0.5a there will be supports of saving and loading trained models, in which case different models might have their own individual "Exporting version". Models can only load from supported versions of saved files. To make it easier to differentiate, this version is stored as model.version and is of the following format: N.Nb, where Ns are numbers.

Current Version (V0.5a)

Changes (Comparing to 0.4a)

  • support for models v0.4a

  • added options to load and save trained models.

  • support for the new Dataset Data format, removed old bitext and tritext

Options

Run

> python aligner.py -h

To see all options.

Config file

A sample config file is provided in src\sample_config_file.ini.

The purpose of a config file is to provide information regarding specific testing and training data, instead of having to type all the options on the console.

The config file is divided into 3 sections: General, TrainData, and TestData.

[General]
DataDirectory = ~/Data/
TargetLanguageSuffix = cn
SourceLanguageSuffix = en

[TrainData]
TextFilePrefix = train
TagFilePrefix = train.tags
AlignmentFileSuffix = wa

[TestData]
TextFilePrefix = test
TagFilePrefix = test.tags
Reference = FULLPATHTOFILE.WA

The aligner will search for files that matches the prefix and suffix given above in the DataDirectory. Please note that currently Reference has to be the full path.

Dataset formats

The descriptions of file formats supported by this version are here.

Saved model files

Saved model files are of .pkl and .pklz formats, with the latter being the compressed version of the former which is smaller in size but usually takes longer to save and load.

Please note that when loading saved files, the model will check the file's modelName (and version if applicable, see API reference for Alignment Models for more detail) to prevent accidentally loading a file for a different model(or unsupported version of current model).

Individual modules