Skip to content

API reference: Alignment Data Format V0.1a

Jetic Gu edited this page Jun 8, 2017 · 2 revisions

Introduction

The data structure used to store alignments are not specifically defined as a class, but rather as is.

Data of these formats will be seen passed to functions, methods of the Aligner.

The description here is of module version 0.1a.

Alignment data format for entire sentence

SentenceAlignment

SentenceAlignment is the only data format used in the Aligner to store the alignment of one sentence.

It is basically a list, with each element being a tuple of 2 integers. The following is an example:

sentenceAlignment = [ (f_1, e_1),
                      (f_2, e_2),
                        ... ...
                      (f_n, e_n) ]

Each tuple above represents a pair of aligned words from the source sentence and target sentence, marked by their original positions in the sentence.

For example, for the following alignment:

The sentenceAlignment would be (order doesn't matter):

sentenceAlignment = [ (1, 4),
                      (2, 3),
                      (3, 1),
                      (4, 2) ]

Please note that in this data format, to match the data format of gold alignments the index of the first word should be 1 instead of 0.

In the event that there are multiple words aligned to the same source word, for example:

There would be two entries for the source word "gefällt", which is a verb meaning "attracts/makes one like":

sentenceAlignment = [ (1, 5),
                      (2, 2),
                      (2, 4),
                      (3, 1),
                      (4, 3) ]

Alignment data formats for entire dataset

Alignment

Alignment is the data format used to store the alignments produced by the Aligner. It is the most common alignment data format of the Aligner.

An alignment is in fact an ordered list, with each element of SentenceAlignment format. The following is an example:

alignment = [ sentenceAlignment1,
              sentenceAlignment2,
                   ... ...      ,
              sentenceAlignmentN ]

Unlike SentenceAlignment, the order of the list here naturally matters.

GoldAlignment

GoldAlignment is the data format used to store the reference alignments. It is different from Alignment for the following reasons:

  1. Our current Aligner doesn't produce certain alignments and probable alignments separably, in contrast to gold alignment, which usually contains certain alignments and probable alignments.

  2. a goldAlignment is not a tuple of two alignment because it would not be inconvenient this way, and it would be a bit easier to compare a goldAlignment and an alignment, like in Evaluators.

A goldAlignment is a list of dicts:

goldAlignment = [ goldAlignmentEntry1,
                  goldAlignmentEntry2,
                       ... ...      ,
                  goldAlignmentEntryN ]

each dict being the following format:

goldAlignmentEntry = { "certain": sentenceAlignment1,
                       "probable": sentenceAlignment2 }

in which sentenceAlignment1 and sentenceAlignment2 are SentenceAlignment of format.