Skip to content

API reference: Dataset Data Format V0.2a

Jetic Gu edited this page Jun 25, 2017 · 2 revisions

Introduction

The data structure used to store datasets for:

  1. training, and

  2. testing

are not specifically defined as a class, but rather as is.

Data of these formats will be seen passed to functions, methods of the Aligner.

The description here is of module version 0.2a.

Changes in 0.2a

  • old formats including Bitext and Tritest are replaced by Dataset

Dataset data format

Dataset

Dataset is the data format for training and testing datasets with the following components:

  1. source language information; and
  2. target language information; and
  3. optionally, alignment information.

Dataset is in fact a list, which elements are sentences(list).

Each sentence has three parts, stored in a list:

  1. source language information, typically including the original text, and additional information such as POS tags. This is stored as a list, each word and its supplemental information stored as a tuple being element of that list.

  2. target language information, typically including the original text, and additional information such as POS tags. This is stored as a list, each word and its supplemental information stored as a tuple being element of that list.

  3. Optionally, SentenceAlignment as the alignment of such sentence, as specified in Alignment Data Format V0.2a.

For example, the following dataset:

Source(DE):
    Leise flehen meine Lieder
    Durch die Nacht zu Dir

Target(EN):
    Becken softly my songs
    Through the night to you

Will be stored as:

dataset = [ [[("Leise"), ("flehen"), ("meine"), ("Lieder")],
             [("Becken"), ("softly"), ("my"), ("songs")]],

            [[("Durch"), ("die"), ("Nacht"), ("zu"), ("Dir")],
             [("Through"), ("the"), ("night"), ("to"), ("you")]] ]

For Dataset with additional information such as POS tags, an example is given below:

Source(DE):
    Leise[JJ] flehen[VBP] meine[PRP$] Lieder[NN]
    Durch[IN] die[DT] Nacht[NN] zu[TO] Dir[PRP]

Target(EN):
    Becken[VBP] softly[JJ] my[PRP$] songs[NN]
    Through[IN] the[DT] night[NN] to[TO] you[PRP]

It will be stored as:

dataset = [ [[("Leise", "JJ"), ("flehen", "VBP"), ("meine", "PRP$"), ("Lieder", "NN")],
             [("Becken", "VBP"), ("softly", "JJ"), ("my", "PRP$"), ("songs", "NN")]],

            [[("Durch", "IN"), ("die", "DT"), ("Nacht", "NN"), ("zu", "TO"), ("Dir", "PRP")],
             [("Through", "IN"), ("the", "DT"), ("night", "NN"), ("to", "TO"), ("you", "PRP")]] ]

How to load your dataset?

Please refer to FileIO v0.3a for more detail.