API reference: Dataset Data Format V0.2a

Introduction

The data structure used to store datasets for:

training, and
testing

are not specifically defined as a class, but rather as is.

Data of these formats will be seen passed to functions, methods of the Aligner.

The description here is of module version 0.2a.

Changes in 0.2a

old formats including Bitext and Tritest are replaced by Dataset

Dataset data format

Dataset

Dataset is the data format for training and testing datasets with the following components:

source language information; and
target language information; and
optionally, alignment information.

Dataset is in fact a list, which elements are sentences(list).

Each sentence has three parts, stored in a list:

source language information, typically including the original text, and additional information such as POS tags. This is stored as a list, each word and its supplemental information stored as a tuple being element of that list.
target language information, typically including the original text, and additional information such as POS tags. This is stored as a list, each word and its supplemental information stored as a tuple being element of that list.
Optionally, SentenceAlignment as the alignment of such sentence, as specified in Alignment Data Format V0.2a.

For example, the following dataset:

Source(DE):
    Leise flehen meine Lieder
    Durch die Nacht zu Dir

Target(EN):
    Becken softly my songs
    Through the night to you

Will be stored as:

dataset = [ [[("Leise"), ("flehen"), ("meine"), ("Lieder")],
             [("Becken"), ("softly"), ("my"), ("songs")]],

            [[("Durch"), ("die"), ("Nacht"), ("zu"), ("Dir")],
             [("Through"), ("the"), ("night"), ("to"), ("you")]] ]

For Dataset with additional information such as POS tags, an example is given below:

Source(DE):
    Leise[JJ] flehen[VBP] meine[PRP$] Lieder[NN]
    Durch[IN] die[DT] Nacht[NN] zu[TO] Dir[PRP]

Target(EN):
    Becken[VBP] softly[JJ] my[PRP$] songs[NN]
    Through[IN] the[DT] night[NN] to[TO] you[PRP]

It will be stored as:

dataset = [ [[("Leise", "JJ"), ("flehen", "VBP"), ("meine", "PRP$"), ("Lieder", "NN")],
             [("Becken", "VBP"), ("softly", "JJ"), ("my", "PRP$"), ("songs", "NN")]],

            [[("Durch", "IN"), ("die", "DT"), ("Nacht", "NN"), ("zu", "TO"), ("Dir", "PRP")],
             [("Through", "IN"), ("the", "DT"), ("night", "NN"), ("to", "TO"), ("you", "PRP")]] ]

How to load your dataset?

Please refer to FileIO v0.3a for more detail.

Home
API reference
- V0.6a (Current)
- Older versions
  - V0.5a
  - V0.4a
  - V0.3a
  - V0.2a
  - V0.1a
Experiments on Chinese-English
Experiments on German-English
Experiments on French-English

Provide feedback

Saved searches