API reference: Dataset Data Format V0.1a

Introduction

The data structure used to store datasets for:

training, and
testing

are not specifically defined as a class, but rather as is.

Data of these formats will be seen passed to functions, methods of the Aligner.

The description here is of module version 0.1a.

Dataset data format for two components

Bitext

Bitext is the data format for training and testing datasets with the following components:

source language texts; and
target language texts.

Bitext is in fact a list, which elements are pairs of list of strings (str).

For example, the following dataset:

Source(DE):
    Leise flehen meine Lieder
    Durch die Nacht zu Dir

Target(EN):
    Becken softly my songs
    Through the night to you

Will be stored as:

bitext = [ [["Leise", "flehen", "meine", "Lieder"], ["Becken", "softly", "my", "songs"]],
           [["Durch", "die", "Nacht", "Dir"], ["Through", "the", "night", "to", "you"]] ]

Dataset data format for three components

Tritext

Tritext is the data format for datasets with three components unlike two of Bitext:

Tritext is in fact a list, which elements are trios of list of strings (str).

For example, the following dataset:

DE:
    Leise flehen meine Lieder
    Durch die Nacht zu Dir

EN:
    Becken softly my songs
    Through the night to you

IT:
    Sommessi nella notte
    I miei canti ti supplicano

Will be stored as:

bitext = [ [["Leise", "flehen", "meine", "Lieder"], ["Becken", "softly", "my", "songs"], ["Sommessi", "nella", "notte"]],
           [["Durch", "die", "Nacht", "Dir"], ["Through", "the", "night", "to", "you"], ["I", "miei", "canti", "ti", "supplicano"]] ]

Home
API reference
- V0.6a (Current)
- Older versions
  - V0.5a
  - V0.4a
  - V0.3a
  - V0.2a
  - V0.1a
Experiments on Chinese-English
Experiments on German-English
Experiments on French-English

Provide feedback

Saved searches