Skip to content

API reference: Dataset Data Format V0.1a

Jetic Gu edited this page Jun 8, 2017 · 1 revision

Introduction

The data structure used to store datasets for:

  1. training, and

  2. testing

are not specifically defined as a class, but rather as is.

Data of these formats will be seen passed to functions, methods of the Aligner.

The description here is of module version 0.1a.

Dataset data format for two components

Bitext

Bitext is the data format for training and testing datasets with the following components:

  1. source language texts; and
  2. target language texts.

Bitext is in fact a list, which elements are pairs of list of strings (str).

For example, the following dataset:

Source(DE):
    Leise flehen meine Lieder
    Durch die Nacht zu Dir

Target(EN):
    Becken softly my songs
    Through the night to you

Will be stored as:

bitext = [ [["Leise", "flehen", "meine", "Lieder"], ["Becken", "softly", "my", "songs"]],
           [["Durch", "die", "Nacht", "Dir"], ["Through", "the", "night", "to", "you"]] ]

Dataset data format for three components

Tritext

Tritext is the data format for datasets with three components unlike two of Bitext:

Tritext is in fact a list, which elements are trios of list of strings (str).

For example, the following dataset:

DE:
    Leise flehen meine Lieder
    Durch die Nacht zu Dir

EN:
    Becken softly my songs
    Through the night to you

IT:
    Sommessi nella notte
    I miei canti ti supplicano

Will be stored as:

bitext = [ [["Leise", "flehen", "meine", "Lieder"], ["Becken", "softly", "my", "songs"], ["Sommessi", "nella", "notte"]],
           [["Durch", "die", "Nacht", "Dir"], ["Through", "the", "night", "to", "you"], ["I", "miei", "canti", "ti", "supplicano"]] ]