-
Notifications
You must be signed in to change notification settings - Fork 4
API reference: Dataset Data Format V0.1a
Jetic Gu edited this page Jun 8, 2017
·
1 revision
The data structure used to store datasets for:
-
training, and
-
testing
are not specifically defined as a class, but rather as is.
Data of these formats will be seen passed to functions, methods of the Aligner.
The description here is of module version 0.1a.
Bitext
is the data format for training and testing datasets with the following components:
- source language texts; and
- target language texts.
Bitext
is in fact a list
, which elements are pairs of list
of strings (str
).
For example, the following dataset:
Source(DE):
Leise flehen meine Lieder
Durch die Nacht zu Dir
Target(EN):
Becken softly my songs
Through the night to you
Will be stored as:
bitext = [ [["Leise", "flehen", "meine", "Lieder"], ["Becken", "softly", "my", "songs"]],
[["Durch", "die", "Nacht", "Dir"], ["Through", "the", "night", "to", "you"]] ]
Tritext
is the data format for datasets with three components unlike two of Bitext
:
Tritext
is in fact a list
, which elements are trios of list
of strings (str
).
For example, the following dataset:
DE:
Leise flehen meine Lieder
Durch die Nacht zu Dir
EN:
Becken softly my songs
Through the night to you
IT:
Sommessi nella notte
I miei canti ti supplicano
Will be stored as:
bitext = [ [["Leise", "flehen", "meine", "Lieder"], ["Becken", "softly", "my", "songs"], ["Sommessi", "nella", "notte"]],
[["Durch", "die", "Nacht", "Dir"], ["Through", "the", "night", "to", "you"], ["I", "miei", "canti", "ti", "supplicano"]] ]