Skip to content

API reference: FileIO V0.1a

Jetic Gu edited this page Jun 8, 2017 · 1 revision

Introduction

FileIO is located in src/fileIO.py. It contains functions used to handle file operations in the Aligner.

The description here is of module version 0.1a.

Functions

def exportToFile(result, fileName)

Parameters:

def loadBitext(file1, file2, linesToLoad=sys.maxint)

Parameters:

  • file1: str, the first file to read
  • file2: str, the second file to read
  • linesToLoad: int, the lines to read

Return:

def loadTritext(file1, file2, file3, linesToLoad=sys.maxint):

Parameters:

  • file1: str, the first file to read
  • file2: str, the second file to read
  • file3: str, the third file to read
  • linesToLoad: int, the lines to read

Return:

def loadAlignment(fileName, linesToLoad=sys.maxint):

Parameters:

  • fileName: str, the Alignment file to read
  • linesToLoad: int, the lines to read

Return:

File formats

Source/target language text files

UTF-8 text files. Each line contains one sentence, sentences are segmented in which words are separated by space. One language each file.

Gold Alignment files(.wa)

UTF-8 text files. Each line contains one sentence. Alignments of words of in one sentence are separated by space. Each alignment is represented in the following format:

  1. "NN-MM", where NN and MM are integers, means that there is a certain alignment between the NNth word of the source sentence and the MMth word of the target sentence. In addition, MM could be of the format: "M1,M2,M3,..." which means that there are certain alignments between the NNth word of the source sentence and each of the Mith words of the target sentence.

  2. "NN?MM", where NN and MM are integers, means that there is a probable alignment between the NNth word of the source sentence and the MMth word of the target sentence. In addition, MM could be of the format: "M1,M2,M3,..." which means that there are probable alignments between the NNth word of the source sentence and each of the Mith words of the target sentence.

  3. "NN-MM-TT", where NN and MM are integers, TT is a str representing the tag of the word. It means that there is a certain alignment between the NNth word of the source sentence and the MMth word of the target sentence, both of which are of TT type.

Clone this wiki locally