Skip to content

Text Corpus Format

Maarten Janssen edited this page Aug 21, 2020 · 2 revisions

The Text Corpus Format (.xml, .tcf) is a file-format used by the WebLicht. It is an XML-based format for token-based stand-off annotation.

Import

tcf2teitok.pl

Options

Command line options of the tool:

  • debug: debugging mode
  • output: name of the output file - if empty STDOUT
  • morerev: More revision statements
  • file: filename of the input

Specifications

The script converts the various blocks in the TCF file into their TEI(TOK) counterpart:

  • tokens: token converted to tok
  • posTag: tag converted to @pos on the corresponding token
  • lemmas: lemma converted to @lemma on the corresponding token
  • orthography: correction converted to @reg on the corresponding token
  • dependencies: dependency => @gov_IDs converted to @head, and @class to @deprel on the corresponding token
  • parses: parse => converted to eTree
  • sentences: sentences => converted to s with the corresponding tokens inside
  • namedEntities: entity => converted to name with the corresponding tokens inside

Known issues

Not all parts are implemented, and for all but sentences and entities, annotations can only relate to single tokens.

Export

No export has been provided as of yet.

Clone this wiki locally