Skip to content

CoNNL U

Maarten Janssen edited this page Aug 17, 2020 · 1 revision

CoNLL-U (.conllu) is a file-format for dependency trees in vertical format, used for Universal Dependencies.

Import

conllu2teitok.pl

Options

Command line options of the tool:

  • debug: debugging mode
  • output: name of the output file - if empty STDOUT
  • morerev: More revision statements
  • file: filename of the input

Specifications

The CoNLL-U import converts all the token lines to tokens, with attributes corresponding to their UD names, and converts sentences and paragraphs into XML regions. When there is no "SpaceAfter=No" in the @misc column, a space will be introduced after the token.

The numbering of the tokens is done sequentially according to the TEITOK-style numbering, and the ordinal-based head indication is changed into a reference to the new ID of the head.

Known issues

Export

teitok2conllu.pl

  • writeback: write back to original file or put in new file
  • file: input file name
  • output: output file name
  • pos: the attribute to be used as XPOS (by default either @xpos or @pos)
  • tagmapping: a CSV file mapping mapping XPOS onto UPOS + Feat (see the UD conversion tables)

Specifications

The script converts tokens into verticalized format. When the input file is not tokenized, the script will abort. The script will go through sentences if they are present in the input file, or introduce a new sentence after every token corresponding to [.!?].

Known issues

The script currently does not take paragraphs into account.

Clone this wiki locally