-
Notifications
You must be signed in to change notification settings - Fork 0
CoNNL U
CoNLL-U (.conllu) is a file-format for dependency trees in vertical format, used for Universal Dependencies.
conllu2teitok.pl
Command line options of the tool:
- debug: debugging mode
- output: name of the output file - if empty STDOUT
- morerev: More revision statements
- file: filename of the input
The CoNLL-U import converts all the token lines to tokens, with attributes corresponding to their UD names, and converts sentences and paragraphs into XML regions. When there is no "SpaceAfter=No" in the @misc column, a space will be introduced after the token.
The numbering of the tokens is done sequentially according to the TEITOK-style numbering, and the ordinal-based head indication is changed into a reference to the new ID of the head.
teitok2conllu.pl
- writeback: write back to original file or put in new file
- file: input file name
- output: output file name
- pos: the attribute to be used as XPOS (by default either @xpos or @pos)
- tagmapping: a CSV file mapping mapping XPOS onto UPOS + Feat (see the UD conversion tables)
The script converts tokens into verticalized format. When the input file is not tokenized, the script will abort. The script will go through sentences if they are present in the input file, or introduce a new sentence after every token corresponding to [.!?].
The script currently does not take paragraphs into account.