-
Notifications
You must be signed in to change notification settings - Fork 0
Text Corpus Format
Maarten Janssen edited this page Aug 21, 2020
·
2 revisions
The Text Corpus Format (.xml, .tcf) is a file-format used by the WebLicht. It is an XML-based format for token-based stand-off annotation.
tcf2teitok.pl
Command line options of the tool:
- debug: debugging mode
- output: name of the output file - if empty STDOUT
- morerev: More revision statements
- file: filename of the input
The script converts the various blocks in the TCF file into their TEI(TOK) counterpart:
-
tokens
:token
converted totok
-
posTag
:tag
converted to@pos
on the corresponding token -
lemmas
:lemma
converted to@lemma
on the corresponding token -
orthography
:correction
converted to@reg
on the corresponding token -
dependencies
:dependency
=>@gov_IDs
converted to@head
, and@class
to@deprel
on the corresponding token -
parses
:parse
=> converted toeTree
-
sentences
:sentences
=> converted tos
with the corresponding tokens inside -
namedEntities
:entity
=> converted toname
with the corresponding tokens inside
Not all parts are implemented, and for all but sentences and entities, annotations can only relate to single tokens.
No export has been provided as of yet.