-
Notifications
You must be signed in to change notification settings - Fork 1
Text Corpus Format
Maarten Janssen edited this page Aug 21, 2020
·
2 revisions
The Text Corpus Format (.xml, .tcf) is a file-format used by the WebLicht. It is an XML-based format for stand-off annotation.
tcf2teitok.pl
Command line options of the tool:
- debug: debugging mode
- output: name of the output file - if empty STDOUT
- morerev: More revision statements
- file: filename of the input
The script converts the various blocks in the TCF file into their TEI(TOK) counterpart
Not all parts are implemented, what is implemented for the moment is:
-
tokens
:token
converted totok
-
posTag
:tag
converted to@pos
on the corresponding token -
lemmas
:lemma
converted to@lemma
on the corresponding token -
orthography
:correction
converted to@reg
on the corresponding token -
depenendenies
:dependency
=>@gov_IDs
converted to@head
, and@class
to@deprel
on the corresponding token -
parses
:parse
=> converted toeTree
-
sentences
:sentences
=> converted tos
with the corresponding tokens inside -
namedEntities
:entity
=> converted toname
with the corresponding tokens inside
No export has been provided as of yet.