forked from longliveenduro/heideltime
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreadme.txt
395 lines (339 loc) · 21.6 KB
/
readme.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
#######################
# UIMA HEIDELTIME KIT #
#######################
Author: Jannik Strötgen
Date: October 4, 2016
Version: 2.2
eMail: [email protected]
###################################
# 1. Papers describing HeidelTime #
###################################
HeidelTime was used in the TempEval-2 challenge as described in:
Jannik Strötgen and Michael Gertz.
HeidelTime: High Quality Rule-based Extraction and Normalization of Temporal Expressions.
In: SemEval-2010: Proceedings of the 5th International Workshop on Semantic Evaluation.
Pages 321-324, Uppsala, Sweden, July 15-16, 2010. ACL.
http://www.aclweb.org/anthology/S/S10
In "Language Resources and Evaluation", we have published a paper on "Multilingual and
Cross-domain Temporal Tagging". Here, we detail the features and architecture of
HeidelTime:
Jannik Strötgen and Michael Gertz.
Multilingual and Cross-domain Temporal Tagging.
In: Language Resources and Evaluation, 47(2), pages 269-298, 2013, Springer.
http://link.springer.com/article/10.1007%2Fs10579-012-9179-y
Please cite one of these papers if you use HeidelTime. Or, alternatively:
If you use HeidelTime for processing colloquial text (such as SMS or tweets) or scientific
publications (e.g., biomedical studies), you may want to cite the following paper instead:
Jannik Strötgen and Michael Gertz:
Temporal Tagging on Different Domains: Challenges, Strategies, and Gold Standards.
In: LREC 2012: Proceedings of the 8th International Conference on Language Resources
and Evaluation. Pages 3746-3753, Istanbul, Turkey, May 21-27, 2012. ELRA.
http://www.lrec-conf.org/proceedings/lrec2012/pdf/425_Paper.pdf
Starting with version 2.0, HeidelTime contains also automatically created resources. If you
use them, please cite the following paper:
Jannik Strötgen and Michael Gertz:
A Baseline Temporal Tagger for All Languages.
In EMNLP: Proceedings of the 2015 Conference on Empirical Methods in Natural Language
Processing. Pages 541-547, Lisbon, Portugal, September 17-21, 2015. ACL.
Starting with version 2.2, HeidelTime can be used for temponym tagging.
If you use HeidelTime as a temponym tagger, please cite the following paper:
Erdal Kuzey, Jannik Strötgen, Vinay Setty, and Gerhard Weikum:
Temponym Tagging: Temporal Scopes for Textual Phrases.
In TempWeb: Proceedings of the 6th Temporal Web Analytics Workshop. Pages 841-842, Montreal,
Canada, April 12, 2016. ACM.
###################
# 2. Introduction #
###################
HeidelTime is a multilingual temporal tagger that extracts temporal expressions from documents
and normalizes them according to the TIMEX3 annotation standard, which is part of the mark-up
language TimeML. HeidelTime uses different normalization strategies depending on the domain of
the documents that are to be processed (news, narratives, scientific or colloquial texts). It
is a rule-based system and due to its architectural feature that the source code and the
resources (patterns, normalization information, and rules) are strictly separated, one can
simply develop resources for additional languages using HeidelTime's well-defined rule syntax.
HeidelTime was the best system for the extraction and normalization of English temporal
expressions from documents in the TempEval-2 challenge in 2010. Furthermore, it is evaluated on
several additional corpora, as described in our papers, e.g., in "Multilingual Cross-domain
Temporal Tagging" (http://www.springerlink.com/content/64767752451075k8/). In TempEval-3,
HeidelTime achieved the best results for the combination of extraction and normalization for
English and Spanish. In the EVENTI competition of EVALITA 2014, HeidelTime (version 1.8) achieved
the best results for Italian temporal tagging.
HeidelTime with resources for several languages is one component of our UIMA HeidelTime kit.
- German
- English (and Englischcoll (for colloquial texts), Englishsci (for scientific texts))
- Dutch (kindly provided by Matje van de Camp, Tilburg University,
http://www.tilburguniversity.edu/webwijs/show/?uid=m.m.v.d.camp)
- Arabic
- Vietnamese
- Spanish
- Italian
- French (kindly provided by Véronique Moriceau, LIMSI - CNRS, http://vero.moriceau.free.fr/)
- Chinese
- Russian (a preliminary version was kindly shared by Elena Klyachko)
- Croatian (kindly provided by Luka Skukan)
- Estonian
- Portuguese (kindly provided by Zunsik Lim)
Starting with version 2.0, HeidelTime contains automatically created resources for 200+
languages. It can thus be used as a baseline temporal tagger for all these languages or
as a starting point for developing more sophisticated temporal tagging capabilities for
languages which have not been addressed so far.
Additionally, whilst expanding the set of domains that HeidelTime can recognize temporal
expressions in, English resources for colloquial as well as scientific style documents were
developed. Colloquial documents are for example SMS or Twitter messages where language is
abbreviated to fit a length constraint and non-standard language is common. Scientific documents
include for example biomedical studies, in which temporal information is often given relative
to a document-specific "local" time frame.
In addition to HeidelTime with resources for the respective languages, the UIMA HeidelTime kit
contains collection readers, cas consumers, and analysis engines to process temporal annotated
corpora and reproduce HeidelTime's evaluation results on these corpora. The HeidelTime kit
contains:
* ACE Tern Reader: This Collection Reader reads corpora of the ACE Tern style and annotates
the document creation time. Using the corpora preparation script (for details, see below),
several corpora can be processed, e.g.: ACE Tern 2004 training, WikiWars, and WikiWars_DE.
* TempEval-2 Reader: This Collection Reader reads the TempEval-2 input data of the training
and the evaluation sets and annotates the document creation time as well as token and
sentence information.
* TempEval-3 Reader: This Collection Reader reads the TempEval-3 input TimeML data of the
training and the evaluation sets and annotates the document creation time as well as some
meta information.
* Eventi 2014 Reader: This Collection Reader reads the Eventi 2014 input data of the
training and the evaluation sets and annotates the tokenization information.
* AllLanguagesTokenizer: This Analysis Engine produces Token and Sentence annotations. It is a
simple yet generic tool and should be used for languages which are not supported by any other
preprocessing tool, i.e., for most of the languages for which HeidelTime resources have been
automatically created.
* TreeTaggerWrapper: This Analysis Engine produces Token, Sentence and Part-of-Speech annotations
required by HeidelTime by using the multilingual TreeTagger tool.
* HunPosTaggerWrapper: This Analysis Engine produces Token, Sentence and Part-of-Speech annotations
required by HeidelTime by using the HunPosTagger tool.
* StanfordPOSTaggerWrapper: This Analysis Engine produces Token, Sentence and Part-of-Speech
annotations required by HeidelTime by using the multilingual Stanford POS Tagger.
* JVnTextProWrapper: This Analysis Engine produces Token, Sentence and Part-of-Speech annotations
required by HeidelTime by using the JVnTextPro tool for documents in Vietnamese.
* Annotation Translator: This Analysis Engine translates Sentence, Token, and Part-of-Speech
annotations of one type system into HeidelTime's type system.
* HeidelTime: Possible parameter values are:
- languages: english, englishcoll, englishsci, german, spanish, italian, vietnamese, arabic,
dutch, chinese, french, russian, croatian, estonian, portuguese
as well as any "auto-LANGUAGE" contained in the resource folder
- types: news, narratives, colloquial (for use with englishcoll), scientific (englishsci)
- locale: the locale to use for date calculation. Leave it empty to use en_GB.
- Debugging: to output verbose debugging information to stderr.
- Date / Time / Duration / Set / Temponym: if respective temporal expressions shall be
extracted
* IntervalTagger: This Analysis Engine in conjunction with HeidelTime recognizes
temporal intervals in documents.
* ACE Tern Writer: This CAS Consumer creates output data as needed to run the official ACE
Tern evaluation scripts.
* TempEval-2 Writer: This CAS Consumer creates two files needed to evaluate the tasks of
extracting and evaluating temporal expressions of the TempEval-2 challenge using the
official evaluation scripts.
* TempEval-3 Writer: This CAS Consumer writes annotated TimeML files that were read in by
the TempEval-3 Reader and processed by HeidelTime in the format required by the TempEval-3
evaluation scripts.
* Eventi 2014 Writer: This CAS Consumer writes annotated Eventi 2014 files that were read in by
the Eventi 2014 Reader and processed by HeidelTime in the format required by the Eventi 2014
evaluation script.
######################
# 3. Getting started #
######################
Most of the descriptions are tested and written for Linux, e.g., how to set environment
variables. If you do not use Linux, make sure that you carry out analogous steps, e.g.,
set the environment variables.
1. UIMA (if you already use UIMA, you can skip this step):
To be able to use HeidelTime, you have to install UIMA:
* Download UIMA:
- either from http://uima.apache.org/downloads.cgi or
- wget http://archive.apache.org/dist/uima/uimaj-2.6.0/uimaj-2.6.0-bin.tar.gz
* Extract UIMA:
- tar xvfz uimaj-2.6.0-bin.tar.gz
* Set environment variable (you can set variables globally, e.g., in your $HOME/.bashrc)
- set UIMA_HOME to the path of your "apache-uima" folder
* export UIMA_HOME="$(pwd)/apache-uima"
- make sure that JAVA_HOME is set correctly
- add the "$UIMA_HOME/bin" to your PATH
* export PATH=$PATH:$UIMA_HOME/bin
* Adjust the UIMA's example paths:
- $UIMA_HOME/bin/adjustExamplePaths.sh
* For further information about UIMA, see http://uima.apache.org/
2. Download and install the UIMA HeidelTime kit
* download the latest heideltime-kit from
https://github.com/HeidelTime/heideltime/releases
* unzip or untar the heideltime-kit into a path called HEIDELTIME_HOME from hereon out.
* set the environment variable HEIDELTIME_HOME (you can set these variables globally,
e.g., in your $HOME/.bashrc):
- export HEIDELTIME_HOME='/path/to/heideltime/'
3. HeidelTime requires sentence, token, and part-of speech annotations. We have developed
our own wrapper for the popular TreeTagger tool that will support any language for which
there are parameter files available.
However, you may use other tools (if so, you either have to adjust the analysis engine
"Annotation Translator", which is part of the UIMA HeidelTime kit or you may adjust the
imports in HeidelTime.java itself. Furthermore, if a differing tag set is used, all rules
containing part-of-speech information have to be adapted).
To process English, German, Dutch, Spanish, Italian, French, Chinese or Russian documents,
the TreeTaggerWrapper can be used for pre-processing:
* Download the TreeTagger and its tagging scripts, installation scripts, as well as
English, German, and Dutch (and all required) parameter files into one directory from:
http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
- mkdir treetagger
- cd treetagger
- wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger-linux-3.2.1.tar.gz
- or alternatively: wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger-linux-3.2-old.tar.gz
- wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tagger-scripts.tar.gz
- wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/install-tagger.sh
- wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/german-par-linux-3.2-utf8.bin.gz
- wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/english-par-linux-3.2-utf8.bin.gz
- wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/dutch-par-linux-3.2-utf8.bin.gz
- wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/italian-par-linux-3.2-utf8.bin.gz
- wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/spanish-par-linux-3.2-utf8.bin.gz
- wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/french-par-linux-3.2-utf8.bin.gz
- wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/portuguese-par-linux-3.2-utf8.bin.gz
- wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/estonian-par-linux-3.2-utf8.bin.gz
Attention: If you do not use Linux, please download all TreeTagger files directly from
http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
* (OPTIONAL) For Chinese documents, please get the Tokenizer and TreeTagger parameter file
from Serge Sharoff's page http://corpus.leeds.ac.uk/tools/zh/:
- wget http://corpus.leeds.ac.uk/tools/zh/tt-lcmc.tgz
- wget https://drive.google.com/uc?id=0B1ZoOwaeRsbva2F3NThLd3ptRWM -O zh-tokenise.tgz
Extract the Tokenizer into a new directory and TreeTagger parameter files like this:
- mkdir chinese-tokenizer
- tar -xzvf tt-lcmc.tgz
- tar -xzvf zh-tokenise.tgz -C chinese-tokenizer
* (OPTIONAL) For Russian documents, please grab a copy of the Russian parameter file from
Serge Sharoff's page at http://corpus.leeds.ac.uk/mocky/ and extract it into TreeTagger's
lib/-folder:
- cd /path/to/treetagger/
- mkdir lib && cd lib
- wget http://corpus.leeds.ac.uk/mocky/russian.par.gz
- gunzip russian.par.gz
* Install the TreeTagger
- sh install-tagger.sh
* Set environment variables (you can set variables permanently, e.g., in your $HOME/.bashrc)
and then source the environment.
- export TREETAGGER_HOME='path to TreeTagger'
- source $HEIDELTIME_HOME/metadata/setenv
For further information on the TreeTagger, take a look at its documentation and our wiki
page for it: https://github.com/HeidelTime/heideltime/wiki/TreeTaggerWrapper
To process Vietnamese documents, we have developed the JVnTextProWrapper Analysis
Engine. It makes use of the JVnTextPro tool. To use it, follow these steps:
* Download JVnTextPro and unpack it:
- wget http://sourceforge.net/projects/jvntextpro/files/latest/download -O JVnTextPro.zip
- unzip JVnTextPro.zip
* Set the relevant environment variable, then source the environment to construct
the CLASSPATH.
- export JVNTEXTPRO_HOME='<path to JVnTextPro>/bin'
- source $HEIDELTIME_HOME/metadata/setenv
Further information about JVnTextPro can be found on our Wiki page for the Engine:
https://github.com/HeidelTime/heideltime/wiki/JVnTextProWrapper
To process Arabic documents, we have developed the Stanford POS Tagger Wrapper
Analysis Engine. It utilizes the Stanford POS Tagger. Follow these instructions
to set it up:
* Download the Stanford POS Tagger *Full Package* from this URL and unzip it:
http://nlp.stanford.edu/software/tagger.shtml
- wget http://nlp.stanford.edu/software/stanford-postagger-full-2014-01-04.zip
- unzip stanford-postagger-full-2014-01-04.zip
* Set the relevant environment variable, then source the environment to construct
the CLASSPATH.
- export STANFORDTAGGER='path to stanford-postagger-<version>.jar'
- source $HEIDELTIME_HOME/metadata/setenv
For more information on the Stanford POS Tagger Wrapper, see our Wiki page:
https://github.com/HeidelTime/heideltime/wiki/StanfordPOSTaggerWrapper
To process Croatian documents, Luka Skukan has developed a Wrapper for the
HunPosTagger. You will need to get a copy of the HunPos tagger as well
as a Croatian model file for it:
From https://code.google.com/p/hunpos/downloads/list, download hunpos-1.0:
- wget https://hunpos.googlecode.com/files/hunpos-1.0-linux.tgz
From http://nlp.ffzg.hr/resources/models/tagging/, download the model file:
- wget http://nlp.ffzg.hr/data/tagging/model.hunpos.mte5.defnpout.tar.gz
Extract both of the files into the same directory
- tar -xzvf hunpos-1.0-linux.tgz
- tar -xzvf model.hunpos.mte5.defnpout.tar.gz -C hunpos-1.0-linux/
You will need to enter the full path of the hunpos-1.0-linux directory in the
HunPosTaggerWrapper.
To process any of the language with automatically created resources, you can use
the AllLanguagesTokenizer, which is part of the heideltime kit. It is a simple
(whitespace-based) yet generic tool and creaetes sentence and token annotation.
For sample UIMA workflows for many of the supported languages, please take a look
at our evaluation results reproduction Wiki page:
https://github.com/HeidelTime/heideltime/wiki/Reproducing-Evaluation-Results
and select a workflow description for a corpus of the language of your choice.
#########################
# 4. Testing HeidelTime #
#########################
1. source the environment and copy the resources into the CLASSPATH
* source $HEIDELTIME_HOME/metadata/setenv
* cd $HEIDELTIME_HOME/resources && sh printResourceInformation.sh
2. run cpeGui.sh and create a workflow
* cpeGui.sh
* create a workflow with the following components:
Collection reader:
- UIMA's file system collection reader:
$UIMA_HOME/examples/descriptors/collection_reader/FileSystemCollectionReader.xml
set "Input directory" to $HEIDELTIME_HOME/doc/
Analysis Engines
- TreeTaggerWrapper located at
HEIDELTIME_HOME/desc/annotator/TreeTaggerWrapper.xml
set "Language" to "english"
set "Annotate_tokens" to "true"
set "Annotate_partofspeech" to "true"
set "Annotate_sentences" to "true"
set "Improvegermansentences" to "false"
- HeidelTime located at
HEIDELTIME_HOME/desc/annotator/HeidelTime.xml
set "Date" to "true"
set "Time" to "true"
set "Duration" to "true"
set "Set" to "true"
set "Temponym" to "false"
set "Language" to "english"
set "Type" to "narratives"
CAS Consumer
- UIMA's XMI Writer CAS Consumer located at
$UIMA_HOME/examples/descriptors/cas_consumer/XmiWriterCasConsumer.xml
set "Output Directory" to OUTPUT
* (save the workflow)
* run the workflow
###########################################################
# 5. Analyze the results using the UIMA annotation viewer #
###########################################################
To analyze the annotations produced by HeidelTime you may use UIMA's annotation viewer:
* annotationViewer.sh
set "Input Directory" to "OUTPUT"
set TypeSystem or AE Descriptor File" to "$HEIDELTIME_HOME/desc/type/HeidelTime_TypeSystem.xml"
* focus the analysis on Section 6 of the "readme.txt" file.
####################################################################
# 6. What kind of temporal expressions can be found and normalized #
####################################################################
HeidelTime distinguishes between four types of documents: news style, narrative
style, colloquial style and scientific style documents. In addition, temponyms such as "John
F. Kennedy's death" can be identified and normalized if "temponym" is set to true. This file
here is a narrative document. This version of HeidelTime was released in 2016. To be more
precise, and using a relative expression, it was released on Ocotber 4. HeidelTime was the best
performing system of task A of the TempEval-2 challenge in 2010 and the system was presented at the
TempEval Workshop at the ACL conference in Uppsala, Sweden on July 15, 2010 or July 16.
In the meantime, it is October 2016 and HeidelTime is made publicly available and identifies
these temporal expressions: January 22, 2001 or twice a week.
##########################################
# 7. Additional HeidelTime documentation #
##########################################
HeidelTime's GitHub Project contains a lot of valuable information on how to use
HeidelTime or its components, as well as additional resources, an always up-to-date
code repository and issue tracker in case you spot a bug.
Visit the project at
https://github.com/HeidelTime/heideltime
#######################################################################
# 8. Reproducing HeidelTime's evaluation results on different corpora #
#######################################################################
To reproduce HeidelTime's evaluation results reported in in our papers, e.g., in
"Multilingual Cross-domain Temporal Tagging", follow the instructions on:
https://github.com/HeidelTime/heideltime/wiki/Reproducing-Evaluation-Results
##############
# 9. License #
##############
Copyright (c) 2012-2016, Database Research Group, Institute of Computer Science, Heidelberg University.
All rights reserved. This program and the accompanying materials
are made available under the terms of the GNU General Public License.
author: Jannik Strötgen
email: [email protected]
HeidelTime is a multilingual, cross-domain temporal tagger.
For details, see http://dbs.ifi.uni-heidelberg.de/heideltime