Files
doc
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||
####################### # UIMA HEIDELTIME KIT # ####################### Author: Jannik Strötgen Date: October 4, 2016 Version: 2.2 eMail: [email protected] ################################### # 1. Papers describing HeidelTime # ################################### HeidelTime was used in the TempEval-2 challenge as described in: Jannik Strötgen and Michael Gertz. HeidelTime: High Quality Rule-based Extraction and Normalization of Temporal Expressions. In: SemEval-2010: Proceedings of the 5th International Workshop on Semantic Evaluation. Pages 321-324, Uppsala, Sweden, July 15-16, 2010. ACL. http://www.aclweb.org/anthology/S/S10 In "Language Resources and Evaluation", we have published a paper on "Multilingual and Cross-domain Temporal Tagging". Here, we detail the features and architecture of HeidelTime: Jannik Strötgen and Michael Gertz. Multilingual and Cross-domain Temporal Tagging. In: Language Resources and Evaluation, 47(2), pages 269-298, 2013, Springer. http://link.springer.com/article/10.1007%2Fs10579-012-9179-y Please cite one of these papers if you use HeidelTime. Or, alternatively: If you use HeidelTime for processing colloquial text (such as SMS or tweets) or scientific publications (e.g., biomedical studies), you may want to cite the following paper instead: Jannik Strötgen and Michael Gertz: Temporal Tagging on Different Domains: Challenges, Strategies, and Gold Standards. In: LREC 2012: Proceedings of the 8th International Conference on Language Resources and Evaluation. Pages 3746-3753, Istanbul, Turkey, May 21-27, 2012. ELRA. http://www.lrec-conf.org/proceedings/lrec2012/pdf/425_Paper.pdf Starting with version 2.0, HeidelTime contains also automatically created resources. If you use them, please cite the following paper: Jannik Strötgen and Michael Gertz: A Baseline Temporal Tagger for All Languages. In EMNLP: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Pages 541-547, Lisbon, Portugal, September 17-21, 2015. ACL. Starting with version 2.2, HeidelTime can be used for temponym tagging. If you use HeidelTime as a temponym tagger, please cite the following paper: Erdal Kuzey, Jannik Strötgen, Vinay Setty, and Gerhard Weikum: Temponym Tagging: Temporal Scopes for Textual Phrases. In TempWeb: Proceedings of the 6th Temporal Web Analytics Workshop. Pages 841-842, Montreal, Canada, April 12, 2016. ACM. ################### # 2. Introduction # ################### HeidelTime is a multilingual temporal tagger that extracts temporal expressions from documents and normalizes them according to the TIMEX3 annotation standard, which is part of the mark-up language TimeML. HeidelTime uses different normalization strategies depending on the domain of the documents that are to be processed (news, narratives, scientific or colloquial texts). It is a rule-based system and due to its architectural feature that the source code and the resources (patterns, normalization information, and rules) are strictly separated, one can simply develop resources for additional languages using HeidelTime's well-defined rule syntax. HeidelTime was the best system for the extraction and normalization of English temporal expressions from documents in the TempEval-2 challenge in 2010. Furthermore, it is evaluated on several additional corpora, as described in our papers, e.g., in "Multilingual Cross-domain Temporal Tagging" (http://www.springerlink.com/content/64767752451075k8/). In TempEval-3, HeidelTime achieved the best results for the combination of extraction and normalization for English and Spanish. In the EVENTI competition of EVALITA 2014, HeidelTime (version 1.8) achieved the best results for Italian temporal tagging. HeidelTime with resources for several languages is one component of our UIMA HeidelTime kit. - German - English (and Englischcoll (for colloquial texts), Englishsci (for scientific texts)) - Dutch (kindly provided by Matje van de Camp, Tilburg University, http://www.tilburguniversity.edu/webwijs/show/?uid=m.m.v.d.camp) - Arabic - Vietnamese - Spanish - Italian - French (kindly provided by Véronique Moriceau, LIMSI - CNRS, http://vero.moriceau.free.fr/) - Chinese - Russian (a preliminary version was kindly shared by Elena Klyachko) - Croatian (kindly provided by Luka Skukan) - Estonian - Portuguese (kindly provided by Zunsik Lim) Starting with version 2.0, HeidelTime contains automatically created resources for 200+ languages. It can thus be used as a baseline temporal tagger for all these languages or as a starting point for developing more sophisticated temporal tagging capabilities for languages which have not been addressed so far. Additionally, whilst expanding the set of domains that HeidelTime can recognize temporal expressions in, English resources for colloquial as well as scientific style documents were developed. Colloquial documents are for example SMS or Twitter messages where language is abbreviated to fit a length constraint and non-standard language is common. Scientific documents include for example biomedical studies, in which temporal information is often given relative to a document-specific "local" time frame. In addition to HeidelTime with resources for the respective languages, the UIMA HeidelTime kit contains collection readers, cas consumers, and analysis engines to process temporal annotated corpora and reproduce HeidelTime's evaluation results on these corpora. The HeidelTime kit contains: * ACE Tern Reader: This Collection Reader reads corpora of the ACE Tern style and annotates the document creation time. Using the corpora preparation script (for details, see below), several corpora can be processed, e.g.: ACE Tern 2004 training, WikiWars, and WikiWars_DE. * TempEval-2 Reader: This Collection Reader reads the TempEval-2 input data of the training and the evaluation sets and annotates the document creation time as well as token and sentence information. * TempEval-3 Reader: This Collection Reader reads the TempEval-3 input TimeML data of the training and the evaluation sets and annotates the document creation time as well as some meta information. * Eventi 2014 Reader: This Collection Reader reads the Eventi 2014 input data of the training and the evaluation sets and annotates the tokenization information. * AllLanguagesTokenizer: This Analysis Engine produces Token and Sentence annotations. It is a simple yet generic tool and should be used for languages which are not supported by any other preprocessing tool, i.e., for most of the languages for which HeidelTime resources have been automatically created. * TreeTaggerWrapper: This Analysis Engine produces Token, Sentence and Part-of-Speech annotations required by HeidelTime by using the multilingual TreeTagger tool. * HunPosTaggerWrapper: This Analysis Engine produces Token, Sentence and Part-of-Speech annotations required by HeidelTime by using the HunPosTagger tool. * StanfordPOSTaggerWrapper: This Analysis Engine produces Token, Sentence and Part-of-Speech annotations required by HeidelTime by using the multilingual Stanford POS Tagger. * JVnTextProWrapper: This Analysis Engine produces Token, Sentence and Part-of-Speech annotations required by HeidelTime by using the JVnTextPro tool for documents in Vietnamese. * Annotation Translator: This Analysis Engine translates Sentence, Token, and Part-of-Speech annotations of one type system into HeidelTime's type system. * HeidelTime: Possible parameter values are: - languages: english, englishcoll, englishsci, german, spanish, italian, vietnamese, arabic, dutch, chinese, french, russian, croatian, estonian, portuguese as well as any "auto-LANGUAGE" contained in the resource folder - types: news, narratives, colloquial (for use with englishcoll), scientific (englishsci) - locale: the locale to use for date calculation. Leave it empty to use en_GB. - Debugging: to output verbose debugging information to stderr. - Date / Time / Duration / Set / Temponym: if respective temporal expressions shall be extracted * IntervalTagger: This Analysis Engine in conjunction with HeidelTime recognizes temporal intervals in documents. * ACE Tern Writer: This CAS Consumer creates output data as needed to run the official ACE Tern evaluation scripts. * TempEval-2 Writer: This CAS Consumer creates two files needed to evaluate the tasks of extracting and evaluating temporal expressions of the TempEval-2 challenge using the official evaluation scripts. * TempEval-3 Writer: This CAS Consumer writes annotated TimeML files that were read in by the TempEval-3 Reader and processed by HeidelTime in the format required by the TempEval-3 evaluation scripts. * Eventi 2014 Writer: This CAS Consumer writes annotated Eventi 2014 files that were read in by the Eventi 2014 Reader and processed by HeidelTime in the format required by the Eventi 2014 evaluation script. ###################### # 3. Getting started # ###################### Most of the descriptions are tested and written for Linux, e.g., how to set environment variables. If you do not use Linux, make sure that you carry out analogous steps, e.g., set the environment variables. 1. UIMA (if you already use UIMA, you can skip this step): To be able to use HeidelTime, you have to install UIMA: * Download UIMA: - either from http://uima.apache.org/downloads.cgi or - wget http://archive.apache.org/dist/uima/uimaj-2.6.0/uimaj-2.6.0-bin.tar.gz * Extract UIMA: - tar xvfz uimaj-2.6.0-bin.tar.gz * Set environment variable (you can set variables globally, e.g., in your $HOME/.bashrc) - set UIMA_HOME to the path of your "apache-uima" folder * export UIMA_HOME="$(pwd)/apache-uima" - make sure that JAVA_HOME is set correctly - add the "$UIMA_HOME/bin" to your PATH * export PATH=$PATH:$UIMA_HOME/bin * Adjust the UIMA's example paths: - $UIMA_HOME/bin/adjustExamplePaths.sh * For further information about UIMA, see http://uima.apache.org/ 2. Download and install the UIMA HeidelTime kit * download the latest heideltime-kit from https://github.com/HeidelTime/heideltime/releases * unzip or untar the heideltime-kit into a path called HEIDELTIME_HOME from hereon out. * set the environment variable HEIDELTIME_HOME (you can set these variables globally, e.g., in your $HOME/.bashrc): - export HEIDELTIME_HOME='/path/to/heideltime/' 3. HeidelTime requires sentence, token, and part-of speech annotations. We have developed our own wrapper for the popular TreeTagger tool that will support any language for which there are parameter files available. However, you may use other tools (if so, you either have to adjust the analysis engine "Annotation Translator", which is part of the UIMA HeidelTime kit or you may adjust the imports in HeidelTime.java itself. Furthermore, if a differing tag set is used, all rules containing part-of-speech information have to be adapted). To process English, German, Dutch, Spanish, Italian, French, Chinese or Russian documents, the TreeTaggerWrapper can be used for pre-processing: * Download the TreeTagger and its tagging scripts, installation scripts, as well as English, German, and Dutch (and all required) parameter files into one directory from: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ - mkdir treetagger - cd treetagger - wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger-linux-3.2.1.tar.gz - or alternatively: wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger-linux-3.2-old.tar.gz - wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tagger-scripts.tar.gz - wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/install-tagger.sh - wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/german-par-linux-3.2-utf8.bin.gz - wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/english-par-linux-3.2-utf8.bin.gz - wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/dutch-par-linux-3.2-utf8.bin.gz - wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/italian-par-linux-3.2-utf8.bin.gz - wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/spanish-par-linux-3.2-utf8.bin.gz - wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/french-par-linux-3.2-utf8.bin.gz - wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/portuguese-par-linux-3.2-utf8.bin.gz - wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/estonian-par-linux-3.2-utf8.bin.gz Attention: If you do not use Linux, please download all TreeTagger files directly from http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ * (OPTIONAL) For Chinese documents, please get the Tokenizer and TreeTagger parameter file from Serge Sharoff's page http://corpus.leeds.ac.uk/tools/zh/: - wget http://corpus.leeds.ac.uk/tools/zh/tt-lcmc.tgz - wget https://drive.google.com/uc?id=0B1ZoOwaeRsbva2F3NThLd3ptRWM -O zh-tokenise.tgz Extract the Tokenizer into a new directory and TreeTagger parameter files like this: - mkdir chinese-tokenizer - tar -xzvf tt-lcmc.tgz - tar -xzvf zh-tokenise.tgz -C chinese-tokenizer * (OPTIONAL) For Russian documents, please grab a copy of the Russian parameter file from Serge Sharoff's page at http://corpus.leeds.ac.uk/mocky/ and extract it into TreeTagger's lib/-folder: - cd /path/to/treetagger/ - mkdir lib && cd lib - wget http://corpus.leeds.ac.uk/mocky/russian.par.gz - gunzip russian.par.gz * Install the TreeTagger - sh install-tagger.sh * Set environment variables (you can set variables permanently, e.g., in your $HOME/.bashrc) and then source the environment. - export TREETAGGER_HOME='path to TreeTagger' - source $HEIDELTIME_HOME/metadata/setenv For further information on the TreeTagger, take a look at its documentation and our wiki page for it: https://github.com/HeidelTime/heideltime/wiki/TreeTaggerWrapper To process Vietnamese documents, we have developed the JVnTextProWrapper Analysis Engine. It makes use of the JVnTextPro tool. To use it, follow these steps: * Download JVnTextPro and unpack it: - wget http://sourceforge.net/projects/jvntextpro/files/latest/download -O JVnTextPro.zip - unzip JVnTextPro.zip * Set the relevant environment variable, then source the environment to construct the CLASSPATH. - export JVNTEXTPRO_HOME='<path to JVnTextPro>/bin' - source $HEIDELTIME_HOME/metadata/setenv Further information about JVnTextPro can be found on our Wiki page for the Engine: https://github.com/HeidelTime/heideltime/wiki/JVnTextProWrapper To process Arabic documents, we have developed the Stanford POS Tagger Wrapper Analysis Engine. It utilizes the Stanford POS Tagger. Follow these instructions to set it up: * Download the Stanford POS Tagger *Full Package* from this URL and unzip it: http://nlp.stanford.edu/software/tagger.shtml - wget http://nlp.stanford.edu/software/stanford-postagger-full-2014-01-04.zip - unzip stanford-postagger-full-2014-01-04.zip * Set the relevant environment variable, then source the environment to construct the CLASSPATH. - export STANFORDTAGGER='path to stanford-postagger-<version>.jar' - source $HEIDELTIME_HOME/metadata/setenv For more information on the Stanford POS Tagger Wrapper, see our Wiki page: https://github.com/HeidelTime/heideltime/wiki/StanfordPOSTaggerWrapper To process Croatian documents, Luka Skukan has developed a Wrapper for the HunPosTagger. You will need to get a copy of the HunPos tagger as well as a Croatian model file for it: From https://code.google.com/p/hunpos/downloads/list, download hunpos-1.0: - wget https://hunpos.googlecode.com/files/hunpos-1.0-linux.tgz From http://nlp.ffzg.hr/resources/models/tagging/, download the model file: - wget http://nlp.ffzg.hr/data/tagging/model.hunpos.mte5.defnpout.tar.gz Extract both of the files into the same directory - tar -xzvf hunpos-1.0-linux.tgz - tar -xzvf model.hunpos.mte5.defnpout.tar.gz -C hunpos-1.0-linux/ You will need to enter the full path of the hunpos-1.0-linux directory in the HunPosTaggerWrapper. To process any of the language with automatically created resources, you can use the AllLanguagesTokenizer, which is part of the heideltime kit. It is a simple (whitespace-based) yet generic tool and creaetes sentence and token annotation. For sample UIMA workflows for many of the supported languages, please take a look at our evaluation results reproduction Wiki page: https://github.com/HeidelTime/heideltime/wiki/Reproducing-Evaluation-Results and select a workflow description for a corpus of the language of your choice. ######################### # 4. Testing HeidelTime # ######################### 1. source the environment and copy the resources into the CLASSPATH * source $HEIDELTIME_HOME/metadata/setenv * cd $HEIDELTIME_HOME/resources && sh printResourceInformation.sh 2. run cpeGui.sh and create a workflow * cpeGui.sh * create a workflow with the following components: Collection reader: - UIMA's file system collection reader: $UIMA_HOME/examples/descriptors/collection_reader/FileSystemCollectionReader.xml set "Input directory" to $HEIDELTIME_HOME/doc/ Analysis Engines - TreeTaggerWrapper located at HEIDELTIME_HOME/desc/annotator/TreeTaggerWrapper.xml set "Language" to "english" set "Annotate_tokens" to "true" set "Annotate_partofspeech" to "true" set "Annotate_sentences" to "true" set "Improvegermansentences" to "false" - HeidelTime located at HEIDELTIME_HOME/desc/annotator/HeidelTime.xml set "Date" to "true" set "Time" to "true" set "Duration" to "true" set "Set" to "true" set "Temponym" to "false" set "Language" to "english" set "Type" to "narratives" CAS Consumer - UIMA's XMI Writer CAS Consumer located at $UIMA_HOME/examples/descriptors/cas_consumer/XmiWriterCasConsumer.xml set "Output Directory" to OUTPUT * (save the workflow) * run the workflow ########################################################### # 5. Analyze the results using the UIMA annotation viewer # ########################################################### To analyze the annotations produced by HeidelTime you may use UIMA's annotation viewer: * annotationViewer.sh set "Input Directory" to "OUTPUT" set TypeSystem or AE Descriptor File" to "$HEIDELTIME_HOME/desc/type/HeidelTime_TypeSystem.xml" * focus the analysis on Section 6 of the "readme.txt" file. #################################################################### # 6. What kind of temporal expressions can be found and normalized # #################################################################### HeidelTime distinguishes between four types of documents: news style, narrative style, colloquial style and scientific style documents. In addition, temponyms such as "John F. Kennedy's death" can be identified and normalized if "temponym" is set to true. This file here is a narrative document. This version of HeidelTime was released in 2016. To be more precise, and using a relative expression, it was released on Ocotber 4. HeidelTime was the best performing system of task A of the TempEval-2 challenge in 2010 and the system was presented at the TempEval Workshop at the ACL conference in Uppsala, Sweden on July 15, 2010 or July 16. In the meantime, it is October 2016 and HeidelTime is made publicly available and identifies these temporal expressions: January 22, 2001 or twice a week. ########################################## # 7. Additional HeidelTime documentation # ########################################## HeidelTime's GitHub Project contains a lot of valuable information on how to use HeidelTime or its components, as well as additional resources, an always up-to-date code repository and issue tracker in case you spot a bug. Visit the project at https://github.com/HeidelTime/heideltime ####################################################################### # 8. Reproducing HeidelTime's evaluation results on different corpora # ####################################################################### To reproduce HeidelTime's evaluation results reported in in our papers, e.g., in "Multilingual Cross-domain Temporal Tagging", follow the instructions on: https://github.com/HeidelTime/heideltime/wiki/Reproducing-Evaluation-Results ############## # 9. License # ############## Copyright (c) 2012-2016, Database Research Group, Institute of Computer Science, Heidelberg University. All rights reserved. This program and the accompanying materials are made available under the terms of the GNU General Public License. author: Jannik Strötgen email: [email protected] HeidelTime is a multilingual, cross-domain temporal tagger. For details, see http://dbs.ifi.uni-heidelberg.de/heideltime