NewsReader is a natural language processing pipeline. Among others, it tags parts-of-speech, recognizes named entities and annotates entities with predicates.
There are a number of implementations of the NewsReader pipeline:
- POAS: pipeline-on-a-stick.
- cltl/nlpp: contains a script that constructs the pipeline (EN+NL) from components.
- vmc-from-scratch: creating a VM with the Dutch version of NewsReader
- newsreader-docker: a Docker image for setting up a NewsReader server.
At the moment, none of these implementations succesfully build the whole pipeline for Dutch (see issues tracker). We have therefore decided to build the pipeline from individual modules.
We have imported all modules from NewsReader under the heading "Dutch modules":
- tokenization: Splits text into tokens (words / punctuation symbols) (wiki).
- part-of-speech-tagging: tags words with grammar categories such as 'nouns' and 'verbs' (wiki).
- named-entity-recognition: recognizes words as named entities such as 'Holland' (wiki).
- named-entity-disambiguation: some names refer to multiple entities, this module selects the most likely one (wiki).
- word-sense-disambiguation: selects the most likely meaning of individual words (wiki).
- time-expression-recognition: recognizes temporal expressions, such as "last week" (wiki, Heideltime).
- ontological-tagger: tags words with predicates, recognizes equivalent semantic frames and identifies events.
- semantic-role-labeling: assigns roles to agents, such as 'murderer' and 'murdered' (wiki, additional-roles).
- event-coreference: determines that two recognized events are actually referring to the same event (wiki).
- opinion-miner: detects whether a statement contains an opinion.
These modules depend on the following software packages:
- KafNafParserPy: a parser for KAF/NAF files in python.
- vua-resources: a package with utility functions of the Computational Lexicology & Terminology Lab.
- Alpino: a dependency parser for Dutch text.
- dbpedia-spotlight: tool for annotating mentions of DBpedia resources (more info).
- libsvm: library of support vector machines.
- svmlight: library of support vector machines.
- timbl: Tilburg Memory-Based Learner, containing classifiers for symbolic feature spaces.
The goal is to construct a lightweight, portable pipeline, which we achieve through a Docker image. This image is available from Docker Hub and can be obtained by pulling:
docker pull evidence/newsreaderdutch
If you would like to make change and build the image yourself, call:
docker image build -t newsreaderdutch NewsReaderDutch/
from within the root of the repository.
The Docker container can be run directly on your text files by calling:
docker run -v /workspace/:/work/ newsreaderdutch /work/file.txt
where /workspace/
is your local directory containing files that need to be processed and file.txt
is the document that you would like to get annotated. The output will have the same filename, but with a *.naf
extension. Currently, the pipeline writes the output of each module separately as well.
Questions, comments and bugs can be submitted to the issues tracker.