Skip to content

Pipeline for Processing German Literary Texts. Work in Progress.

License

Notifications You must be signed in to change notification settings

cophi-wue/LLpro

Repository files navigation

LLpro – A Literary Language Processing Pipeline for German Narrative Texts

An NLP Pipeline for German literary texts implemented in Python and Spacy (v3.5.2). Work in progress.

This pipeline implements several custom pipeline components using the Spacy API. Currently the components perform

See also the section about the Output Format for a description of the tabular output format.

Usage

usage: bin/llpro_cli.py [-h] [-v] [--no-normalize-tokens] [--tokenized]
                        [--sentencized] [--paragraph-pattern PAT]
                        [--section-pattern PAT] [--stdout | --writefiles DIR]
                        --infiles FILE [FILE ...]

NLP Pipeline for literary texts written in German.

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose
  --no-normalize-tokens
                        Do not normalize tokens.
  --tokenized           Skip tokenization, and assume that tokens are
                        separated by whitespace.
  --sentencized         Skip sentence splitting, and assume that sentences are
                        separated by newline characters.
  --paragraph-pattern PAT
                        Optional paragraph separator pattern. Paragraph
                        separators are removed, and sentences always terminate
                        on paragraph boundaries. Performed before
                        tokenization/sentence splitting.
  --section-pattern PAT
                        Optional sectioning paragraph pattern. Paragraphs
                        fully matching the pattern are removed. Performed
                        before tokenization/sentence splitting.
  --stdout              Write all processed tokens to stdout.
  --writefiles DIR      For each input file, write processed tokens to a
                        separate file in DIR.
  --infiles FILE [FILE ...]
                        Input files, or directories.

Note: you can specify the resources directory (containing ParZu etc.) with the environment variable LLPRO_RESOURCES_ROOT, and the temporary workdir with the environment variable LLPRO_TEMPDIR.

Installation

The LLpro pipeline can be run either locally or as a Docker container. Running the pipeline using Docker is strongly recommended.

WINDOWS USERS: For building the Docker image, clone using

git clone https://github.com/aehrm/LLpro --config core.autocrlf=input

to preserve line endings.

Building and running the Docker image

We strongly recommend using Docker to run the pipeline. With the provided Dockerfile, all dependencies and prerequisites are downloaded automatically.

cd LLpro
docker build --tag cophiwue/llpro .
# or, if you want experimental features enabled
# docker build --build-arg LLPRO_EXPERIMENTAL=1 --tag cophiwue/llpro-experimental .

After building, the Docker image can be run like this:

mkdir -p files/in files/out
chmod a+w files/out  # make directory writeable from the Docker container
# copy files into ./files/in to be processed
docker run \
    --rm \
    -e OMP_NUM_THREADS=4 \
    --gpus all \    # alternatively, e.g., --gpus "device=0"
    --interactive \
    --tty \
    -a stdout \
    -a stderr \
    -v "$(pwd)/files:/files" \
    cophiwue/llpro -v --writefiles /files/out --infiles /files/in
# processed files are located in ./files/out

Installing locally

Verify that the following dependencies are installed:

  • Python (tested on version 3.7)
  • For RNNTagger
    • CUDA (tested on version 11.4)
  • For Parzu:
    • SWI-Prolog >= 5.6
    • SFST >= 1.4

Execute poetry install and ./prepare.sh. The script downloads all remaining prerequisites. Example usage:

poetry install
./prepare.sh
# NOTICE: use the prepared poetry venv!
poetry run python ./bin/llpro_cli.py -v --writefiles files/out files/in

# if desired, run tests
poetry run pytest -vv

Developer Guide

See the separate Developer Guide about the implemented Spacy components and how to access the assigned attributes.

See also the separate document about the tabular Output Format for a description of the output format and a reference of the used tagsets.

See the folder ./contrib for scripts to reproduce the fine-tuning of the custom models.

Citing

If you use the LLpro software for academic research, please consider citing the accompanying publication:

Ehrmanntraut, Anton, Leonard Konle, and Fotis Jannidis. 2023. „LLpro: A Literary Language Processing Pipeline for German Narrative Text.“ In Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023). Ingolstadt, Germany: KONVENS 2023 Organizers. To be published.

License

In accordance with the license terms of ParZu+Zmorge (GPL v2), and of SoMeWeTa (GPL v3) the LLpro pipeline is licensed under the terms of GPL v3. See LICENSE.

NOTICE: The code of the ParZu parser located in resources/ParZu has been modified to be compatible with LLpro. See git log -p df1e91a.. -- resources/ParZu for a summary of these changes.

NOTICE: Some subsystems and resources used by the LLpro pipeline have additional license terms:

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

References

Akbik, Alan, Duncan Blythe, and Roland Vollgraf. 2018. “Contextual String Embeddings for Sequence Labeling.” In COLING 2018, 27th International Conference on Computational Linguistics, 1638–49.

Brunner, Annelen, Ngoc Duyen Tanja Tu, Lukas Weimer, and Fotis Jannidis. 2021. “To BERT or Not to BERT – Comparing Contextual Embeddings in a Deep Learning Architecture for the Automatic Recognition of Four Types of Speech, Thought and Writing Representation.” In Proceedings of the 5th Swiss Text Analytics Conference (SwissText) & 16th Conference on Natural Language Processing (KONVENS), 2624:11. CEUR Workshop Proceedings. Zurich, Switzerland. http://ceur-ws.org/Vol-2624/paper5.pdf.

Krug, Markus, Lukas Weimer, Isabella Reger, Luisa Macharowsky, Stephan Feldhaus, Frank Puppe, and Fotis Jannidis. 2017. “Description of a Corpus of Character References in German Novels - DROC [Deutsches ROman Corpus].” https://resolver.sub.uni-goettingen.de/purl?gro-2/108301.

Kurfalı, Murathan, and Mats Wirén. 2021. “Breaking the Narrative: Scene Segmentation Through Sequential Sentence Classification.” In Proceedings of the Shared Task on Scene Segmentation, edited by Albin Zehe, Leonard Konle, Lea Dümpelmann, Evelyn Gius, Svenja Guhr, Andreas Hotho, Fotis Jannidis, et al., 3001:49–53. CEUR Workshop Proceedings. Düsseldorf, Germany. http://ceur-ws.org/Vol-3001/#paper6.

Proisl, Thomas. 2018. “SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 665–70. Miyazaki, Japan: European Language Resources Association ELRA. http://www.lrec-conf.org/proceedings/lrec2018/pdf/49.pdf.

Proisl, Thomas, and Peter Uhrig. 2016. “SoMaJo: State-of-the-Art Tokenization for German Web and Social Media Texts.” In Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task, 57–62. Berlin, Germany: Association for Computational Linguistics (ACL). http://aclweb.org/anthology/W16-2607.

———. 2019. “Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts.” In DATeCH, Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage, 133–37. Brussels, Belgium: Association for Computing Machinery. https://www.cis.uni-muenchen.de/~schmid/papers/Datech2019.pdf.

Schröder, Fynn, Hans Ole Hatzel, and Chris Biemann. 2021. “Neural End-to-End Coreference Resolution for German in Different Domains.” In Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021), 170–81. Düsseldorf, Germany: KONVENS 2021 Organizers. https://aclanthology.org/2021.konvens-1.15.

Schweter, Stefan, and Alan Akbik. 2021. “FLERT: Document-Level Features for Named Entity Recognition.” arXiv:2011.06993 [Cs], May. http://arxiv.org/abs/2011.06993.

Sennrich, Rico, and Beat Kunz. 2014. “Zmorge: A German Morphological Lexicon Extracted from Wiktionary.” In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), 1063–7. Reykjavik, Iceland: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2014/pdf/116_Paper.pdf.

Sennrich, Rico, G. Schneider, M. Volk, M. Warin, C. Chiarcos, Richard Eckart de Castilho, and Manfred Stede. 2009. “A New Hybrid Dependency Parser for German.” In Proceedings of the GSCL Conference. Potsdam, Germany. https://doi.org/10.5167/UZH-25506.

Sennrich, Rico, Martin Volk, and Gerold Schneider. 2013. “Exploiting Synergies Between Open Resources for German Dependency Parsing, POS-Tagging, and Morphological Analysis.” In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, 601–9. Hissar, Bulgaria: INCOMA Ltd. Shoumen, BULGARIA. https://www.aclweb.org/anthology/R13-1079.

Vauth, Michael, Hans Ole Hatzel, Evelyn Gius, and Chris Biemann. 2021. “Automated Event Annotation in Literary Texts.” In Proceedings of the Conference on Computational Humanities Research 2021, edited by Maud Ehrmann, Folgert Karsdorp, Melvin Wevers, Tara Lee Andrews, Manuel Burghardt, Mike Kestemont, Enrique Manjavacas, Michael Piotrowski, and Joris van Zundert, 2989:333–45. CEUR Workshop Proceedings. Amsterdam, the Netherlands. https://ceur-ws.org/Vol-2989/#short_paper18.