Skip to content

Latest commit

 

History

History
91 lines (63 loc) · 6.47 KB

nederlab-enrichment.md

File metadata and controls

91 lines (63 loc) · 6.47 KB

Nederlab: Automatic Linguistic Enrichment of Historical Dutch

Metadata

  • Status: Completed
  • Type: Specific
  • Work Package: WP3/WP6
  • Research Coordinators: Katrien DePuydt (INT), Hennie Brugman (KNAW HuC)
  • Participating Institutes: INT, KNAW HuC, Radboud
  • Developers: Erik Tjong Kim Sang (KNAW HuC), Maarten van Gompel (RU), Jesse de Does (INT), Erwin Komen (RU)
  • Interest Groups: Text, Annotation

Description

The aim of the Nederlab project is to offer a research portal containing a huge collection of various digitised Dutch documents in various time periods.

Though the project itself is independent of CLARIAH, it uses a variety of CLARIAH tools.

What is the research about?

The aspect addressed in this use case relates to the automatic linguistic enrichment of historical dutch, with an initial focus on 17th century dutch.

What problem is hindering the research?

A main issue was the fact that most NLP tools were trained on contemporary Dutch rather than historical Dutch. Other major issues related to handling/converting the original TEI documents and getting them into another form so automatic enrichment could be performed. A third issue was the inaccuracy of metadata.

What is needed to do the research?

Data

The resulting corpora are delivered in the FoLiA format. Various improvements were made to accommodate the Nederlab results, such as the ability to associate metadata with sub-parts of a document. The input corpora are largely in TEI format and come from various sources, the KB's DBNL being a notable one.

Tagged historical trainings data with part-of-speech tags and lemmas was an important resource in this research. The Brieven als Buit corpus and the corpus Rhenen-Mulder were used as a basis here.

The inaccuracy of metadata (time period information) provided to be a problem, especially pertaining to sub-editions, so a metadata curation effort was undertaken by Katrien DePuydt. Accurate time period information was essential for the enrichment pipeline to decide what enrichment strategy to apply.

Tools

A wide variety of tools were used and a workflow needed to be implemented that invokes these tools. There was considerable work needed to handle the initial TEI input so that it was suitable for further linguistic enrichment; a tei2folia converted was developed to this end and continues where earlier efforts like INT's OpenConvert stopped. Frog was used as the main engine for linguistic enrichment in the form of Part-of-Speech tagging, lemmatisation and named entity recognition. As Frog only handles contemporary Dutch, new Models were trained for different two different historical time periods.

Initial work by Erik Tjong Kim Sang focussed on a translation step of 17th century dutch to contemporary dutch, so the contemporary translation could be used for further enrichment with tools (Frog) that only knew contemporary Dutch. This strategy was applied for this time period.

What software and services are involved?

  • ucto for tokenisation.
  • Frog for PoS-tagging, lemmatisation and Named Entity Recognition for Dutch, Middle Dutch, and Early New Dutch (vroegnieuwnederlands)
  • FoLiA-utils for:
    • FoLiA-wordtranslate - Implements Erik Tjong Kim Sang's word-by-word modernisation method. This is a reimplementation of his initial prototype, with some added improvements.
  • Colibri Utils for:
    • colibri-lang - Language Identification (including models for Middle Dutch and Early new Dutch)
  • FoLiA Tools for:
    • foliavalidator - Validation
    • foliaupgrade - Upgrades to FoLiA v2
    • tei2folia - Conversion from a subset of TEI to FoLiA.
    • foliamerge - Merges annotations between two FoLiA documents.
  • wikiente for Named Entity Recognition and Linking using DBPedia Spotlight. This eventually replaced earlier work called foliaentity by Erwin Komen.

References

Related use cases:

Links:

Publications: