Skip to content

Latest commit

 

History

History
95 lines (55 loc) · 6.72 KB

File metadata and controls

95 lines (55 loc) · 6.72 KB

Dutch (Municipal) Text Simplification Corpora

This repository contains data and links to resources related to the task of simplifying Dutch (municipal) texts.

Why?

According to recent research, 16% of the people between 16 and 65 in Amsterdam have low literacy skills. This hinders societal participation in tasks such as voting, paying taxes, reissuing documents, or applying for social benefits. Thus, as part of our Amsterdam for All project, we have set on a mission to research the use of AI for measuring and improving the readability of municipal communication. We also hope to inspire others to work on text simplification and related technology.

Further information about the problem of measuring and improving the readability of municipal texts, and how we make use of the datasets in this repository can be found on the Amsterdam Intelligence website, as well as on Openresearch.

Automatic Text Simplification

Dataset by City of Amsterdam

The City of Amsterdam has published a dataset for the task of sentence-level simplification of Dutch municipal texts. The dataset consists of 1311 sentence pairs automatically aligned from 50 documents manually simplified by communications experts.

The complex-simple-sentences folder contains the dataset, as well as further details about its creation.

Dataset and demo by KU Leuven

In 2023, as part of her MSc AI thesis, Charlotte Van de Velde and other researchers from KU Leuven shared the following text simplification resources:

  • a dataset of 1267 sentences automatically simplified by using gpt-3.5-turbo
  • a base, small and large versions of a model fine-tuned on the above-mentioned dataset (using the corresponding UL2 models)
  • a demo of the base version

Lexical Simplification

Contextualized Lexical Simplification Dataset

The contextualized-lexical-simplification folder contains information about an evaluation dataset for Dutch Contextualized Lexical Simplification, as well as a reference to the corresponding paper proposing LSBertje - a new Dutch model for the task.

Difficult Words

The City of Amsterdam maintains a list of complex words together with simpler alternatives.

Municipal Abbreviations

A (partial) list of municipal abbreviations from the municipal domain can be found on github.

Inclusive Language

Due to the overlap in underlying technology, we propose taking inclusive language into account when detecting words or phrases that require substitution or providing suitable alternatives. The City of Amsterdam maintains a list of inclusive words as well as a full list of resources related to inclusive language.

Language Level Analysis and Classification

NT2Lex resources

According to their own website, "NT2Lex is a lexical database for Dutch as a foreign language (NT2) that includes frequency distributions of words observed in texts graded along the six-level scale of the Common European Framework of Reference for Languages. It is a receptive graded lexicon, with word frequencies observed in textbook reading activities and simplified readers targeting learners of Dutch."

There are also online tools for the analysis of the complexity of words in a text, as well as frequency distributions of words along the CEFR scale.

CEFR Annotations by EDIA

As part of a European Language Grid project, EDIA has developed a dataset of 1200 texts (form diverse data source, including municipal texts such as the once that can be found on the website of the City of Amsterdam). The texts are labelled with CEFR readability level and are available under a CC-BY-NC license (academic purposes).

The dataset can be requested on the company's website.

Contributing

Feel free to help out! Open an issue, submit a PR or contact us.

Acknowledgements

This repository was created by Amsterdam Intelligence for the City of Amsterdam.

The resources (linked) in this repository were compiled by colleagues at the City of Amsterdam, as well as researchers from the University of Amsterdam, the Vrije Universiteit Amsterdam, as well as external partners and organizations.

We owe a special thank you to the Communications Department of the City of Amsterdam for providing us with valuable resources, and to Eliza Hobo, Daniel Vlantis and Ayoub Abdelouarit for their dedication.

License

This project is licensed under the terms of the European Union Public License 1.2 (EUPL-1.2).