Skip to content

Latest commit

 

History

History
83 lines (73 loc) · 14.8 KB

README.md

File metadata and controls

83 lines (73 loc) · 14.8 KB

Text Simplification Datasets

A collection of text simplification datasets with a focus on sentence/paragraph/document-level simplification. All contributions are welcome!

For browsing/sorting the datasets, you can use the interactive table.

Datasets

Notes on the table columns:

  • Kind refers to the way simplification instances were obtained. For parallel, this is usually through manual simplification according to specific guidelines. For comparable, this is by automatically mining pairs of complex/simple sentences with similar meaning from a large text corpus.
  • Level can be lexical (lex), sentence (sent), paragraph (para) or document (doc).
  • Refs refers to the number of references per instance (i.e., gold simplifications).
Dataset Lang Domain Kind Level Instances Refs. Link
PWKP (Zhu et al., 2010) EN Wikipedia Comparable Sent 108,016 paired sentences extracted from 65,133 articles. 1 Link
C&K1 (Coster and Kauchak, 2011) EN Wikipedia Comparable Sent 137,000 paired sentences from 10,588 articles. 1 Link
C&K-2 (Kauchak, 2013) EN Wikipedia Comparable Sent 167,000 paired sentences. 1 Link
LexMTurk (Horn et al., 2014) EN Wikipedia Parallel Lex 500 multiple Link
EW-SEW (Hwang et al., 2015) EN Wikipedia Comparable Sent 150,000 full and 130,000 partial matches 1 Link (Archived, pre-processed version available here)
sscorpus (Kajiwara and Komachi, 2016) EN Wikipedia Comparable Sent 492,993 aligned sentences from 126K article pairs. 1 Link
TurkCorpus (Xu et al., 2016) EN Wikipedia Parallel Sent 2359 sentences (2000 dev, 359 test) 8 Link
NNSEval (Paetzold and Specia, 2016) EN Wikipedia Comparable Lex 239 multiple Link
BenchLS (Paetzold and Specia, 2016) EN Wikipedia Comparable Lex 929 multiple Link
WikiLarge (Zhang and Lapata, 2017) EN Wikipedia Comparable Sent 296,402 sentence pairs (WikiLarge) 1 Link
WikiSmall (Zhang and Lapata, 2017) EN Wikipedia Comparable Sent 89,042 sentence pairs 1 Link
WikiSplit (Botha et al., 2018) EN Wikipedia Parallel Sent 1 million sentences 1 Link
Hsplit (Sulem et al., 2018) EN Wikipedia Parallel Sent 359 sentences (test set of turk corpus) 4 Link
ASSET (Alva-Manchego et al., 2020) EN Wikipedia Parallel Sent 2359 sentences (2000 train, 359 test) 10 Link
Wiki-AUTO (Jiang et al., 2020) EN Wikipedia Comparable Sent 488,332 train sentences from 138,095 article pairs (2019/09 dump). 1 Link (Part of GEM)
Wikipedia (with context) (Sun et al., 2020) EN Wikipedia Comparable Sent 116,020 sentences with context (includes preceding and following sentence) 1 Link
D-Wikipedia (Sun et al., 2021) EN Wikipedia Comparable Doc 143,546 article pairs 1 Link
Klexikon (Aumiller and Gertz, 2022) DE Wikipedia Comparable Doc 2898 article pairs 1 Link
SWiPE (Laban et al., 2023) EN Wikipedia Comparable Doc 145,161 article revision pairs, anntations of fine-grained edit operations on ~5000 articles 1 Link
Dsim (Klerke and Søgaard, 2012) DA News Parallel Doc 3,701 articles with 48,186 aligned sentences 1 n/a
Newsela (Xu et al., 2015) EN News Parallel Doc 1130 articles (original); 1911 articles (v2016-01-29); at 5 levels 1 Link
Newsela-ES (Xu et al., 2015) ES News Parallel Doc 243 articles (v2016-01-29) at 5 levels 1 Link
OneStopEnglish (Vajjala and Lucic, 2018) EN News Parallel Doc 189 articles at three levels. Automatic sentence alignment: 1.6K ELE-INT, 2.1K ELE-ADV, 3.1K INT-ADV. 1 Link
Newsela-AUTO (Jiang et al., 2020) EN News Parallel Sent 666,645 sentence pairs from 1932 articles at 5 levels 1 Link
20 minutes (Rios et al., 2021) DE News Parallel Doc 18,305 articles with simplified summaries. 1 Link
SNIML (Hauser et al., 2022) DE, EN, FI, FR, IT, SV News Simplified only Doc 13,447 documents n/a Link
DEplain (Stodden et al., 2023) DE News Parallel Doc 500 document pairs in News domain (13k aligned sentences), 150 document pairs in Web domain (2k aligned sentences) 1 Link
SimpleGerman (Klaper et al., 2013) DE Web Comparable Sent 7000 sentences from 256 articles. 78% of sentences have an alignment 1 n/a (Available on request)
SimPA (Scarton et al., 2018) EN Web Parallel Sent 1100 sentences with 3 lexical, and one 1 syntactic simplification each 3, 1 Link
SimpleGerman V2.0 (Battisti et al., 2020) DE Web Comparable Doc 5461 simple, unaligned documents and 378 aligned (complex-simple) documents (6217 docs in total). The document-aligned portion has 17,121 complex sentences and 21,072 simple sentences. No statistics on the sentence-alignments are reported. 1 n/a (Scraping code)
Simple German V3.0 (Toborek et al., 2022) DE Web Comparable Doc 708 documents 1 n/a (Scraping code)
PPDB (Ganitkevitch et al., 2013) EN Mixed Comparable Sent 221 million sentences 1 Link
Simple-PPDB (Pavlick and Callison-Burch, 2016) EN Mixed Comparable Sent 4.5 million sentences 1 Link
WebSplit (Narayan et al., 2017) EN Mixed Comparable Sent 1 million sentences 1 Link
EASIER (Alarcon et al., 2021) ES Mixed Parallel Lex 5153 1-3 Link
RuAdapt (Dmitrieva and Tiedemann, 2021) RU Books Parallel Doc 457 documents Link
CEFR (Uchida et al., 2018) EN Education Comparable Lex 414 2.4(avg) Link
SIMPLEX-PB-3.0 (Hartmann and Aluisio, 2021) PT (BR) Education Parallel Lex 1582 7,3(avg) Link
PSAT (Taylor et al., 2022) EN Education Parallel Doc 112 documents, with total of 1883 aligned sentences 1 Link
Vikidia (Lee and Vajjala, 2022) EN / FR Education Parallel Doc 6165 (for each language) 1 Link
CEFR-SP (Arase et al., 2022) EN Education CEFR-level Sent 17000 sentences from Newsela-Auto (upon request), Wiki-Auto, and SCoRE dataset 1 Link
CLEAR (Grabar and Cardon, 2019) FR Medical Comparable Doc 16190 documents 1 Link
myTomorrows-Wiki (van den Bercken et al., 2019) EN Medical Comparable Sent 5415 (manually aligned); 3797 (automatically aligned) 1 Link
MSD-Manuals (Cao et al., 2020) EN Medical Comparable Sent 2551 linked paragraphs (professionals <-> laymen) with average of 10.4 and 11.3 sentences each. From a random sample of 1000 paragraphs, medical experts extracted 930 aligned sentences with equivalent meaning. 1 Link
PharmMT (Li et al. , 2020) EN Medical Parallel Sent 380,000K aligned sentences. 1 n/a
AutoMeTS (Van et al., 2020) EN Medical Comparable Sent 3300 aligned sentences 1 Link
Cochrane (Devaraj et al., 2021) EN Medical Comparable Par 4459 paragraph pairs (<1024 tokens) 1 Link
CLARA-MeD (Campillos-Llanos et al., 2022) ES Medical Comparable Doc 24298 comparable documents and 3800 parallel sentences Link
BioLaySumm (Goldsack et al., 2022) EN Medical Parallel Doc 32353 document-plain abstract pairs 1 Link
CELLS (Guo et al., 2022) EN Medical Comparable Par 63000 1 Link
PLABA (Attal et al., 2023) EN Medical Parallel Doc 750 documents with 7643 sentence pairs 1 Link
MultiCochrane (Joseph et al., 2023) EN, ES, FR, FA Medical Comparable Sent Cross-lingual pairs; 5K pairs (clean, semi-automatically aligned), 100K pairs (noisy) 1 Link
CLARA-MeD-simp-sent (Campillos-Llanos et al., 2024) ES Medical Parallel Sent 1200 manually-simplified sentences 1 Link
SimpMedLexSp (Campillos-Llanos et al., 2024) ES Medical Parallel Lex >14000 pairs of medical terms and the corresponding simplified synonym/definition. 1 Link
JASMINE (Horiguchi et al., 2024) JA Medical Parallel Sent 1425 sentences (0/425/1000 train/val/test) 1 Link (available on request)
MedLane (Luo et al., 2022) EN Clinical Parallel Sent 12,801/1,015/1,016 train/valid/test sentences (avg. 20/24 tokens in source/target) 1 Link
MTSamples (Moramarco et al., 2022) EN Clinical Parallel Sent 1250 sentence pairs. 1 Link
SimplePatho (Trienes et al., 2022) DE Clinical Parallel Doc 851 documents 1 n/a
FestAbility (Chamovitz and Abend, 2022) EN Talks Parallel Sent 321 sentence pairs 1 Link

Contributing

New entries can be added to data.yml. Afterwards, run python render.py and submit a PR with the changes.

Acknowledgements

This list has greatly benefitted from the survey of Alva-Manchego et al. (2020) and Štajner (2021), as well as notes by Laura Vásquez-Rodríguez. Thanks! Also thanks to @tollefj for adding the interactive table.