Text Simplification Datasets

A collection of text simplification datasets with a focus on sentence/paragraph/document-level simplification. All contributions are welcome!

For browsing/sorting the datasets, you can use the interactive table.

Datasets

Notes on the table columns:

Kind refers to the way simplification instances were obtained. For parallel, this is usually through manual simplification according to specific guidelines. For comparable, this is by automatically mining pairs of complex/simple sentences with similar meaning from a large text corpus.
Level can be lexical (lex), sentence (sent), paragraph (para) or document (doc).
Refs refers to the number of references per instance (i.e., gold simplifications).

Dataset	Lang	Domain	Kind	Level	Instances	Refs.	Link
PWKP (Zhu et al., 2010)	EN	Wikipedia	Comparable	Sent	108,016 paired sentences extracted from 65,133 articles.	1	Link
C&K1 (Coster and Kauchak, 2011)	EN	Wikipedia	Comparable	Sent	137,000 paired sentences from 10,588 articles.	1	Link
C&K-2 (Kauchak, 2013)	EN	Wikipedia	Comparable	Sent	167,000 paired sentences.	1	Link
LexMTurk (Horn et al., 2014)	EN	Wikipedia	Parallel	Lex	500	multiple	Link
EW-SEW (Hwang et al., 2015)	EN	Wikipedia	Comparable	Sent	150,000 full and 130,000 partial matches	1	Link (Archived, pre-processed version available here)
sscorpus (Kajiwara and Komachi, 2016)	EN	Wikipedia	Comparable	Sent	492,993 aligned sentences from 126K article pairs.	1	Link
TurkCorpus (Xu et al., 2016)	EN	Wikipedia	Parallel	Sent	2359 sentences (2000 dev, 359 test)	8	Link
NNSEval (Paetzold and Specia, 2016)	EN	Wikipedia	Comparable	Lex	239	multiple	Link
BenchLS (Paetzold and Specia, 2016)	EN	Wikipedia	Comparable	Lex	929	multiple	Link
WikiLarge (Zhang and Lapata, 2017)	EN	Wikipedia	Comparable	Sent	296,402 sentence pairs (WikiLarge)	1	Link
WikiSmall (Zhang and Lapata, 2017)	EN	Wikipedia	Comparable	Sent	89,042 sentence pairs	1	Link
WikiSplit (Botha et al., 2018)	EN	Wikipedia	Parallel	Sent	1 million sentences	1	Link
Hsplit (Sulem et al., 2018)	EN	Wikipedia	Parallel	Sent	359 sentences (test set of turk corpus)	4	Link
ASSET (Alva-Manchego et al., 2020)	EN	Wikipedia	Parallel	Sent	2359 sentences (2000 train, 359 test)	10	Link
Wiki-AUTO (Jiang et al., 2020)	EN	Wikipedia	Comparable	Sent	488,332 train sentences from 138,095 article pairs (2019/09 dump).	1	Link (Part of GEM)
Wikipedia (with context) (Sun et al., 2020)	EN	Wikipedia	Comparable	Sent	116,020 sentences with context (includes preceding and following sentence)	1	Link
D-Wikipedia (Sun et al., 2021)	EN	Wikipedia	Comparable	Doc	143,546 article pairs	1	Link
Klexikon (Aumiller and Gertz, 2022)	DE	Wikipedia	Comparable	Doc	2898 article pairs	1	Link
SWiPE (Laban et al., 2023)	EN	Wikipedia	Comparable	Doc	145,161 article revision pairs, anntations of fine-grained edit operations on ~5000 articles	1	Link
Dsim (Klerke and Søgaard, 2012)	DA	News	Parallel	Doc	3,701 articles with 48,186 aligned sentences	1	n/a
Newsela (Xu et al., 2015)	EN	News	Parallel	Doc	1130 articles (original); 1911 articles (v2016-01-29); at 5 levels	1	Link
Newsela-ES (Xu et al., 2015)	ES	News	Parallel	Doc	243 articles (v2016-01-29) at 5 levels	1	Link
OneStopEnglish (Vajjala and Lucic, 2018)	EN	News	Parallel	Doc	189 articles at three levels. Automatic sentence alignment: 1.6K ELE-INT, 2.1K ELE-ADV, 3.1K INT-ADV.	1	Link
Newsela-AUTO (Jiang et al., 2020)	EN	News	Parallel	Sent	666,645 sentence pairs from 1932 articles at 5 levels	1	Link
20 minutes (Rios et al., 2021)	DE	News	Parallel	Doc	18,305 articles with simplified summaries.	1	Link
SNIML (Hauser et al., 2022)	DE, EN, FI, FR, IT, SV	News	Simplified only	Doc	13,447 documents	n/a	Link
DEplain (Stodden et al., 2023)	DE	News	Parallel	Doc	500 document pairs in News domain (13k aligned sentences), 150 document pairs in Web domain (2k aligned sentences)	1	Link
SimpleGerman (Klaper et al., 2013)	DE	Web	Comparable	Sent	7000 sentences from 256 articles. 78% of sentences have an alignment	1	n/a (Available on request)
SimPA (Scarton et al., 2018)	EN	Web	Parallel	Sent	1100 sentences with 3 lexical, and one 1 syntactic simplification each	3, 1	Link
SimpleGerman V2.0 (Battisti et al., 2020)	DE	Web	Comparable	Doc	5461 simple, unaligned documents and 378 aligned (complex-simple) documents (6217 docs in total). The document-aligned portion has 17,121 complex sentences and 21,072 simple sentences. No statistics on the sentence-alignments are reported.	1	n/a (Scraping code)
Simple German V3.0 (Toborek et al., 2022)	DE	Web	Comparable	Doc	708 documents	1	n/a (Scraping code)
PPDB (Ganitkevitch et al., 2013)	EN	Mixed	Comparable	Sent	221 million sentences	1	Link
Simple-PPDB (Pavlick and Callison-Burch, 2016)	EN	Mixed	Comparable	Sent	4.5 million sentences	1	Link
WebSplit (Narayan et al., 2017)	EN	Mixed	Comparable	Sent	1 million sentences	1	Link
EASIER (Alarcon et al., 2021)	ES	Mixed	Parallel	Lex	5153	1-3	Link
RuAdapt (Dmitrieva and Tiedemann, 2021)	RU	Books	Parallel	Doc	457 documents		Link
CEFR (Uchida et al., 2018)	EN	Education	Comparable	Lex	414	2.4(avg)	Link
SIMPLEX-PB-3.0 (Hartmann and Aluisio, 2021)	PT (BR)	Education	Parallel	Lex	1582	7,3(avg)	Link
PSAT (Taylor et al., 2022)	EN	Education	Parallel	Doc	112 documents, with total of 1883 aligned sentences	1	Link
Vikidia (Lee and Vajjala, 2022)	EN / FR	Education	Parallel	Doc	6165 (for each language)	1	Link
CEFR-SP (Arase et al., 2022)	EN	Education	CEFR-level	Sent	17000 sentences from Newsela-Auto (upon request), Wiki-Auto, and SCoRE dataset	1	Link
CLEAR (Grabar and Cardon, 2019)	FR	Medical	Comparable	Doc	16190 documents	1	Link
myTomorrows-Wiki (van den Bercken et al., 2019)	EN	Medical	Comparable	Sent	5415 (manually aligned); 3797 (automatically aligned)	1	Link
MSD-Manuals (Cao et al., 2020)	EN	Medical	Comparable	Sent	2551 linked paragraphs (professionals <-> laymen) with average of 10.4 and 11.3 sentences each. From a random sample of 1000 paragraphs, medical experts extracted 930 aligned sentences with equivalent meaning.	1	Link
PharmMT (Li et al. , 2020)	EN	Medical	Parallel	Sent	380,000K aligned sentences.	1	n/a
AutoMeTS (Van et al., 2020)	EN	Medical	Comparable	Sent	3300 aligned sentences	1	Link
Cochrane (Devaraj et al., 2021)	EN	Medical	Comparable	Par	4459 paragraph pairs (<1024 tokens)	1	Link
CLARA-MeD (Campillos-Llanos et al., 2022)	ES	Medical	Comparable	Doc	24298 comparable documents and 3800 parallel sentences		Link
BioLaySumm (Goldsack et al., 2022)	EN	Medical	Parallel	Doc	32353 document-plain abstract pairs	1	Link
CELLS (Guo et al., 2022)	EN	Medical	Comparable	Par	63000	1	Link
PLABA (Attal et al., 2023)	EN	Medical	Parallel	Doc	750 documents with 7643 sentence pairs	1	Link
MultiCochrane (Joseph et al., 2023)	EN, ES, FR, FA	Medical	Comparable	Sent	Cross-lingual pairs; 5K pairs (clean, semi-automatically aligned), 100K pairs (noisy)	1	Link
CLARA-MeD-simp-sent (Campillos-Llanos et al., 2024)	ES	Medical	Parallel	Sent	1200 manually-simplified sentences	1	Link
SimpMedLexSp (Campillos-Llanos et al., 2024)	ES	Medical	Parallel	Lex	>14000 pairs of medical terms and the corresponding simplified synonym/definition.	1	Link
JASMINE (Horiguchi et al., 2024)	JA	Medical	Parallel	Sent	1425 sentences (0/425/1000 train/val/test)	1	Link (available on request)
MedLane (Luo et al., 2022)	EN	Clinical	Parallel	Sent	12,801/1,015/1,016 train/valid/test sentences (avg. 20/24 tokens in source/target)	1	Link
MTSamples (Moramarco et al., 2022)	EN	Clinical	Parallel	Sent	1250 sentence pairs.	1	Link
SimplePatho (Trienes et al., 2022)	DE	Clinical	Parallel	Doc	851 documents	1	n/a
FestAbility (Chamovitz and Abend, 2022)	EN	Talks	Parallel	Sent	321 sentence pairs	1	Link

Contributing

New entries can be added to data.yml. Afterwards, run python render.py and submit a PR with the changes.

Acknowledgements

This list has greatly benefitted from the survey of Alva-Manchego et al. (2020) and Štajner (2021), as well as notes by Laura Vásquez-Rodríguez. Thanks! Also thanks to @tollefj for adding the interactive table.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Text Simplification Datasets

Datasets

Contributing

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

Text Simplification Datasets

Datasets

Contributing

Acknowledgements