| Text source | Information |
|---|---|
| "Alice in Wonderland" | Alice in Wonderland (Ch.1) |
| "Romeo and Juliet" | Romeo and Juliet |
| "Bhagavad Gita" | Bhagavad Gita |
| "Memento screenplay" | Memento screenplay |
| "100K tweets" | 100,000 tweets from: Sentiment140 dataset training data |
| "20K tweets" | 20,000 tweets from Gender Classifier Data |
| "MASC tweets" | MASC tweets (cleaned of html markup) |
| "MASC spoken" | MASC spoken transcripts (phone and face-to-face: 25,783 words) |
| "COCA blogs" | Corpus of Contemporary American English blog samples |
| "Google website" | Google homepage (accessed 10/20/2020) |
| "Software languages" | "Tower of Hanoi" (programming languages A-Z from Rosetta Code) |
| "Monkey text" | Ian Douglas's English-generated monkey0-7.txt corpus |
| "Coder text" | Ian Douglas's software-generated coder0-7.txt corpus |
| "iweb cleaned corpus" | First 150,000 lines of Shai Coleman's iweb-corpus-samples-cleaned.txt |
Reference for Monkey and Coder texts: Douglas, Ian. (2021, March 28). Keyboard Layout Analysis: Creating the Corpus, Bigram Chains, and Shakespeare's Monkeys (Version 1.0.0). Zenodo. http://doi.org/10.5281/zenodo.4642460