To discover word usage frequencies by US presidents over time across 118,561 official documents and transcripts. Answers questions such as:
- What does word usage by US presidents look like over time?
- Who said it first? Last? Most?
- Where was it said?
- When was it said?
- Can I preview the documents it was said in?
The primary data source was The American Presidency Project at https://www.presidency.ucsb.edu/. I wrote a web scraper in R to capture all 118,561 official documents and associated metadata including:
- Date
- Location
- Categories
- President
- Citation
- Document uri
- Word count
Corpus data totaled 117,374,146 words, which was then tokenized using Quanteda to produce a SQLite database of 7,069,561 n-grams. This project gathered n-grams 1:5, meaning single words up to 5 word pairs.
The corpus data was further optimized for full text search by leveraging Sqlite’s FST4 extension. With FST4, it was also possible to extract snippets from the corpus data (pictured below).
Live version available at: https://bryanfinlayson.shinyapps.io/presidential_ngram_search/