diy-ch13-text.Rmd

---
title: 'DIY: Basic text processing'
bibliography: references.bib
output:
  html_document:
    fig_caption: yes
    highlight: tango
    theme: spacelab
---

*From Chapter 13*
 

> *DIYs are designed to be hands-on. To follow along, download the DIYs repository from Github ([https://github.com/DataScienceForPublicPolicy/diys](https://github.com/DataScienceForPublicPolicy/diys)). The R Markdown file for this example is labeled `diy-ch13-text.Rmd`.*


Textual processing, TF-IDF and document similarities are foundational knowledge for natural language applications. In this DIY, we illustrate how to put these concepts to use.  Using a set of news articles, we illustrate the process of constructing a DTM, then use TF-IDF to find potential stop words. Lastly, we relate two or more documents to one another using cosine similarity. For a large scale database, these steps can help identify relationships between documents and easily facilitate qualitative analyses.

The news articles used for this DIY focus on the US' federal budget deficit as reported on in October 2019 [@nytimesdeficit, @deficit, @marketwatchdeficit, @thehilldeficit, @cnbcdeficit] and were scraped from various news websites.^[In the wild, these news articles would have been embedded in HTML files on websites, requiring some attention to the structure of the page.]

To start, we load a combination of packages, namely the `tidytext` package to work with text, `dplyr` for general data manipulation, and `stringr` for manipulating string values.

```{r, textpackages, message = FALSE, error = FALSE, warning = FALSE}
  pacman::p_load(tidytext, dplyr, stringr)
```

__*Processing*__. The five articles are stored in a CSV named `deficit-articles.csv`. As a initial step, the text should be scrubbed of all numeric values and punctuation using simple regex statements. While we recognize that numbers can carry useful information, we are more concerned with standardizing words so their frequency can approximate importance. 

```{r, loadtext, message = FALSE, error = FALSE, warning = FALSE}
#Load file
  deficit <- read.csv("data/deficit-articles.csv", 
                      stringsAsFactors = FALSE)
  
#Remove punctuation and numeric values
  deficit$text <- str_remove_all(deficit$text, "[[:digit:][:punct:]]")
  
```

In one neat block of code, we unravel the text into neat word counts for each article, piping multiple commands together (`%>%`). 

*Tokenization*. The code first tokenizes the `text` column into unigrams (single word tokens) using the `unnest_tokens` function (`tidytext`). A new column `word` is added to the data frame, expanding the number of rows from $n=5$ to $n=2989$.  Note that each token is automatically converted to lower case.

*Stop words*. With the tokens exposed, stop words are removed. We retrieve a data frame of standard English stop words using the `get_stopwords`, then apply an `anti_join` (`dplyr` package) that retains rows that did not match terms in the stop word list. This step significantly reduces the size of the data set to $n=1792$.

*Stemming*. Using the `wordStem` function (`SnowballC` package), all words are screened and adjusted for stemming if appropriate. The result is assigned to a new variable `word.stem` using the `mutate` function (`dplyr`). As a sanity check, only stemmed tokens with more than one character are retained. 

*Tabulation*. Lastly, the remaining unigrams are summarized as *term frequencies* (TF) -- a count of how often each `word.stem` token appears in each article. The resulting data set contains $n = 1005$ records, meaning that some terms appear more than once and contain relatively more of an article's meaning than other terms.

```{r, message = FALSE, error = FALSE, warning = FALSE}
#Unigram in one go
  unigram <- deficit  %>% 
                 unnest_tokens(word, text) %>%
                 anti_join(get_stopwords(language = "en")) %>%
                 mutate(word.stem = SnowballC::wordStem(word)) %>% 
                 filter(nchar(word.stem) > 1) %>%
                 count(outlet, word.stem, sort = TRUE) 
```


*Document-Term Matrix*. The tabulations can be further processed into a DTM using the `cast_dtm` function (`tidytext`) -- useful for applications like topic modeling. We should also highlight that some NLP use cases call for a Document-Feature Matrix (DFM) that allow metadata about each document among other variables to be included in the matrix. DFMs are quite similar to DTM in structure, but are stored as separate object classes to facilitate other applications.

```{r}
#Cast into DTM
  deficit_dtm <- unigram  %>% 
                 cast_dtm(document = outlet, 
                          term = word.stem, 
                          value = n)
  
#Cast into DFM
  deficit_dfm <- unigram  %>% 
                 cast_dfm(document = outlet, 
                          term = word.stem, 
                          value = n)   

```


__*Distinguishing between documents*__. Term frequencies imply that higher frequency terms carry more importance.   In our example, all five articles focus on the 2019 federal budget deficit and the words "trillion", "dollar", "budget", and "deficit" all have high TF values. These high frequency terms prove particularly useful to distinguish between articles about deficits and any other non-deficit topic such as education and defense. In the case of a narrowly defined corpus exclusively focused on deficits, these terms can be viewed as stop words. TF-IDF can be applied to identify overly common words.

 To use TF-IDF in `R`, we apply the `bind_tf_idf` function to our word counts, which calculates and joins the TF, IDF, and combined TF-IDF metrics for each token.

```{r, message = FALSE, error = FALSE, warning = FALSE}
  unigram <- unigram  %>%
              bind_tf_idf(term = word.stem, 
                          document = outlet, 
                          n = n)
```

The effect of TF-IDF on word importance can be quite dramatic and visualized in a parallel coordinate plot as seen in Figure \@ref(fig:tfidfcomparison). The plot illustrates how a single token's relative value changes between simple term frequencies on one vertical axis ($n$) and TF-IDF. About three-quarters of unigrams increase in their importance once controlling for how common terms are across the corpus of articles (see blue coordinate pairs). In contrast, one-quarter of the terms reduced in rank, indicating that they hold less distinguishing information. These terms include obvious deficit-related terms such as *deficit*, *spend*, *year*, *trump*, *budget*, *tax*, among others. 

TF-IDF can also be used to remove hard-to-identify keywords. In the deficit articles, approximately 10% of terms ($n=100$) have TF-IDF values equal to zero.  It is not to say these terms are unimportant, but removing these terms in this context could be beneficial.

```{r, tfidfcomparison, echo = FALSE, fig.cap = "Comparison of word frequency and TF-IDF distributions.", fig.height = 3}
pacman::p_load(GGally)
#
up_down <- rep("same", nrow(unigram))
up_down[ percent_rank(unigram$n) < percent_rank(unigram$tf_idf)] <- "up"
up_down[ percent_rank(unigram$n) < percent_rank(unigram$tf_idf)] <- "up"

up_down[ percent_rank(unigram$n) > percent_rank(unigram$tf_idf)] <- "down"
unigram$Direction <- up_down
unigram$n_tile <- percent_rank(unigram$n)
unigram$tfidf_tile <- percent_rank(unigram$tf_idf)

#Re-order
unigram <- rbind(unigram[up_down == "down",],
                 unigram[up_down == "same",],
                unigram[up_down == "up",])

#Plot
  ggparcoord(unigram,
           columns = c(3,6), groupColumn = 7, 
           showPoints = TRUE, 
           scale = "uniminmax",
           alphaLines = 0.5) +
    ylab("Scaled value") + 
    scale_color_manual(values=c( "pink", "slategrey" ,"darkblue") ) +
    theme_bw()+
    theme(plot.title = element_text(size = 10),
                  axis.title.x = element_text(size = 10),
                  axis.title.y = element_text(size = 10)) 
    
```

__*Finding similar documents*__. In order to associate the articles with one another, we can construct cosine similarities from the DTM. There is, however, an outstanding consideration: should the similarities be based on term frequencies or TF-IDF.  Using the `cosine` function (`coop` package), we calculate the cosine similarities for each case. 


```{r, eval = FALSE}
#Load COOP
  pacman::p_load(coop)
  
#Similarity based on term frequencies
  grams_n <- unigram  %>% 
                 cast_dtm(word.stem, outlet, n)
  cosine(grams_n)
  
#Similarity based on TF-IDF
  grams_tfidf <- unigram  %>% 
                 cast_dtm(word.stem, outlet, tf_idf)
  cosine(grams_tfidf)
```


As is apparent in Figure \@ref(tab:cosinesim), our choice of input metric emphasizes different qualities of the articles.  The raw term frequencies help identify articles that are related in overarching topics, fixating on common words that describe the federal deficit. However, TF-IDF treats these common words as stop words, leaving only terms that are reflective of the author's style and attitudes rather than the big picture. Both are valid approaches. The most appropriate metric, however, is dependent on the analytical objective.

Cosine similarity is simply the relational metric that helps rank documents with respect to each document. By calculating the cosine similarity between all documents in a corpus, one can recommend other articles with similar content. For example, the scores in Figure \@ref(tab:cosinesim) can be interpreted as a list of *similar articles conditional on the MarketWatch article*. 


```{r, message = FALSE, warning = FALSE, error = FALSE, echo = FALSE}
#Load COOP
  pacman::p_load(coop)
  
#Similarity based on term frequencies
  grams_n <- unigram  %>% 
                 cast_dtm(word.stem, outlet, n)
  g1 <- cosine(grams_n)
  
#Similarity based on TFIDF
  grams_tfidf <- unigram  %>% 
                 cast_dtm(word.stem, outlet, tf_idf)
  g2 <- cosine(grams_tfidf)
  
#Master
  output <- data.frame(g1[,4], g2[,4])
  output <- output[-4,]
  output <- data.frame(Outlet = c("NYTimes", "The Hill", "AP", "CNBC"),
                       output)
  colnames(output) <- c("Outlet", "n", "TF-IDF")
  
#Output table
  pander::pander(output, split.cell = 80, split.table = Inf, 
           caption = "(\\#tab:cosinesim) Comparison of cosine similarity using TF-IDF and Term Frequencies. All values are compared against the deficit article written by MarketWatch.", 
           justify = "left", row.names = FALSE)
```