Skip to content
Halina Do-Linh edited this page Feb 3, 2023 · 4 revisions

title: "Text Analysis" author: "Halina Do-Linh" date: "2023-02-02" output: html_document

knitr::opts_chunk$set(echo = TRUE, 
                      warning = F,
                      message = F)

Who works with survey data? Cool! Well I'm interested if you've used these techniques before or if not, hopefully these techniques will be useful to you.

The inherent benefit of quantitative text analysis is that it is highly scalable. With the right computational techniques, massive quantities of text can be mined and analyzed many, many orders of magnitude faster than it would take a human to do the same task.

The downside, is that human language is inherently nuanced, and computers (as you may have noticed) think very differently than we do. In order for an analysis to capture this nuance, the tools and techniques for text analysis need to be set up with care, especially when the analysis becomes more complex.

There are a number of different types of text analysis. In this lesson we will show some simple examples of two: word frequency, and HOPEFULLY sentiment analysis. Maybe some word clouds.

Much of the information covered in this chapter is based on Text Mining with R: A Tidy Approach by Julia Silge and David Robinson. This is a great book if you want to go deeper into text analysis.

In general, Text mining is the process by which unstructured text is transformed into a structured format to prepare it for analysis.

This can range from the simple techniques we use in this lesson, to much more complicated processes such as using OCR (optical character recognition) to scan and extract text from pdfs, or web scraping.

The big takeaway I want to impart to all of you today is: once text is in a structured format, analysis can be performed on it.

And we'll be practicing some techniques today to get unstructured text into a structured format.

Ok well let's start with our examples.

Set up

library(dplyr)
library(tibble)
library(readr)
library(tidytext)
library(wordcloud)
library(reshape2)
library(tidyr) # unnest function 

# ADD THESE PACKAGES LATER #
library(pdftools) # read in pdf
library(stringr) # manipulate strings
survey_raw <- read_csv("https://dev.nceas.ucsb.edu/knb/d1/mn/v2/object/urn%3Auuid%3A71cb8d0d-70d5-4752-abcd-e3bcf7f14783", show_col_types = FALSE)

events <- read_csv("https://dev.nceas.ucsb.edu/knb/d1/mn/v2/object/urn%3Auuid%3A0a1dd2d8-e8db-4089-a176-1b557d6e2786", show_col_types = FALSE)

Go ahead and clean the survey_raw data, and then join it with the events data using "StartDate" as the key. Once you're done, please put a yellow sticky up. And if you finish early, check in with your neighbor.

survey_clean <-  survey_raw %>% 
  select(-notes) %>% 
  mutate(Q1 = ifelse(Q1 == "1", "below expectations", Q1)) %>% 
  mutate(Q2 = tolower(Q2))
survey_joined <- left_join(survey_clean, events, by = "StartDate")

Unnest Tokens Q3

Are you familiar with the term token? Can anyone share what a token is?

Yes! Exactly. So a token is a meaningful unit of text. Depending on the analysis, that could be a word, two words, or phrase. For our examples our definition of token will be one word.

So we are going to be working in the “tidy text format.” And this format says that column with all of our text data can only contain one token per row.

First, let’s create a data frame with responses to question 3, with the one token per row.

We use the unnest_tokens function from tidytext, after selecting columns of interest. Let's look at the help page of unnest_tokens

q3 <- survey_joined %>% 
  select(StartDate, location, Q3) %>%
  unnest_tokens(output = word, input = Q3) # from `tidytext` 

You’ll see that we now have a very long data frame with only one word in each row of the text column.

Some of the words aren’t so interesting though. The words that are likely not useful for analysis are called “stop words”. There is a list of stop words contained within the tidytext package.

We can view the stop_words package by typing View(stop_words) in the console or we can preview the dataset using head()

head(stop_words)

We can then use the anti_join function to return only the words that are not in the stop word list.

# remove stop words using stop_words df from `tidytext` 
q3 <- anti_join(q3, stop_words, by = "word")

Ok now that we've removed stop words we can see what are the most frequent occuring words in question 3.

Most common words used in Q3

q3_top <- q3 %>% 
  group_by(word) %>% 
  summarize(count = n()) %>% 
  arrange(-count)

Ok now that we've done that for Q3, go ahead and do the same for Q4 or question 4. So the steps would be:

  • unnest tokens for question 4
  • remove the stop words
  • find the most frequent occurring words in question 4 using group_by() and summarize()

Unnest tokens Q4

q4 <- survey_joined %>%
  select(StartDate, location, Q4) %>%
  unnest_tokens(output = word, input = Q4) # from `tidytext` 

Remove stop words Q4

# remove stop words using stop_words df from `tidytext` 
q4 <- anti_join(q4, stop_words, by = "word")

Most common words used in Q4

q4_top <- q4 %>% 
  group_by(word) %>% 
  summarize(count = n()) %>% 
  arrange(-count)

So we see that the word data is coming up most frequently in both question 3 and question 4 which is unsurprisingly but maybe this word isn't very helfpul in our overall analysis.

When situations like this occur, we can customize stop words and add stop words to the stop words dataset.

So taking a look at the stop words dataset again we see there are two columns, word and lexicon. For us to add to this, we need to create our own data frame using the same columns.

Your instinct might be to join these datasets, but since we have the same exact column names and we just want to add rows to the stop words dataset, the function we're going to use is rbind() which will take a row and bind it directly underneath another dataset aka it's combining the rows.

Add custom stop words

custom_words <- data.frame(word = "data", lexicon = "custom")

stop_words_full <- rbind(stop_words, custom_words)

Ok, so we found the most common words in survey question responses that came to us in a data frame structure. What if we wanted to find the most common occurring words in a pdf file? This is unstructured text!

Before we get the pdf file that we'll be working with, let's go back to our load libraries chunk. We need to add two libraries: pdftools and stringr

Unstructured text

A list is a type of vector and one of the most flexible data structures in R because it can contain elements of multiple data types.

So remember when I tried to teach you all atomic vectors? And recall that atomic vectors are made up of elements of the same data type?

Read in pdf using pdf_text

# pdg = population dynamics greenland 
txt <- pdf_text("data/Translation_of_FG_8_Ilulissat_170410_0077.pdf")

class(txt)

So we see the pdf was read in as a character list and if we look in our global environment, the pdf appears in the Values section. So let's convert it from a character list to a data frame, the tibble package has a handy function for this called enframe()

Convert pdf character value to data frame

txt_clean <- txt %>%
  enframe() %>% # from `tibble`
  rename(page = name) # new col name = old col name

Unnest and remove stop words from pdf text

pdf_summary <- txt_clean %>%  
  unnest_tokens(output = word, input = value) %>% 
  # remove stop words using stop_words df from `tidytext`
  anti_join(stop_words, by = "word") %>% 
  count(word) %>% 
  arrange(-n) %>% 
  slice_head(n = 10)

Summarize text in pdf

Take a look at pdg_txt_summarize do we think these words are meaningful? If we take a look at the original document (again it's so important to know what your data is!), we can see that the headers and footers repeat on every page, so focus may not be helpful as we're trying to find the most common occurring words in the responses.

It might also be helpful to separate the questions and answers.

So let's start with removing the headers.

Remove headers from pdfs

First let's see what our data looks like to guide our cleaning. When we view the data set using View() we can only see the first 50 or so characters.

To see more of the values or more of the pdf page we can use a function substr() which extracts (or replaces) substrings of a character vector

Who remembers what the $ does? Also called the subset operator? Right, it allows us to access the different cols or objects in a data frame

Indexing syntax uses square brackets [] and allows us to explore specific values in a col of a df. Remember what Matt said yesterday, all data frames are made up of vectors.

# preview first 500 characters on the first pdf page 
substr(txt_clean$value[1], 1,500)

To start our clean up process, let's split our value column by where double new lines occur \n\n

Split up a string using str_split()

str_split() splits up a string and returns a list. A list in R is a type of data structure and can contain many different data types inside it. A list is a collection of data which is ordered and changeable.

So if you look back at our txt_clean dataframe, and look at the value column essentiall this is one large string. And we need to break up that string. That's what str_split() is doing. Make sense?

If not, no worries. We're gonna do a mini example

  • str_split() mini example

So in this example, our long string is "1,2,3,4,5" and we're storing it in the object x.

In str_split() we can specify the separator we want to split on, in this case it would be commas

list with a vector in it

x <- "1,2,3,4,5"
x

x_split <- str_split(x, ",")
x_split

But we won't be working with a string data type like this. We'll be working in a data frame, and we'll be using str_split() within mutate() so let's apply our mini example to a data frame.

First let's turn our x object into a data frame using data.frame where a column called "x" gets the values of of our objectx. Then we can apply str_split() within our mutate function because we want to manipulate values within a column.

x_df <- data.frame(x = x) %>% 
  mutate(x = str_split(x, ","))

x_df

Now we can unnest the values in our in the x column in our data frame

x_df_flat <- x_df %>% 
  unnest(cols = x)

x_df_flat

So what we did was take a "long" character string, split that string based on a specified separator which for us was commas, which turned it into a list that contains a vector with 5 elements in it. The last thing we did was unnest that vector.

And now we're going to do these same steps but with our pdf text and our separator with be the double new line \n\n

txt_clean <- txt_clean %>% 
  mutate(value = str_split(value, "\n\n")) %>% 
  unnest(cols = value)

We're going to explicitly call the DT package which creates tables to create a quick table of our new clean and unnested data.

DT::datatable(txt_clean, rownames = F)

The questions and answers in the pdf are now more visible, the other lines are either blank or they're header / footer lines from the document. Since we only want to look at the most commonly occurring words in the questions and answers, we're going to use substr() to extract part of the question and answer strings to create id's that we can filter.

txt_clean <- txt_clean %>% 
  # create a new col called "id" and we're going to fill it with the first four chrs from value
  mutate(id = substr(value, 1, 4))
unique(txt_clean$id)

Using unique() we've found that some of the question and answer id's have newline characters before them. We can use the function str_replace() to replace that pattern of the string.

txt_clean <- txt_clean %>% 
  mutate(id = str_replace(id, "\n", ""))

Let's run unique again

unique(txt_clean$id)

Now that we've cleaned up the id's we can extract just the first two characters

txt_clean <- txt_clean %>% 
  mutate(id = substr(id, 1, 2))

Finally we can run the filter and filter for just the id's that contain a Q or A. To do this we have to include the grepl() and a regular expression in the filter. We won't go into detail about regular expressions, but a quick gist is that regular expressions are used to match a pattern in a string or text. There is a chapter on regular expressions in the appendix section of the book for you to reference.

Let's do another mini example.

# create a vector of chr / string data types
x <- c("Q3", "F1", "AAA", "FA")

# grepl is a function that searches for matches based on a pattern you provide. It's try to match this pattern against each element of a chr vector 
# this is saying look for a string pattern based on this regular expression which says find a string that starts with either a Q or A character at the beginning of the string in the object x
grepl(pattern = "^[QA]", x)

We can apply this same regular expression to our pdf data set

txt_clean <- txt_clean %>% 
  mutate(id = substr(value, 1, 4)) %>% 
  mutate(id = str_replace(id, "\n", "")) %>% 
  mutate(id = substr(id, 1, 2)) %>% 
  filter(grepl("^[QA]", id))

Our final cleaning step is to remove all the question or answer ids at the beginning of all the values in the value col

txt_clean <- txt_clean %>% 
  mutate(id = substr(value, 1, 4)) %>% 
  mutate(id = str_replace(id, "\n", "")) %>% 
  mutate(id = substr(id, 1, 2)) %>% 
  filter(grepl("^[QA]", id)) %>% 
  mutate(value = str_replace_all(value, "[QA][0-9]\\:", ""))

Now let's find the most common occurring words

pdf_summary <- txt_clean %>%  
  # unnest tokens 
  unnest_tokens(output = word, input = value) %>%
  anti_join(stop_words) %>% 
  count(word) %>% 
  arrange(-n) %>% 
  slice_head(n = 10)

We did it! We found these are the top 10 most occurring words in the entire pdf. This may or may not be meaningful and may require more wrangling, but this is the first iteration.

Sentiment Analysis

What's a sentiment?

In sentiment analysis, tokens (in this case our single words) are evaluated against a dictionary of words where a sentiment is assigned to the word.

There are many different sentiment lexicons, some with single words, some with more than one word, and some that are aimed at particular disciplines. When embarking on a sentiment analysis project, choosing your lexicon is one that should be done with care.

For now we're going to do a very simple sentiment analysis on Q3 and Q4 answers using the bing lexicon.

First we're going to load our sentiments.

bing <- get_sentiments()

Next we do an inner join to return the words from question 3 that are contained within the lexicon.

q3_sent <- inner_join(q3, bing, by = "word")

Analysis wise there are lots of different ways to go after you've assigned your sentiments.

You can:

  • calculate an overall sentiment index for that question,
  • plot sentiment against some other variable or,
  • make a fun word cloud!
q3_sent %>% 
  # count frequency of words and their sentiment
  count(word, sentiment, sort = TRUE) %>%
  # start word cloud - negative and positive frequency for each word
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100, title.size = 2) 
q4 %>%
  # join with bing lexicon
  inner_join(bing, by = "word") %>% 
  # count frequencies 
  count(word, sentiment, sort = TRUE) %>%
  # word cloud 
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100, title.size = 2)