scraping Trump lies

Additional links and sources

Original tutorial
Fork in R

Where to find our database of lies

We’ll use this opinion piece from the New York Times (last updated more than a year ago, sadly).

Several data points in this piece:

The article is a (long) list,
Each list item is a lie,
Each lie is made of a date, of a statement, of an explanation of why this statement is untrue, and a supporting link

Scraping in R

Scraping is downloading data from a webpage, as much as it is creating structure out of this data.

Conceptually we will want to achive something like this:

date	statement	counter statement	supporting link
Jan 21	I didn’t want to go into Iraq	He was for an invastion	https://something

Setting up tooling in R

We’ll use the rvest package:

library(rvest)
page <- read_html("https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html")
page

{xml_document}
<html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope="" xmlns:og="http://opengraphprotocol.org/schema/">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<script>\nif (docume ...
[2] <body>\n<style>\n.lt-ie10 .messenger.suggestions {\n  display: block !important;\n  height: 50px; ...

A word on the web

The DOM

The DOM is a hierarchy/ tree of Javascript node objects

Structure of a webpage

<body>
  <article>
    <h1>This is a simple title</h1>

    <h1 id="special">This is another title... 
      <em>with a twist!</em>
    </h1>
  </article>
</body>

HTML is composed of tags: <tag>

Making sense of our lies data

This is what each lie in the big <ol> (for Ordered List) looks like:

<span class="short-desc">
    <strong> DATE </strong> LIE
    <span class="short-truth">
        <a href="URL"> EXPLANATION </a>
    </span>
</span>

Collecting all the lie “containers”

See this <span class="short-desc">? We can target the class to select all tags/nodes that carry this class:

lie_containers <- page %>% html_nodes(".short-desc")
lie_containers %>% head()

{xml_nodeset (6)}
[1] <span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq...
[2] <span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time ma...
[3] <span class="short-desc"><strong>Jan. 23 </strong>“Between 3 million and ...

We end up with a list of 180 items.

Extracting the date

Dates are stored here:

<strong> DATE </strong>

As previously we can use html_nodes:

lie_containers %>%
  html_nodes("strong")

{xml_nodeset (180)}
[1] <strong>Jan. 21 </strong>
[2] <strong>Jan. 21 </strong>
[3] <strong>Jan. 23 </strong>

Getting the text out

Obviously we don’t want to keep the tags in there.

Also we want to trim these rogue spaces at the beginning and the end of the string:

lie_containers %>%
  html_nodes("strong") %>%
  html_text(trim= TRUE)

[1] "Jan. 21"  "Jan. 21"  "Jan. 23"  "Jan. 25"  "Jan. 25"

Extracting the lie

Getting the lie itself is a bit more difficult. Here’s our structure again:

<span class="short-desc">
    <strong> DATE </strong> LIE
    <span class="short-truth">
        <a href="URL"> EXPLANATION </a>
    </span>
</span>

LIE is not within any tag…

XML contents

Fortunately, thanks to some magic (and because the page is well structured), we can use a trick:

xml_contents(lie_containers[1])

{xml_nodeset (3)}
[1] <strong>Jan. 21 </strong>
[2] “I wasn't a fan of Iraq. I didn't want to go into Iraq.” 
[3] <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump- ...

Running xml_contents() on a list element gives us yet another list; it contains three elements, the second one being our lie. Thus:

xml_contents(lie_containers[1])[2]

{xml_nodeset (1)}
[1] “I wasn't a fan of Iraq. I didn't want to go into Iraq.”

Extracting the explanation

The explanation is easy because it is contained in the .short-truth tag:

lie_containers %>%
  html_nodes(".short-truth") %>%
  html_text(trim= TRUE)
[1] "(He was for an invasion before he was against it.)"                                                                                 
[2] "(Trump was on the cover 11 times and Nixon appeared 55 times.)"                                                                     
[3] "(There's no evidence of illegal voting.)"

Extracting the supporting link

The supporting link is contained in a <a> tag: it is a link.

Here’s how a link looks like in HTML:

<a href="https://website.com">text of the link</a>

We extract the link only like so:

lie_containers %>%
  html_nodes("a") %>%
  html_attr("href")
[1] "https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the"                                                                                                                                                                    
[2] "http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/"                                                                                                                                                                                                  
[3] "https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html"

String operations

Our strings aren’t very clean. We can use the stringr package to make them a bit nicer to work with.

str_c("21 Feb", ", 2017")
[1] "21 Feb, 2017"

str_sub("'quote marks'", 2, -2)
[1] "quote marks"

Iteration and scraping

A programming pattern you will see regularly is the following:

We create an empty thing/variable: our final dataset, empty for now
We have a list of things we need to do some work on
We iterate over each item in this list, build a temporary store
Then for each iteration we append the temporary store to our main dataset

When we’re finished running over each element in our main list, we end up with a complete main dataset!

For loop

The tool for the job is the for loop:

vector <- c('one', 'two', 'basile')
for (variable in vector) {
  print(variable)
}

[1] "one"
[1] "two"
[1] "basile"

Appending to an empty list

Conceptually, we’ll do this:

# our main dataset
records <- vector("list")

for (lie in lie_containers) {
  # store the date
  # store the lie
  # store the explanation
  # store the url
  
  # our temporary date store with all the right columns
  df <- data_frame(date, lie, explanation, url)
  
  # send our temporary data store back up
  # to populate our main dataset
  records <- rbind(records, df)
}

Putting it all together

records <- vector("list")

for (i in seq_along(lie_containers)) {
  date <- lie_containers[i] %>% 
            html_nodes("strong") %>% 
            html_text(trim = TRUE)
  lie <- xml_contents(lie_containers[i])[2] %>% 
          html_text(trim = TRUE)
  explanation <- lie_containers[i] %>%
                  html_nodes(".short-truth") %>%
                  html_text(trim = TRUE)
  url <- lie_containers[i] %>% 
          html_nodes("a") %>%
          html_attr("href")
  df <- data_frame(date = str_c(date, ", 2017"), 
                   lie = str_sub(lie, 2, -2), 
                   explanation = str_sub(explanation, 2, -2), 
                   url = url)
  records <- rbind(records, df)
}

> df %>% head()
# A tibble: 6 x 4
  date     lie                             explanation                    url                           
  <chr>    <chr>                           <chr>                          <chr>                         
1 Jan. 21… I wasn't a fan of Iraq. I didn… He was for an invasion before… https://www.buzzfeed.com/andr…
2 Jan. 21… A reporter for Time magazine —… Trump was on the cover 11 tim… http://nation.time.com/2013/1…
3 Jan. 23… Between 3 million and 5 millio… There's no evidence of illega… https://www.nytimes.com/2017/…
4 Jan. 25… Now, the audience was the bigg… Official aerial photos show O… https://www.nytimes.com/2017/…
5 Jan. 25… Take a look at the Pew reports… The report never mentioned vo… https://www.nytimes.com/2017/…
6 Jan. 25… You had millions of people tha… The real number is less than … https://www.nytimes.com/2017/…

Analysing our dataset

When did he lie?

records %>%
  group_by(date_clean) %>% 
  mutate(month = month(date_clean)) %>%
  ungroup() %>% group_by(month) %>%
  summarise(count = n())

# A tibble: 11 x 2
   month count
   <dbl> <int>
 1     1    12
 2     2    29
 3     3    15
 4     4    23
 5     5    14
 6     6    18
 7     7    15
 8     8    10
 9     9    13
10    10    28
11    11     3

Which source is quoted most?

install.packages('urltools')
library(urltools)

suffix_extract(domain(records$url)) %>%
  select(domain) %>%
  group_by(domain) %>%
  mutate(count = n()) %>% 
  arrange(desc(count)) %>% distinct() %>% head(10) %>%
  ggplot(aes(x = domain, y = count)) + geom_bar(stat= "identity") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

week7-web-scraping.org

week7-web-scraping.org

scraping Trump lies

Additional links and sources

Where to find our database of lies

Scraping in R

Setting up tooling in R

A word on the web

The DOM

Structure of a webpage

Making sense of our lies data

Collecting all the lie “containers”

Extracting the date

Getting the text out

Extracting the lie

XML contents

Extracting the explanation

Extracting the supporting link

String operations

Iteration and scraping

For loop

Appending to an empty list

Putting it all together

Analysing our dataset

When did he lie?

Which source is quoted most?

Files

week7-web-scraping.org

Latest commit

History

week7-web-scraping.org

File metadata and controls

scraping Trump lies

Additional links and sources

Where to find our database of lies

Scraping in R

Setting up tooling in R

A word on the web

The DOM

Structure of a webpage

Making sense of our lies data

Collecting all the lie “containers”

Extracting the date

Getting the text out

Extracting the lie

XML contents

Extracting the explanation

Extracting the supporting link

String operations

Iteration and scraping

For loop

Appending to an empty list

Putting it all together

Analysing our dataset

When did he lie?

Which source is quoted most?