Skip to content

Latest commit

 

History

History
337 lines (251 loc) · 9.85 KB

week7-web-scraping.org

File metadata and controls

337 lines (251 loc) · 9.85 KB

scraping Trump lies

Additional links and sources

Where to find our database of lies

We’ll use this opinion piece from the New York Times (last updated more than a year ago, sadly).

figures/screenshot.png

Several data points in this piece:

  • The article is a (long) list,
  • Each list item is a lie,
  • Each lie is made of a date, of a statement, of an explanation of why this statement is untrue, and a supporting link

Scraping in R

Scraping is downloading data from a webpage, as much as it is creating structure out of this data.

Conceptually we will want to achive something like this:

datestatementcounter statementsupporting link
Jan 21I didn’t want to go into IraqHe was for an invastionhttps://something

Setting up tooling in R

We’ll use the rvest package:

library(rvest)
page <- read_html("https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html")
page

{xml_document}
<html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope="" xmlns:og="http://opengraphprotocol.org/schema/">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<script>\nif (docume ...
[2] <body>\n<style>\n.lt-ie10 .messenger.suggestions {\n  display: block !important;\n  height: 50px; ...

A word on the web

The DOM

The DOM is a hierarchy/ tree of Javascript node objects

Structure of a webpage

<body>
  <article>
    <h1>This is a simple title</h1>

    <h1 id="special">This is another title... 
      <em>with a twist!</em>
    </h1>
  </article>
</body>

HTML is composed of tags: <tag>

Making sense of our lies data

This is what each lie in the big <ol> (for Ordered List) looks like:

<span class="short-desc">
    <strong> DATE </strong> LIE
    <span class="short-truth">
        <a href="URL"> EXPLANATION </a>
    </span>
</span>

Collecting all the lie “containers”

See this <span class="short-desc">? We can target the class to select all tags/nodes that carry this class:

lie_containers <- page %>% html_nodes(".short-desc")
lie_containers %>% head()

{xml_nodeset (6)}
[1] <span class="short-desc"><strong>Jan. 21 </strong>I wasn't a fan of Iraq...
[2] <span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time ma...
[3] <span class="short-desc"><strong>Jan. 23 </strong>“Between 3 million and ...

We end up with a list of 180 items.

Extracting the date

Dates are stored here:

<strong> DATE </strong>

As previously we can use html_nodes:

lie_containers %>%
  html_nodes("strong")

{xml_nodeset (180)}
[1] <strong>Jan. 21 </strong>
[2] <strong>Jan. 21 </strong>
[3] <strong>Jan. 23 </strong>

Getting the text out

Obviously we don’t want to keep the tags in there.

Also we want to trim these rogue spaces at the beginning and the end of the string:

lie_containers %>%
  html_nodes("strong") %>%
  html_text(trim= TRUE)

[1] "Jan. 21"  "Jan. 21"  "Jan. 23"  "Jan. 25"  "Jan. 25"

Extracting the lie

Getting the lie itself is a bit more difficult. Here’s our structure again:

<span class="short-desc">
    <strong> DATE </strong> LIE
    <span class="short-truth">
        <a href="URL"> EXPLANATION </a>
    </span>
</span>

LIE is not within any tag…

XML contents

Fortunately, thanks to some magic (and because the page is well structured), we can use a trick:

xml_contents(lie_containers[1])

{xml_nodeset (3)}
[1] <strong>Jan. 21 </strong>
[2] “I wasn't a fan of Iraq. I didn't want to go into Iraq.” 
[3] <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump- ...

Running xml_contents() on a list element gives us yet another list; it contains three elements, the second one being our lie. Thus:

xml_contents(lie_containers[1])[2]

{xml_nodeset (1)}
[1] “I wasn't a fan of Iraq. I didn't want to go into Iraq.

Extracting the explanation

The explanation is easy because it is contained in the .short-truth tag:

lie_containers %>%
  html_nodes(".short-truth") %>%
  html_text(trim= TRUE)
[1] "(He was for an invasion before he was against it.)"                                                                                 
[2] "(Trump was on the cover 11 times and Nixon appeared 55 times.)"                                                                     
[3] "(There's no evidence of illegal voting.)"   

Extracting the supporting link

The supporting link is contained in a <a> tag: it is a link.

Here’s how a link looks like in HTML:

<a href="https://website.com">text of the link</a>

We extract the link only like so:

lie_containers %>%
  html_nodes("a") %>%
  html_attr("href")
[1] "https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the"                                                                                                                                                                    
[2] "http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/"                                                                                                                                                                                                  
[3] "https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html"   

String operations

Our strings aren’t very clean. We can use the stringr package to make them a bit nicer to work with.

str_c("21 Feb", ", 2017")
[1] "21 Feb, 2017"

str_sub("'quote marks'", 2, -2)
[1] "quote marks"

Iteration and scraping

A programming pattern you will see regularly is the following:

  • We create an empty thing/variable: our final dataset, empty for now
  • We have a list of things we need to do some work on
  • We iterate over each item in this list, build a temporary store
  • Then for each iteration we append the temporary store to our main dataset

When we’re finished running over each element in our main list, we end up with a complete main dataset!

For loop

The tool for the job is the for loop:

vector <- c('one', 'two', 'basile')
for (variable in vector) {
  print(variable)
}

[1] "one"
[1] "two"
[1] "basile"

Appending to an empty list

Conceptually, we’ll do this:

# our main dataset
records <- vector("list")

for (lie in lie_containers) {
  # store the date
  # store the lie
  # store the explanation
  # store the url
  
  # our temporary date store with all the right columns
  df <- data_frame(date, lie, explanation, url)
  
  # send our temporary data store back up
  # to populate our main dataset
  records <- rbind(records, df)
}

Putting it all together

records <- vector("list")

for (i in seq_along(lie_containers)) {
  date <- lie_containers[i] %>% 
            html_nodes("strong") %>% 
            html_text(trim = TRUE)
  lie <- xml_contents(lie_containers[i])[2] %>% 
          html_text(trim = TRUE)
  explanation <- lie_containers[i] %>%
                  html_nodes(".short-truth") %>%
                  html_text(trim = TRUE)
  url <- lie_containers[i] %>% 
          html_nodes("a") %>%
          html_attr("href")
  df <- data_frame(date = str_c(date, ", 2017"), 
                   lie = str_sub(lie, 2, -2), 
                   explanation = str_sub(explanation, 2, -2), 
                   url = url)
  records <- rbind(records, df)
}

> df %>% head()
# A tibble: 6 x 4
  date     lie                             explanation                    url                           
  <chr>    <chr>                           <chr>                          <chr>                         
1 Jan. 21I wasn't a fan of Iraq. I didn… He was for an invasion before… https://www.buzzfeed.com/andr…
2 Jan. 21… A reporter for Time magazine —… Trump was on the cover 11 tim… http://nation.time.com/2013/1…
3 Jan. 23… Between 3 million and 5 millio… There's no evidence of illegahttps://www.nytimes.com/2017/4 Jan. 25Now, the audience was the biggOfficial aerial photos show Ohttps://www.nytimes.com/2017/5 Jan. 25Take a look at the Pew reportsThe report never mentioned vohttps://www.nytimes.com/2017/6 Jan. 25You had millions of people thaThe real number is less thanhttps://www.nytimes.com/2017/

Analysing our dataset

When did he lie?

records %>%
  group_by(date_clean) %>% 
  mutate(month = month(date_clean)) %>%
  ungroup() %>% group_by(month) %>%
  summarise(count = n())

# A tibble: 11 x 2
   month count
   <dbl> <int>
 1     1    12
 2     2    29
 3     3    15
 4     4    23
 5     5    14
 6     6    18
 7     7    15
 8     8    10
 9     9    13
10    10    28
11    11     3

Which source is quoted most?

install.packages('urltools')
library(urltools)

suffix_extract(domain(records$url)) %>%
  select(domain) %>%
  group_by(domain) %>%
  mutate(count = n()) %>% 
  arrange(desc(count)) %>% distinct() %>% head(10) %>%
  ggplot(aes(x = domain, y = count)) + geom_bar(stat= "identity") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

figures/trump01.png