We’ll use this opinion piece from the New York Times (last updated more than a year ago, sadly).
Several data points in this piece:
- The article is a (long) list,
- Each list item is a lie,
- Each lie is made of a date, of a statement, of an explanation of why this statement is untrue, and a supporting link
Scraping is downloading data from a webpage, as much as it is creating structure out of this data.
Conceptually we will want to achive something like this:
date | statement | counter statement | supporting link |
Jan 21 | I didn’t want to go into Iraq | He was for an invastion | https://something |
We’ll use the rvest
package:
library(rvest)
page <- read_html("https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html")
page
{xml_document}
<html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope="" xmlns:og="http://opengraphprotocol.org/schema/">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<script>\nif (docume ...
[2] <body>\n<style>\n.lt-ie10 .messenger.suggestions {\n display: block !important;\n height: 50px; ...
The DOM is a hierarchy/ tree of Javascript node objects
<body>
<article>
<h1>This is a simple title</h1>
<h1 id="special">This is another title...
<em>with a twist!</em>
</h1>
</article>
</body>
HTML is composed of tags: <tag>
This is what each lie in the big <ol>
(for Ordered List) looks like:
<span class="short-desc">
<strong> DATE </strong> LIE
<span class="short-truth">
<a href="URL"> EXPLANATION </a>
</span>
</span>
See this <span class="short-desc">
?
We can target the class
to select all tags/nodes that carry this class:
lie_containers <- page %>% html_nodes(".short-desc")
lie_containers %>% head()
{xml_nodeset (6)}
[1] <span class="short-desc"><strong>Jan. 21 </strong>“I wasn't a fan of Iraq...
[2] <span class="short-desc"><strong>Jan. 21 </strong>“A reporter for Time ma...
[3] <span class="short-desc"><strong>Jan. 23 </strong>“Between 3 million and ...
We end up with a list of 180 items.
Dates are stored here:
<strong> DATE </strong>
As previously we can use html_nodes
:
lie_containers %>%
html_nodes("strong")
{xml_nodeset (180)}
[1] <strong>Jan. 21 </strong>
[2] <strong>Jan. 21 </strong>
[3] <strong>Jan. 23 </strong>
Obviously we don’t want to keep the tags in there.
Also we want to trim these rogue spaces at the beginning and the end of the string:
lie_containers %>%
html_nodes("strong") %>%
html_text(trim= TRUE)
[1] "Jan. 21" "Jan. 21" "Jan. 23" "Jan. 25" "Jan. 25"
Getting the lie itself is a bit more difficult. Here’s our structure again:
<span class="short-desc">
<strong> DATE </strong> LIE
<span class="short-truth">
<a href="URL"> EXPLANATION </a>
</span>
</span>
LIE
is not within any tag…
Fortunately, thanks to some magic (and because the page is well structured), we can use a trick:
xml_contents(lie_containers[1])
{xml_nodeset (3)}
[1] <strong>Jan. 21 </strong>
[2] “I wasn't a fan of Iraq. I didn't want to go into Iraq.”
[3] <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump- ...
Running xml_contents()
on a list element gives us yet another list; it contains three elements, the second one being our lie. Thus:
xml_contents(lie_containers[1])[2]
{xml_nodeset (1)}
[1] “I wasn't a fan of Iraq. I didn't want to go into Iraq.”
The explanation is easy because it is contained in the .short-truth
tag:
lie_containers %>%
html_nodes(".short-truth") %>%
html_text(trim= TRUE)
[1] "(He was for an invasion before he was against it.)"
[2] "(Trump was on the cover 11 times and Nixon appeared 55 times.)"
[3] "(There's no evidence of illegal voting.)"
The supporting link is contained in a <a>
tag: it is a link.
Here’s how a link looks like in HTML:
<a href="https://website.com">text of the link</a>
We extract the link only like so:
lie_containers %>%
html_nodes("a") %>%
html_attr("href")
[1] "https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the"
[2] "http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/"
[3] "https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html"
Our strings aren’t very clean. We can use the stringr
package to make them a bit nicer to work with.
str_c("21 Feb", ", 2017")
[1] "21 Feb, 2017"
str_sub("'quote marks'", 2, -2)
[1] "quote marks"
A programming pattern you will see regularly is the following:
- We create an empty thing/variable: our final dataset, empty for now
- We have a list of things we need to do some work on
- We iterate over each item in this list, build a temporary store
- Then for each iteration we append the temporary store to our main dataset
When we’re finished running over each element in our main list, we end up with a complete main dataset!
The tool for the job is the for loop:
vector <- c('one', 'two', 'basile')
for (variable in vector) {
print(variable)
}
[1] "one"
[1] "two"
[1] "basile"
Conceptually, we’ll do this:
# our main dataset records <- vector("list") for (lie in lie_containers) { # store the date # store the lie # store the explanation # store the url # our temporary date store with all the right columns df <- data_frame(date, lie, explanation, url) # send our temporary data store back up # to populate our main dataset records <- rbind(records, df) }
records <- vector("list")
for (i in seq_along(lie_containers)) {
date <- lie_containers[i] %>%
html_nodes("strong") %>%
html_text(trim = TRUE)
lie <- xml_contents(lie_containers[i])[2] %>%
html_text(trim = TRUE)
explanation <- lie_containers[i] %>%
html_nodes(".short-truth") %>%
html_text(trim = TRUE)
url <- lie_containers[i] %>%
html_nodes("a") %>%
html_attr("href")
df <- data_frame(date = str_c(date, ", 2017"),
lie = str_sub(lie, 2, -2),
explanation = str_sub(explanation, 2, -2),
url = url)
records <- rbind(records, df)
}
> df %>% head()
# A tibble: 6 x 4
date lie explanation url
<chr> <chr> <chr> <chr>
1 Jan. 21… I wasn't a fan of Iraq. I didn… He was for an invasion before… https://www.buzzfeed.com/andr…
2 Jan. 21… A reporter for Time magazine —… Trump was on the cover 11 tim… http://nation.time.com/2013/1…
3 Jan. 23… Between 3 million and 5 millio… There's no evidence of illega… https://www.nytimes.com/2017/…
4 Jan. 25… Now, the audience was the bigg… Official aerial photos show O… https://www.nytimes.com/2017/…
5 Jan. 25… Take a look at the Pew reports… The report never mentioned vo… https://www.nytimes.com/2017/…
6 Jan. 25… You had millions of people tha… The real number is less than … https://www.nytimes.com/2017/…
records %>%
group_by(date_clean) %>%
mutate(month = month(date_clean)) %>%
ungroup() %>% group_by(month) %>%
summarise(count = n())
# A tibble: 11 x 2
month count
<dbl> <int>
1 1 12
2 2 29
3 3 15
4 4 23
5 5 14
6 6 18
7 7 15
8 8 10
9 9 13
10 10 28
11 11 3
install.packages('urltools')
library(urltools)
suffix_extract(domain(records$url)) %>%
select(domain) %>%
group_by(domain) %>%
mutate(count = n()) %>%
arrange(desc(count)) %>% distinct() %>% head(10) %>%
ggplot(aes(x = domain, y = count)) + geom_bar(stat= "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))