-
Notifications
You must be signed in to change notification settings - Fork 77
/
07-gather.qmd
1681 lines (1254 loc) · 74.4 KB
/
07-gather.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
engine: knitr
---
# APIs, scraping, and parsing {#sec-gather-data}
::: {.callout-note}
Chapman and Hall/CRC published this book in July 2023. You can purchase that [here](https://www.routledge.com/Telling-Stories-with-Data-With-Applications-in-R/Alexander/p/book/9781032134772). This online version has some updates to what was printed.
:::
**Prerequisites**
- Read *Turning History into Data: Data Collection, Measurement, and Inference in HPE*, [@cirone]
- This paper discusses some of the challenges of creating datasets.
- Read *Two Regimes of Prison Data Collection*, [@Johnson2021Two]
- This paper compares data about prisons from the United States government with data from incarcerated people and the community.
- Read *Atlas of AI*, [@crawford]
- Focus on Chapter 3 "Data", which discusses the importance of understanding the sources of data.
**Key concepts and skills**
- Sometimes data are available but they are not necessarily put together for the purposes of being a dataset. We must gather the data.
- It can be cumbersome and annoying to clean and prepare the datasets that come from these unstructured sources but the resulting structured, tidy data are often especially exciting and useful.
- We can gather data from a variety of sources. This includes APIs, both directly, which may involve semi-structured data, and indirectly through `R` packages. We can also gather data through reasonable and ethical web scraping. Finally, we may wish to gather data from PDFs.
**Software and packages**
- Base `R` [@citeR]
- `babynames` [@citebabynames]
- `gh` [@gh]
- `here` [@here]
- `httr` [@citehttr]
- `janitor` [@janitor]
- `jsonlite` [@jsonlite]
- `lubridate` [@GrolemundWickham2011]
- `pdftools` [@pdftools]
- `purrr` [@citepurrr]
- `rvest` [@citervest]
- `spotifyr` [@spotifyr] (this package is not on CRAN so install it with: `install.packages("devtools")` then `devtools::install_github("charlie86/spotifyr")`)
- `tesseract` [@citetesseract]
- `tidyverse` [@tidyverse]
- `tinytable` [@tinytable]
- `usethis` [@usethis]
- `xml2` [@xml2]
```{r}
#| message: false
#| warning: false
library(babynames)
library(gh)
library(here)
library(httr)
library(janitor)
library(jsonlite)
library(lubridate)
library(pdftools)
library(purrr)
library(rvest)
library(spotifyr)
library(tesseract)
library(tidyverse)
library(tinytable)
library(usethis)
library(xml2)
```
## Introduction
In this chapter we consider data that we must gather ourselves. This means that although the observations exist, we must parse, pull, clean, and prepare them to get the dataset that we will consider. In contrast to farmed data, discussed in @sec-farm-data, often these observations are not being made available for the purpose of analysis. This means that we need to be especially concerned with documentation, inclusion and exclusion decisions, missing data, and ethical behavior.
As an example of such a dataset, consider @Cummins2022 who create a dataset using individual-level probate records from England between 1892 and 1992. They find that about one-third of the inheritance of "elites" is concealed.\index{England!inheritance} Similarly, @Taflaga2019 construct a systematic dataset of job responsibilities based on Australian Ministerial telephone directories. They find substantial differences by gender. Neither wills nor telephone directories were created for the purpose of being included in a dataset. But with a respectful approach they enable insights that we could not get by other means. We term this "data gathering"---the data exist but we need to get them.
Decisions need to be made at the start of a project about the values we want the project to have. For instance, @huggingfaceethics value transparency, reproducibility, fairness, being self-critical, and giving credit. How might that affect the project?\index{ethics!values} Valuing "giving credit" might mean being especially zealous about attribution and licensing. In the case of gathered data we should give special thought to this as the original, unedited data may not be ours.
The results of a data science workflow cannot be better than their underlying data [@bailey2008design]. Even the most-sophisticated statistical analysis will struggle to adjust for poorly-gathered data. This means when working in a team, data gathering should be overseen and at least partially conducted by senior members of the team. And when working by yourself, try to give special consideration and care to this stage.
In this chapter we go through a variety of approaches for gathering data. We begin with the use of APIs and semi-structured data, such as JSON and XML. Using an API is typically a situation in which the data provider has specified the conditions under which they are comfortable providing access. An API allows us to write code to gather data. This is valuable because it can be efficient and scales well. Developing comfort with gathering data through APIs enables access to exciting datasets. For instance, @facebookapitrump use the Facebook Political Ad API to gather 218,100 of the Trump 2020 campaign ads to better understand the campaign.
We then turn to web scraping, which we may want to use when there are data available on a website. As these data have typically not been put together for the purposes of being a dataset, it is especially important to have deliberate and definite values for the project. Scraping is a critical part of data gathering because there are many data sources where the priorities of the data provider mean they have not implemented an API. For instance, considerable use of web scraping was critical for creating COVID-19 dashboards in the early days of the pandemic [@scrapecoviddata].
Finally, we consider gathering data from PDFs. This enables the construction of interesting datasets, especially those contained in government reports and old books. Indeed, while freedom of information legislation exists in many countries and require the government to make data available, these all too often result in spreadsheets being shared as PDFs, even when they were a CSV to begin with.
Gathering data can require more of us than using farmed data, but it allows us to explore datasets and answer questions that we could not otherwise. Some of the most exciting work in the world uses gathered data, but it is especially important that we approach it with respect.
## APIs
In everyday language, and for our purposes, an Application Programming Interface (API) is a situation in which someone has set up specific files on their computer such that we can follow their instructions to get them.\index{API} For instance, when we use a gif on Slack, one way it could work in the background is that Slack asks Giphy's server for the appropriate gif, Giphy's server gives that gif to Slack, and then Slack inserts it into the chat. The way in which Slack and Giphy interact is determined by Giphy's API. More strictly, an API is an application that runs on a server that we access using the HTTP protocol.
Here we focus on using APIs for gathering data. In that context an API is a website that is set up for another computer to be able to access it, rather than a person. For instance, we could go to [Google Maps](https://www.google.com/maps).\index{Google!Maps} And we could then scroll and click and drag to center the map on, say, Canberra, Australia.\index{Australia!Canberra} Or we could paste [this link](https://www.google.com/maps/@-35.2812958,149.1248113,16z) into the browser. By pasting that link, rather than navigating, we have mimicked how we will use an API: provide a URL and be given something back. In this case the result should be a map like @fig-focuson2020.\index{maps!API}
![Example of an API response from Google Maps, as of 12 February 2023](figures/07-googlemaps_canberra.png){#fig-focuson2020 width=75% fig-align="center"}
The advantage of using an API is that the data provider usually specifies the data that they are willing to provide, and the terms under which they will provide it. These terms may include aspects such as rate limits (i.e. how often we can ask for data), and what we can do with the data, for instance, we might not be allowed to use it for commercial purposes, or to republish it. As the API is being provided specifically for us to use it, it is less likely to be subject to unexpected changes or legal issues. Because of this it is clear that when an API is available, we should try to use it rather than web scraping.\index{API}
We will now go through a few case studies of using APIs. In the first we deal directly with an API using `httr`. And then we access data from Spotify using `spotifyr`.
### arXiv, NASA, and Dataverse
After installing and loading `httr` we use `GET()` to obtain data from an API directly. This will try to get some specific data and the main argument is "url". This is similar to the Google Maps example in @fig-focuson2020 where the specific information that we were interested in was a map.
#### arXiv
In this case study we will use an [API provided by arXiv](https://arxiv.org/help/api/).\index{API!arXiv} arXiv is an online repository for academic papers before they go through peer review. These papers are typically referred to as "pre-prints".\index{arXiv} We use `GET()` to ask arXiv to obtain some information about a pre-print by providing a URL.
```{r}
#| message: false
#| warning: false
#| eval: false
arxiv <- GET("http://export.arxiv.org/api/query?id_list=2310.01402")
status_code(arxiv)
```
We can use `status_code()` to check our response. For instance, 200 means a success, while 400 means we received an error from the server. Assuming we received something back from the server, we can use `content()` to display it. In this case we have received XML formatted data. XML is a markup language where entries are identified by tags, which can be nested within other tags.\index{XML} After installing and loading `xml2` we can read XML using `read_xml()`. XML is a semi-formatted structure, and it can be useful to start by having a look at it using `html_structure()`.
::: {.content-visible when-format="pdf"}
```{r}
#| eval: false
#| echo: true
content(arxiv) |>
read_xml() |>
html_structure()
```
:::
::: {.content-visible unless-format="pdf"}
```{r}
#| eval: false
content(arxiv) |>
read_xml() |>
html_structure()
```
:::
We might like to create a dataset based on extracting various aspects of this XML tree.\index{XML} For instance, we might look at "entry", which is the eighth item, and in particular obtain the "title" and the "URL", which are the fourth and ninth items, respectively, within "entry".
::: {.content-visible when-format="pdf"}
```{r}
#| eval: false
#| echo: true
data_from_arxiv <-
tibble(
title = content(arxiv) |>
read_xml() |>
xml_child(search = 8) |>
xml_child(search = 4) |>
xml_text(),
link = content(arxiv) |>
read_xml() |>
xml_child(search = 8) |>
xml_child(search = 9) |>
xml_attr("href")
)
```
:::
::: {.content-visible unless-format="pdf"}
```{r}
#| eval: false
data_from_arxiv <-
tibble(
title = content(arxiv) |>
read_xml() |>
xml_child(search = 8) |>
xml_child(search = 4) |>
xml_text(),
link = content(arxiv) |>
read_xml() |>
xml_child(search = 8) |>
xml_child(search = 9) |>
xml_attr("href")
)
data_from_arxiv
```
:::
#### NASA Astronomy Picture of the Day
To consider another example, each day, NASA provides the Astronomy Picture of the Day (APOD) through its [APOD API](https://api.nasa.gov).\index{API!APOD}\index{astronomy} We can use `GET()` to obtain the URL for the photo on particular dates and then display it.
```{r}
NASA_APOD_20190719 <-
GET("https://api.nasa.gov/planetary/apod?api_key=DEMO_KEY&date=2019-07-19")
```
Examining the returned data using `content()`, we can see that we are provided with various fields, such as date, title, explanation, and a URL.
::: {.content-visible when-format="pdf"}
For reasons of space the output is withheld here, but it can be seen on the free, online version of this [book](https://tellingstorieswithdata.com).
```{r}
#| eval: false
#| echo: true
# APOD July 19, 2019
content(NASA_APOD_20190719)$date
content(NASA_APOD_20190719)$title
content(NASA_APOD_20190719)$explanation
content(NASA_APOD_20190719)$url
```
:::
::: {.content-visible unless-format="pdf"}
```{r}
# APOD July 19, 2019
content(NASA_APOD_20190719)$date
content(NASA_APOD_20190719)$title
content(NASA_APOD_20190719)$explanation
content(NASA_APOD_20190719)$url
```
We can provide that URL to `include_graphics()` from `knitr` to display it (@fig-nasaone).
::: {#fig-nasaone}
![Tranquility Base Panorama (Image Credit: Neil Armstrong, Apollo 11, NASA)](figures/apollo11TranquilitybasePan.jpg){#fig-nasamoon}
Images obtained from the NASA APOD API
:::
:::
#### Dataverse
Finally, another common API response in semi-structured form is JSON.\index{JSON} JSON is a human-readable way to store data that can be parsed by machines. In contrast to, say, a CSV, where we are used to rows and columns, JSON uses key-value pairs.
```{css}
{
"firstName": "Rohan",
"lastName": "Alexander",
"age": 36,
"favFoods": {
"first": "Pizza",
"second": "Bagels",
"third": null
}
}
```
We can parse JSON with `jsonlite`. To consider a specific example, we use a "Dataverse" which is a web application that makes it easier to share datasets. We can use an API to query a demonstration dataverse. For instance, we might be interested in datasets related to politics.\index{API!Dataverse}
::: {.content-visible when-format="pdf"}
```{r}
#| message: false
#| warning: false
#| eval: false
#| echo: true
politics_datasets <-
fromJSON("https://demo.dataverse.org/api/search?q=politics")
```
:::
::: {.content-visible unless-format="pdf"}
```{r}
#| message: false
#| warning: false
#| eval: true
#| echo: true
politics_datasets <-
fromJSON("https://demo.dataverse.org/api/search?q=politics")
politics_datasets
```
:::
We could look at the dataset using `View(politics_datasets)`, which would allow us to expand the tree based on what we are interested in. We can even get the code that we need to focus on different aspects by hovering on the item and then clicking the icon with the green arrow (@fig-jsonfirst).
![Example of hovering over a JSON element, "items", where the icon with a green arrow can be clicked on to get the code that would focus on that element](figures/jsonlite.png){#fig-jsonfirst width=90% fig-align="center"}
This tells us how to obtain the dataset of interest.
```{r}
#| eval: false
#| echo: true
as_tibble(politics_datasets[["data"]][["items"]])
```
<!-- ### Gathering data from Twitter -->
<!-- Twitter is a rich source of text and other data and an extraordinary amount of academic research uses it [@twitterseemsokay]. The Twitter API is the way in which Twitter asks that we gather these data. `rtweet` is built around this API and allows us to interact with it in ways that are similar to using any other `R` package. (Another useful `R` package is `academictwitteR` [@academictwitteR], which only works with the full access academic-track API, but enables full archive searches.) Initially, we can use the Twitter API with just a regular Twitter account. -->
<!-- Begin by installing and loading `rtweet` and `tidyverse`. We then need to authorize `rtweet` and we start that process by calling any function from the package, for instance `get_favorites()` which would normally return a tibble of a user's favorites. When it is executed before authorization, this will open a browser, and we then log into a regular Twitter account (@fig-rtweetlogin). -->
<!-- ```{r} -->
<!-- #| warning: false -->
<!-- #| message: false -->
<!-- #| label: loadpackages -->
<!-- library(rtweet) -->
<!-- library(tidyverse) -->
<!-- ``` -->
<!-- ```{r} -->
<!-- #| eval: false -->
<!-- #| label: initialise_rtweet -->
<!-- get_favorites(user = "RohanAlexander") -->
<!-- ``` -->
<!-- ![rtweet authorisation page](figures/rtweet.png){#fig-rtweetlogin width=90% fig-align="center"} -->
<!-- Once the application is authorized, we can use `get_favorites()` to actually get the favorites of a user and save them. -->
<!-- ```{r} -->
<!-- #| label: get_rohan_favs -->
<!-- #| eval: false -->
<!-- rohans_favorites <- get_favorites("RohanAlexander") -->
<!-- saveRDS(rohans_favorites, "rohans_favorites.rds") -->
<!-- ``` -->
<!-- ```{r} -->
<!-- #| label: get_rohan_favsactual -->
<!-- #| include: false -->
<!-- #| eval: false -->
<!-- # INTERNAL -->
<!-- rohans_favorites <- get_favorites("RohanAlexander") -->
<!-- saveRDS(rohans_favorites, "inputs/data/rohans_favs.rds") -->
<!-- ``` -->
<!-- ```{r} -->
<!-- #| include: false -->
<!-- rohans_favorites <- readRDS(here::here("inputs/data/rohans_favs.rds")) -->
<!-- ``` -->
<!-- We could then look at some recent favorites, keeping in mind that they may be different depending on when they are being accessed. -->
<!-- ```{r} -->
<!-- #| label: look_at_rohans_favs -->
<!-- rohans_favorites |> -->
<!-- arrange(desc(created_at)) |> -->
<!-- slice(1:10) |> -->
<!-- select(screen_name, text) -->
<!-- ``` -->
<!-- We can use `search_tweets()` to search for tweets about a particular topic. For instance, we could look at tweets using a hashtag commonly associated with R: "#rstats". -->
<!-- ```{r} -->
<!-- #| label: get_rstats -->
<!-- #| eval: false -->
<!-- rstats_tweets <- search_tweets( -->
<!-- q = "#rstats", -->
<!-- include_rts = FALSE -->
<!-- ) -->
<!-- saveRDS(rstats_tweets, "rstats_tweets.rds") -->
<!-- ``` -->
<!-- ```{r} -->
<!-- #| label: get_rstatsactual -->
<!-- #| include: false -->
<!-- #| eval: false -->
<!-- # INTERNAL -->
<!-- rstats_tweets <- search_tweets( -->
<!-- q = "#rstats", -->
<!-- include_rts = FALSE -->
<!-- ) -->
<!-- saveRDS(rstats_tweets, "inputs/data/rstats_tweets.rds") -->
<!-- ``` -->
<!-- ```{r} -->
<!-- #| include: false -->
<!-- rstats_tweets <- readRDS(here::here("inputs/data/rstats_tweets.rds")) -->
<!-- ``` -->
<!-- ```{r} -->
<!-- #| label: look_at_rstats -->
<!-- rstats_tweets |> -->
<!-- select(screen_name, text) |> -->
<!-- head() -->
<!-- ``` -->
<!-- Other useful functions that can be used include `get_friends()` to get all the accounts that a user follows, and `get_timelines()` to get a user's recent tweets. Registering as a developer enables access to more Twitter API functionality. -->
<!-- When using APIs, even when they are wrapped in an `R` package, in this case `rtweet`, it is important to read the terms under which access is provided. The Twitter API documentation is surprisingly readable, and the [developer policy](https://developer.twitter.com/en/developer-terms/policy) is especially clear. To see how easy it is to violate the terms under which an API provider makes data available, consider that we saved the tweets that we downloaded. If we were to push these to GitHub, then it is possible that we may have accidentally stored sensitive information if there happened to be some in the tweets. Twitter is also explicit about asking those that use their API to be especially careful about sensitive information and not matching Twitter users with their off-Twitter identity. Again, the [documentation around these restricted uses](https://developer.twitter.com/en/developer-terms/more-on-restricted-use-cases) is clear and readable. -->
### Spotify
Sometimes there is an `R` package built around an API and allows us to interact with it in ways that are similar what we have seen before. For instance, `spotifyr` is a wrapper around the Spotify API.\index{API!Spotify}\index{Spotify!API} When using APIs, even when they are wrapped in an `R` package, in this case `spotifyr`, it is important to read the terms under which access is provided.
To access the Spotify API, we need a [Spotify Developer Account](https://developer.spotify.com/dashboard/). This is free but will require logging in with a Spotify account and then accepting the Developer Terms (@fig-spotifyaccept).\index{Spotify}
![Spotify Developer Account Terms agreement page](figures/spotify.png){#fig-spotifyaccept width=90% fig-align="center"}
Continuing with the registration process, in our case, we "do not know" what we are building and so Spotify requires us to use a non-commercial agreement which is fine. To use the Spotify API we need a "Client ID" and a "Client Secret". These are things that we want to keep to ourselves because otherwise anyone with the details could use our developer account as though they were us. One way to keep these details secret with minimum hassle is to keep them in our "System Environment".\index{API!key storage} In this way, when we push to GitHub they should not be included. To do this we will load and use `usethis` to modify our System Environment. In particular, there is a file called ".Renviron" which we will open and then add our "Client ID" and "Client Secret".
```{r}
#| eval: false
edit_r_environ()
```
When we run `edit_r_environ()`, a ".Renviron" file will open and we can add our "Spotify Client ID" and "Client Secret". Use the same names, because `spotifyr` will look in our environment for keys with those specific names. Being careful to use single quotes is important here even though we normally use double quotes in this book.
```{r}
#| eval: false
SPOTIFY_CLIENT_ID = 'PUT_YOUR_CLIENT_ID_HERE'
SPOTIFY_CLIENT_SECRET = 'PUT_YOUR_SECRET_HERE'
```
Save the ".Renviron" file, and then restart R: "Session" $\rightarrow$ "Restart R". We can now use our "Spotify Client ID" and "Client Secret" as needed. And functions that require those details as arguments will work without them being explicitly specified again.
To try this out we install and load `spotifyr`. We will get and save some information about Radiohead, the English rock band, using `get_artist_audio_features()`. One of the required arguments is `authorization`, but as that is set, by default, to look at the ".Renviron" file, we do not need to specify it here.
```{r}
#| eval: false
radiohead <- get_artist_audio_features("radiohead")
saveRDS(radiohead, "radiohead.rds")
```
```{r}
#| include: false
#| eval: false
# INTERNAL
radiohead <- get_artist_audio_features("radiohead")
saveRDS(radiohead, "inputs/data/radiohead.rds")
```
```{r}
#| eval: true
#| include: false
radiohead <- readRDS("inputs/data/radiohead.rds")
```
```{r}
#| eval: false
#| include: true
radiohead <- readRDS("radiohead.rds")
```
There is a variety of information available based on songs. We might be interested to see whether their songs are getting longer over time (@fig-readioovertime). Following the guidance in @sec-static-communication this is a nice opportunity to additionally use a boxplot to communicate summary statistics by album at the same time.
```{r}
#| fig-cap: "Length of each Radiohead song, over time, as gathered from Spotify"
#| label: fig-readioovertime
#| message: false
#| warning: false
radiohead <- as_tibble(radiohead)
radiohead |>
mutate(album_release_date = ymd(album_release_date)) |>
ggplot(aes(
x = album_release_date,
y = duration_ms,
group = album_release_date
)) +
geom_boxplot() +
geom_jitter(alpha = 0.5, width = 0.3, height = 0) +
theme_minimal() +
labs(
x = "Album release date",
y = "Duration of song (ms)"
)
```
One interesting variable provided by Spotify about each song is "valence".\index{Spotify!valence} The Spotify [documentation](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-audio-features) describes this as a measure between zero and one that signals "the musical positiveness" of the track with higher values being more positive. We might be interested to compare valence over time between a few artists, for instance, Radiohead, the American rock band The National, and the American singer Taylor Swift.
First, we need to gather the data.
```{r}
#| eval: false
taylor_swift <- get_artist_audio_features("taylor swift")
the_national <- get_artist_audio_features("the national")
saveRDS(taylor_swift, "taylor_swift.rds")
saveRDS(the_national, "the_national.rds")
```
```{r}
#| include: false
#| eval: false
# INTERNAL
taylor_swift <- get_artist_audio_features("taylor swift")
the_national <- get_artist_audio_features("the national")
saveRDS(taylor_swift, "inputs/data/taylor_swift.rds")
saveRDS(the_national, "inputs/data/the_national.rds")
```
```{r}
#| eval: true
#| include: false
# INTERNAL
taylor_swift <- readRDS("inputs/data/taylor_swift.rds")
the_national <- readRDS("inputs/data/the_national.rds")
```
Then we can bring them together and make the graph (@fig-swiftyvsnationalvsradiohead). This appears to show that while Taylor Swift and Radiohead have largely maintained their level of valence over time, The National has decreased theirs.
```{r}
#| fig-cap: "Comparing valence, over time, for Radiohead, Taylor Swift, and The National"
#| label: fig-swiftyvsnationalvsradiohead
#| message: false
#| warning: false
#| fig-height: 6
rbind(taylor_swift, the_national, radiohead) |>
select(artist_name, album_release_date, valence) |>
mutate(album_release_date = ymd(album_release_date)) |>
ggplot(aes( x = album_release_date, y = valence, color = artist_name)) +
geom_point(alpha = 0.3) +
geom_smooth() +
theme_minimal() +
facet_wrap(facets = vars(artist_name), dir = "v") +
labs(
x = "Album release date",
y = "Valence",
color = "Artist"
) +
scale_color_brewer(palette = "Set1") +
theme(legend.position = "bottom")
```
How amazing that we live in a world where all that information is available with very little effort or cost! And having gathered the data, there much that could be done. For instance, @kaylinpavlik uses an expanded dataset to classify musical genres and @theeconomistonspotify looks at how language is associated with music streaming on Spotify. Our ability to gather such data enables us to answer questions that had to be considered experimentally in the past. For instance, @salganik2006experimental had to use experimental data to analyze the social aspect of what makes a hit song, rather than the observational data we can now access.
That said, it is worth thinking about what valence is purporting to measure. Little information is available in the Spotify documentation how it was created. It is doubtful that one number can completely represent how positive is a song. And what about the songs from these artists that are not on Spotify, or even publicly released? This is a nice example of how measurement and sampling pervade all aspects of telling stories with data.\index{Spotify!measurement}
## Web scraping
### Principles
Web scraping is a way to get data from websites.\index{web scraping} Rather than going to a website using a browser and then saving a copy of it, we write code that does it for us. This opens considerable data to us, but on the other hand, it is not typically data that are being made available for these purposes. This means that it is especially important to be respectful. While generally not illegal, the specifics about the legality of web scraping depend on jurisdictions and what we are doing, and so it is also important to be mindful. Even if our use is not commercially competitive, of particular concern is the conflict between the need for our work to be reproducible with the need to respect terms of service that may disallow data republishing [@luscombe2021algorithmic].
Privacy often trumps reproducibility.\index{ethics!privacy vs reproducibility} There is also a considerable difference between data being publicly available on a website and being scraped, cleaned, and prepared into a dataset which is then publicly released. For instance, @kirkegaard2016okcupid scraped publicly available OKCupid\index{ethics!OKCupid} profiles and then made the resulting dataset easily available [@hackett2016researchers]. @zimmer2018addressing details some of the important considerations that were overlooked including "minimizing harm", "informed consent", and ensuring those in the dataset maintain "privacy and confidentiality".\index{ethics!context specific} While it is correct to say that OKCupid made data public, they did so in a certain context, and when their data was scraped that context was changed.
:::{.callout-note}
## Oh, you think we have good data on that!
Police violence is particularly concerning because of the need for trust between the police and society.\index{police violence!measurement of} Without good data it is difficult to hold police departments accountable, or know whether there is an issue, but getting data is difficult [@bronnerpolicenordata]. The fundamental problem is that there is no way to easily simplify an encounter that results in violence into a dataset. Two popular datasets draw on web scraping:
1) "Mapping Police Violence"; and
2) "Fatal Force Database".
@Bor2018 use "Mapping Police Violence" to examine police killings of Black Americans, especially when unarmed, and find a substantial effect on the mental health of Black Americans. Responses to the paper, such as @Nix2020, have special concern with the coding of the dataset, and after re-coding draw different conclusions. An example of a coding difference is the unanswerable question, because it depends on context and usage, of whether to code an individual who was killed with a toy firearm as "armed" or "unarmed". We may want a separate category, but some simplification is necessary for the construction of a quantitative dataset. *The Washington Post* writes many articles using the "Fatal Force Database" [@washpostfatalforce]. @washpostfatalforcemethods describes their methodology and the challenges of standardization. @Comer2022 compare the datasets and find similarities, but document ways in which the datasets are different.
:::
Web scraping is an invaluable source of data. But they are typically datasets that can be created as a by-product of someone trying to achieve another aim. And web scraping imposes a cost on the website host, and so we should reduce this to the extent possible. For instance, a retailer may have a website with their products and their prices. That has not been created deliberately as a source of data, but we can scrape it to create a dataset. The following principles may be useful to guide web scraping.\index{web scraping!principles}
1. Avoid it. Try to use an API wherever possible.
2. Abide by their desires. Some websites have a "robots.txt" file that contains information about what they are comfortable with scrapers doing. In general, if it exists, a "robots.txt" file can be accessed by appending "robots.txt" to the base URL. For instance, the "robots.txt" file for https://www.google.com, can be accessed at https://www.google.com/robots.txt. Note if there are folders listed against "Disallow:". These are the folders that the website would not like to be scraped. And also note any instances of "Crawl-delay:". This is the number of seconds the website would like you to wait between visits.
3. Reduce the impact.
1. Slow down the scraper, for instance, rather than having it visit the website every second, slow it down using `sys.sleep()`. If you only need a few hundred files, then why not just have it visit the website a few times a minute, running in the background overnight?
2. Consider the timing of when you run the scraper. For instance, if you are scraping a retailer then maybe set the script to run from 10pm through to the morning, when fewer customers are likely using the site. Similarly, if it is a government website and they have a regular monthly release, then it might be polite to avoid that day.
4. Take only what is needed. For instance, you do not need to scrape the entirety of Wikipedia if all you need is the names of the ten largest cities in Croatia. This reduces the impact on the website, and allows us to more easily justify our actions.
5. Only scrape once. This means you should save everything as you go so that you do not have to re-collect data when the scraper inevitably fails at some point. For instance, you will typically spend considerable time getting a scraper working on one page, but typically the page structure will change at some point and the scraper will need to be updated. Once you have the data, you should save that original, unedited data separately to the modified data. If you need data over time then you will need to go back, but this is different than needlessly re-scraping a page.
6. Do not republish the pages that were scraped (this contrasts with datasets that you create from it).
7. Take ownership and ask permission if possible. At a minimum all scripts should have contact details in them. Depending on the circumstances, it may be worthwhile asking for permission before you scrape.
### HTML/CSS essentials
Web scraping is possible by taking advantage of the underlying structure of a webpage. We use patterns in the HTML/CSS to get the data that we want.\index{HTML!web scraping} To look at the underlying HTML/CSS we can either:
1) open a browser, right-click, and choose something like "Inspect"; or
2) save the website and then open it with a text editor rather than a browser.
HTML/CSS is a markup language based on matching tags. If we want text to be bold, then we would use something like:
```text
<b>My bold text</b>
```
Similarly, if we want a list, then we start and end the list as well as indicating each item.
```text
<ul>
<li>Learn webscraping</li>
<li>Do data science</li>
<li>Profit</li>
</ul>
```
When scraping we will search for these tags.
To get started, we can pretend that we obtained some HTML from a website, and that we want to get the name from it. We can see that the name is in bold, so we want to focus on that feature and extract it.
```{r}
#| eval: true
#| echo: true
website_extract <- "<p>Hi, I'm <b>Rohan</b> Alexander.</p>"
```
`rvest` is part of the `tidyverse` so it does not have to be installed, but it is not part of the core, so it does need to be loaded. After that, use `read_html()` to read in the data.
```{r}
#| eval: true
#| echo: true
#| warning: false
#| message: false
rohans_data <- read_html(website_extract)
rohans_data
```
The language used by `rvest` to look for tags is "node", so we focus on bold nodes. By default `html_elements()` returns the tags as well. We extract the text with `html_text()`.
```{r}
#| eval: true
#| echo: true
rohans_data |>
html_elements("b")
rohans_data |>
html_elements("b") |>
html_text()
```
Web scraping is an exciting source of data, and we will now go through some examples. But in contrast to these examples, information is not usually all on one page. Web scraping quickly becomes a difficult art form that requires practice. For instance, we distinguish between an index scrape and a contents scrape. The former is scraping to build the list of URLs that have the content you want, while the latter is to get the content from those URLs. An example is provided by @luscombe2022jumpstarting. If you end up doing much web scraping, then `polite` [@polite] may be helpful to better optimize your workflow. And using GitHub Actions to allow for larger and slower scrapes over time.
### Book information
In this case study we will scrape a list of books available [here](https://rohansbooks.com). We will then clean the data and look at the distribution of the first letters of author surnames. It is slightly more complicated than the example above, but the underlying workflow is the same: download the website, look for the nodes of interest, extract the information, and clean it.\index{web scraping!books example}
We use `rvest` to download a website, and to then navigate the HTML to find the aspects that we are interested in. And we use the `tidyverse` to clean the dataset. We first need to go to the website and then save a local copy.
```{r}
#| include: false
#| eval: false
#| echo: true
# INTERNAL
books_data <- read_html("https://rohansbooks.com")
write_html(books_data, "inputs/my_website/raw_data.html")
```
```{r}
#| echo: true
#| eval: false
#| include: true
books_data <- read_html("https://rohansbooks.com")
write_html(books_data, "raw_data.html")
```
We need to navigate the HTML to get the aspects that we want.\index{web scraping!HTML} And then try to get the data into a tibble as quickly as possible because this will allow us to more easily use `dplyr` verbs and other functions from the `tidyverse`.
::: {.content-visible when-format="pdf"}
See ["R essentials"](https://tellingstorieswithdata.com/20-r_essentials.html) if this is unfamiliar to you.
:::
::: {.content-visible unless-format="pdf"}
See [Online Appendix -@sec-r-essentials] if this is unfamiliar to you.
:::
```{r}
#| eval: false
#| echo: true
#| include: true
#| message: false
#| warning: false
books_data <- read_html("raw_data.html")
```
```{r}
#| eval: true
#| echo: false
#| include: true
#| message: false
#| warning: false
books_data <- read_html("inputs/my_website/raw_data.html")
```
```{r}
#| eval: true
#| include: true
#| message: false
#| warning: false
#| echo: true
books_data
```
To get the data into a tibble we first need to use HTML tags to identify the data that we are interested in. If we look at the website then we know we need to focus on list items (@fig-rohansbooks-display). And we can look at the source, focusing particularly on looking for a list (@fig-rohansbooks-html).
::: {#fig-rohansbooks layout-ncol=2}
![Books website as displayed](figures/books_display.png){#fig-rohansbooks-display width=50%}
![HTML for the top of the books website and the list of books](figures/books_source.png){#fig-rohansbooks-html width=50%}
Screen captures from the books website as at 16 June 2022
:::
The tag for a list item is "li", so we can use that to focus on the list.
```{r}
#| eval: true
#| include: true
#| echo: true
text_data <-
books_data |>
html_elements("li") |>
html_text()
all_books <-
tibble(books = text_data)
head(all_books)
```
We now need to clean the data. First we want to separate the title and the author using `separate()` and then clean up the author and title columns. We can take advantage of the fact that the year is present and separate based on that.
```{r}
#| eval: true
#| include: true
#| message: false
#| warning: false
#| echo: true
all_books <-
all_books |>
mutate(books = str_squish(books)) |>
separate(books, into = c("author", "title"), sep = "\\, [[:digit:]]{4}\\, ")
head(all_books)
```
Finally, we could make, say, a table of the distribution of the first letter of the names (@tbl-lettersofbooks).
```{r}
#| label: tbl-lettersofbooks
#| eval: true
#| echo: true
#| tbl-cap: "Distribution of first letter of author names in a collection of books"
all_books |>
mutate(
first_letter = str_sub(author, 1, 1)
) |>
count(.by = first_letter) |>
tt() |>
style_tt(j = 1:2, align = "lr") |>
setNames(c("First letter", "Number of times"))
```
### Prime Ministers of the United Kingdom
In this case study we are interested in how long prime ministers\index{United Kingdom!prime ministers} of the United Kingdom lived, based on the year they were born. We will scrape data from Wikipedia\index{web scraping!Wikipedia} using `rvest`, clean it, and then make a graph.\index{Wikipedia!web scraping}\index{Wikipedia!UK prime ministers} From time to time a website will change. This makes many scrapes largely bespoke, even if we can borrow some code from earlier projects. It is normal to feel frustrated at times. It helps to begin with an end in mind.
To that end, we can start by generating some simulated data.\index{simulation!UK prime ministers} Ideally, we want a table that has a row for each prime minister, a column for their name, and a column each for the birth and death years. If they are still alive, then that death year can be empty. We know that birth and death years should be somewhere between 1700 and 1990, and that death year should be larger than birth year. Finally, we also know that the years should be integers, and the names should be characters. We want something that looks roughly like this:
```{r}
#| echo: true
#| message: false
#| warning: false
set.seed(853)
simulated_dataset <-
tibble(
prime_minister = babynames |>
filter(prop > 0.01) |>
distinct(name) |>
unlist() |>
sample(size = 10, replace = FALSE),
birth_year = sample(1700:1990, size = 10, replace = TRUE),
years_lived = sample(50:100, size = 10, replace = TRUE),
death_year = birth_year + years_lived
) |>
select(prime_minister, birth_year, death_year, years_lived) |>
arrange(birth_year)
simulated_dataset
```
One of the advantages of generating a simulated dataset is that if we are working in groups then one person can start making the graph, using the simulated dataset, while the other person gathers the data. In terms of a graph, we are aiming for something like @fig-pmsgraphexample.
![Sketch of planned graph showing how long United Kingdom prime ministers lived](figures/pms_graph_plan.png){#fig-pmsgraphexample width=75% fig-align="center"}
We are starting with a question that is of interest, which is how long each prime minister of the United Kingdom lived. As such, we need to identify a source of data. While there are plenty of data sources that have the births and deaths of each prime minister, we want one that we can trust, and as we are going to be scraping, we want one that has some structure to it. The [Wikipedia page about prime ministers of the United Kingdom](https://en.wikipedia.org/wiki/List_of_prime_ministers_of_the_United_Kingdom) fits both these criteria. As it is a popular page the information is likely to be correct, and the data are available in a table.
We load `rvest` and then download the page using `read_html()`. Saving it locally provides us with a copy that we need for reproducibility in case the website changes, and means that we do not have to keep visiting the website. But it is not ours, and so this is typically not something that should be publicly redistributed.
```{r}
#| echo: true
#| eval: false
#| include: false
# INTERNAL
raw_data <-
read_html(
"https://en.wikipedia.org/wiki/List_of_prime_ministers_of_the_United_Kingdom"
)
write_html(raw_data, "inputs/wiki/pms.html") # Note that we save the file as a HTML file.
```
```{r}
#| echo: true
#| eval: false
#| include: true
raw_data <-
read_html(
"https://en.wikipedia.org/wiki/List_of_prime_ministers_of_the_United_Kingdom"
)
write_html(raw_data, "pms.html")
```
As with the earlier case study, we are looking for patterns in the HTML that we can use to help us get closer to the data that we want. This is an iterative process and involves trial and error. Even simple examples will take time.
One tool that may help is the [SelectorGadget](https://rvest.tidyverse.org/articles/articles/selectorgadget.html). This allows us to pick and choose the elements that we want, and then gives us the input for `html_element()` (@fig-selectorgadget). By default, SelectorGadget\index{web scraping!SelectorGadget} uses CSS selectors. These are not the only way to specify the location of the information you want, and using an alternative, such as XPath, can be a useful option to consider.
![Using the Selector Gadget to identify the tag, as at 12 February 2023](figures/07-wikipedia_selectorgadget_screenshot.png){#fig-selectorgadget width=75% fig-align="center"}
```{r}
#| eval: false
#| include: true
raw_data <- read_html("pms.html")
```
```{r}
#| eval: true
#| include: false
raw_data <- read_html("inputs/wiki/pms.html")
raw_data
```
```{r}
#| eval: true
#| include: true
parse_data_selector_gadget <-
raw_data |>
html_element(".wikitable") |>
html_table()
head(parse_data_selector_gadget)
```
In this case there are many columns that we do not need, and some duplicated rows.
```{r}
#| eval: true
#| include: true
parsed_data <-
parse_data_selector_gadget |>
clean_names() |>
rename(raw_text = prime_minister_office_lifespan) |>
select(raw_text) |>
filter(raw_text != "Prime ministerOffice(Lifespan)") |>
distinct()
head(parsed_data)
```
Now that we have the parsed data, we need to clean it to match what we wanted. We want a names column, as well as columns for birth year and death year. We use `separate()` to take advantage of the fact that it looks like the names and dates are distinguished by brackets. The argument in `str_extract()` is a regular expression. It looks for four digits in a row, followed by a dash, followed by four more digits in a row. We use a slightly different regular expression for those prime ministers who are still alive.
```{r}
#| eval: true
#| include: true
initial_clean <-
parsed_data |>
separate(
raw_text, into = c("name", "not_name"), sep = "\\[", extra = "merge",
) |>
mutate(date = str_extract(not_name, "[[:digit:]]{4}–[[:digit:]]{4}"),
born = str_extract(not_name, "born[[:space:]][[:digit:]]{4}")
) |>
select(name, date, born)
head(initial_clean)
```
Finally, we need to clean up the columns.
```{r}
#| message: false
#| warning: false
cleaned_data <-
initial_clean |>
separate(date, into = c("birth", "died"),
sep = "–") |> # PMs who have died have their birth and death years
# separated by a hyphen, but we need to be careful with the hyphen as it seems
# to be a slightly odd type of hyphen and we need to copy/paste it.
mutate(
born = str_remove_all(born, "born[[:space:]]"),
birth = if_else(!is.na(born), born, birth)
) |> # Alive PMs have slightly different format
select(-born) |>
rename(born = birth) |>
mutate(across(c(born, died), as.integer)) |>
mutate(Age_at_Death = died - born) |>
distinct() # Some of the PMs had two goes at it.
head(cleaned_data)
```
Our dataset looks similar to the one that we said we wanted at the start (@tbl-canadianpmscleanddata).
```{r}
#| echo: true
#| eval: true
#| label: tbl-canadianpmscleanddata
#| tbl-cap: "UK Prime Ministers, by how old they were when they died"
cleaned_data |>
head() |>
tt() |>
style_tt(j = 1:4, align = "lrrr") |>
setNames(c("Prime Minister", "Birth year", "Death year", "Age at death"))
```
At this point we would like to make a graph that illustrates how long each prime minister lived (@fig-pmslives). If they are still alive then we would like them to run to the end, but we would like to color them differently.
```{r}
#| echo: true
#| fig-height: 8
#| fig-cap: "How long each prime minister of the United Kingdom lived"
#| label: fig-pmslives
cleaned_data |>
mutate(
still_alive = if_else(is.na(died), "Yes", "No"),
died = if_else(is.na(died), as.integer(2023), died)
) |>
mutate(name = as_factor(name)) |>
ggplot(
aes(x = born, xend = died, y = name, yend = name, color = still_alive)
) +
geom_segment() +
labs(
x = "Year of birth", y = "Prime minister", color = "PM is currently alive"
) +
theme_minimal() +
scale_color_brewer(palette = "Set1") +
theme(legend.position = "bottom")
```
### Iteration
Considering text as data is exciting and allows us to explore many different research questions.\index{text!gathering} We will draw on it in @sec-text-as-data. Many guides assume that we already have a nicely formatted text dataset, but that is rarely actually the case. In this case study we will download files from a few different pages. While we have already seen two examples of web scraping, those were focused on just one page, whereas we often need many. Here we will focus on this iteration. We will use `download.file()` to do the download, and use `purrr` to apply this function across multiple sites.\index{web scraping!multiple files} You do not need to install or load that package because it is part of the core `tidyverse` so it is loaded when you load the `tidyverse`.
The Reserve Bank of Australia (RBA) is Australia's central bank.\index{Australia!Reserve Bank of Australia} It has responsibility for setting the cash rate, which is the interest rate used for loans between banks. This interest rate is an especially important one and has a large impact on the other interest rates in the economy. Four times a year---February, May, August, and November---the RBA publishes a statement on monetary policy, and these are available as PDFs. In this example we will download two statements published in 2023.
First we set up a tibble that has the information that we need. We will take advantage of commonalities in the structure of the URLs. We need to specify both a URL and a local file name for each state.
```{r}
#| eval: true
#| message: false