-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathindex.Rmd
2632 lines (1858 loc) · 120 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Open Case Studies: Exploring CO2 emissions across time"
css: style.css
output:
html_document:
includes:
in_header: GA_Script.Rhtml
self_contained: yes
code_download: yes
highlight: tango
number_sections: no
theme: cosmo
toc: yes
toc_float: yes
pdf_document:
toc: yes
word_document:
toc: yes
---
<style>
#TOC {
background: url("https://opencasestudies.github.io/img/icon-bahi.png");
background-size: contain;
padding-top: 240px !important;
background-repeat: no-repeat;
}
</style>
<!-- Open all links in new tab-->
<base target="_blank"/>
<div id="google_translate_element"></div>
<script type="text/javascript" src='//translate.google.com/translate_a/element.js?cb=googleTranslateElementInit'></script>
<script type="text/javascript">
function googleTranslateElementInit() {
new google.translate.TranslateElement({pageLanguage: 'en'}, 'google_translate_element');
}
</script>
```{r setup, include=FALSE}
knitr::opts_chunk$set(include = TRUE, comment = NA, echo = TRUE,
message = FALSE, warning = FALSE, cache = FALSE,
fig.align = "center", out.width = '90%')
library(here)
library(knitr)
library(magrittr)
remotes::install_github("benmarwick/wordcountaddin", type = "source", dependencies = TRUE)
remotes::install_github("alistaire47/read.so")
library(wordcountaddin)
library(read.so)
rmarkdown:::perf_timer_reset_all()
rmarkdown:::perf_timer_start("render")
```
#### {.outline }
```{r, echo = FALSE, out.width = "800 px", dpi=300}
knitr::include_graphics(here::here("img", "mainplot.png"))
```
####
#### {.disclaimer_block}
**Disclaimer**: The purpose of the [Open Case Studies](https://opencasestudies.github.io){target="_blank"} project is **to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data**. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given dataset, and should not be used in the context of making policy decisions without external consultation from scientific experts.
####
#### {.license_block}
This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 [(CC BY-NC 3.0)](https://creativecommons.org/licenses/by-nc/3.0/us/){target="_blank"} United States License.
####
#### {.reference_block}
To cite this case study please use:
Wright, Carrie and Ontiveros, Michael and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com/opencasestudies/ocs-bp-co2-emissions. Exploring CO2 emissions across time (Version v1.0.0).
####
To access the GitHub repository for this case study see here: https://github.com//opencasestudies/ocs-bp-co2-emissions.
You may also access and download the data using our `OCSdata` package. To learn more about this package including examples, see this [link](https://github.com/opencasestudies/OCSdata). Here is how you would install this package:
```{r, eval=FALSE}
install.packages("OCSdata")
```
This case study is part of a series of public health case studies for the [Bloomberg American Health Initiative](https://americanhealth.jhu.edu/open-case-studies).
***
The total reading time for this case study is calculated via [koRpus](https://github.com/unDocUMeantIt/koRpus) and shown below:
```{r, echo=FALSE}
readtable = text_stats("index.Rmd") # producing reading time markdown table
readtime = read.so::read.md(readtable) %>% dplyr::select(Method, koRpus) %>% # reading table into dataframe, selecting relevant factors
dplyr::filter(Method == "Reading time") %>% # dropping unnecessary rows
dplyr::mutate(koRpus = paste(round(as.numeric(stringr::str_split(koRpus, " ")[[1]][1])), "minutes")) %>% # rounding reading time estimate
dplyr::mutate(Method = "koRpus") %>% dplyr::relocate(koRpus, .before = Method) %>% dplyr::rename(`Reading Time` = koRpus) # reorganizing table
knitr::kable(readtime, format="markdown")
```
***
**Readability Score: **
A readability index estimates the reading difficulty level of a particular text. Flesch-Kincaid, FORCAST, and SMOG are three common readability indices that were calculated for this case study via [koRpus](https://github.com/unDocUMeantIt/koRpus). These indices provide an estimation of the minimum reading level required to comprehend this case study by grade and age.
```{r, echo=FALSE}
rt = wordcountaddin::readability("index.Rmd", quiet=TRUE) # producing readability markdown table
df = read.so::read.md(rt) %>% dplyr::select(index, grade, age) %>% # reading table into dataframe, selecting relevant factors
tidyr::drop_na() %>% dplyr::mutate(grade = round(as.numeric(grade)), # dropping rows with missing values, rounding age and grade columns
age = round(as.numeric(age))
)
knitr::kable(df, format="markdown")
```
***
Please help us by filling out our survey.
<div style="display: flex; justify-content: center;"><iframe src="https://docs.google.com/forms/d/e/1FAIpQLSfpN4FN3KELqBNEgf2Atpi7Wy7Nqy2beSkFQINL7Y5sAMV5_w/viewform?embedded=true" width="1200" height="700" frameborder="0" marginheight="0" marginwidth="0">Loading…</iframe></div>
# **Motivation**
***
This case study explores how different countries have contributed to Carbon Dioxide (CO2) emissions over time and how CO2 emission rates may relate to increasing global temperatures and increased rates of natural disasters and storms.
We used this [report from the EPA](https://www.epa.gov/report-environment/greenhouse-gases){target="_blank"} as the basis for motivating this case study, as it provides background information about how CO2 emissions and other greenhouse gases have influenced the climate and weather patterns.
CO2 makes up the largest proportion of greenhouse gas emissions in the United States:
```{r, echo = FALSE, out.width="500px"}
knitr::include_graphics(here::here("img", "emissions.jpg"))
```
##### [[source]](https://www.epa.gov/ghgemissions/inventory-us-greenhouse-gas-emissions-and-sinks){target="_blank"}
A variety of sources and sectors contribute to greenhouse gas emissions:
```{r, echo = FALSE, out.width="500px"}
knitr::include_graphics(here::here("img", "sector.png"))
```
##### [[source]](https://www.epa.gov/ghgemissions/inventory-us-greenhouse-gas-emissions-and-sinks){target="_blank"}
Transportation and Electricity contribute the most metric tons of CO2:
```{r, echo = FALSE, out.width="500px"}
knitr::include_graphics(here::here("img", "sources_pie.jpg"))
```
##### [[source]](https://www.epa.gov/ghgemissions/inventory-us-greenhouse-gas-emissions-and-sinks){target="_blank"}
So why should we pay attention to greenhouse gases?
According to the [US Environmental Protection Agency (EPA) Inventory of U.S. Greenhouse Gas Emissions and Sinks 2020 Report](https://www.epa.gov/sites/production/files/2020-04/documents/us-ghg-inventory-2020-main-text.pdf){target="_blank"}:
> Greenhouse gases absorb infrared radiation, thereby trapping heat in the atmosphere and making the planet warmer. The most important greenhouse gases directly emitted by humans include carbon dioxide (CO2), methane (CH4), nitrous oxide (N2O), and several fluorine-containing halogenated substances. Although CO2, CH4, and N2O occur naturally in the atmosphere, human activities have changed their atmospheric concentrations. From the pre- industrial era (i.e., ending about 1750) to 2018, concentrations of these greenhouse gases have increased globally by 46, 165, and 23 percent, respectively (IPCC 2013; NOAA/ESRL 2019a, 2019b, 2019c).
\* IPCC stands for the Intergovernmental Panel on Climate Change
In fact, there are many signs that our planet is experiencing warmer temperatures:
```{r, echo = FALSE, out.width="500px"}
knitr::include_graphics(here::here("img", "warming.png"))
```
##### [[source]](https://data.globalchange.gov/report/nca3-overview){target="_blank"}
The connection between greenhouse gas levels and global temperatures and the influence of increased global temperatures on human health are motivated by these reports:
#### {.reference_block}
- Melillo, J.M., T.C. Richmond, and G.W. Yohe (eds.). 2014. Climate change impacts in the United States: The third National Climate Assessment. U.S. Global Change Research Program.
- 2020. “Inventory of US Greenhouse Gas Emissions and Sinks: 1990--2018.” EPA 430-R-20-002, Tech. Rep. https://www.epa.gov/ghgemissions/inventory-us-greenhouse-gas-emissions-and-sinks.
####
The [National Climate Assessment Report](https://data.globalchange.gov/report/nca3-overview){target="_blank"} states that:
> Heat-trapping gases already in the atmosphere have committed us to a hotter future with more climate-related impacts over the next few decades. The magnitude of climate change beyond the next few decades depends primarily on the amount of heat-trapping gases that human activities emit globally, now and in the future.
See the following links for more information about how greenhouse gases have influenced global temperatures:
1) The EPA [report](https://www.epa.gov/report-environment/greenhouse-gases){target="_blank"} on green house gases
2) The National Climate Assessment (NCA) [summary from 2014](https://nca2014.globalchange.gov/){target="_blank"})
3) The [World101 website](https://world101.cfr.org/global-era-issues/climate-change/climate-change-adaptations){target="_blank"} about how countries are adapting to climate change
# **Main Questions**
***
#### {.main_question_block}
<b><u> Our main questions: </u></b>
1. How have global CO2 emission rates changed over time? In particular for the US, and how does the US compare to other countries?
2. Are CO2 emissions in the US, global temperatures, and natural disaster rates in the US associated?
####
# **Learning Objectives**
***
In this case study, we will explore CO2 emission data from around the world.
We will also focus on the US specifically to evaluate patterns of temperatures and natural disaster activity.
This case study will particularly focus on how to use different datasets that span different ranges of time, as well as how to create visualizations of patterns over time.
We will especially focus on using packages and functions from the [`tidyverse`](https://www.tidyverse.org/){target="_blank"}, such as `dplyr`, `tidyr`, and `ggplot2`.
The tidyverse is a library of packages created by RStudio.
While some students may be familiar with previous R programming packages, these packages make data science in R especially legible and intuitive.
The skills, methods, and concepts that students will be familiar with by the end of this case study are:
<u>**Data Science Learning Objectives:**</u>
1. Importing data from various types of Excel files and CSV files
2. Apply action verbs in `dplyr` for data wrangling
3. How to pivot between "long" and "wide" datasets
4. Joining together multiple datasets using `dplyr`
5. How to create effective longitudinal data visualizations with `ggplot2`
6. How to add text, color, and labels to `ggplot2` plots
7. How to create faceted `ggplot2` plots
<u>**Statistical Learning Objectives:**</u>
1. Introduction to correlation coefficient as a summary statistic
2. Relationship between correlation and linear regression
3. Correlation is not causation
```{r, out.width = "20%", echo = FALSE, fig.align = "center"}
include_graphics("https://tidyverse.tidyverse.org/logo.png")
```
***
We will begin by loading the packages that we will need:
```{r}
library(here)
library(readxl)
library(readr)
library(dplyr)
library(magrittr)
library(stringr)
library(purrr)
library(tidyr)
library(forcats)
library(ggplot2)
library(directlabels)
library(ggrepel)
library(broom)
library(patchwork)
library(OCSdata)
```
<u>**Packages used in this case study:** </u>
Package | Use in this case study
---------- |-------------
[`here`](https://github.com/jennybc/here_here){target="_blank"} | to easily load and save data
[`readxl`](https://readxl.tidyverse.org/){target="_blank"} | to import the Excel file data
[`readr`](https://readr.tidyverse.org/){target="_blank"} | to import the csv file data
[`dplyr`](https://dplyr.tidyverse.org/){target="_blank"} | to view and wrangle the data, by modifying variables, renaming variables, selecting variables, creating variables, and arranging values within a variable
[`magrittr`](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html){target="_blank"} | to use and reassign data objects using the `%<>%`pipe operator
[`stringr`](https://stringr.tidyverse.org/){target="_blank"} | to select only the first 4 characters of date data
[`purrr`](https://purrr.tidyverse.org/){target="_blank"} | to apply a function on a list of tibbles (tibbles are the tidyverse version of a data frame)
[`tidyr`](https://tidyr.tidyverse.org/){target="_blank"} | to drop rows with `NA` values from a tibble
[`forcats`](https://forcats.tidyverse.org/){target="_blank"} | to reorder the levels of a factor
[`ggplot2`](https://ggplot2.tidyverse.org/){target="_blank"} | to make visualizations
[`directlabels`](http://directlabels.r-forge.r-project.org/docs/index.html){target="_blank"} | to add labels to plots easily
[`ggrepel`](https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html){target="_blank"} | to add labels that don't overlap to plots
[`broom`](https://www.tidyverse.org/blog/2018/07/broom-0-5-0/) | to make the output form statistical tests easier to work with
[`patchwork`](https://github.com/thomasp85/patchwork){target="_blank"} | to combine plots
[`OCSdata`](https://github.com/opencasestudies/OCSdata){target="_blank"} | to access and download OCS data files
The first time we use a function, we will use the `::` to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.
# **Context**
***
Now we will describe a bit more background about greenhouse gas emissions and the potential influence of these emissions on public health.
Greenhouse gas emissions are due to both natural processes and anthropogenic (human-derived) activities.
These emissions are one of the contributing factors to rising global temperatures, which can have a great influence on [public health](https://www.epa.gov/climate-indicators/understanding-connections-between-climate-change-and-human-health){target="_blank"} as illustrated in the following image:
```{r, echo = FALSE, out.width="800px"}
knitr::include_graphics(here::here("img", "climate_change_health_impacts.jpg"))
```
##### [[source]](https://www.cdc.gov/climateandhealth/effects/default.htm){target="_blank"}
According to the [US Environmental Protection Agency (EPA) Inventory of U.S. Greenhouse Gas Emissions and Sinks 2020 Report](https://www.epa.gov/sites/production/files/2020-04/documents/us-ghg-inventory-2020-main-text.pdf){target="_blank"}:
> Gases in the atmosphere can contribute to climate change both directly and indirectly. Direct effects occur when the gas itself absorbs radiation. Indirect radiative forcing occurs when chemical transformations of the substance produce other greenhouse gases, when a gas influences the atmospheric lifetimes of other gases, and/or when a gas affects atmospheric processes that alter the radiative balance of the earth (e.g., affect cloud formation or [albedo](https://en.wikipedia.org/wiki/Albedo){target="_blank"}).
The **Global Warming Potential (GWP)** compares the **ability of a greenhouse gas to trap heat in the atmosphere relative to another gas**.
>The GWP of a greenhouse gas is defined as the ratio of the accumulated radiative forcing within a specific time horizon caused by emitting 1 kilogram of the gas, relative to that of the reference gas CO2 (IPCC 2013). Therefore GWP-weighted emissions are provided in million metric tons of CO2 equivalent (MMT CO2 Eq.)
##### [[source]](https://www.epa.gov/sites/production/files/2020-04/documents/us-ghg-inventory-2020-main-text.pdf){target="_blank"}
CO2 is actually the least heat-trapping gas of the greenhouse gases:
```{r, echo = FALSE, out.width="800px"}
knitr::include_graphics(here::here("img", "GWP.png"))
```
##### [[source]](https://www.epa.gov/sites/production/files/2020-04/documents/us-ghg-inventory-2020-main-text.pdf){target="_blank"}
However, because CO2 is so much more abundant and stays in the atmosphere so much longer than other greenhouse gases, it has been the largest contributor to global warming.
See [here](https://www.ucsusa.org/resources/why-does-co2-get-more-attention-other-gases#:~:text=CO2%20sticks%20around,oxide%20(N2O)){target="_blank"} for more details.
It is also important to keep in mind that there is a [lag](https://earthobservatory.nasa.gov/blogs/climateqa/would-gw-stop-with-greenhouse-gases/) between greenhouse gas emissions and temperature changes that we experience because much of Earth's thermal energy (and CO2) gets stored in the ocean.
Due to a process called [thermal inertia](https://en.wikipedia.org/wiki/Volumetric_heat_capacity#Thermal_inertia), the heat stored in the ocean will eventually be transfered to the surface of the Earth long after the gases were emitted that resulted in the increased ocean temperature.
See [here](https://earthobservatory.nasa.gov/blogs/climateqa/would-gw-stop-with-greenhouse-gases/) for more explanation.
Furthermore, rising CO2 levels in the ocean also influence ocean acidity:
```{r, echo = FALSE, out.width="500px"}
knitr::include_graphics(here::here("img", "oceans.png"))
```
##### [[source]](https://data.globalchange.gov/report/nca3-overview){target="_blank"}
As CO2 levels rise in the ocean, the pH becomes more acidic, which makes it difficult for organisms to maintain their shells or skeletons that are made of calcium carbonate, thus making it more difficult for these organisms to survive and impacting their role in the ecosystem and food chain.
Furthermore, greenhouse gas emissions are believed to influence weather patterns as shown in this [report](https://data.globalchange.gov/report/nca3-overview){target="_blank"}.
Indeed, events with high levels of precipitation which can induce flooding and property damage are generally increasing around the country:
```{r, echo = FALSE, out.width="500px"}
knitr::include_graphics(here::here("img", "storms.png"))
```
##### [[source]](https://data.globalchange.gov/report/nca3-overview){target="_blank"}
# **Limitations**
***
An important limitation regarding this data analysis to keep in mind is the datasets only include countries and years in which countries were reporting such information to the agencies that collected the data.
Thus, the data are incomplete.
For example, while we have a fairly good sense of CO2 emissions globally for later years, additional emissions were also produced by countries that are not included in the data.
# **What are the data?**
***
In this case study we will be using data related to CO2 emissions, as well as other data that may influence, be influenced or relate to CO2 emissions.
Most of our data is from [Gapminder](https://www.gapminder.org/data/){target="_blank"} that was originally obtained from the [World Bank](https://www.worldbank.org/en/what-we-do){target="_blank"}.
In addition, we will use some data that is specific to the United States from the [National Oceanic and Atmospheric Administration (NOAA)](https://www.noaa.gov/){target="_blank"}, which is an agency that collects weather and climate data.
Data | Time span | Source | Original Source | Description | Citation
-----------|---------------|-------------|-------------|----------------------------|--------
**CO2 emissions** |1751-2014 | [Gapminder](https://www.gapminder.org/data/){target="_blank"} | [Carbon Dioxide Information Analysis Center (CDIAC)](https://cdiac.ess-dive.lbl.gov/){target="_blank"} | CO2 emissions in tonnes or metric tons (equivalent to approximately 2,204.6 pounds) per person by country| NA
**GDP per capita (percent yearly growth)** | 1801-2019| [Gapminder](https://www.gapminder.org/data/){target="_blank"} | [World Bank](https://data.worldbank.org/indicator/NY.GDP.PCAP.KD.ZG){target="_blank"} | [Growth Domestic Product](https://www.investopedia.com/terms/g/gdp.asp#:~:text=Gross%20Domestic%20Product%20(GDP)%20is%20the%20monetary%20value%20of%20all,expenditures%2C%20production%2C%20or%20incomes.){target="_blank"} (which is an overall measure of the health of nation's economy) per person by country| NA
**Energy use per person** |1960-2015 | [Gapminder](https://www.gapminder.org/data/){target="_blank"} | [World Bank](https://data.worldbank.org/indicator/EG.USE.PCAP.KG.OE){target="_blank"} | Use of primary energy before transformation to other end-use fuels, by country | NA
**US Natural Disasters** | 1980-2019 | [The National Oceanic and Atmospheric Administration (NOAA)](https://www.ncdc.noaa.gov/billions/time-series){target="_blank"}| [The National Oceanic and Atmospheric Administration (NOAA) ](https://www.ncdc.noaa.gov/billions/time-series){target="_blank"}| US data about: <br> -- Droughts <br> -- Floods <br> -- Freezes <br> -- Severe Storms <br> -- Tropical Cyclones <br> -- Wildfires<br> -- Winter Storms | NOAA National Centers for Environmental Information (NCEI) U.S. Billion-Dollar Weather and Climate Disasters (2020). https://www.ncdc.noaa.gov/billions/, DOI: 10.25921/stkw-7w73
**Temperature** | 1895-2019| [The National Oceanic and Atmospheric Administration (NOAA)](https://www.ncdc.noaa.gov/cag/national/time-series){target="_blank"} | [The National Oceanic and Atmospheric Administration (NOAA)](https://www.ncdc.noaa.gov/cag/national/time-series){target="_blank"} | US National yearly average temperature (in Fahrenheit) from 1895 to 2019 | NOAA National Centers for Environmental information, Climate at a Glance: National Time Series, published June 2020, retrieved on June 26, 2020 from https://www.ncdc.noaa.gov/cag/
To obtain the temperature data, the annual average temperatures were selected as shown in this image:
```{r, echo = FALSE, out.width = "800 px"}
knitr::include_graphics(here::here("img", "temp.png"))
```
##### [[source]](https://www.ncdc.noaa.gov/cag/national/time-series){target="_blank"}
Importantly, notice that the data we would like to use span different time periods:
Data | Time span
---------- |-------------
**CO2 emissions** |1751 to 2014
**GDP per capita (yearly growth)** | 1801 to 2019
**Energy use per person** |1960 to 2015
**US Natural Disasters** | 1980 to 2019
**Temperature** | 1895 to 2019
We will explore more about this a bit later.
#### {.think_question_block}
<b><u> Question Opportunity </u></b>
What concerns might arise about reliability and variation of measurement practices over time?
####
# **Data Import**
***
In our case, we downloaded the data for the files from the various sources as indicated in the table above and put them within a "raw" subdirectory of a "data" directory for our project. If you use an RStudio project, then you can use the `here()` function of the `here` package to make the path for importing this data simpler. The `here` package automatically starts looking for files based on where you have a `.Rproj` file which is created when you start a new RStudio project. We can specify that we want to look for the "yearly_co2_emissions_1000_tonnes.xlsx" file within the "raw" directory within the "data" directory within a directory where our `.Rproj` file is located by separating the names of these directories using commas and listing "data" first.
***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>
You can create a project by going to the File menu of RStudio like so:
```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "New_project.png"))
```
You can also do so by clicking the project button:
```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "project_button.png"))
```
See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects and [here](https://github.com/jennybc/here_here) to learn more about the `here` package.
</details>
***
To read in the files that were downloaded from the various sources as indicated in the table above, we will use the `read_xlsx()` and `read_xls()` functions of the `readxl` package to import the data from the `.xlsx` and `.xls` files, respectively. We will also use the `here()` function of the `here` package to more easily specify the path to our files relative to the directory where the .Rproj file is located.
```{r}
CO2_emissions <- readxl::read_xlsx(here("data","raw", "yearly_co2_emissions_1000_tonnes.xlsx"))
gdp_growth <- readxl::read_xlsx(here("data", "raw", "gdp_per_capita_yearly_growth.xlsx"))
energy_use <- readxl::read_xlsx(here("data", "raw", "energy_use_per_person.xlsx"))
```
If you had trouble downloading these files, you can do so at our [GitHub repo](https://github.com//opencasestudies/ocs-bp-co2-emissions/tree/master/data/raw/) or more directly by clicking [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-co2-emissions/master/data/raw/yearly_co2_emissions_1000_tonnes.xlsx), [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-co2-emissions/master/data/raw/gdp_per_capita_yearly_growth.xlsx), and [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-co2-emissions/master/data/raw/energy_use_per_person.xlsx).
You may also download these files using the `OCSdata` package:
```{r, eval=FALSE}
# install.packages("OCSdata")
library(OCSdata)
raw_data("ocs-bp-co2-emissions", outpath = getwd())
# This will save the raw data files in a "OCSdata/data/raw/" subfolder
# in your current working directory
```
We will use the `read_csv()` function of the `readr` package to import the data from the `.csv` files.
However, for these files there are some lines that we would like to not import because the number of columns differ for some rows. If we don't account for this, then we may end up importing fewer columns of the data that we would like.
In the first 5 rows shown below in the `data/disasters.csv` file, you can see that the first two rows does not have the same number of columns as the subsequent rows and are just (sub)titles.
```{r, echo = FALSE, out.width = "600 px"}
knitr::include_graphics(here::here("img", "Disasters.png"))
```
To do this, we can skip rows using the `skip = 2` argument of the `read_csv()` function.
```{r}
us_disaster <- readr::read_csv(here("data", "raw", "disasters.csv"), skip = 2)
```
If you had trouble downloading this file, you can do so at our [GitHub repo](https://github.com//opencasestudies/ocs-bp-co2-emissions/tree/master/data/raw) or more directly by clicking [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-co2-emissions/master/data/raw/disasters.csv).
Now looking at the `data/temperature.csv` file, we see that the first four lines do not have the same number of columns as the subsequent lines.
```{r, echo = FALSE, out.width = "600 px"}
knitr::include_graphics(here::here("img", "tempdata.png"))
```
We will skip importing all 4 lines by using `skip = 4`.
We can also replace all instances of `"-99"` with `NA` using the `na = "-99"` argument of the `read_csv()` function.
The "-99" needs to be in quotation marks because this argument expects characters.
***
<details> <summary> Click here for an explanation about data types in R and about character strings.</summary>
There are several [classes of data in R programming](https://en.wikipedia.org/wiki/R_(programming_language)), meaning that certain objects will be treated or interpreted differently. Character is one of these classes. A character string is an individual data value made up of characters. This can be a paragraph, like the legend for the table, or it can be a single letter or number like the letter "a" or the number "3". If data are of class character, than the numeric values will not be processed like a numeric value in a mathematical sense. If you want your numeric values to be interpreted that way, they need to be converted to a numeric class. The options typically used are integer (which has no decimal place) and double precision (which has a decimal place).
A variable that is a factor has a set of particular values called levels (this can be numbers or characters). Even if these are numeric, they will be interpreted as levels (i.e., as if they were characters) not as mathematical numbers. The values of a factor are assumed to have a particular ordering; by default the order is alphabetical, but this is not always the correct/intuitive ordering. You can modify the order of these levels with the `forcats` package.
</details>
***
```{r}
us_temperature <- readr::read_csv(here("data", "raw", "temperature.csv"), skip = 4, na = "-99")
```
If you had trouble downloading this file, you can do so at our [GitHub repo](https://github.com//opencasestudies/ocs-bp-co2-emissions/tree/master/data/raw) or more directly by clicking [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-co2-emissions/master/data/raw/temperature.csv).
Great! now we have imported all of the data that we will need.
To allow users to skip import we will save the data as an RDA file:
```{r, eval = FALSE}
save(CO2_emissions,
gdp_growth,
energy_use,
us_disaster,
us_temperature,
file = here::here("data", "imported", "co2_data_imported.rda"))
```
# **Data Wrangling**
***
If you have been following along but stopped, we could load our imported data like so:
```{r}
load(here::here("data", "imported", "co2_data_imported.rda"))
```
***
<details> <summary> If you skipped the data import section click here. </summary>
First you need to install and load the `OCSdata` package:
```{r, eval=FALSE}
install.packages("OCSdata")
library(OCSdata)
```
Then, you may load the imported data using the following code:
```{r, eval=FALSE}
imported_data("ocs-bp-co2-emissions", outpath = getwd())
load(here::here("OCSdata", "data", "imported", "co2_data_imported.rda"))
```
If the package does not work for you, alternatively, an RDA file (stands for R data) of the data can be found [here](https://github.com//opencasestudies/ocs-bp-co2-emissions/tree/master/data/imported) or slightly more directly [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-co2-emissions/master/data/imported/co2_data_imported.rda). Download this file and then place it in your current working directory within a subdirectory called "imported" within a directory called "data" to copy and paste our code. We used an RStudio project and the [`here` package](https://github.com/jennybc/here_here) to navigate to the file more easily.
```{r}
load(here::here("data", "imported", "co2_data_imported.rda"))
```
***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>
You can create a project by going to the File menu of RStudio like so:
```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "New_project.png"))
```
You can also do so by clicking the project button:
```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "project_button.png"))
```
See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects and [here](https://github.com/jennybc/here_here) to learn more about the `here` package.
</details>
***
</details>
***
Next, we take a look at our data that we just imported.
We will need to do some data wrangling to allow us to evaluate how CO2 emissions have changed over time and how emissions may relate to energy use, GDP, etc.
Let's explore how to do that with useful functions and packages from the `tidyverse`.
## **Yearly CO~2~ Emissions**
***
First, let's take a look at the CO2 data (`CO2_emissions`).
We can use the `slice_head()` function of the `dplyr` package to see just the first rows of our data.
We can specify how many rows we would like to see by using the `n =` argument.
We will use the `%>%` pipe from the `magrittr` package (although it is also imported by other `tidyverse` packages, like `dplyr`), which can be used to define the input for later sequential steps.
This will make more sense when we have multiple sequential steps using the same data object.
```{r}
CO2_emissions %>%
slice_head(n = 3)
```
Another useful function is `slice_sample()` to look at a **selection of random rows** using [pseudorandom](https://en.wikipedia.org/wiki/Pseudorandomness){target="_blank"} numbers for the index of rows to show. To continue to get the same random values or for others to get the same values, we need to set a seed first. We can do this with the `set.seed()` base function. We just specify a number with this function and that will allow us to get the same subset of values from the `slice_sample()` function. If two different people ran this code (without set.seed()), they would each see a different subset of rows. For data exploration, this isn't a huge deal, but if we'd like separate analysts running the same code to see the same output, we will use set.seed(). If we changed set.seed(123) to set.seed(333), we would obtain a different random sample of rows.
```{r}
set.seed(123)
CO2_emissions %>%
slice_sample(n = 3)
```
#### {.think_question_block}
<b><u> Question Opportunity </u></b>
Try setting a different seed to see the difference in the output.
####
OK, we see each country is represented along one row and each column contains yearly CO2 emissions.
We also see that there are a lot of `NA` values.
We can also use the `glimpse()` function of the `dplyr` package to view our data.
This allows us to see all of our variables at once.
We will see a tiny bit of each variable/column with the data displayed on the right.
#### {.scrollable }
```{r}
# Scroll through the output!
CO2_emissions %>%
dplyr::glimpse()
```
####
We can also see that we have a large [tibble](https://tibble.tidyverse.org/).
```{r}
CO2_emissions %>%
class()
```
This is the object that is created when we read in the data with `readr`.
A tibble (or `tbl_df`) is the `tidyverse` version of a `data.frame` object.
Similar to `data.frame`, it is a table with variable information arranged as columns, and individual observations arranged as rows.
However some nice differences are they do not change variable names or data types and they give more messages when something is wrong (e.g. when a variable does not exist), which forces the analyst to confront problems earlier.
Tibbles also give us information about the class of each variable.
For example the `country` variable is made up of character (abbreviated as `chr`) values.
```{r}
CO2_emissions %>%
select(country)
```
We see that we have `r nrow(CO2_emissions)` rows different country variables and CO2 emission values for `r ncol(CO2_emissions) - 1` different years (from 1751 to 2014).
```{r}
names(CO2_emissions)
```
Recall, the values are emissions in metric tons, also called tonnes.
Scrolling through the `glimpse()` function above, we can also see that there are fewer `NA` values for later years.
In this next code chunk, we will introduce the `%<>%` operator from the `magrittr` package.
This allows us to use our `CO2_emissions` data and reassign it to a modified version at the same time.
Let's modify `CO2_emissions` to make it more usable for making visualizations.
Specifically, we will use the `pivot_longer()` function of the `dplyr` package to convert our data into what is called **"long"** format. This is also sometimes referred to as **"narrow"** format.
This means that we will have more rows and fewer columns than our current format.
Right now our data is in what is called **"wide"** format.
In wide format, each variable is listed as its own column.
In contrast, in long format, variables maybe collapsed into a column that identifies the variables and a column of values.
See [here](https://en.wikipedia.org/wiki/Wide_and_narrow_data){target="_blank"} for more information about the difference between the two formats.
We want to collapse all of the values for the emission data across the different individual year variables into one new `Emissions` variable. We will identify what year they are from by creating a new `Year` variable. The `cols =` argument allows us to specify which columns we want to pivot (or not pivot) to create these new columns. We want to keep our `country` data as an ID variable so we will exclude it using the `-` sign, by default all other columns will be used.
```{r}
CO2_emissions %<>%
pivot_longer(cols = -country,
names_to = "Year",
values_to = "Emissions")
set.seed(123)
CO2_emissions %>%
slice_sample(n = 6)
```
#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>
Think a moment about what the dimensions of the `CO2_emissions` tibble are now and why? How would you check this?
<b><u> Hint </u></b>: Checking has something to do with a unique aspect about tibbles.
####
Let's say we also want to rename the `country` variable to be capitalized.
To do this, we can use the `rename()` function of the `dplyr` package to rename this variable.
When renaming variables the syntax is `new-name = old-name`, where the new name is listed first before the `=`.
You may also note that the `Year` variable is currently of class type character. We would like to change it to be numeric. To do this we will use the `mutate()` function, which is also part of the `dplyr` package. This function allows us to create and modify variables. We will also use this function to create a variable called `Label` which will have `"CO2 Emissions (Metric Tons)"` as the value for every row, to be used when we create plots later.
```{r}
CO2_emissions %<>%
dplyr::rename(Country = country) %>%
dplyr::mutate(Year = as.numeric(Year),
Label = "CO2 Emissions (Metric Tons)")
```
Now let's take a look to see how our data has changed:
```{r}
set.seed(123)
CO2_emissions %>%
slice_sample(n = 6)
```
Great, we can see that now the `Year` variable is of class double (abbreviated `dbl`), which is a numeric class.
Now, let's take a look at the `Country` variable to check if there is anything unexpected.
We will use the `distinct()` function of the `dplyr` package to view the unique values only.
Finally, we use the `pull()` function of the `dplyr` package to extract the values from the column (this is similar to using the `$` base R syntax e.g. `CO2_emission$Country`).
#### {.scrollable }
```{r}
# Scroll through the output!
CO2_emissions %>%
distinct(Country) %>%
pull()
```
####
These all look as expected!
## **Yearly Growth in GDP per Capita**
***
Let's take a look at the next dataset (`gdp_growth`) that we imported.
```{r}
gdp_growth %>%
slice_head(n = 3)
```
How many rows and columns are there are there? We can easily check by using the base `dim()` function, which evaluates the dimensions of an object.
```{r}
dim(gdp_growth)
```
Interesting, it's `r nrow(gdp_growth)` rows (as opposed to `r nrow(CO2_emissions)` above).
We will deal with this and other differences in the sets of countries a bit later on.
There are also `r ncol(gdp_growth)` columns with a `country` column and a set of columns corresponding to different years.
```{r}
names(gdp_growth)
```
Yes, no other columns in this dataset.
Next, we will use the `pivot_longer()` to transform the data to long format, similar to what we did in the previous section.
We will also again change the `country` variable to be `Country` by using the `rename()` function, and we will make the `Year` variable numeric using the `mutate()` function.
#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>
Using what you just learned about `pivot_longer()`, `rename()`, and `mutate()` and without scrolling up, try to come up with the code to do the wrangling for this data.
####
***
<details> <summary> Click here to reveal the code. </summary>
```{r, eval = FALSE}
gdp_growth %<>%
pivot_longer(cols = -country,
names_to = "Year",
values_to = "gdp_growth") %>%
rename(Country = country) %>%
mutate(Year = as.numeric(Year),
Label = "GDP Growth/Capita (%)") %>%
rename(GDP = gdp_growth)
```
</details>
***
```{r, echo = FALSE}
gdp_growth %<>%
pivot_longer(cols = -country,
names_to = "Year",
values_to = "gdp_growth") %>%
rename(Country = country) %>%
mutate(Year = as.numeric(Year),
Label = "GDP Growth/Capita (%)") %>%
rename(GDP = gdp_growth)
```
Now let's see how this data has changed:
```{r}
gdp_growth %>%
slice_head(n = 6)
gdp_growth %>%
count(Year)
```
Again let's check that the `Country` variable only contains values we would expect.
#### {.scrollable }
```{r}
# Scroll through the output!
gdp_growth %>%
distinct(Country) %>%
pull()
```
####
Also looks good!
## **Energy Use per Person**
***
Now let's take a look at the energy use per person data (`energy_use`) using `slice_head()` and `glimpse()`.
```{r}
energy_use %>%
slice_head(n = 3)
```
#### {.scrollable}
```{r}
energy_use %>%
glimpse()
```
####
Looks like we have `r nrow(energy_use)` rows and `r ncol(energy_use)` columns where we have a `country` column and again a set of years.
To wrangle the `energy_use` data, we will again convert the data to long format, rename some variables, and mutate the `Year` data to be numeric.
#### {.recall_code_question_block}
<b><u> Question Opportunity </u></b>
Again try to come up with the code on your own to wrangle the data.
####
***
<details> <summary> Click here to reveal the code. </summary>
```{r, eval = FALSE}
energy_use %<>%
pivot_longer(cols = -country,
names_to = "Year",
values_to = "energy_use") %>%
rename(Country = country) %>%
mutate(Year = as.numeric(Year),
Label = "Energy Use (kg, oil-eq./capita)") %>%
rename(Energy = energy_use)
```
</details>
***
```{r, echo = FALSE}
energy_use %<>%
pivot_longer(cols = -country,
names_to = "Year",
values_to = "energy_use") %>%
rename(Country = country) %>%
mutate(Year = as.numeric(Year),
Label = "Energy Use (kg, oil-eq./capita)") %>%
rename(Energy = energy_use)
```
```{r}
set.seed(123)
energy_use %>%
slice_sample(n = 3)
```
Now we will check the `Country` variable:
#### {.scrollable }
```{r}
# Scroll through the output!
energy_use %>%
distinct(Country) %>%
pull()
```
####
Looks good!
## **US Specific Data**
***
Now we will take a look at the US data about disasters and temperature.
### **Disasters**
***
First, we consider the disasters that have occurred in the US.
```{r}
us_disaster
```
We are specifically interested in the `Year` and the variables that contain the word `"Count"`. The other variables represent an estimate of the economic cost in billions of dollars, as well as the upper and lower bounds for simulations used to estimate the economic cost, which show the level of uncertainty in these estimates (at three different levels of confidence) as the true cost is unknown. See [here](https://www.ncdc.noaa.gov/billions/time-series) for more information about the data. For this analysis, we will focus just on the number of disasters that occurred each year.
We will select our variables of interest using the `select()` and `contains()` functions in the `dplyr` package.
Since we are selecting for variables with the word `"Count"` we need to use quotation marks around it.
Selecting for the variable `Year` does not require quotes because it is the full name of one of the existing variables.
```{r}
us_disaster %<>%
select(Year, contains("Count"))
us_disaster %>%
slice_head(n = 6)
```
Now we want to create a new variable that will be the sum of all the different types of disasters for each year.
We don't want to include the `Year` variable in our sum, so we can exclude it using the `select` function. To perform the sum for each year, we can use the base `rowSums()` function.
```{r}
yearly_disasters <- us_disaster %>%
select(-Year) %>%
rowSums()
yearly_disasters
```
We could then add this to our `us_diaster` tibble like so using the `bind_cols` function of the `dplyr` package:
```{r}
us_disaster %>% bind_cols(Disaters = yearly_disasters)
```
However, we can actually create and add this new variable directly to the `us_disaster` tibble by using the `mutate()` function of `dplyr` and using the `.` notation.
We need to use the `.` notation to indicate that we are using the data that we already used as input (on the left side of the pipe) to our `mutate()` function (on the right side of our pipe), which in this case is the entire `us_disaster` tibble for our `select()` function. The output from the `select()` function will be used for the `rowSums()` function.
```{r}
us_disaster %<>%
mutate(Disasters = rowSums(select(., -Year)))
us_disaster %>%
glimpse()
```
Great, now we are going to remove some of these variables and just keep the variables of interest using `select()`.
We are also going to add a new variable called `Country` to indicate that this data is from the United States. Again this will create a new variable where every value is `United States`.
```{r}
us_disaster %<>%
dplyr::select(Year, Disasters) %>%
mutate(Country = "United States") %>%
pivot_longer(cols = c(-Country, -Year),
names_to = "Indicator",
values_to = "Value") %>%
mutate(Label = "Number of Disasters")
us_disaster %>%
slice_head(n = 6)
```
Great, this looks good now.
#### {.think_question_block}
<b><u> Question Opportunity </u></b>
This dataset was slightly different from the other datasets and therefore required slightly different wrangling.
Why was it necessary to exclude the `Year` variable from the `pivot_longer()` function?
What would happen if we did not exclude `Year`?
####
### **Temperature**
***
Next, we consider the temperature in the US over time.
```{r}
us_temperature %>%
slice_head(n = 6)
```
So a few things need to be fixed here.
First, the `Date` column looks a bit strange. The format of the numbers look like the year followed by the number 12 (representing 12 months).
We want to change this to only keep the first 4 characters in the `Date` variable string values.
However, first let's make sure that indeed all of the `Date` variables are 6 characters long and that they all end with the number 12.
We can use a couple of functions in the `stringr` package to do this. This package is used for working with character strings. The `str_length()` function can be used to check the length of each value, while the `str_ends()` function can be used to check that all the values end with `"12"`.
Let's start with the `str_length()` function. These functions in the `stringr` package require a character vector. Thus we need to pull the values for the `Date` variable first using the `pull()` function of the `dplyr` package.
```{r}
us_temperature %>%
pull(Date) %>%