-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathae-10a-lifecycle.qmd
102 lines (61 loc) · 3.45 KB
/
ae-10a-lifecycle.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
title: "Live Coding - Data Science Lifecycle"
format:
html:
toc: true
editor: visual
bibliography: references.bib
execute:
echo: true
message: false
warning: false
editor_options:
chunk_output_type: console
---
# Data import
Data can be imported from many different sources. In this exercise, we import data from:
1. a CSV file that is stored locally in our repository,
2. an R Package that is loaded via the `library()` function.
## CSV file stored locally
**Waste Characterization Data, Abidjan, Cote d'Ivoire**
The data in this exercise come from a study undertaken in Abidjan, Cote d'Ivore [@waste2worthinnovations2020]. The data is stored in the sub-folder "raw_data" within the "data" folder. We are using a function from the `readr` R Package to import the data.
## Stored in a R data package
**Gapminder**
The data is an [excerpt of the Gapminder data](https://www.gapminder.org/data/documentation/) on life expectancy, GDP per capita, and population by country. Compiled for the purpose of teaching with data until 2007. It is not a definitive source of socioeconomic data and it is not updated [@bryan2017].
# Data tidying
Once you imported the data, the next step in any data science lifecycle is the tidying of data. It's about bringing your data into a consistent structure that let's you focus on the analysis. Depending on how messy the data structure is, this can be tedious task and take up a great part of the data scienbce lifecycle process.
## Waste characterization data
**Look at the output in the console**
1. Which principles for data organisation in spreadsheets are not followed?
## Gapminder data
# Data transformation
Once your data is tidy, a common first step is to transform it. This includes:
1. narrowing in on what interests you (e.g. all people without toilets in one district, or all data from last year)
2. creating new variables from existing (like toilet density, as in number of people per toilet)
3. calculating summary statistics (like counts or the mean)
## Waste characterization data
**Goal:** Calculate the percentages of waste types for sample number 23-A1.
## Gapminder data
**Goal:** Calculate the median life expectancy at birth by continent for 2007.
# Data visualisation (and tables)
## Waste characterization data
Half of the waste in sample 23-A1 was found to be in the Food/Organic category (see @tbl-one-sample).
```{r}
#| label: tbl-one-sample
#| tbl-cap: "Mass in kilogram and percentages of waste categories in sample 23-A1"
```
## Gapminder data
Africa has the lowst life expectancy and greatest spread in life expectancy globally (compare @tbl-median-continent and @fig-boxplot-continent).
```{r}
#| label: tbl-median-continent
#| tbl-cap: "Median life expectancy by continent in 2007."
```
**Goal:** Use a boxplot to visualize the median life expectancy by continent in 2007. In the same plot, use a jitter plot to visualise the spread of life expectancy for all countries.
```{r}
#| label: fig-boxplot-continent
#| fig-cap: "Median life expectancy by continent in 2007. Each dot represents one individual country."
```
# Data communication
Your rendered html file is a great data communication product. You will learn a bit later in the course how you could make it accessible to others as a website, but you could also just share the HTML file with a collaborator or render the document to the 'pdf' or 'docx' formats.
**Finale:** Render, Add, Commit, and Push changes back to GitHub.
# References