Skip to content

Commit

Permalink
Some final minor edits
Browse files Browse the repository at this point in the history
  • Loading branch information
rafalab committed Nov 24, 2023
1 parent cda2c17 commit 1308179
Show file tree
Hide file tree
Showing 7 changed files with 80 additions and 76 deletions.
40 changes: 20 additions & 20 deletions docs/search.json

Large diffs are not rendered by default.

56 changes: 28 additions & 28 deletions docs/sitemap.xml
Original file line number Diff line number Diff line change
Expand Up @@ -2,114 +2,114 @@
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/index.html</loc>
<lastmod>2023-11-24T17:05:14.904Z</lastmod>
<lastmod>2023-11-24T17:29:34.104Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/intro.html</loc>
<lastmod>2023-11-24T17:05:14.907Z</lastmod>
<lastmod>2023-11-24T17:29:34.108Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/R/intro-to-R.html</loc>
<lastmod>2023-11-24T17:05:14.910Z</lastmod>
<lastmod>2023-11-24T17:29:34.111Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/R/getting-started.html</loc>
<lastmod>2023-11-24T17:05:14.916Z</lastmod>
<lastmod>2023-11-24T17:29:34.116Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/R/R-basics.html</loc>
<lastmod>2023-11-24T17:05:14.948Z</lastmod>
<lastmod>2023-11-24T17:29:34.149Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/R/programming-basics.html</loc>
<lastmod>2023-11-24T17:05:14.959Z</lastmod>
<lastmod>2023-11-24T17:29:34.160Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/R/tidyverse.html</loc>
<lastmod>2023-11-24T17:05:14.992Z</lastmod>
<lastmod>2023-11-24T17:29:34.184Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/R/data-table.html</loc>
<lastmod>2023-11-24T17:05:15.004Z</lastmod>
<lastmod>2023-11-24T17:29:34.196Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/R/importing-data.html</loc>
<lastmod>2023-11-24T17:05:15.014Z</lastmod>
<lastmod>2023-11-24T17:29:34.205Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/dataviz/intro-dataviz.html</loc>
<lastmod>2023-11-24T17:05:15.018Z</lastmod>
<lastmod>2023-11-24T17:29:34.210Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/dataviz/distributions.html</loc>
<lastmod>2023-11-24T17:05:15.025Z</lastmod>
<lastmod>2023-11-24T17:29:34.217Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/dataviz/ggplot2.html</loc>
<lastmod>2023-11-24T17:05:15.044Z</lastmod>
<lastmod>2023-11-24T17:29:34.247Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/dataviz/dataviz-principles.html</loc>
<lastmod>2023-11-24T17:05:15.059Z</lastmod>
<lastmod>2023-11-24T17:29:34.260Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/dataviz/dataviz-in-practice.html</loc>
<lastmod>2023-11-24T17:05:15.090Z</lastmod>
<lastmod>2023-11-24T17:29:34.291Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/wrangling/intro-to-wrangling.html</loc>
<lastmod>2023-11-24T17:05:15.093Z</lastmod>
<lastmod>2023-11-24T17:29:34.293Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/wrangling/reshaping-data.html</loc>
<lastmod>2023-11-24T17:05:15.104Z</lastmod>
<lastmod>2023-11-24T17:29:34.305Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/wrangling/joining-tables.html</loc>
<lastmod>2023-11-24T17:05:15.117Z</lastmod>
<lastmod>2023-11-24T17:29:34.318Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/wrangling/dates-and-times.html</loc>
<lastmod>2023-11-24T17:05:15.125Z</lastmod>
<lastmod>2023-11-24T17:29:34.325Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/wrangling/data-table-wrangling.html</loc>
<lastmod>2023-11-24T17:05:15.133Z</lastmod>
<lastmod>2023-11-24T17:29:34.332Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/wrangling/web-scraping.html</loc>
<lastmod>2023-11-24T17:05:15.142Z</lastmod>
<lastmod>2023-11-24T17:29:34.342Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/wrangling/string-processing.html</loc>
<lastmod>2023-11-24T17:10:41.829Z</lastmod>
<lastmod>2023-11-24T17:29:34.377Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/wrangling/text-mining.html</loc>
<lastmod>2023-11-24T17:05:15.211Z</lastmod>
<lastmod>2023-11-24T17:29:34.393Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/productivity/intro-productivity.html</loc>
<lastmod>2023-11-24T17:05:15.215Z</lastmod>
<lastmod>2023-11-24T17:29:34.396Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/productivity/installing-r-and-rstudio.html</loc>
<lastmod>2023-11-24T17:05:15.219Z</lastmod>
<lastmod>2023-11-24T17:29:34.401Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/productivity/installing-git.html</loc>
<lastmod>2023-11-24T17:05:15.225Z</lastmod>
<lastmod>2023-11-24T17:29:34.407Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/productivity/unix.html</loc>
<lastmod>2023-11-24T17:05:15.234Z</lastmod>
<lastmod>2023-11-24T17:29:34.416Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/productivity/git.html</loc>
<lastmod>2023-11-24T17:05:15.242Z</lastmod>
<lastmod>2023-11-24T17:29:34.423Z</lastmod>
</url>
<url>
<loc>http://rafalab.dfci.harvard.edu/dsbook-part-1/productivity/reproducible-projects.html</loc>
<lastmod>2023-11-24T17:05:15.251Z</lastmod>
<lastmod>2023-11-24T17:29:34.433Z</lastmod>
</url>
</urlset>
4 changes: 1 addition & 3 deletions docs/wrangling/dates-and-times.html
Original file line number Diff line number Diff line change
Expand Up @@ -332,9 +332,7 @@
</div>
<!-- main -->
<main class="content" id="quarto-document-content"><header id="title-block-header" class="quarto-title-block default"><div class="quarto-title">
<h1 class="title">
<span class="chapter-number">13</span>&nbsp; <span class="chapter-title">Parsing dates and times</span>
</h1>
<h1 class="title"><span id="sec-dates-and-times" class="quarto-section-identifier"><span class="chapter-number">13</span>&nbsp; <span class="chapter-title">Parsing dates and times</span></span></h1>
</div>


Expand Down
25 changes: 12 additions & 13 deletions docs/wrangling/reshaping-data.html
Original file line number Diff line number Diff line change
Expand Up @@ -334,9 +334,7 @@
</div>
<!-- main -->
<main class="content" id="quarto-document-content"><header id="title-block-header" class="quarto-title-block default"><div class="quarto-title">
<h1 class="title">
<span class="chapter-number">11</span>&nbsp; <span class="chapter-title">Reshaping data</span>
</h1>
<h1 class="title"><span id="sec-reshape" class="quarto-section-identifier"><span class="chapter-number">11</span>&nbsp; <span class="chapter-title">Reshaping data</span></span></h1>
</div>


Expand Down Expand Up @@ -473,19 +471,20 @@ <h1 class="title">
<span><span class="co">#&gt; 5 Germany 1962 fertility 2.47</span></span>
<span><span class="co">#&gt; # ℹ 219 more rows</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
<p>But we are not done yet. We need to create a column for each variable. As we learned, the <code>pivot_wider</code> function can do this:</p>
<div class="cell" data-layout-align="center" data-hash="reshaping-data_cache/html/unnamed-chunk-13_5029f55e371b147070e8fb0f639d3858">
<p>But we are not done yet. We need to create a column for each variable and change <code>year</code> to a number. As we learned, the <code>pivot_wider</code> function can do this:</p>
<div class="cell" data-layout-align="center" data-hash="reshaping-data_cache/html/unnamed-chunk-13_28f79e46a9546c0dad30a0c9ec106507">
<div class="sourceCode" id="cb14"><pre class="downlit sourceCode r code-with-copy"><code class="sourceCode R"><span><span class="va">dat</span> <span class="op">|&gt;</span> </span>
<span> <span class="fu"><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim</a></span><span class="op">(</span><span class="va">name</span>, delim <span class="op">=</span> <span class="st">"_"</span>, names <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/c.html">c</a></span><span class="op">(</span><span class="st">"year"</span>, <span class="st">"name"</span><span class="op">)</span>, too_many <span class="op">=</span> <span class="st">"merge"</span><span class="op">)</span> <span class="op">|&gt;</span></span>
<span> <span class="fu"><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider</a></span><span class="op">(</span><span class="op">)</span></span>
<span> <span class="fu"><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider</a></span><span class="op">(</span><span class="op">)</span> <span class="op">|&gt;</span></span>
<span> <span class="fu"><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate</a></span><span class="op">(</span>year <span class="op">=</span> <span class="fu"><a href="https://rdrr.io/r/base/integer.html">as.integer</a></span><span class="op">(</span><span class="va">year</span><span class="op">)</span><span class="op">)</span></span>
<span><span class="co">#&gt; # A tibble: 112 × 4</span></span>
<span><span class="co">#&gt; country year fertility life_expectancy</span></span>
<span><span class="co">#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;</span></span>
<span><span class="co">#&gt; 1 Germany 1960 2.41 69.3</span></span>
<span><span class="co">#&gt; 2 Germany 1961 2.44 69.8</span></span>
<span><span class="co">#&gt; 3 Germany 1962 2.47 70.0</span></span>
<span><span class="co">#&gt; 4 Germany 1963 2.49 70.1</span></span>
<span><span class="co">#&gt; 5 Germany 1964 2.49 70.7</span></span>
<span><span class="co">#&gt; country year fertility life_expectancy</span></span>
<span><span class="co">#&gt; &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;</span></span>
<span><span class="co">#&gt; 1 Germany 1960 2.41 69.3</span></span>
<span><span class="co">#&gt; 2 Germany 1961 2.44 69.8</span></span>
<span><span class="co">#&gt; 3 Germany 1962 2.47 70.0</span></span>
<span><span class="co">#&gt; 4 Germany 1963 2.49 70.1</span></span>
<span><span class="co">#&gt; 5 Germany 1964 2.49 70.7</span></span>
<span><span class="co">#&gt; # ℹ 107 more rows</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
<p>The data is now in tidy format with one row for each observation with three variables: year, fertility, and life expectancy.</p>
Expand Down
22 changes: 14 additions & 8 deletions wrangling/data-table-wrangling.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ filename <- file.path(path, "fertility-two-countries-example.csv")
```


### pivot_long is melt
### `pivot_longer` is `melt`

If in **tidyeverse** we write

Expand All @@ -39,7 +39,7 @@ in **data.table** we use the `melt` function

```{r}
dt_wide_data <- fread(filename)
dt_new_tidy_data <- melt(as.data.table(dt_wide_data),
dt_new_tidy_data <- melt(dt_wide_data,
measure.vars = 2:ncol(dt_wide_data),
variable.name = "year",
value.name = "fertility")
Expand All @@ -54,13 +54,13 @@ If in **tidyeverse** we write
```{r}
new_wide_data <- new_tidy_data |>
pivot_wider(names_from = year, values_from = fertility)
select(new_wide_data, country, `1960`:`1967`)
```

in **data.table** we write:

```{r}
dt_new_wide_data <- dcast(dt_new_tidy_data, formula = ... ~ year, value.var = "fertility")
dt_new_wide_data <- dcast(dt_new_tidy_data, formula = ... ~ year,
value.var = "fertility")
```


Expand All @@ -77,17 +77,21 @@ In **tidyverse** we wrangled using
```{r, message=FALSE}
raw_dat <- read_csv(filename)
dat <- raw_dat |> pivot_longer(-country) |>
separate_wider_delim(name, delim = "_", names = c("year", "name"), too_many = "merge") |>
pivot_wider()
separate_wider_delim(name, delim = "_", names = c("year", "name"),
too_many = "merge") |>
pivot_wider() |>
mutate(year = as.integer(year))
```

In **data.table** we can use the `tstrsplit` function:

```{r}
dt_raw_dat <- fread(filename)
dat_long <- melt(dt_raw_dat, measure.vars = which(names(dt_raw_dat) != "country"),
dat_long <- melt(dt_raw_dat,
measure.vars = which(names(dt_raw_dat) != "country"),
variable.name = "name", value.name = "value")
dat_long[, c("year", "name", "name2") := tstrsplit(name, "_", fixed = TRUE, type.convert = TRUE)]
dat_long[, c("year", "name", "name2") :=
tstrsplit(name, "_", fixed = TRUE, type.convert = TRUE)]
dat_long[is.na(name2), name2 := ""]
dat_long[, name := paste(name, name2, sep = "_")][, name2 := NULL]
dat_wide <- dcast(dat_long, country + year ~ name, value.var = "value")
Expand Down Expand Up @@ -127,5 +131,7 @@ Other similar functions are `second`, `minute`, `hour`, `wday`, `week`,
The package also includes the class `IDate` and `ITime`, which store dates and times more efficiently, convenient for large files with date stamps. You convert dates in the usual R format using `as.IDate` and `as.ITime`.


## Exercises

Repear exercises in @sec-reshape, @sec-joins, and @sec-dates-and-times using **data.table** instead of **tidyverse**.

2 changes: 1 addition & 1 deletion wrangling/dates-and-times.qmd
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Parsing dates and times
# Parsing dates and times {#sec-dates-and-times}


We have described three main types of vectors: numeric, character, and logical. When analyzing data, we often encounter variables that are dates. Although we can represent a date with a string, for example `November 2, 2017`, once we pick a reference day, referred to as the _epoch_ by computer programmers, they can be converted to numbers by calculating the number of days since the epoch. In R and Unix, the epoch is defined as January 1, 1970. So, for example, January 2, 1970 is day 1, December 31, 1969 is day -1, and November 2, 2017, is day 17,204.
Expand Down
7 changes: 4 additions & 3 deletions wrangling/reshaping-data.qmd
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Reshaping data
# Reshaping data {#sec-reshape}

As we have seen through the book, having data in *tidy* format is what makes the tidyverse flow. After the first step in the data analysis process, importing data, a common next step is to reshape the data into a form that facilitates the rest of the analysis. The **tidyr** package, part of **tidyverse**, includes several functions that are useful for tidying data.

Expand Down Expand Up @@ -115,12 +115,13 @@ However, this line of code will give an error. This is because the life expectan
dat |> separate_wider_delim(name, delim = "_", names = c("year", "name"), too_many = "merge")
```

But we are not done yet. We need to create a column for each variable. As we learned, the `pivot_wider` function can do this:
But we are not done yet. We need to create a column for each variable and change `year` to a number. As we learned, the `pivot_wider` function can do this:

```{r}
dat |>
separate_wider_delim(name, delim = "_", names = c("year", "name"), too_many = "merge") |>
pivot_wider()
pivot_wider() |>
mutate(year = as.integer(year))
```

The data is now in tidy format with one row for each observation with three variables: year, fertility, and life expectancy.
Expand Down

0 comments on commit 1308179

Please sign in to comment.