From 130817912223327d2cc1dfa735bc6b11ff3f9cd5 Mon Sep 17 00:00:00 2001 From: Rafael A Irizarry Date: Fri, 24 Nov 2023 12:30:30 -0500 Subject: [PATCH] Some final minor edits --- docs/search.json | 40 ++++++++++----------- docs/sitemap.xml | 56 ++++++++++++++--------------- docs/wrangling/dates-and-times.html | 4 +-- docs/wrangling/reshaping-data.html | 25 +++++++------ wrangling/data-table-wrangling.qmd | 22 +++++++----- wrangling/dates-and-times.qmd | 2 +- wrangling/reshaping-data.qmd | 7 ++-- 7 files changed, 80 insertions(+), 76 deletions(-) diff --git a/docs/search.json b/docs/search.json index 049c142..75eadf7 100644 --- a/docs/search.json +++ b/docs/search.json @@ -744,35 +744,35 @@ { "objectID": "wrangling/reshaping-data.html#pivot_longer", "href": "wrangling/reshaping-data.html#pivot_longer", - "title": "\n11  Reshaping data\n", + "title": "11  Reshaping data", "section": "\n11.1 pivot_longer\n", "text": "11.1 pivot_longer\n\nOne of the most used functions in the tidyr package is pivot_longer, which is useful for converting wide data into tidy data.\nAs with most tidyverse functions, the pivot_longer function’s first argument is the data frame that will be converted. Here we want to reshape the wide_data dataset so that each row represents a fertility observation, which implies we need three columns to store the year, country, and the observed value. In its current form, data from different years are in different columns with the year values stored in the column names. Through the names_to and values_to argument we will tell pivot_longer the column names we want to assign to the columns containing the current column names and observations, respectively. The default names are name and value, which are often usable choices. In this case a better choice for these two arguments would be year and fertility. Note that nowhere in the data file does it tell us this is fertility data. Instead, we deciphered this from the file name. Through cols, the second argument we specify the columns containing observed values; these are the columns that will be pivoted. The default is to pivot all columns so, in most cases, we have to specify the columns. In our example we want columns 1960, 1961 up to 2015.\nThe code to pivot the fertility data therefore looks like this:\n\nnew_tidy_data <- wide_data |>\n pivot_longer(`1960`:`2015`, names_to = \"year\", values_to = \"fertility\")\n\nWe can see that the data have been converted to tidy format with columns year and fertility:\n\nhead(new_tidy_data)\n#> # A tibble: 6 × 3\n#> country year fertility\n#> <chr> <chr> <dbl>\n#> 1 Germany 1960 2.41\n#> 2 Germany 1961 2.44\n#> 3 Germany 1962 2.47\n#> 4 Germany 1963 2.49\n#> 5 Germany 1964 2.49\n#> # ℹ 1 more row\n\nand that each year resulted in two rows since we have two countries and this column was not pivoted. A somewhat quicker way to write this code is to specify which column will not include in the pivot, rather than all the columns that will be pivoted:\n\nnew_tidy_data <- wide_data |>\n pivot_longer(-country, names_to = \"year\", values_to = \"fertility\")\n\nThe new_tidy_data object looks like the original tidy_data we defined this way\n\ntidy_data <- gapminder |> \n filter(country %in% c(\"South Korea\", \"Germany\") & !is.na(fertility)) |>\n select(country, year, fertility)\n\nwith just one minor difference. Can you spot it? Look at the data type of the year column. The pivot_longer function assumes that column names are characters. So we need a bit more wrangling before we are ready to make a plot. We need to convert the year column to be numbers:\n\nnew_tidy_data <- wide_data |>\n pivot_longer(-country, names_to = \"year\", values_to = \"fertility\") |>\n mutate(year = as.integer(year))\n\nNow that the data is tidy, we can use this relatively simple ggplot code:\n\nnew_tidy_data |> ggplot(aes(year, fertility, color = country)) + geom_point()" }, { "objectID": "wrangling/reshaping-data.html#pivot_wider", "href": "wrangling/reshaping-data.html#pivot_wider", - "title": "\n11  Reshaping data\n", + "title": "11  Reshaping data", "section": "\n11.2 pivot_wider\n", "text": "11.2 pivot_wider\n\nAs we will see in later examples, it is sometimes useful for data wrangling purposes to convert tidy data into wide data. We often use this as an intermediate step in tidying up data. The pivot_wider function is basically the inverse of pivot_longer. The first argument is for the data, but since we are using the pipe, we don’t show it. The names_from argument tells pivot_wider which variable will be used as the column names. The values_from argument specifies which variable to use to fill out the cells.\n\nnew_wide_data <- new_tidy_data |> \n pivot_wider(names_from = year, values_from = fertility)\nselect(new_wide_data, country, `1960`:`1967`)\n#> # A tibble: 2 × 9\n#> country `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967`\n#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>\n#> 1 Germany 2.41 2.44 2.47 2.49 2.49 2.48 2.44 2.37\n#> 2 South Korea 6.16 5.99 5.79 5.57 5.36 5.16 4.99 4.85\n\nSimilar to pivot_wider, names_from and values_from default to name and value." }, { "objectID": "wrangling/reshaping-data.html#sec-separate", "href": "wrangling/reshaping-data.html#sec-separate", - "title": "\n11  Reshaping data\n", + "title": "11  Reshaping data", "section": "\n11.3 Separating variables", - "text": "11.3 Separating variables\nThe data wrangling shown above was simple compared to what is usually required. In our example spreadsheet files, we include an illustration that is slightly more complicated. It contains two variables: life expectancy and fertility. However, the way it is stored is not tidy and, as we will explain, not optimal.\n\npath <- system.file(\"extdata\", package = \"dslabs\")\n\nfilename <- \"life-expectancy-and-fertility-two-countries-example.csv\"\nfilename <- file.path(path, filename)\n\nraw_dat <- read_csv(filename)\nselect(raw_dat, 1:5)\n#> # A tibble: 2 × 5\n#> country `1960_fertility` `1960_life_expectancy` `1961_fertility`\n#> <chr> <dbl> <dbl> <dbl>\n#> 1 Germany 2.41 69.3 2.44\n#> 2 South Korea 6.16 53.0 5.99\n#> # ℹ 1 more variable: `1961_life_expectancy` <dbl>\n\nFirst, note that the data is in wide format. Second, notice that this table includes values for two variables, fertility and life expectancy, with the column name encoding which column represents which variable. Encoding information in the column names is not recommended but, unfortunately, it is quite common. We will put our wrangling skills to work to extract this information and store it in a tidy fashion.\nWe can start the data wrangling with the pivot_longer function, but we should no longer use the column name year for the new column since it also contains the variable type. We will call it name, the default, for now:\n\ndat <- raw_dat |> pivot_longer(-country)\nhead(dat)\n#> # A tibble: 6 × 3\n#> country name value\n#> <chr> <chr> <dbl>\n#> 1 Germany 1960_fertility 2.41\n#> 2 Germany 1960_life_expectancy 69.3 \n#> 3 Germany 1961_fertility 2.44\n#> 4 Germany 1961_life_expectancy 69.8 \n#> 5 Germany 1962_fertility 2.47\n#> # ℹ 1 more row\n\nThe result is not exactly what we refer to as tidy since each observation is associated with two, not one, rows. We want to have the values from the two variables, fertility and life expectancy, in two separate columns. The first challenge to achieve this is to separate the name column into the year and the variable type. Notice that the entries in this column separate the year from the variable name with an underscore:\n\ndat$name[1:5]\n#> [1] \"1960_fertility\" \"1960_life_expectancy\" \"1961_fertility\" \n#> [4] \"1961_life_expectancy\" \"1962_fertility\"\n\nEncoding multiple variables in a column name is such a common problem that the tidyr package includes function to separate these columns into two or more. The separate_wider_delim function takes three arguments: the name of the column to be separated, the names to be used for the new columns, and the character that separates the variables. So, a first attempt at separating the variable name from the year might be:\n\ndat |> separate_wider_delim(name, delim = \"_\", names = c(\"year\", \"name\"))\n\nHowever, this line of code will give an error. This is because the life expectancy names have three string separated by _ and the fertility names have two. This is a common problem so the separate_wider_delim function has arguments too_few and too_many to handle these situations. We see in the help file that the option too_many = merge will merge together any additional pieces. The following line does what we want:\n\ndat |> separate_wider_delim(name, delim = \"_\", names = c(\"year\", \"name\"), too_many = \"merge\")\n#> # A tibble: 224 × 4\n#> country year name value\n#> <chr> <chr> <chr> <dbl>\n#> 1 Germany 1960 fertility 2.41\n#> 2 Germany 1960 life_expectancy 69.3 \n#> 3 Germany 1961 fertility 2.44\n#> 4 Germany 1961 life_expectancy 69.8 \n#> 5 Germany 1962 fertility 2.47\n#> # ℹ 219 more rows\n\nBut we are not done yet. We need to create a column for each variable. As we learned, the pivot_wider function can do this:\n\ndat |> \n separate_wider_delim(name, delim = \"_\", names = c(\"year\", \"name\"), too_many = \"merge\") |>\n pivot_wider()\n#> # A tibble: 112 × 4\n#> country year fertility life_expectancy\n#> <chr> <chr> <dbl> <dbl>\n#> 1 Germany 1960 2.41 69.3\n#> 2 Germany 1961 2.44 69.8\n#> 3 Germany 1962 2.47 70.0\n#> 4 Germany 1963 2.49 70.1\n#> 5 Germany 1964 2.49 70.7\n#> # ℹ 107 more rows\n\nThe data is now in tidy format with one row for each observation with three variables: year, fertility, and life expectancy.\nThree related function are separate_wider_position, separate_wider_regex, and unite. separate_wider_position takes a width instead of delimiter. separate_wider_regex, described in Section 16.4.13, provides much more control over how we separate and what we keep. The untie function can be tought of as the inverse of the separate function: it combines two columns into one." + "text": "11.3 Separating variables\nThe data wrangling shown above was simple compared to what is usually required. In our example spreadsheet files, we include an illustration that is slightly more complicated. It contains two variables: life expectancy and fertility. However, the way it is stored is not tidy and, as we will explain, not optimal.\n\npath <- system.file(\"extdata\", package = \"dslabs\")\n\nfilename <- \"life-expectancy-and-fertility-two-countries-example.csv\"\nfilename <- file.path(path, filename)\n\nraw_dat <- read_csv(filename)\nselect(raw_dat, 1:5)\n#> # A tibble: 2 × 5\n#> country `1960_fertility` `1960_life_expectancy` `1961_fertility`\n#> <chr> <dbl> <dbl> <dbl>\n#> 1 Germany 2.41 69.3 2.44\n#> 2 South Korea 6.16 53.0 5.99\n#> # ℹ 1 more variable: `1961_life_expectancy` <dbl>\n\nFirst, note that the data is in wide format. Second, notice that this table includes values for two variables, fertility and life expectancy, with the column name encoding which column represents which variable. Encoding information in the column names is not recommended but, unfortunately, it is quite common. We will put our wrangling skills to work to extract this information and store it in a tidy fashion.\nWe can start the data wrangling with the pivot_longer function, but we should no longer use the column name year for the new column since it also contains the variable type. We will call it name, the default, for now:\n\ndat <- raw_dat |> pivot_longer(-country)\nhead(dat)\n#> # A tibble: 6 × 3\n#> country name value\n#> <chr> <chr> <dbl>\n#> 1 Germany 1960_fertility 2.41\n#> 2 Germany 1960_life_expectancy 69.3 \n#> 3 Germany 1961_fertility 2.44\n#> 4 Germany 1961_life_expectancy 69.8 \n#> 5 Germany 1962_fertility 2.47\n#> # ℹ 1 more row\n\nThe result is not exactly what we refer to as tidy since each observation is associated with two, not one, rows. We want to have the values from the two variables, fertility and life expectancy, in two separate columns. The first challenge to achieve this is to separate the name column into the year and the variable type. Notice that the entries in this column separate the year from the variable name with an underscore:\n\ndat$name[1:5]\n#> [1] \"1960_fertility\" \"1960_life_expectancy\" \"1961_fertility\" \n#> [4] \"1961_life_expectancy\" \"1962_fertility\"\n\nEncoding multiple variables in a column name is such a common problem that the tidyr package includes function to separate these columns into two or more. The separate_wider_delim function takes three arguments: the name of the column to be separated, the names to be used for the new columns, and the character that separates the variables. So, a first attempt at separating the variable name from the year might be:\n\ndat |> separate_wider_delim(name, delim = \"_\", names = c(\"year\", \"name\"))\n\nHowever, this line of code will give an error. This is because the life expectancy names have three string separated by _ and the fertility names have two. This is a common problem so the separate_wider_delim function has arguments too_few and too_many to handle these situations. We see in the help file that the option too_many = merge will merge together any additional pieces. The following line does what we want:\n\ndat |> separate_wider_delim(name, delim = \"_\", names = c(\"year\", \"name\"), too_many = \"merge\")\n#> # A tibble: 224 × 4\n#> country year name value\n#> <chr> <chr> <chr> <dbl>\n#> 1 Germany 1960 fertility 2.41\n#> 2 Germany 1960 life_expectancy 69.3 \n#> 3 Germany 1961 fertility 2.44\n#> 4 Germany 1961 life_expectancy 69.8 \n#> 5 Germany 1962 fertility 2.47\n#> # ℹ 219 more rows\n\nBut we are not done yet. We need to create a column for each variable and change year to a number. As we learned, the pivot_wider function can do this:\n\ndat |> \n separate_wider_delim(name, delim = \"_\", names = c(\"year\", \"name\"), too_many = \"merge\") |>\n pivot_wider() |>\n mutate(year = as.integer(year))\n#> # A tibble: 112 × 4\n#> country year fertility life_expectancy\n#> <chr> <int> <dbl> <dbl>\n#> 1 Germany 1960 2.41 69.3\n#> 2 Germany 1961 2.44 69.8\n#> 3 Germany 1962 2.47 70.0\n#> 4 Germany 1963 2.49 70.1\n#> 5 Germany 1964 2.49 70.7\n#> # ℹ 107 more rows\n\nThe data is now in tidy format with one row for each observation with three variables: year, fertility, and life expectancy.\nThree related function are separate_wider_position, separate_wider_regex, and unite. separate_wider_position takes a width instead of delimiter. separate_wider_regex, described in Section 16.4.13, provides much more control over how we separate and what we keep. The untie function can be tought of as the inverse of the separate function: it combines two columns into one." }, { "objectID": "wrangling/reshaping-data.html#the-janitor-package", "href": "wrangling/reshaping-data.html#the-janitor-package", - "title": "\n11  Reshaping data\n", + "title": "11  Reshaping data", "section": "\n11.4 The janitor package", "text": "11.4 The janitor package\nThe janitor package includes function for some of the most common steps needed to wrangle data. These are particularly useful as these tasks that are often repetitive and time-consuming. Key features include functions for examining and cleaning column names, removing empty or duplicate rows, and converting data types. It also offers capabilities to generate frequency tables and perform cross tabulations with ease. The package is designed to work seamlessly with the tidyverse. Here we show four examples.\nSpreadsheets often use names that are not compatible with programming. The most common problem is column names with spaces. The clean_names() function attempts to fix this and other common problems. By default it forces varaible names to be lower case and with underscore instead of space. In this example we change the variable names of the object dat created in the previous section and then demonstrate how this function works:\n\nlibrary(janitor)\n#> \n#> Attaching package: 'janitor'\n#> The following objects are masked from 'package:stats':\n#> \n#> chisq.test, fisher.test\nnames(dat) <- c(\"Country\", \"Year\", \"Fertility\", \"Life Expectancy\")\n#> Warning: The `value` argument of `names<-` must have the same length as `x` as\n#> of tibble 3.0.0.\nclean_names(dat) |> names()\n#> [1] \"country\" \"year\" \"fertility\"\n\nAnother very common challenging reality is that numeric matrices are saved in spreadsheets and include a column with characters defining the row names. To fix this we have to remove the first column, but only after assigning them as vector that we will use to define rownames after converting the data frame to a matrix. The function column_to_rows does these operations for us and all we have to do is specify which column contains the rownames:\n\ndata.frame(ids = letters[1:3], x = 1:3, y = 4:6) |> \n column_to_rownames(\"ids\") |>\n as.matrix() \n#> x y\n#> a 1 4\n#> b 2 5\n#> c 3 6\n\nAnother common challenge is that spreadsheets include the columnnames as a first row. To quickly fix this we can `row_to_names``:\n\nx <- read.csv(file.path(path, \"murders.csv\"), header = FALSE) |> row_to_names(1)\nnames(x)\n#> [1] \"state\" \"abb\" \"region\" \"population\" \"total\"\n\nOur final example relates to finding duplicates. A very common error in the creation of speadsheets is that rows are duplicated. The get_dups function finds and reports duplicate records. By default it considers all varaibles, but you can also specificy which ones.\n\nx <- bind_rows(x, x[1,])\nget_dupes(x)\n#> No variable names specified - using all columns.\n#> state abb region population total dupe_count\n#> 1 Alabama AL South 4779736 135 2\n#> 2 Alabama AL South 4779736 135 2" }, { "objectID": "wrangling/reshaping-data.html#exercises", "href": "wrangling/reshaping-data.html#exercises", - "title": "\n11  Reshaping data\n", + "title": "11  Reshaping data", "section": "\n11.5 Exercises", "text": "11.5 Exercises\n1. Run the following command to define the co2_wide object:\n\nco2_wide <- data.frame(matrix(co2, ncol = 12, byrow = TRUE)) |> \n setNames(1:12) |>\n mutate(year = as.character(1959:1997))\n\nUse the pivot_longer function to wrangle this into a tidy dataset. Call the column with the CO2 measurements co2 and call the month column month. Call the resulting object co2_tidy.\n2. Plot CO2 versus month with a different curve for each year using this code:\n\nco2_tidy |> ggplot(aes(month, co2, color = year)) + geom_line()\n\nIf the expected plot is not made, it is probably because co2_tidy$month is not numeric:\n\nclass(co2_tidy$month)\n\nRewrite your code to make sure the month column is numeric. Then make the plot.\n3. What do we learn from this plot?\n\nCO2 measures increase monotonically from 1959 to 1997.\nCO2 measures are higher in the summer and the yearly average increased from 1959 to 1997.\nCO2 measures appear constant and random variability explains the differences.\nCO2 measures do not have a seasonal trend.\n\n4. Now load the admissions data set, which contains admission information for men and women across six majors and keep only the admitted percentage column:\n\nload(admissions)\ndat <- admissions |> select(-applicants)\n\nIf we think of an observation as a major, and that each observation has two variables (men admitted percentage and women admitted percentage) then this is not tidy. Use the pivot_wider function to wrangle into tidy shape: one row for each major.\n5. Now we will try a more advanced wrangling challenge. We want to wrangle the admissions data so that for each major we have 4 observations: admitted_men, admitted_women, applicants_men and applicants_women. The trick we perform here is actually quite common: first use pivot_longer to generate an intermediate data frame and then pivot_wider to obtain the tidy data we want. We will go step by step in this and the next two exercises.\nUse the pivot_longer function to create a tmp data.frame with a column containing the type of observation admitted or applicants. Call the new columns name and value.\n6. Now you have an object tmp with columns major, gender, name and value. Note that if you combine the name and gender, we get the column names we want: admitted_men, admitted_women, applicants_men and applicants_women. Use the function unite to create a new column called column_name.\n7. Now use the pivot_wider function to generate the tidy data with four variables for each major.\n8. Now use the pipe to write a line of code that turns admissions to the table produced in the previous exercise." }, @@ -814,21 +814,21 @@ { "objectID": "wrangling/dates-and-times.html#the-date-data-type", "href": "wrangling/dates-and-times.html#the-date-data-type", - "title": "\n13  Parsing dates and times\n", + "title": "13  Parsing dates and times", "section": "\n13.1 The date data type", "text": "13.1 The date data type\nWe can see an example of the data type R uses for data here:\n\nlibrary(tidyverse)\nlibrary(dslabs)\npolls_us_election_2016$startdate |> head()\n#> [1] \"2016-11-03\" \"2016-11-01\" \"2016-11-02\" \"2016-11-04\" \"2016-11-03\"\n#> [6] \"2016-11-03\"\n\nThe dates look like strings, but they are not:\n\nclass(polls_us_election_2016$startdate)\n#> [1] \"Date\"\n\nLook at what happens when we convert them to numbers:\n\nas.numeric(polls_us_election_2016$startdate) |> head()\n#> [1] 17108 17106 17107 17109 17108 17108\n\nIt turns them into days since the epoch. The as.Date function can convert a character into a date. So to see that the epoch is day 0 we can type\n\nas.Date(\"1970-01-01\") |> as.numeric()\n#> [1] 0\n\nPlotting functions, such as those in ggplot, are aware of the date format. This means that, for example, a scatterplot can use the numeric representation to decide on the position of the point, but include the string in the labels:\n\npolls_us_election_2016 |> filter(pollster == \"Ipsos\" & state == \"U.S.\") |>\n ggplot(aes(startdate, rawpoll_trump)) +\n geom_line()\n\n\n\n\n\n\n\nNote in particular that the month names are displayed, a very convenient feature." }, { "objectID": "wrangling/dates-and-times.html#sec-lubridate", "href": "wrangling/dates-and-times.html#sec-lubridate", - "title": "\n13  Parsing dates and times\n", + "title": "13  Parsing dates and times", "section": "\n13.2 The lubridate package", "text": "13.2 The lubridate package\nThe lubridate package provides tools to work with date and times.\n\nlibrary(lubridate)\n\nWe will take a random sample of dates to show some of the useful things one can do:\n\nset.seed(2002)\ndates <- sample(polls_us_election_2016$startdate, 10) |> sort()\ndates\n#> [1] \"2016-05-31\" \"2016-08-08\" \"2016-08-19\" \"2016-09-22\" \"2016-09-27\"\n#> [6] \"2016-10-12\" \"2016-10-24\" \"2016-10-26\" \"2016-10-29\" \"2016-10-30\"\n\nThe functions year, month and day extract those values:\n\ntibble(date = dates, month = month(dates), day = day(dates), year = year(dates))\n#> # A tibble: 10 × 4\n#> date month day year\n#> <date> <dbl> <int> <dbl>\n#> 1 2016-05-31 5 31 2016\n#> 2 2016-08-08 8 8 2016\n#> 3 2016-08-19 8 19 2016\n#> 4 2016-09-22 9 22 2016\n#> 5 2016-09-27 9 27 2016\n#> # ℹ 5 more rows\n\nWe can also extract the month labels:\n\nmonth(dates, label = TRUE)\n\n\n#> [1] May Aug Aug Sep Sep Oct Oct Oct Oct Oct\n#> 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < ... < Dec\n\nAnother useful set of functions are the parsers that convert strings into dates. The function ymd assumes the dates are in the format YYYY-MM-DD and tries to parse as well as possible.\n\nx <- c(20090101, \"2009-01-02\", \"2009 01 03\", \"2009-1-4\",\n \"2009-1, 5\", \"Created on 2009 1 6\", \"200901 !!! 07\")\nymd(x)\n#> [1] \"2009-01-01\" \"2009-01-02\" \"2009-01-03\" \"2009-01-04\" \"2009-01-05\"\n#> [6] \"2009-01-06\" \"2009-01-07\"\n\nA further complication comes from the fact that dates often come in different formats in which the order of year, month, and day are different. The preferred format is to show year (with all four digits), month (two digits), and then day, or what is called the ISO 8601. Specifically we use YYYY-MM-DD so that if we order the string, it will be ordered by date. You can see the function ymd returns them in this format.\nBut, what if you encounter dates such as “09/01/02”? This could be September 1, 2002 or January 2, 2009 or January 9, 2002. In these cases, examining the entire vector of dates will help you determine what format it is by process of elimination. Once you know, you can use the many parses provided by lubridate.\nFor example, if the string is:\n\nx <- \"09/01/02\"\n\nThe ymd function assumes the first entry is the year, the second is the month, and the third is the day, so it converts it to:\n\nymd(x)\n#> [1] \"2009-01-02\"\n\nThe mdy function assumes the first entry is the month, then the day, then the year:\n\nmdy(x)\n#> [1] \"2002-09-01\"\n\nThe lubridate package provides a function for every possibility. Here are the other common one:\n\ndmy(x)\n#> [1] \"2002-01-09\"\n\nThe lubridate package is also useful for dealing with times. In R base, you can get the current time typing Sys.time(). The lubridate package provides a slightly more advanced function, now, that permits you to define the time zone:\n\nnow()\n#> [1] \"2023-11-24 12:04:40 EST\"\nnow(\"GMT\")\n#> [1] \"2023-11-24 17:04:40 GMT\"\n\nYou can see all the available time zones with OlsonNames() function.\nWe can also extract hours, minutes, and seconds:\n\nnow() |> hour()\n#> [1] 12\nnow() |> minute()\n#> [1] 4\nnow() |> second()\n#> [1] 40.1\n\nThe package also includes a function to parse strings into times as well as parsers for time objects that include dates:\n\nx <- c(\"12:34:56\")\nhms(x)\n#> [1] \"12H 34M 56S\"\nx <- \"Nov/2/2012 12:34:56\"\nmdy_hms(x)\n#> [1] \"2012-11-02 12:34:56 UTC\"\n\nThis package has many other useful functions. We describe two of these here that we find particularly useful.\nThe make_date function can be used to quickly create a date object. It takes three arguments: year, month, day, hour, minute, seconds, and time zone defaulting to the epoch values on UTC time. To create an date object representing, for example, July 6, 2019 we write:\n\nmake_date(2019, 7, 6)\n#> [1] \"2019-07-06\"\n\nTo make a vector of January 1 for the 80s we write:\n\nmake_date(1980:1989)\n#> [1] \"1980-01-01\" \"1981-01-01\" \"1982-01-01\" \"1983-01-01\" \"1984-01-01\"\n#> [6] \"1985-01-01\" \"1986-01-01\" \"1987-01-01\" \"1988-01-01\" \"1989-01-01\"\n\nAnother very useful function is the round_date. It can be used to round dates to nearest year, quarter, month, week, day, hour, minutes, or seconds. So if we want to group all the polls by week of the year we can do the following:\n\npolls_us_election_2016 |> \n mutate(week = round_date(startdate, \"week\")) |>\n group_by(week) |>\n summarize(margin = mean(rawpoll_clinton - rawpoll_trump)) |>\n ggplot(aes(week, margin)) +\n geom_point()" }, { "objectID": "wrangling/dates-and-times.html#exercises", "href": "wrangling/dates-and-times.html#exercises", - "title": "\n13  Parsing dates and times\n", + "title": "13  Parsing dates and times", "section": "\n13.3 Exercises", "text": "13.3 Exercises\n1. We want to make a plot of death counts versus date. Confirm that the date variable are in fact dates and not strings.\n2. Plot deaths versus date.\n\nWhat time period is represented in these data?\n\n4. Note that after May 31, 2018, the deaths are all 0. The data is probably not entered yet. We also see a drop off starting around May 1. Redefine dat to exclude observations taken on or after May 1, 2018. Then, remake the plot.\n5. Repeat the plot but use the day of the year on the x-axis instead of date.\n6. Compute the deaths per day by month.\n7. Show the deaths per days for July and for September. What do you notice?\n8. Compute deaths per week and make a plot." }, @@ -837,14 +837,14 @@ "href": "wrangling/data-table-wrangling.html#reshaping-data", "title": "\n14  Wrangling with data.table\n", "section": "\n14.1 Reshaping data", - "text": "14.1 Reshaping data\nPreviously we used this example:\n\nlibrary(dslabs)\npath <- system.file(\"extdata\", package = \"dslabs\")\nfilename <- file.path(path, \"fertility-two-countries-example.csv\")\n\n\n14.1.1 pivot_long is melt\nIf in tidyeverse we write\n\nwide_data <- read_csv(filename)\n#> Rows: 2 Columns: 57\n#> ── Column specification ────────────────────────────────────────────────\n#> Delimiter: \",\"\n#> chr (1): country\n#> dbl (56): 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969...\n#> \n#> ℹ Use `spec()` to retrieve the full column specification for this data.\n#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\nnew_tidy_data <- wide_data |>\n pivot_longer(-1, names_to = \"year\", values_to = \"fertility\")\n\nin data.table we use the melt function\n\ndt_wide_data <- fread(filename) \ndt_new_tidy_data <- melt(as.data.table(dt_wide_data), \n measure.vars = 2:ncol(dt_wide_data), \n variable.name = \"year\", \n value.name = \"fertility\")" + "text": "14.1 Reshaping data\nPreviously we used this example:\n\nlibrary(dslabs)\npath <- system.file(\"extdata\", package = \"dslabs\")\nfilename <- file.path(path, \"fertility-two-countries-example.csv\")\n\n\n14.1.1 pivot_longer is melt\n\nIf in tidyeverse we write\n\nwide_data <- read_csv(filename)\nnew_tidy_data <- wide_data |>\n pivot_longer(-1, names_to = \"year\", values_to = \"fertility\")\n\nin data.table we use the melt function\n\ndt_wide_data <- fread(filename) \ndt_new_tidy_data <- melt(dt_wide_data, \n measure.vars = 2:ncol(dt_wide_data), \n variable.name = \"year\", \n value.name = \"fertility\")" }, { "objectID": "wrangling/data-table-wrangling.html#pivot_wider-is-dcast", "href": "wrangling/data-table-wrangling.html#pivot_wider-is-dcast", "title": "\n14  Wrangling with data.table\n", "section": "\n14.2 pivot_wider is dcast\n", - "text": "14.2 pivot_wider is dcast\n\nIf in tidyeverse we write\n\nnew_wide_data <- new_tidy_data |> \n pivot_wider(names_from = year, values_from = fertility)\nselect(new_wide_data, country, `1960`:`1967`)\n#> # A tibble: 2 × 9\n#> country `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967`\n#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>\n#> 1 Germany 2.41 2.44 2.47 2.49 2.49 2.48 2.44 2.37\n#> 2 South Korea 6.16 5.99 5.79 5.57 5.36 5.16 4.99 4.85\n\nin data.table we write:\n\ndt_new_wide_data <- dcast(dt_new_tidy_data, formula = ... ~ year, value.var = \"fertility\")\n\n\n14.2.1 Separating variables\n\npath <- system.file(\"extdata\", package = \"dslabs\")\nfilename <- \"life-expectancy-and-fertility-two-countries-example.csv\"\nfilename <- file.path(path, filename)\n\nIn tidyverse we wrangled using\n\nraw_dat <- read_csv(filename)\ndat <- raw_dat |> pivot_longer(-country) |>\n separate_wider_delim(name, delim = \"_\", names = c(\"year\", \"name\"), too_many = \"merge\") |>\n pivot_wider()\n\nIn data.table we can use the tstrsplit function:\n\ndt_raw_dat <- fread(filename)\ndat_long <- melt(dt_raw_dat, measure.vars = which(names(dt_raw_dat) != \"country\"), \n variable.name = \"name\", value.name = \"value\")\ndat_long[, c(\"year\", \"name\", \"name2\") := tstrsplit(name, \"_\", fixed = TRUE, type.convert = TRUE)]\ndat_long[is.na(name2), name2 := \"\"]\ndat_long[, name := paste(name, name2, sep = \"_\")][, name2 := NULL]\ndat_wide <- dcast(dat_long, country + year ~ name, value.var = \"value\")" + "text": "14.2 pivot_wider is dcast\n\nIf in tidyeverse we write\n\nnew_wide_data <- new_tidy_data |> \n pivot_wider(names_from = year, values_from = fertility)\n\nin data.table we write:\n\ndt_new_wide_data <- dcast(dt_new_tidy_data, formula = ... ~ year,\n value.var = \"fertility\")\n\n\n14.2.1 Separating variables\n\npath <- system.file(\"extdata\", package = \"dslabs\")\nfilename <- \"life-expectancy-and-fertility-two-countries-example.csv\"\nfilename <- file.path(path, filename)\n\nIn tidyverse we wrangled using\n\nraw_dat <- read_csv(filename)\ndat <- raw_dat |> pivot_longer(-country) |>\n separate_wider_delim(name, delim = \"_\", names = c(\"year\", \"name\"), \n too_many = \"merge\") |>\n pivot_wider() |>\n mutate(year = as.integer(year))\n\nIn data.table we can use the tstrsplit function:\n\ndt_raw_dat <- fread(filename)\ndat_long <- melt(dt_raw_dat, \n measure.vars = which(names(dt_raw_dat) != \"country\"), \n variable.name = \"name\", value.name = \"value\")\ndat_long[, c(\"year\", \"name\", \"name2\") := \n tstrsplit(name, \"_\", fixed = TRUE, type.convert = TRUE)]\ndat_long[is.na(name2), name2 := \"\"]\ndat_long[, name := paste(name, name2, sep = \"_\")][, name2 := NULL]\ndat_wide <- dcast(dat_long, country + year ~ name, value.var = \"value\")" }, { "objectID": "wrangling/data-table-wrangling.html#joins", @@ -860,6 +860,13 @@ "section": "\n14.4 Dates and times", "text": "14.4 Dates and times\nThe data.table package also includes some of the functionality to lubridate. For example, it includes the mday, month, and year functions:\n\ndata.table::mday(now())\n#> [1] 24\ndata.table::month(now())\n#> [1] 11\ndata.table::year(now())\n#> [1] 2023\n\nOther similar functions are second, minute, hour, wday, week, isoweek quarter, yearmon, yearqtr.\nThe package also includes the class IDate and ITime, which store dates and times more efficiently, convenient for large files with date stamps. You convert dates in the usual R format using as.IDate and as.ITime." }, + { + "objectID": "wrangling/data-table-wrangling.html#exercises", + "href": "wrangling/data-table-wrangling.html#exercises", + "title": "\n14  Wrangling with data.table\n", + "section": "\n14.5 Exercises", + "text": "14.5 Exercises\nRepear exercises in Chapter 11, Section 12.1, and Chapter 13 using data.table instead of tidyverse." + }, { "objectID": "wrangling/web-scraping.html#html", "href": "wrangling/web-scraping.html#html", @@ -938,8 +945,8 @@ "text": "16.3 Escaping\nTo define strings in R, we can use either double quotes or single quotes:\n\ns <- \"Hello!\"\ns <- 'Hello!' \n\nMake sure you choose the correct single quote, as opposed to the back quote `.\nNow, what happens if the string we want to define includes double quotes? For example, if we want to write 10 inches like this 10\"? In this case you can’t use:\n\ns <- \"10\"\"\n\nbecause this is just the string 10 followed by a double quote. If you type this into R, you get an error because you failed to close the double quote. To avoid this, we can use the single quotes:\n\ns <- '10\"'\n\nIf we print out s we see that the double quotes are escaped with the backslash \\.\n\ns\n#> [1] \"10\\\"\"\n\nIn fact, escaping with the backslash provides a way to define the string while still using the double quotes to define strings:\n\ns <- \"10\\\"\"\n\nIn R, the function cat lets us see what the string actually looks like:\n\ncat(s)\n#> 10\"\n\nNow, what if we want our string to be 5 feet written like this 5'? In this case, we can use the double quotes or escape the single quote\n\ns <- \"5'\"\ns <- '5\\''\n\nSo we’ve learned how to write 5 feet and 10 inches separately, but what if we want to write them together to represent 5 feet and 10 inches like this 5'10\"? In this case, neither the single nor double quotes will work since '5'10\"' closes the string after 5 and this \"5'10\"\" closes the string after 10. Keep in mind that if we type one of the above code snippets into R, it will get stuck waiting for you to close the open quote and you will have to exit the execution with the esc button.\nTo achieve the desired result we need to escape both quotes with the backslash \\. You can escape either character that can be confused with a closing quote. These are the two options:\n\ns <- '5\\'10\"'\ns <- \"5'10\\\"\"\n\nEscaping characters is something we often have to use when processing strings. Another characters that often needs escaping is the escape chatacter itself. We can do this with \\\\. When using regular expression, the topic of the next section, we often have to escape the special characters used in this approach." }, { - "objectID": "wrangling/string-processing.html#regex", - "href": "wrangling/string-processing.html#regex", + "objectID": "wrangling/string-processing.html#sec-regex", + "href": "wrangling/string-processing.html#sec-regex", "title": "\n16  String processing\n", "section": "\n16.4 Regular expressions", "text": "16.4 Regular expressions\nA regular expression (regex) is a way to describe specific patterns of characters of text. They can be used to determine if a given string matches the pattern. A set of rules has been defined to do this efficiently and precisely and here we show some examples. We can learn more about these rules by reading a detailed tutorials1 2. This RStudio cheat sheet3 is also very useful.\nThe patterns supplied to the stringr functions can be a regex rather than a standard string. We will learn how this works through a series of examples.\nThroughout this section you will see that we create strings to test out our regex. To do this, we define patterns that we know should match and also patterns that we know should not. We will call them yes and no, respectively. This permits us to check for the two types of errors: failing to match and incorrectly matching.\n\n16.4.1 Strings are a regexp\nTechnically any string is a regex, perhaps the simplest example is a single character. So the comma , used in the next code example is a simple example of searching with regex.\n\npattern <- \",\"\nstr_detect(c(\"1\", \"10\", \"100\", \"1,000\", \"10,000\"), pattern) \n#> [1] FALSE FALSE FALSE TRUE TRUE\n\nAbove, we noted that an entry included a cm. This is also a simple example of a regex. We can show all the entries that used cm like this:\n\nstr_subset(reported_heights$height, \"cm\")\n#> [1] \"165cm\" \"170 cm\"\n\n\n16.4.2 Special characters\nNow let’s consider a slightly more complicated example. Which of the following strings contain the pattern cm or inches?\n\nyes <- c(\"180 cm\", \"70 inches\")\nno <- c(\"180\", \"70''\")\ns <- c(yes, no)\n\nWe can do this with two searches:\n\nstr_detect(s, \"cm\") | str_detect(s, \"inches\")\n#> [1] TRUE TRUE FALSE FALSE\n\nHowever, we don’t need to do this. The main feature that distinguishes the regex language from plain strings is that we can use special characters. These are characters with a meaning. We start by introducing | which means or. So if we want to know if either cm or inches appears in the strings, we can use the regex cm|inches:\n\nstr_detect(s, \"cm|inches\")\n#> [1] TRUE TRUE FALSE FALSE\n\nand obtain the correct answer.\nAnother special character that will be useful for identifying feet and inches values is \\d which means any digit: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. The backslash is used to distinguish it from the character d. In R, we have to escape the backslash \\ so we actually have to use \\\\d to represent digits. Here is an example:\n\nyes <- c(\"5\", \"6\", \"5'10\", \"5 feet\", \"4'11\")\nno <- c(\"\", \".\", \"Five\", \"six\")\ns <- c(yes, no)\npattern <- \"\\\\d\"\nstr_detect(s, pattern)\n#> [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE\n\nWe take this opportunity to introduce the str_view function, which is helpful for troubleshooting as it shows us the first match for each string:\n\nstr_view(s, pattern)\n\n\nand str_view_all shows us all the matches, so 3'2 has two matches and 5'10 has three.\n\nstr_view_all(s, pattern)\n\n\nThere are many other special characters. We will learn some others below, but you can see most or all of them in the cheat sheet4 mentioned earlier.\nFinally, a useful special character is \\w which stands for word character and it matches any letter, number, or underscore.\n\n16.4.3 Character classes\nCharacter classes are used to define a series of characters that can be matched. We define character classes with square brackets []. So, for example, if we want the pattern to match only if we have a 5 or a 6, we use the regex [56]:\n\nstr_view(s, \"[56]\")\n\n\nSuppose we want to match values between 4 and 7. A common way to define character classes is with ranges. So, for example, [0-9] is equivalent to \\\\d. The pattern we want is therefore [4-7].\n\nyes <- as.character(4:7)\nno <- as.character(1:3)\ns <- c(yes, no)\nstr_detect(s, \"[4-7]\")\n#> [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE\n\nHowever, it is important to know that in regex everything is a character; there are no numbers. So 4 is the character 4 not the number four. Notice, for example, that [1-20] does not mean 1 through 20, it means the characters 1 through 2 or the character 0. So [1-20] simply means the character class composed of 0, 1, and 2.\nKeep in mind that characters do have an order and the digits do follow the numeric order. So 0 comes before 1 which comes before 2 and so on. For the same reason, we can define lower case letters as [a-z], upper case letters as [A-Z], and [a-zA-z] as both.\nNotice that \\w is equivalent to [a-zA-Z0-9_].\n\n16.4.4 Anchors\nWhat if we want a match when we have exactly 1 digit? This will be useful in our case study since feet are never more than 1 digit so a restriction will help us. One way to do this with regex is by using anchors, which let us define patterns that must start or end at a specific place. The two most common anchors are ^ and $ which represent the beginning and end of a string, respectively. So the pattern ^\\\\d$ is read as “start of the string followed by one digit followed by end of string”.\nThis pattern now only detects the strings with exactly one digit:\n\npattern <- \"^\\\\d$\"\nyes <- c(\"1\", \"5\", \"9\")\nno <- c(\"12\", \"123\", \" 1\", \"a4\", \"b\")\ns <- c(yes, no)\nstr_view_all(s, pattern)\n\n\nThe 1 does not match because it does not start with the digit but rather with a space, which is actually not easy to see.\n\n16.4.5 Bounded quantifiers\nFor the inches part, we can have one or two digits. This can be specified in regex with quantifiers. This is done by following the pattern with curly brackets containing the number of times the previous entry can be repeated. We call the bounded because the numbers in the quantifier are limited by the numbers in the curly brackets. Later we learn about unbounded quantifiers.\nWe use an example to illustrate. The pattern for one or two digits is:\n\npattern <- \"^\\\\d{1,2}$\"\nyes <- c(\"1\", \"5\", \"9\", \"12\")\nno <- c(\"123\", \"a4\", \"b\")\nstr_view(c(yes, no), pattern)\n\n\nIn this case, 123 does not match, but 12 does. So to look for our feet and inches pattern, we can add the symbols for feet ' and inches \" after the digits.\nWith what we have learned, we can now construct an example for the pattern x'y\\\" with x feet and y inches.\n\npattern <- \"^[4-7]'\\\\d{1,2}\\\"$\"\n\nThe pattern is now getting complex, but you can look at it carefully and break it down:\n\n\n^ = start of the string\n\n[4-7] = one digit, either 4,5,6 or 7\n\n' = feet symbol\n\n\\\\d{1,2} = one or two digits\n\n\\\" = inches symbol\n\n$ = end of the string\n\nLet’s test it out:\n\nyes <- c(\"5'7\\\"\", \"6'2\\\"\", \"5'12\\\"\")\nno <- c(\"6,2\\\"\", \"6.2\\\"\",\"I am 5'11\\\"\", \"3'2\\\"\", \"64\")\nstr_detect(yes, pattern)\n#> [1] TRUE TRUE TRUE\nstr_detect(no, pattern)\n#> [1] FALSE FALSE FALSE FALSE FALSE\n\nFor now, we are permitting the inches to be 12 or larger. We will add a restriction later as the regex for this is a bit more complex than we are ready to show.\n\n16.4.6 White space \\s\n\nAnother problem we have are spaces. For example, our pattern does not match 5' 4\" because there is a space between ' and 4 which our pattern does not permit. Spaces are characters and R does not ignore them:\n\nidentical(\"Hi\", \"Hi \")\n#> [1] FALSE\n\nIn regex, \\s represents white space. To find patterns like 5' 4, we can change our pattern to:\n\npattern_2 <- \"^[4-7]'\\\\s\\\\d{1,2}\\\"$\"\nstr_subset(problems, pattern_2)\n#> [1] \"5' 4\\\"\" \"5' 11\\\"\" \"5' 7\\\"\"\n\nHowever, this will not match the patterns with no space. So do we need more than one regex pattern? It turns out we can use a quantifier for this as well.\n\n16.4.7 Unbounded quantifiers: *, ?, +\n\nWe want the pattern to permit spaces but not require them. Even if there are several spaces, like in this example 5' 4, we still want it to match. There is a quantifier for exactly this purpose. In regex, the character * means zero or more instances of the previous character. Here is an example:\n\nyes <- c(\"AB\", \"A1B\", \"A11B\", \"A111B\", \"A1111B\")\nno <- c(\"A2B\", \"A21B\")\nstr_detect(yes, \"A1*B\")\n#> [1] TRUE TRUE TRUE TRUE TRUE\nstr_detect(no, \"A1*B\")\n#> [1] FALSE FALSE\n\nThe above matches the first string which has zero 1s and all the strings with one or more 1. We can then improve our pattern by adding the * after the space character \\s.\nThere are two other similar quantifiers. For none or once, we can use ?, and for one or more, we can use +. You can see how they differ with this example:\n\ndata.frame(string = c(\"AB\", \"A1B\", \"A11B\", \"A111B\", \"A1111B\"),\n none_or_more = str_detect(yes, \"A1*B\"),\n nore_or_once = str_detect(yes, \"A1?B\"),\n once_or_more = str_detect(yes, \"A1+B\"))\n#> string none_or_more nore_or_once once_or_more\n#> 1 AB TRUE TRUE FALSE\n#> 2 A1B TRUE TRUE TRUE\n#> 3 A11B TRUE FALSE TRUE\n#> 4 A111B TRUE FALSE TRUE\n#> 5 A1111B TRUE FALSE TRUE\n\nWe will actually use all three in our reported heights example, but we will see these in a later section.\n\n16.4.8 Not\nTo specify patterns that we do not want to detect, we can use the ^ symbol but only inside square brackets. Remember that outside the square bracket ^ means the start of the string. So, for example, if we want to detect digits that are preceded by anything except a letter we can do the following:\n\npattern <- \"[^a-zA-Z]\\\\d\"\nyes <- c(\".3\", \"+2\", \"-0\",\"*4\")\nno <- c(\"A3\", \"B2\", \"C0\", \"E4\")\nstr_detect(yes, pattern)\n#> [1] TRUE TRUE TRUE TRUE\nstr_detect(no, pattern)\n#> [1] FALSE FALSE FALSE FALSE\n\nAnother way to generate a pattern that searches for everything except is to use the upper case of the special character. For example \\\\D means anything other than a digit, \\\\S means anything except a space, and so on.\n\n16.4.9 Groups\nGroups are a powerful aspect of regex that permits the extraction of values. Groups are defined using parentheses. They don’t affect the pattern matching per se. Instead, it permits tools to identify specific parts of the pattern so we can extract them.\nWe want to change heights written like 5.6 to 5'6.\nTo avoid changing patterns such as 70.2, we will require that the first digit be between 4 and 7 [4-7] and that the second be none or more digits \\\\d*. Let’s start by defining a simple pattern that matches this:\n\npattern_without_groups <- \"^[4-7],\\\\d*$\"\n\nWe want to extract the digits so we can then form the new version using a period. These are our two groups, so we encapsulate them with parentheses:\n\npattern_with_groups <- \"^([4-7]),(\\\\d*)$\"\n\nWe encapsulate the part of the pattern that matches the parts we want to keep for later use. Adding groups does not affect the detection, since it only signals that we want to save what is captured by the groups. Note that both patterns return the same result when using str_detect:\n\nyes <- c(\"5,9\", \"5,11\", \"6,\", \"6,1\")\nno <- c(\"5'9\", \",\", \"2,8\", \"6.1.1\")\ns <- c(yes, no)\nstr_detect(s, pattern_without_groups)\n#> [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE\nstr_detect(s, pattern_with_groups)\n#> [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE\n\nOnce we define groups, we can use the function str_match to extract the values these groups define:\n\nstr_match(s, pattern_with_groups)\n#> [,1] [,2] [,3]\n#> [1,] \"5,9\" \"5\" \"9\" \n#> [2,] \"5,11\" \"5\" \"11\"\n#> [3,] \"6,\" \"6\" \"\" \n#> [4,] \"6,1\" \"6\" \"1\" \n#> [5,] NA NA NA \n#> [6,] NA NA NA \n#> [7,] NA NA NA \n#> [8,] NA NA NA\n\nNotice that the second and third columns contain feet and inches, respectively. The first column is the part of the string matching the pattern. If no match occurred, we see an NA.\nNow we can understand the difference between the functions str_extract and str_match. str_extract extracts only strings that match a pattern, not the values defined by groups:\n\nstr_extract(s, pattern_with_groups)\n#> [1] \"5,9\" \"5,11\" \"6,\" \"6,1\" NA NA NA NA\n\n\n16.4.10 Search and replace\nEarlier we defined the object problems containing the strings that do not appear to be in inches. We can see that not too many of our problematic strings match the pattern:\n\npattern <- \"^[4-7]'\\\\d{1,2}\\\"$\"\nsum(str_detect(problems, pattern))\n#> [1] 14\n\nTo see why this is, we show some examples that expose why we don’t have more matches:\n\nproblems[c(2, 10, 11, 12, 15)] |> str_view(pattern)\n\n\nAn initial problem we see immediately is that some students wrote out the words “feet” and “inches”. We can see the entries that did this with the str_subset function:\n\nstr_subset(problems, \"inches\")\n#> [1] \"5 feet and 8.11 inches\" \"Five foot eight inches\"\n#> [3] \"5 feet 7inches\" \"5ft 9 inches\" \n#> [5] \"5 ft 9 inches\" \"5 feet 6 inches\"\n\nWe also see that some entries used two single quotes '' instead of a double quote \".\n\nstr_subset(problems, \"''\")\n#> [1] \"5'9''\" \"5'10''\" \"5'10''\" \"5'3''\" \"5'7''\" \"5'6''\" \n#> [7] \"5'7.5''\" \"5'7.5''\" \"5'10''\" \"5'11''\" \"5'10''\" \"5'5''\"\n\nTo correct this, we can replace the different ways of representing inches and feet with a uniform symbol. We will use ' for feet, whereas for inches we will simply not use a symbol since some entries were of the form x'y. Now, if we no longer use the inches symbol, we have to change our pattern accordingly:\n\npattern <- \"^[4-7]'\\\\d{1,2}$\"\n\nIf we do this replacement before the matching, we get many more matches:\n\nproblems |> \n str_replace(\"feet|ft|foot\", \"'\") |> # replace feet, ft, foot with ' \n str_replace(\"inches|in|''|\\\"\", \"\") |> # remove all inches symbols\n str_detect(pattern) |> \n sum()\n#> [1] 48\n\nHowever, we still have many cases to go.\nNote that in the code above, we leveraged the stringr consistency and used the pipe.\nFor now, we improve our pattern by adding \\\\s* in front of and after the feet symbol ' to permit space between the feet symbol and the numbers. Now we match a few more entries:\n\npattern <- \"^[4-7]\\\\s*'\\\\s*\\\\d{1,2}$\"\nproblems |> \n str_replace(\"feet|ft|foot\", \"'\") |> # replace feet, ft, foot with ' \n str_replace(\"inches|in|''|\\\"\", \"\") |> # remove all inches symbols\n str_detect(pattern) |> \n sum()\n#> [1] 53\n\nWe might be tempted to avoid doing this by removing all the spaces with str_replace_all. However, when doing such an operation we need to make sure that it does not have unintended effects. In our reported heights examples, this will be a problem because some entries are of the form x y with space separating the feet from the inches. If we remove all spaces, we will incorrectly turn x y into xy which implies that a 6 1 would become 61 inches instead of 73 inches.\nThe second large type of problematic entries were of the form x.y, x,y and x y. We want to change all these to our common format x'y. But we can’t just do a search and replace because we would change values such as 70.5 into 70'5. Our strategy will therefore be to search for a very specific pattern that assures us feet and inches are being provided and then, for those that match, replace appropriately.\n\n16.4.11 Search and replace using groups\nAnother powerful aspect of groups is that you can refer to the extracted values in a regex when searching and replacing.\nThe regex special character for the i-th group is \\\\i. So \\\\1 is the value extracted from the first group, \\\\2 the value from the second and so on. As a simple example, note that the following code will replace a comma with period, but only if it is between two digits:\n\npattern_with_groups <- \"^([4-7]),(\\\\d*)$\"\nyes <- c(\"5,9\", \"5,11\", \"6,\", \"6,1\")\nno <- c(\"5'9\", \",\", \"2,8\", \"6.1.1\")\ns <- c(yes, no)\nstr_replace(s, pattern_with_groups, \"\\\\1'\\\\2\")\n#> [1] \"5'9\" \"5'11\" \"6'\" \"6'1\" \"5'9\" \",\" \"2,8\" \"6.1.1\"\n\nWe can use this to convert cases in our reported heights.\nWe are now ready to define a pattern that helps us convert all the x.y, x,y and x y to our preferred format. We need to adapt pattern_with_groups to be a bit more flexible and capture all the cases.\n\npattern_with_groups <- \"^([4-7])\\\\s*[,\\\\.\\\\s+]\\\\s*(\\\\d*)$\"\n\nLet’s break this one down:\n\n\n^ = start of the string\n\n[4-7] = one digit, either 4, 5, 6, or 7\n\n\\\\s* = none or more white space\n\n[,\\\\.\\\\s+] = feet symbol is either ,, . or at least one space\n\n\\\\s* = none or more white space\n\n\\\\d* = none or more digits\n\n$ = end of the string\n\nWe can see that it appears to be working:\n\nstr_subset(problems, pattern_with_groups) |> head()\n#> [1] \"5.3\" \"5.25\" \"5.5\" \"6.5\" \"5.8\" \"5.6\"\n\nand will be able to perform the search and replace:\n\nstr_subset(problems, pattern_with_groups) |> \n str_replace(pattern_with_groups, \"\\\\1'\\\\2\") |> head()\n#> [1] \"5'3\" \"5'25\" \"5'5\" \"6'5\" \"5'8\" \"5'6\"\n\nAgain, we will deal with the inches-larger-than-twelve challenge later.\n\n16.4.12 Lookarounds\nLookarounds provide a way to ask for one or more conditions to be satisfied without moving the search forward or matching it. For example, you might want to check for multiple conditions and if they are matched, then return the pattern or part of the pattern that matched.\nThere are four types of lookarounds: lookahead (?=pattern), lookbehind (?<=pattern), negative lookahead (?!pattern), and negative lookbehind (?<!pattern).\nThe conventional example checking password that must satisfy several conditions such as 1) 8-16 word characters, 2) starts with a letter, and 3) has at least one digit. You can concatenate lookarounds to check for multiple conditions. For our example we can write\n\npattern <- \"(?=\\\\w{8,16})(?=^[a-z|A-Z].*)(?=.*\\\\d+.*)\"\n\nA simpler example is changing all superman to supergirl without changing all the men to girl. We could use a lookaround like this:\n\ns <- \"Superman saved a man. The man thanked superman.\"\nstr_replace_all(s, \"(?<=[Ss]uper)man\", \"girl\")\n#> [1] \"Supergirl saved a man. The man thanked supergirl.\"\n\n\n16.4.13 Separating variables\nIn Section 11.3 we introduced functions that can split columns into new ones. The separate_wider_regex uses regex instead of delimiters to separate variables in data frames. It uses an approach similar to regex groups ts of the original column we keep in which new column.\nSuppose we have data frame like this:\n\ntab <- data.frame(x = c(\"5'10\", \"6' 1\", \"5 ' 9\", \"5'11\\\"\"))\n\nNote that using separate_wider_delim won’t work here because the delimiter can varies across entries. With separate_wider_regex we can define flexible patterns that are matched to define each column.\n\npatterns <- c(feet = \"\\\\d\", \"\\\\s*'\\\\s*\", inches = \"\\\\d{1,2}\", \".*\")\ntab |> separate_wider_regex(x, patterns = patterns)\n#> # A tibble: 4 × 2\n#> feet inches\n#> <chr> <chr> \n#> 1 5 10 \n#> 2 6 1 \n#> 3 5 9 \n#> 4 5 11\n\nBy not naming the second and fourth entries of patterns we tell the function not to keep the values that match that pattern." @@ -1237,12 +1244,5 @@ "title": "22  Reproducible projects", "section": "", "text": "https://github.com/rairizarry/murders↩︎" - }, - { - "objectID": "wrangling/string-processing.html#sec-regex", - "href": "wrangling/string-processing.html#sec-regex", - "title": "\n16  String processing\n", - "section": "\n16.4 Regular expressions", - "text": "16.4 Regular expressions\nA regular expression (regex) is a way to describe specific patterns of characters of text. They can be used to determine if a given string matches the pattern. A set of rules has been defined to do this efficiently and precisely and here we show some examples. We can learn more about these rules by reading a detailed tutorials1 2. This RStudio cheat sheet3 is also very useful.\nThe patterns supplied to the stringr functions can be a regex rather than a standard string. We will learn how this works through a series of examples.\nThroughout this section you will see that we create strings to test out our regex. To do this, we define patterns that we know should match and also patterns that we know should not. We will call them yes and no, respectively. This permits us to check for the two types of errors: failing to match and incorrectly matching.\n\n16.4.1 Strings are a regexp\nTechnically any string is a regex, perhaps the simplest example is a single character. So the comma , used in the next code example is a simple example of searching with regex.\n\npattern <- \",\"\nstr_detect(c(\"1\", \"10\", \"100\", \"1,000\", \"10,000\"), pattern) \n#> [1] FALSE FALSE FALSE TRUE TRUE\n\nAbove, we noted that an entry included a cm. This is also a simple example of a regex. We can show all the entries that used cm like this:\n\nstr_subset(reported_heights$height, \"cm\")\n#> [1] \"165cm\" \"170 cm\"\n\n\n16.4.2 Special characters\nNow let’s consider a slightly more complicated example. Which of the following strings contain the pattern cm or inches?\n\nyes <- c(\"180 cm\", \"70 inches\")\nno <- c(\"180\", \"70''\")\ns <- c(yes, no)\n\nWe can do this with two searches:\n\nstr_detect(s, \"cm\") | str_detect(s, \"inches\")\n#> [1] TRUE TRUE FALSE FALSE\n\nHowever, we don’t need to do this. The main feature that distinguishes the regex language from plain strings is that we can use special characters. These are characters with a meaning. We start by introducing | which means or. So if we want to know if either cm or inches appears in the strings, we can use the regex cm|inches:\n\nstr_detect(s, \"cm|inches\")\n#> [1] TRUE TRUE FALSE FALSE\n\nand obtain the correct answer.\nAnother special character that will be useful for identifying feet and inches values is \\d which means any digit: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. The backslash is used to distinguish it from the character d. In R, we have to escape the backslash \\ so we actually have to use \\\\d to represent digits. Here is an example:\n\nyes <- c(\"5\", \"6\", \"5'10\", \"5 feet\", \"4'11\")\nno <- c(\"\", \".\", \"Five\", \"six\")\ns <- c(yes, no)\npattern <- \"\\\\d\"\nstr_detect(s, pattern)\n#> [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE\n\nWe take this opportunity to introduce the str_view function, which is helpful for troubleshooting as it shows us the first match for each string:\n\nstr_view(s, pattern)\n\n\nand str_view_all shows us all the matches, so 3'2 has two matches and 5'10 has three.\n\nstr_view_all(s, pattern)\n\n\nThere are many other special characters. We will learn some others below, but you can see most or all of them in the cheat sheet4 mentioned earlier.\nFinally, a useful special character is \\w which stands for word character and it matches any letter, number, or underscore.\n\n16.4.3 Character classes\nCharacter classes are used to define a series of characters that can be matched. We define character classes with square brackets []. So, for example, if we want the pattern to match only if we have a 5 or a 6, we use the regex [56]:\n\nstr_view(s, \"[56]\")\n\n\nSuppose we want to match values between 4 and 7. A common way to define character classes is with ranges. So, for example, [0-9] is equivalent to \\\\d. The pattern we want is therefore [4-7].\n\nyes <- as.character(4:7)\nno <- as.character(1:3)\ns <- c(yes, no)\nstr_detect(s, \"[4-7]\")\n#> [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE\n\nHowever, it is important to know that in regex everything is a character; there are no numbers. So 4 is the character 4 not the number four. Notice, for example, that [1-20] does not mean 1 through 20, it means the characters 1 through 2 or the character 0. So [1-20] simply means the character class composed of 0, 1, and 2.\nKeep in mind that characters do have an order and the digits do follow the numeric order. So 0 comes before 1 which comes before 2 and so on. For the same reason, we can define lower case letters as [a-z], upper case letters as [A-Z], and [a-zA-z] as both.\nNotice that \\w is equivalent to [a-zA-Z0-9_].\n\n16.4.4 Anchors\nWhat if we want a match when we have exactly 1 digit? This will be useful in our case study since feet are never more than 1 digit so a restriction will help us. One way to do this with regex is by using anchors, which let us define patterns that must start or end at a specific place. The two most common anchors are ^ and $ which represent the beginning and end of a string, respectively. So the pattern ^\\\\d$ is read as “start of the string followed by one digit followed by end of string”.\nThis pattern now only detects the strings with exactly one digit:\n\npattern <- \"^\\\\d$\"\nyes <- c(\"1\", \"5\", \"9\")\nno <- c(\"12\", \"123\", \" 1\", \"a4\", \"b\")\ns <- c(yes, no)\nstr_view_all(s, pattern)\n\n\nThe 1 does not match because it does not start with the digit but rather with a space, which is actually not easy to see.\n\n16.4.5 Bounded quantifiers\nFor the inches part, we can have one or two digits. This can be specified in regex with quantifiers. This is done by following the pattern with curly brackets containing the number of times the previous entry can be repeated. We call the bounded because the numbers in the quantifier are limited by the numbers in the curly brackets. Later we learn about unbounded quantifiers.\nWe use an example to illustrate. The pattern for one or two digits is:\n\npattern <- \"^\\\\d{1,2}$\"\nyes <- c(\"1\", \"5\", \"9\", \"12\")\nno <- c(\"123\", \"a4\", \"b\")\nstr_view(c(yes, no), pattern)\n\n\nIn this case, 123 does not match, but 12 does. So to look for our feet and inches pattern, we can add the symbols for feet ' and inches \" after the digits.\nWith what we have learned, we can now construct an example for the pattern x'y\\\" with x feet and y inches.\n\npattern <- \"^[4-7]'\\\\d{1,2}\\\"$\"\n\nThe pattern is now getting complex, but you can look at it carefully and break it down:\n\n\n^ = start of the string\n\n[4-7] = one digit, either 4,5,6 or 7\n\n' = feet symbol\n\n\\\\d{1,2} = one or two digits\n\n\\\" = inches symbol\n\n$ = end of the string\n\nLet’s test it out:\n\nyes <- c(\"5'7\\\"\", \"6'2\\\"\", \"5'12\\\"\")\nno <- c(\"6,2\\\"\", \"6.2\\\"\",\"I am 5'11\\\"\", \"3'2\\\"\", \"64\")\nstr_detect(yes, pattern)\n#> [1] TRUE TRUE TRUE\nstr_detect(no, pattern)\n#> [1] FALSE FALSE FALSE FALSE FALSE\n\nFor now, we are permitting the inches to be 12 or larger. We will add a restriction later as the regex for this is a bit more complex than we are ready to show.\n\n16.4.6 White space \\s\n\nAnother problem we have are spaces. For example, our pattern does not match 5' 4\" because there is a space between ' and 4 which our pattern does not permit. Spaces are characters and R does not ignore them:\n\nidentical(\"Hi\", \"Hi \")\n#> [1] FALSE\n\nIn regex, \\s represents white space. To find patterns like 5' 4, we can change our pattern to:\n\npattern_2 <- \"^[4-7]'\\\\s\\\\d{1,2}\\\"$\"\nstr_subset(problems, pattern_2)\n#> [1] \"5' 4\\\"\" \"5' 11\\\"\" \"5' 7\\\"\"\n\nHowever, this will not match the patterns with no space. So do we need more than one regex pattern? It turns out we can use a quantifier for this as well.\n\n16.4.7 Unbounded quantifiers: *, ?, +\n\nWe want the pattern to permit spaces but not require them. Even if there are several spaces, like in this example 5' 4, we still want it to match. There is a quantifier for exactly this purpose. In regex, the character * means zero or more instances of the previous character. Here is an example:\n\nyes <- c(\"AB\", \"A1B\", \"A11B\", \"A111B\", \"A1111B\")\nno <- c(\"A2B\", \"A21B\")\nstr_detect(yes, \"A1*B\")\n#> [1] TRUE TRUE TRUE TRUE TRUE\nstr_detect(no, \"A1*B\")\n#> [1] FALSE FALSE\n\nThe above matches the first string which has zero 1s and all the strings with one or more 1. We can then improve our pattern by adding the * after the space character \\s.\nThere are two other similar quantifiers. For none or once, we can use ?, and for one or more, we can use +. You can see how they differ with this example:\n\ndata.frame(string = c(\"AB\", \"A1B\", \"A11B\", \"A111B\", \"A1111B\"),\n none_or_more = str_detect(yes, \"A1*B\"),\n nore_or_once = str_detect(yes, \"A1?B\"),\n once_or_more = str_detect(yes, \"A1+B\"))\n#> string none_or_more nore_or_once once_or_more\n#> 1 AB TRUE TRUE FALSE\n#> 2 A1B TRUE TRUE TRUE\n#> 3 A11B TRUE FALSE TRUE\n#> 4 A111B TRUE FALSE TRUE\n#> 5 A1111B TRUE FALSE TRUE\n\nWe will actually use all three in our reported heights example, but we will see these in a later section.\n\n16.4.8 Not\nTo specify patterns that we do not want to detect, we can use the ^ symbol but only inside square brackets. Remember that outside the square bracket ^ means the start of the string. So, for example, if we want to detect digits that are preceded by anything except a letter we can do the following:\n\npattern <- \"[^a-zA-Z]\\\\d\"\nyes <- c(\".3\", \"+2\", \"-0\",\"*4\")\nno <- c(\"A3\", \"B2\", \"C0\", \"E4\")\nstr_detect(yes, pattern)\n#> [1] TRUE TRUE TRUE TRUE\nstr_detect(no, pattern)\n#> [1] FALSE FALSE FALSE FALSE\n\nAnother way to generate a pattern that searches for everything except is to use the upper case of the special character. For example \\\\D means anything other than a digit, \\\\S means anything except a space, and so on.\n\n16.4.9 Groups\nGroups are a powerful aspect of regex that permits the extraction of values. Groups are defined using parentheses. They don’t affect the pattern matching per se. Instead, it permits tools to identify specific parts of the pattern so we can extract them.\nWe want to change heights written like 5.6 to 5'6.\nTo avoid changing patterns such as 70.2, we will require that the first digit be between 4 and 7 [4-7] and that the second be none or more digits \\\\d*. Let’s start by defining a simple pattern that matches this:\n\npattern_without_groups <- \"^[4-7],\\\\d*$\"\n\nWe want to extract the digits so we can then form the new version using a period. These are our two groups, so we encapsulate them with parentheses:\n\npattern_with_groups <- \"^([4-7]),(\\\\d*)$\"\n\nWe encapsulate the part of the pattern that matches the parts we want to keep for later use. Adding groups does not affect the detection, since it only signals that we want to save what is captured by the groups. Note that both patterns return the same result when using str_detect:\n\nyes <- c(\"5,9\", \"5,11\", \"6,\", \"6,1\")\nno <- c(\"5'9\", \",\", \"2,8\", \"6.1.1\")\ns <- c(yes, no)\nstr_detect(s, pattern_without_groups)\n#> [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE\nstr_detect(s, pattern_with_groups)\n#> [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE\n\nOnce we define groups, we can use the function str_match to extract the values these groups define:\n\nstr_match(s, pattern_with_groups)\n#> [,1] [,2] [,3]\n#> [1,] \"5,9\" \"5\" \"9\" \n#> [2,] \"5,11\" \"5\" \"11\"\n#> [3,] \"6,\" \"6\" \"\" \n#> [4,] \"6,1\" \"6\" \"1\" \n#> [5,] NA NA NA \n#> [6,] NA NA NA \n#> [7,] NA NA NA \n#> [8,] NA NA NA\n\nNotice that the second and third columns contain feet and inches, respectively. The first column is the part of the string matching the pattern. If no match occurred, we see an NA.\nNow we can understand the difference between the functions str_extract and str_match. str_extract extracts only strings that match a pattern, not the values defined by groups:\n\nstr_extract(s, pattern_with_groups)\n#> [1] \"5,9\" \"5,11\" \"6,\" \"6,1\" NA NA NA NA\n\n\n16.4.10 Search and replace\nEarlier we defined the object problems containing the strings that do not appear to be in inches. We can see that not too many of our problematic strings match the pattern:\n\npattern <- \"^[4-7]'\\\\d{1,2}\\\"$\"\nsum(str_detect(problems, pattern))\n#> [1] 14\n\nTo see why this is, we show some examples that expose why we don’t have more matches:\n\nproblems[c(2, 10, 11, 12, 15)] |> str_view(pattern)\n\n\nAn initial problem we see immediately is that some students wrote out the words “feet” and “inches”. We can see the entries that did this with the str_subset function:\n\nstr_subset(problems, \"inches\")\n#> [1] \"5 feet and 8.11 inches\" \"Five foot eight inches\"\n#> [3] \"5 feet 7inches\" \"5ft 9 inches\" \n#> [5] \"5 ft 9 inches\" \"5 feet 6 inches\"\n\nWe also see that some entries used two single quotes '' instead of a double quote \".\n\nstr_subset(problems, \"''\")\n#> [1] \"5'9''\" \"5'10''\" \"5'10''\" \"5'3''\" \"5'7''\" \"5'6''\" \n#> [7] \"5'7.5''\" \"5'7.5''\" \"5'10''\" \"5'11''\" \"5'10''\" \"5'5''\"\n\nTo correct this, we can replace the different ways of representing inches and feet with a uniform symbol. We will use ' for feet, whereas for inches we will simply not use a symbol since some entries were of the form x'y. Now, if we no longer use the inches symbol, we have to change our pattern accordingly:\n\npattern <- \"^[4-7]'\\\\d{1,2}$\"\n\nIf we do this replacement before the matching, we get many more matches:\n\nproblems |> \n str_replace(\"feet|ft|foot\", \"'\") |> # replace feet, ft, foot with ' \n str_replace(\"inches|in|''|\\\"\", \"\") |> # remove all inches symbols\n str_detect(pattern) |> \n sum()\n#> [1] 48\n\nHowever, we still have many cases to go.\nNote that in the code above, we leveraged the stringr consistency and used the pipe.\nFor now, we improve our pattern by adding \\\\s* in front of and after the feet symbol ' to permit space between the feet symbol and the numbers. Now we match a few more entries:\n\npattern <- \"^[4-7]\\\\s*'\\\\s*\\\\d{1,2}$\"\nproblems |> \n str_replace(\"feet|ft|foot\", \"'\") |> # replace feet, ft, foot with ' \n str_replace(\"inches|in|''|\\\"\", \"\") |> # remove all inches symbols\n str_detect(pattern) |> \n sum()\n#> [1] 53\n\nWe might be tempted to avoid doing this by removing all the spaces with str_replace_all. However, when doing such an operation we need to make sure that it does not have unintended effects. In our reported heights examples, this will be a problem because some entries are of the form x y with space separating the feet from the inches. If we remove all spaces, we will incorrectly turn x y into xy which implies that a 6 1 would become 61 inches instead of 73 inches.\nThe second large type of problematic entries were of the form x.y, x,y and x y. We want to change all these to our common format x'y. But we can’t just do a search and replace because we would change values such as 70.5 into 70'5. Our strategy will therefore be to search for a very specific pattern that assures us feet and inches are being provided and then, for those that match, replace appropriately.\n\n16.4.11 Search and replace using groups\nAnother powerful aspect of groups is that you can refer to the extracted values in a regex when searching and replacing.\nThe regex special character for the i-th group is \\\\i. So \\\\1 is the value extracted from the first group, \\\\2 the value from the second and so on. As a simple example, note that the following code will replace a comma with period, but only if it is between two digits:\n\npattern_with_groups <- \"^([4-7]),(\\\\d*)$\"\nyes <- c(\"5,9\", \"5,11\", \"6,\", \"6,1\")\nno <- c(\"5'9\", \",\", \"2,8\", \"6.1.1\")\ns <- c(yes, no)\nstr_replace(s, pattern_with_groups, \"\\\\1'\\\\2\")\n#> [1] \"5'9\" \"5'11\" \"6'\" \"6'1\" \"5'9\" \",\" \"2,8\" \"6.1.1\"\n\nWe can use this to convert cases in our reported heights.\nWe are now ready to define a pattern that helps us convert all the x.y, x,y and x y to our preferred format. We need to adapt pattern_with_groups to be a bit more flexible and capture all the cases.\n\npattern_with_groups <- \"^([4-7])\\\\s*[,\\\\.\\\\s+]\\\\s*(\\\\d*)$\"\n\nLet’s break this one down:\n\n\n^ = start of the string\n\n[4-7] = one digit, either 4, 5, 6, or 7\n\n\\\\s* = none or more white space\n\n[,\\\\.\\\\s+] = feet symbol is either ,, . or at least one space\n\n\\\\s* = none or more white space\n\n\\\\d* = none or more digits\n\n$ = end of the string\n\nWe can see that it appears to be working:\n\nstr_subset(problems, pattern_with_groups) |> head()\n#> [1] \"5.3\" \"5.25\" \"5.5\" \"6.5\" \"5.8\" \"5.6\"\n\nand will be able to perform the search and replace:\n\nstr_subset(problems, pattern_with_groups) |> \n str_replace(pattern_with_groups, \"\\\\1'\\\\2\") |> head()\n#> [1] \"5'3\" \"5'25\" \"5'5\" \"6'5\" \"5'8\" \"5'6\"\n\nAgain, we will deal with the inches-larger-than-twelve challenge later.\n\n16.4.12 Lookarounds\nLookarounds provide a way to ask for one or more conditions to be satisfied without moving the search forward or matching it. For example, you might want to check for multiple conditions and if they are matched, then return the pattern or part of the pattern that matched.\nThere are four types of lookarounds: lookahead (?=pattern), lookbehind (?<=pattern), negative lookahead (?!pattern), and negative lookbehind (?<!pattern).\nThe conventional example checking password that must satisfy several conditions such as 1) 8-16 word characters, 2) starts with a letter, and 3) has at least one digit. You can concatenate lookarounds to check for multiple conditions. For our example we can write\n\npattern <- \"(?=\\\\w{8,16})(?=^[a-z|A-Z].*)(?=.*\\\\d+.*)\"\n\nA simpler example is changing all superman to supergirl without changing all the men to girl. We could use a lookaround like this:\n\ns <- \"Superman saved a man. The man thanked superman.\"\nstr_replace_all(s, \"(?<=[Ss]uper)man\", \"girl\")\n#> [1] \"Supergirl saved a man. The man thanked supergirl.\"\n\n\n16.4.13 Separating variables\nIn Section 11.3 we introduced functions that can split columns into new ones. The separate_wider_regex uses regex instead of delimiters to separate variables in data frames. It uses an approach similar to regex groups ts of the original column we keep in which new column.\nSuppose we have data frame like this:\n\ntab <- data.frame(x = c(\"5'10\", \"6' 1\", \"5 ' 9\", \"5'11\\\"\"))\n\nNote that using separate_wider_delim won’t work here because the delimiter can varies across entries. With separate_wider_regex we can define flexible patterns that are matched to define each column.\n\npatterns <- c(feet = \"\\\\d\", \"\\\\s*'\\\\s*\", inches = \"\\\\d{1,2}\", \".*\")\ntab |> separate_wider_regex(x, patterns = patterns)\n#> # A tibble: 4 × 2\n#> feet inches\n#> <chr> <chr> \n#> 1 5 10 \n#> 2 6 1 \n#> 3 5 9 \n#> 4 5 11\n\nBy not naming the second and fourth entries of patterns we tell the function not to keep the values that match that pattern." } ] \ No newline at end of file diff --git a/docs/sitemap.xml b/docs/sitemap.xml index e7721c0..142baa8 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -2,114 +2,114 @@ http://rafalab.dfci.harvard.edu/dsbook-part-1/index.html - 2023-11-24T17:05:14.904Z + 2023-11-24T17:29:34.104Z http://rafalab.dfci.harvard.edu/dsbook-part-1/intro.html - 2023-11-24T17:05:14.907Z + 2023-11-24T17:29:34.108Z http://rafalab.dfci.harvard.edu/dsbook-part-1/R/intro-to-R.html - 2023-11-24T17:05:14.910Z + 2023-11-24T17:29:34.111Z http://rafalab.dfci.harvard.edu/dsbook-part-1/R/getting-started.html - 2023-11-24T17:05:14.916Z + 2023-11-24T17:29:34.116Z http://rafalab.dfci.harvard.edu/dsbook-part-1/R/R-basics.html - 2023-11-24T17:05:14.948Z + 2023-11-24T17:29:34.149Z http://rafalab.dfci.harvard.edu/dsbook-part-1/R/programming-basics.html - 2023-11-24T17:05:14.959Z + 2023-11-24T17:29:34.160Z http://rafalab.dfci.harvard.edu/dsbook-part-1/R/tidyverse.html - 2023-11-24T17:05:14.992Z + 2023-11-24T17:29:34.184Z http://rafalab.dfci.harvard.edu/dsbook-part-1/R/data-table.html - 2023-11-24T17:05:15.004Z + 2023-11-24T17:29:34.196Z http://rafalab.dfci.harvard.edu/dsbook-part-1/R/importing-data.html - 2023-11-24T17:05:15.014Z + 2023-11-24T17:29:34.205Z http://rafalab.dfci.harvard.edu/dsbook-part-1/dataviz/intro-dataviz.html - 2023-11-24T17:05:15.018Z + 2023-11-24T17:29:34.210Z http://rafalab.dfci.harvard.edu/dsbook-part-1/dataviz/distributions.html - 2023-11-24T17:05:15.025Z + 2023-11-24T17:29:34.217Z http://rafalab.dfci.harvard.edu/dsbook-part-1/dataviz/ggplot2.html - 2023-11-24T17:05:15.044Z + 2023-11-24T17:29:34.247Z http://rafalab.dfci.harvard.edu/dsbook-part-1/dataviz/dataviz-principles.html - 2023-11-24T17:05:15.059Z + 2023-11-24T17:29:34.260Z http://rafalab.dfci.harvard.edu/dsbook-part-1/dataviz/dataviz-in-practice.html - 2023-11-24T17:05:15.090Z + 2023-11-24T17:29:34.291Z http://rafalab.dfci.harvard.edu/dsbook-part-1/wrangling/intro-to-wrangling.html - 2023-11-24T17:05:15.093Z + 2023-11-24T17:29:34.293Z http://rafalab.dfci.harvard.edu/dsbook-part-1/wrangling/reshaping-data.html - 2023-11-24T17:05:15.104Z + 2023-11-24T17:29:34.305Z http://rafalab.dfci.harvard.edu/dsbook-part-1/wrangling/joining-tables.html - 2023-11-24T17:05:15.117Z + 2023-11-24T17:29:34.318Z http://rafalab.dfci.harvard.edu/dsbook-part-1/wrangling/dates-and-times.html - 2023-11-24T17:05:15.125Z + 2023-11-24T17:29:34.325Z http://rafalab.dfci.harvard.edu/dsbook-part-1/wrangling/data-table-wrangling.html - 2023-11-24T17:05:15.133Z + 2023-11-24T17:29:34.332Z http://rafalab.dfci.harvard.edu/dsbook-part-1/wrangling/web-scraping.html - 2023-11-24T17:05:15.142Z + 2023-11-24T17:29:34.342Z http://rafalab.dfci.harvard.edu/dsbook-part-1/wrangling/string-processing.html - 2023-11-24T17:10:41.829Z + 2023-11-24T17:29:34.377Z http://rafalab.dfci.harvard.edu/dsbook-part-1/wrangling/text-mining.html - 2023-11-24T17:05:15.211Z + 2023-11-24T17:29:34.393Z http://rafalab.dfci.harvard.edu/dsbook-part-1/productivity/intro-productivity.html - 2023-11-24T17:05:15.215Z + 2023-11-24T17:29:34.396Z http://rafalab.dfci.harvard.edu/dsbook-part-1/productivity/installing-r-and-rstudio.html - 2023-11-24T17:05:15.219Z + 2023-11-24T17:29:34.401Z http://rafalab.dfci.harvard.edu/dsbook-part-1/productivity/installing-git.html - 2023-11-24T17:05:15.225Z + 2023-11-24T17:29:34.407Z http://rafalab.dfci.harvard.edu/dsbook-part-1/productivity/unix.html - 2023-11-24T17:05:15.234Z + 2023-11-24T17:29:34.416Z http://rafalab.dfci.harvard.edu/dsbook-part-1/productivity/git.html - 2023-11-24T17:05:15.242Z + 2023-11-24T17:29:34.423Z http://rafalab.dfci.harvard.edu/dsbook-part-1/productivity/reproducible-projects.html - 2023-11-24T17:05:15.251Z + 2023-11-24T17:29:34.433Z diff --git a/docs/wrangling/dates-and-times.html b/docs/wrangling/dates-and-times.html index 0ad141c..26bb53b 100644 --- a/docs/wrangling/dates-and-times.html +++ b/docs/wrangling/dates-and-times.html @@ -332,9 +332,7 @@
-

-13  Parsing dates and times -

+

13  Parsing dates and times

diff --git a/docs/wrangling/reshaping-data.html b/docs/wrangling/reshaping-data.html index aae97e9..3d60681 100644 --- a/docs/wrangling/reshaping-data.html +++ b/docs/wrangling/reshaping-data.html @@ -334,9 +334,7 @@
-

-11  Reshaping data -

+

11  Reshaping data

@@ -473,19 +471,20 @@

#> 5 Germany 1962 fertility 2.47 #> # ℹ 219 more rows -

But we are not done yet. We need to create a column for each variable. As we learned, the pivot_wider function can do this:

-
+

But we are not done yet. We need to create a column for each variable and change year to a number. As we learned, the pivot_wider function can do this:

+
dat |> 
   separate_wider_delim(name, delim = "_", names = c("year", "name"), too_many = "merge") |>
-  pivot_wider()
+  pivot_wider() |>
+  mutate(year = as.integer(year))
 #> # A tibble: 112 × 4
-#>   country year  fertility life_expectancy
-#>   <chr>   <chr>     <dbl>           <dbl>
-#> 1 Germany 1960       2.41            69.3
-#> 2 Germany 1961       2.44            69.8
-#> 3 Germany 1962       2.47            70.0
-#> 4 Germany 1963       2.49            70.1
-#> 5 Germany 1964       2.49            70.7
+#>   country  year fertility life_expectancy
+#>   <chr>   <int>     <dbl>           <dbl>
+#> 1 Germany  1960      2.41            69.3
+#> 2 Germany  1961      2.44            69.8
+#> 3 Germany  1962      2.47            70.0
+#> 4 Germany  1963      2.49            70.1
+#> 5 Germany  1964      2.49            70.7
 #> # ℹ 107 more rows

The data is now in tidy format with one row for each observation with three variables: year, fertility, and life expectancy.

diff --git a/wrangling/data-table-wrangling.qmd b/wrangling/data-table-wrangling.qmd index 7f9a693..016f3e8 100644 --- a/wrangling/data-table-wrangling.qmd +++ b/wrangling/data-table-wrangling.qmd @@ -25,7 +25,7 @@ filename <- file.path(path, "fertility-two-countries-example.csv") ``` -### pivot_long is melt +### `pivot_longer` is `melt` If in **tidyeverse** we write @@ -39,7 +39,7 @@ in **data.table** we use the `melt` function ```{r} dt_wide_data <- fread(filename) -dt_new_tidy_data <- melt(as.data.table(dt_wide_data), +dt_new_tidy_data <- melt(dt_wide_data, measure.vars = 2:ncol(dt_wide_data), variable.name = "year", value.name = "fertility") @@ -54,13 +54,13 @@ If in **tidyeverse** we write ```{r} new_wide_data <- new_tidy_data |> pivot_wider(names_from = year, values_from = fertility) -select(new_wide_data, country, `1960`:`1967`) ``` in **data.table** we write: ```{r} -dt_new_wide_data <- dcast(dt_new_tidy_data, formula = ... ~ year, value.var = "fertility") +dt_new_wide_data <- dcast(dt_new_tidy_data, formula = ... ~ year, + value.var = "fertility") ``` @@ -77,17 +77,21 @@ In **tidyverse** we wrangled using ```{r, message=FALSE} raw_dat <- read_csv(filename) dat <- raw_dat |> pivot_longer(-country) |> - separate_wider_delim(name, delim = "_", names = c("year", "name"), too_many = "merge") |> - pivot_wider() + separate_wider_delim(name, delim = "_", names = c("year", "name"), + too_many = "merge") |> + pivot_wider() |> + mutate(year = as.integer(year)) ``` In **data.table** we can use the `tstrsplit` function: ```{r} dt_raw_dat <- fread(filename) -dat_long <- melt(dt_raw_dat, measure.vars = which(names(dt_raw_dat) != "country"), +dat_long <- melt(dt_raw_dat, + measure.vars = which(names(dt_raw_dat) != "country"), variable.name = "name", value.name = "value") -dat_long[, c("year", "name", "name2") := tstrsplit(name, "_", fixed = TRUE, type.convert = TRUE)] +dat_long[, c("year", "name", "name2") := + tstrsplit(name, "_", fixed = TRUE, type.convert = TRUE)] dat_long[is.na(name2), name2 := ""] dat_long[, name := paste(name, name2, sep = "_")][, name2 := NULL] dat_wide <- dcast(dat_long, country + year ~ name, value.var = "value") @@ -127,5 +131,7 @@ Other similar functions are `second`, `minute`, `hour`, `wday`, `week`, The package also includes the class `IDate` and `ITime`, which store dates and times more efficiently, convenient for large files with date stamps. You convert dates in the usual R format using `as.IDate` and `as.ITime`. +## Exercises +Repear exercises in @sec-reshape, @sec-joins, and @sec-dates-and-times using **data.table** instead of **tidyverse**. diff --git a/wrangling/dates-and-times.qmd b/wrangling/dates-and-times.qmd index e912d6f..3a73a2a 100644 --- a/wrangling/dates-and-times.qmd +++ b/wrangling/dates-and-times.qmd @@ -1,4 +1,4 @@ -# Parsing dates and times +# Parsing dates and times {#sec-dates-and-times} We have described three main types of vectors: numeric, character, and logical. When analyzing data, we often encounter variables that are dates. Although we can represent a date with a string, for example `November 2, 2017`, once we pick a reference day, referred to as the _epoch_ by computer programmers, they can be converted to numbers by calculating the number of days since the epoch. In R and Unix, the epoch is defined as January 1, 1970. So, for example, January 2, 1970 is day 1, December 31, 1969 is day -1, and November 2, 2017, is day 17,204. diff --git a/wrangling/reshaping-data.qmd b/wrangling/reshaping-data.qmd index c245eb7..1242263 100644 --- a/wrangling/reshaping-data.qmd +++ b/wrangling/reshaping-data.qmd @@ -1,4 +1,4 @@ -# Reshaping data +# Reshaping data {#sec-reshape} As we have seen through the book, having data in *tidy* format is what makes the tidyverse flow. After the first step in the data analysis process, importing data, a common next step is to reshape the data into a form that facilitates the rest of the analysis. The **tidyr** package, part of **tidyverse**, includes several functions that are useful for tidying data. @@ -115,12 +115,13 @@ However, this line of code will give an error. This is because the life expectan dat |> separate_wider_delim(name, delim = "_", names = c("year", "name"), too_many = "merge") ``` -But we are not done yet. We need to create a column for each variable. As we learned, the `pivot_wider` function can do this: +But we are not done yet. We need to create a column for each variable and change `year` to a number. As we learned, the `pivot_wider` function can do this: ```{r} dat |> separate_wider_delim(name, delim = "_", names = c("year", "name"), too_many = "merge") |> - pivot_wider() + pivot_wider() |> + mutate(year = as.integer(year)) ``` The data is now in tidy format with one row for each observation with three variables: year, fertility, and life expectancy.