From 217cf543247735a90a7ab9947cc4d5c8ae19c77d Mon Sep 17 00:00:00 2001 From: Richard Hanna Date: Mon, 30 Sep 2024 11:29:39 -0400 Subject: [PATCH] Draft edits, point to classic project --- pkgdown/_pkgdown.yml | 2 + tests/testthat/_snaps/write.md | 154 ++++++++++++++++-- .../{ => images}/labelled_approaches.png | Bin vignettes/articles/labelled.Rmd | 111 ++++++++----- 4 files changed, 210 insertions(+), 57 deletions(-) rename vignettes/articles/{ => images}/labelled_approaches.png (100%) diff --git a/pkgdown/_pkgdown.yml b/pkgdown/_pkgdown.yml index ede73fed..a188ddc1 100644 --- a/pkgdown/_pkgdown.yml +++ b/pkgdown/_pkgdown.yml @@ -31,6 +31,8 @@ navbar: - text: "Exporting to Excel" desc: "Convert Data Tibbles to XLSX Sheets" href: articles/export_to_xlsx.html + - text: "Using Labelled Vectors with REDCapTidieR" + href: articles/labelled.html search: exclude: ['news/index.html'] diff --git a/tests/testthat/_snaps/write.md b/tests/testthat/_snaps/write.md index d77388fc..c6943f93 100644 --- a/tests/testthat/_snaps/write.md +++ b/tests/testthat/_snaps/write.md @@ -16,7 +16,8 @@ 9 api_no_access_2 API No Access 2 10 survey Survey 11 repeat_survey Repeat Survey - 12 REDCap Metadata + 12 labelled_vignette Labelled Vignette + 13 REDCap Metadata Repeating or Nonrepeating? # of Rows in Data # of Columns in Data 2 structure data_rows data_cols 3 nonrepeating 4 4 @@ -28,19 +29,21 @@ 9 nonrepeating 4 5 10 nonrepeating 4 9 11 repeating 3 10 - 12 + 12 nonrepeating 4 7 + 13 Data size in Memory % of Data Missing NA Sheet # 2 data_size data_na_pct form_complete_pct Sheet # 3 2.28 kB 0.25 0 1 4 1.94 kB 0.5 0 2 5 2.58 kB 0 0 3 - 6 7.71 kB 0.293103448275862 0 4 + 6 7.71 kB 0.28448275862069 0 4 7 7.40 kB 0.75 0 5 8 1.78 kB 1 0 6 9 2.06 kB 1 0 7 10 3.73 kB 0.392857142857143 0 8 11 3.94 kB 0.142857142857143 0 9 - 12 10 + 12 3.04 kB 0 0 10 + 13 11 [[1]][[2]] Record ID Text Box Input Text Box Input REDCap Instrument Completed? @@ -77,7 +80,7 @@ 2 record_id text note calculated dropdown_single radio_single 3 1 text notes 2 one B 4 2 2 three C - 5 3 + 5 3 2 6 4 2 NA NA NA 2 radio_duplicate_label checkbox_multiple___1 checkbox_multiple___2 @@ -225,6 +228,20 @@ 5 2022-11-09 12:21:04 Complete [[1]][[11]] + Record ID Text Box Radio Buttons Checkbox: A Checkbox: B Checkbox: C + 2 record_id text_box_1 radio_buttons_1 checkbox___1 checkbox___2 checkbox___3 + 3 1 Record 1 A Checked Unchecked Unchecked + 4 2 Record 2 B Checked Checked Unchecked + 5 3 Record 3 C Unchecked Checked Checked + 6 4 Record 4 A Unchecked Unchecked Unchecked + REDCap Instrument Completed? + 2 form_status_complete + 3 Complete + 4 Complete + 5 Complete + 6 Complete + + [[1]][[12]] REDCap Instrument Name REDCap Instrument Description 2 redcap_form_name redcap_form_label 3 @@ -293,6 +310,11 @@ 66 repeat_survey Repeat Survey 67 repeat_survey Repeat Survey 68 repeat_survey Repeat Survey + 69 labelled_vignette Labelled Vignette + 70 labelled_vignette Labelled Vignette + 71 labelled_vignette Labelled Vignette + 72 labelled_vignette Labelled Vignette + 73 labelled_vignette Labelled Vignette Variable / Field Name 2 field_name 3 record_id @@ -361,6 +383,11 @@ 66 repeatsurvey_checkbox_v2___one 67 repeatsurvey_checkbox_v2___two 68 repeatsurvey_checkbox_v2___three + 69 text_box_1 + 70 radio_buttons_1 + 71 checkbox___1 + 72 checkbox___2 + 73 checkbox___3 Field Label Field Type 2 field_label field_type 3 Record ID text @@ -429,6 +456,11 @@ 66 Checkbox Field: Choice 1 checkbox 67 Checkbox Field: Choice 2 checkbox 68 Checkbox Field: Choice 3 checkbox + 69 Text Box text + 70 Radio Buttons radio + 71 Checkbox: A checkbox + 72 Checkbox: B checkbox + 73 Checkbox: C checkbox Section Header Prior to this Field 2 section_header 3 @@ -497,6 +529,11 @@ 66 67 68 + 69 + 70 + 71 + 72 + 73 Choices, Calculations, or Slider Labels 2 select_choices_or_calculations 3 @@ -565,6 +602,11 @@ 66 one, Choice 1 | two, Choice 2 | three, Choice 3 67 one, Choice 1 | two, Choice 2 | three, Choice 3 68 one, Choice 1 | two, Choice 2 | three, Choice 3 + 69 + 70 1, A | 2, B | 3, C + 71 1, A | 2, B | 3, C + 72 1, A | 2, B | 3, C + 73 1, A | 2, B | 3, C Field Note Text Validation Type OR Show Slider Number 2 field_note text_validation_type_or_show_slider_number 3 @@ -633,6 +675,11 @@ 66 67 68 + 69 + 70 + 71 + 72 + 73 Minimum Accepted Value for Text Validation 2 text_validation_min 3 @@ -701,6 +748,11 @@ 66 67 68 + 69 + 70 + 71 + 72 + 73 Maximum Accepted Value for Text Validation Is this Field an Identifier? 2 text_validation_max identifier 3 @@ -769,6 +821,11 @@ 66 67 68 + 69 + 70 + 71 + 72 + 73 Branching Logic (Show field only if...) Is this Field Required? 2 branching_logic required_field 3 @@ -837,6 +894,11 @@ 66 67 68 + 69 + 70 + 71 + 72 + 73 Custom Alignment Question Number (surveys only) Matrix Group Name 2 custom_alignment question_number matrix_group_name 3 @@ -905,6 +967,11 @@ 66 67 68 + 69 + 70 + 71 + 72 + 73 Matrix Ranking? Field Annotation Data Type Count of Missing Values 2 matrix_ranking field_annotation skim_type n_missing 3 @@ -916,7 +983,7 @@ 9 character 0 10 character 3 11 character 3 - 12 numeric 1 + 12 numeric 0 13 factor 2 14 factor 2 15 factor 4 @@ -973,6 +1040,11 @@ 66 logical 0 67 logical 0 68 logical 0 + 69 character 0 + 70 factor 0 + 71 logical 0 + 72 logical 0 + 73 logical 0 Proportion of Non-Missing Values Shortest Value (Fewest Characters) 2 complete_rate character.min 3 @@ -984,7 +1056,7 @@ 9 1 1 10 0.25 4 11 0.25 5 - 12 0.75 + 12 1 13 0.5 14 0.5 15 0 @@ -1041,6 +1113,11 @@ 66 1 67 1 68 1 + 69 1 8 + 70 1 + 71 1 + 72 1 + 73 1 Longest Value (Most Characters) Count of Empty Values Count of Unique Values 2 character.max character.empty character.n_unique 3 @@ -1109,6 +1186,11 @@ 66 67 68 + 69 8 0 4 + 70 + 71 + 72 + 73 Count of Values that are all Whitespace Mean Standard Deviation 2 character.whitespace numeric.mean numeric.sd 3 @@ -1177,6 +1259,11 @@ 66 67 68 + 69 0 + 70 + 71 + 72 + 73 Minimum 25th Percentile Median 75th Percentile Maximum 2 numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 3 @@ -1245,6 +1332,11 @@ 66 67 68 + 69 + 70 + 71 + 72 + 73 Histogram Is the Categorical Value Ordered? Count of Unique Values 2 numeric.hist factor.ordered factor.n_unique 3 @@ -1313,6 +1405,11 @@ 66 67 68 + 69 + 70 FALSE 3 + 71 + 72 + 73 Most Frequent Values Proportion of TRUE Values Count of Logical Values 2 factor.top_counts logical.mean logical.count 3 @@ -1381,6 +1478,11 @@ 66 0.666666666666667 TRU: 2, FAL: 1 67 0.333333333333333 FAL: 2, TRU: 1 68 0.333333333333333 FAL: 2, TRU: 1 + 69 + 70 A: 2, B: 1, C: 1 + 71 0.5 FAL: 2, TRU: 2 + 72 0.5 FAL: 2, TRU: 2 + 73 0.25 FAL: 3, TRU: 1 Earliest Latest Median Count of Unique Values Earliest 2 Date.min Date.max Date.median Date.n_unique POSIXct.min 3 @@ -1449,6 +1551,11 @@ 66 67 68 + 69 + 70 + 71 + 72 + 73 Latest Median Count of Unique Values Minimum 2 POSIXct.max POSIXct.median POSIXct.n_unique difftime.min 3 @@ -1517,6 +1624,11 @@ 66 67 68 + 69 + 70 + 71 + 72 + 73 Maximum Median Count of Unique Values 2 difftime.max difftime.median difftime.n_unique 3 @@ -1585,11 +1697,16 @@ 66 67 68 + 69 + 70 + 71 + 72 + 73 [[2]] tab_name tab_sheet tab_ref - 1 Table1 1 A2:I12 + 1 Table1 1 A2:I13 2 Table2 2 A2:D6 3 Table3 3 A2:D6 4 Table4 4 A2:E6 @@ -1599,9 +1716,10 @@ 8 Table8 8 A2:E6 9 Table9 9 A2:I6 10 Table10 10 A2:J5 - 11 Table11 11 A2:AZ68 + 11 Table11 11 A2:G6 + 12 Table12 12 A2:AZ73 tab_xml - 1
+ 1
2
3
4
@@ -1611,7 +1729,8 @@ 8
9
10
- 11
+ 11
+ 12
tab_act 1 1 2 1 @@ -1624,6 +1743,7 @@ 9 1 10 1 11 1 + 12 1 [[3]] [[3]]$fileVersion @@ -1661,7 +1781,8 @@ [8] "" [9] "" [10] "" - [11] "" + [11] "" + [12] "" [[3]]$functionGroups NULL @@ -1715,11 +1836,12 @@ [9] "" [10] "" [11] "" - [12] "" - [13] "" + [12] "" + [13] "" + [14] "" [[5]] - [1] 1 2 3 4 5 6 7 8 9 10 11 + [1] 1 2 3 4 5 6 7 8 9 10 11 12 [[6]] [1] "Table of Contents" "Nonrepeated" @@ -1727,6 +1849,6 @@ [5] "Data Field Types" "Text Input Validation Types" [7] "API No Access" "API No Access 2" [9] "Survey" "Repeat Survey" - [11] "REDCap Metadata" + [11] "Labelled Vignette" "REDCap Metadata" diff --git a/vignettes/articles/labelled_approaches.png b/vignettes/articles/images/labelled_approaches.png similarity index 100% rename from vignettes/articles/labelled_approaches.png rename to vignettes/articles/images/labelled_approaches.png diff --git a/vignettes/articles/labelled.Rmd b/vignettes/articles/labelled.Rmd index 3ad027d0..1f8a2991 100644 --- a/vignettes/articles/labelled.Rmd +++ b/vignettes/articles/labelled.Rmd @@ -1,5 +1,5 @@ --- -title: "Using labelled vectors (`haven_labelled` class) with REDCapTidieR" +title: "Using Labelled Vectors with REDCapTidieR" output: rmarkdown::html_document --- @@ -14,100 +14,129 @@ knitr::opts_chunk$set( knitr::knit_exit() ``` -## Several options for importing categorical variables +## Options for Importing Categorical Variables -When importing data from REDCap using `read_redcap()`, you have several options determining how to import coded values. +When importing data from REDCap using `read_redcap()`, you have several options for handling coded categorical variables. These options determine how the coded values are represented in your R environment. + +For this vignette, we will be using a sample [classic project](https://chop-cgtinformatics.github.io/REDCapTidieR/articles/glossary.html#classic-project) with a [form](https://chop-cgtinformatics.github.io/REDCapTidieR/articles/glossary.html#form) that comprises most common REDCap data types. ```{r, include = FALSE} # Load credentials redcap_uri <- Sys.getenv("REDCAP_URI") -superheroes_token <- Sys.getenv("SUPERHEROES_REDCAP_API") +token <- Sys.getenv("REDCAPTIDIER_CLASSIC_API") library(REDCapTidieR) ``` ``` r library(REDCapTidieR) -superheroes_token <- "123456789ABCDEF123456789ABCDEF04" +token <- "123456789ABCDEF123456789ABCDEF04" redcap_uri <- "https://my.institution.edu/redcap/api/" ``` -If you use `raw_or_label = "raw"`, you will get the raw coded values for categorical variables, keeping the original coding of your data. However, you will use the information regarding the meaning of each code. You will have to get from REDCap a dictionary table explaining the meaning of each code. +Using `raw_or_label = "raw"` retrieves the raw coded values for categorical variables. This approach preserves the original coding, but you'll need to separately reference the data dictionary from REDCap to interpret the meaning of each code. -```{r} -superheroes <- +```{r, warning=FALSE} +redcap_form <- read_redcap( redcap_uri, - superheroes_token, + token, raw_or_label = "raw" ) |> - extract_tibble("heroes_information") -superheroes + extract_tibble("labelled_vignette") + +redcap_form ``` -Alternatively, you could opt for `raw_or_label = "label"` (the default) where each code will be replaced the corresponding label and all categorical variables will be transformed into factors, ready to be used for analysis. But, here, you will lose the original coding of the data. It could be problematic if you need to keep a track of original codes (e.g. for data cleaning) or if you intend to re-export the data at a latter step (e.g. in Stata or SPSS format) where it would be relevant to keep the original coding. +The default option, `raw_or_label = "label"`, replaces each code with its corresponding label and converts categorical variables into factors. This is convenient for analysis but discards the original numeric codes, which may be necessary for tasks like data cleaning or re-exporting to other formats (e.g., Stata or SPSS). -```{r} -superheroes <- +```{r, warning=FALSE} +redcap_form <- read_redcap( redcap_uri, - superheroes_token, + token, raw_or_label = "label" ) |> - extract_tibble("heroes_information") -superheroes + extract_tibble("labelled_vignette") + +redcap_form ``` -A third and final option is to opt for `raw_or_label = "haven_labelled"`. In that case, categorical variables will be imported as labelled vectors, using the `"haven_labelled"` class introduced by the `{haven}` package (cf. `vignette("semantics", package = "haven")`). In this case, your categorical variables will be imported using their original coding and the corresponding value labels will be attached to them as meta-data. +A third option, `raw_or_label = "haven_labelled"`, imports categorical variables as labelled vectors using the "haven_labelled" class from the haven package (cf. `vignette("semantics", package = "haven")`). This method imports your categorical variables using their original coding and attaches the corresponding value labels to them as metadata. -```{r} -superheroes <- +```{r, warning=FALSE} +redcap_form <- read_redcap( redcap_uri, - superheroes_token, + token, raw_or_label = "haven" ) |> - extract_tibble("heroes_information") -superheroes + extract_tibble("labelled_vignette") + +redcap_form ``` -## Pros & Cons of labelled vectors +## Pros & Cons of Labelled Vectors + +The `"haven_labelled"` class was originally developed to import data from statistical software like SPSS, Stata, or SAS, which use value labels for categorical variables. This format allows you store both the original coding and the labels attached to each value. -The `"haven_labelled"` was initially developed for importing data from SPSS, Stata or SAS who use values labels to store categorical variables. This format allows to store both the original coding and the labels attached to each value. +### Advantages -The `{labelled}` package provides several functions to manipulate value labels, such as `labelled::set_value_labels()`, `labelled::get_value_labels()`, `labelled::add_value_labels()` or `labelled::remove_value_labels()`. +- **Preservation of Original Coding**: Both numeric codes and labels are retained, which is useful for data cleaning and re-exporting. +- **Metadata Management**: The labelled package offers functions to manage value labels effectively. -It is possible to search through the variables and/or to generate a variable dictionary using `labelled::look_for()` (cf. `vignette("look_for", package = "labelled")`). +You can manipulate value labels using functions such as: + +- `labelled::set_value_labels()` +- `labelled::get_value_labels()` +- `labelled::add_value_labels()` +- `labelled::remove_value_labels()` + +Additionally, you can search through variables or generate a variable dictionary with `labelled::look_for()` (cf. `vignette("look_for", package = "labelled")`): ```{r} library(labelled) -superheroes |> look_for() +redcap_form |> + look_for() ``` -However, labelled vectors are not intended for data analysis. For descriptive statistics, plots, or model computing, categorical variables should be coded as factors. It could be easily done with `labelled::to_factor()` or `labelled::unlabelled()` (both could be applied to a full data frame). If you opt for importing your data as labelled vectors, you should therefore chose one of the two following approaches. +### Disadvantages -![](labelled_approaches.png) +Labelled vectors are not optimized for data analysis tasks like descriptive statistics, plotting, or modeling. For these purposes, categorical variables should be converted to factors or numeric vectors. -In **approach A**, `haven_labelled` vectors are converted into factors or into numeric/character vectors just after data import, using `labelled::unlabelled()`, `labelled::to_factor()` or `unclass()`. Then, data cleaning, recoding and analysis are performed using classic **R** vector types. +### Recommended Approaches -In **approach B**, `haven_labelled` vectors are kept for data cleaning and coding, allowing to preserved original recoding, in particular if data should be re-exported after that step. Functions provided by `{labelled}` will be useful for managing value labels. However, as in approach A, `haven_labelled` vectors will have to be converted into classic factors or numeric vectors before data analysis (in particular modelling) as this is the way categorical and continuous variables should be coded for analysis functions. +![labelled Approaches](images/labelled_approaches.png) -## Variable labels +**Approach A**: Convert `haven_labelled` vectors to factors or numeric/character vectors just after import using functions like `labelled::unlabelled()`, `labelled::to_factor()`, or `unclass()`. Proceed with data cleaning, recoding, and analysis using standard R vector types. -Variable labels should not be confounded with value labels. A variable label is a textual description of a variable and does not modify the class of the vector, while values labels are a textual description of certain values of a vector. Adding a value label modifies the class of the vector into `"haven_labelled"`. +**Approach B**: Retain `haven_labelled` vectors for data cleaning and coding to preserve original labels, especially if you plan to re-export the data. Use labelled functions to manage value labels, but convert the vectors to factors or numeric types before performing analysis or modeling. -The `{labelled}` package also provides function to manipulate variable labels, such as `labelled::set_variable_labels()` or `labelled::get_variable_labels()`. +## Managing Variable Labels -The function `REDCapTidieR::make_labelled()` allows to add variable labels to data frames exported from REDCap. +It's important to distinguish between value labels and variable labels: -```{r} -superheroes <- +- **Value Labels**: Describe the meaning of specific values within a vector and change the vector's class to `"haven_labelled"`. +- **Variable Labels**: Provide a textual description of the entire variable without altering its class. + +The labelled package offers functions to handle variable labels, such as: + +- `labelled::set_variable_labels()` +- `labelled::get_variable_labels()` + +Using `REDCapTidieR::make_labelled()` allows you to add variable labels to data frames exported from REDCap: + +```{r, warning=FALSE} +redcap_form <- read_redcap( redcap_uri, - superheroes_token, + token, raw_or_label = "haven" ) |> make_labelled() |> - extract_tibble("heroes_information") + extract_tibble("labelled_vignette") -superheroes |> look_for() +redcap_form |> + look_for() ``` + +This ensures that your data not only retains value labels but also includes descriptive labels for each variable, enhancing the readability and usability of your dataset.