Merge pull request #126 from molgenis/chore/vignette

docs: vignette
molgenis · May 24, 2024 · 5d4b52a · 5d4b52a
2 parents 37c4b85 + c43f30e
commit 5d4b52a
Show file tree

Hide file tree

Showing 8 changed files with 194 additions and 135 deletions.
diff --git a/R/subset.R b/R/subset.R
@@ -28,7 +28,7 @@
 #' \dontrun{
 #' armadillo.subset(
 #'   source_project = "gecko",
-#'   new_project = "study1",
+#'   target_project = "study1",
 #'   subset_def = local_subset
 #' )
 #' }
@@ -75,7 +75,12 @@ armadillo.subset <- function(input_source = NULL, subset_def = NULL, source_proj
 
 #' Builds an R object containing info required to make subsets
 #'
-#' @param reference_csv \code{.csv} file containing vars to subset
+#' @param reference_csv \code{.csv} file containing details of the variable to subset. Must contain 
+#' 5 columns: 'source_folder' specifying the folder from which to subset, 'souce_table' specifying the
+#' table from which to subset, 'target_folder' specifying the folder in which to create the subset 
+#' 'target_table' specifying the name of the subset and 'variable' specifying the variable(s) to 
+#' include in the subset. Note that 'source_project' and 'target_project' are specified as arguments
+#' to `armadillo.subset`. 
 #' @param vars Deprecated: use \code{reference_csv} instead
 #'
 #' @return A dataframe containing variables that is used for input in the

diff --git a/man/armadillo.subset.Rd b/man/armadillo.subset.Rd
diff --git a/man/armadillo.subset_definition.Rd b/man/armadillo.subset_definition.Rd
diff --git a/man/dot-get_linkfile_content.Rd b/man/dot-get_linkfile_content.Rd
diff --git a/man/dot-load_linked_table.Rd b/man/dot-load_linked_table.Rd
diff --git a/vignettes/creating_data_subsets.Rmd b/vignettes/creating_data_subsets.Rmd
@@ -9,119 +9,113 @@ vignette: >
 
 
 
-When researchers request access to your data they may in many cases not be granted access to the whole dataset, but only to a subset. In Armadilllo, access is regulated on the project level, so you will need to create a new project using a subset of the data. Here are the required steps to create these subsets.
+
+When researchers request access to your data they may not be granted access to the whole dataset, but only to the
+variables which they will use in their project. In Armadilllo, access is regulated on the project level, so you will need to create a view containing only these variables. 
 
 ## Install and load the package
-You need to install and load the package first to be able to create the subsets.
+You first need to install and load the package to be able to create the subsets.
+
 
 
 ```r
 install.packages("MolgenisArmadillo")
 ```
 
 
+
 ```r
 library(MolgenisArmadillo)
 ```
 
 ## Logging in
-In order to access the files you need to log in using the URL of the Armadillo server. A browser window will be opened where you can identify yourself with the ID provider.
+In order to access the files, you need to log in using the URL of the Armadillo server. A browser window will be opened where you can identify yourself with the ID provider.
+
 
 
 ```r
 armadillo.login("https://armadillo-demo.molgenis.net/")
+#> [1] "We're opening a browser so you can log in with code 5FLGYF"
 ```
 
-A session will be created and the credentials are stored in the environment.
+A session will be created and the credentials stored in the environment.
+
+## Creating the subset
+Let's assume you are in a consortium which has data that can not be shared in entirety to researchers. You want to share a subset of the whole dataset with certain researchers that applied for access to your data. There
+are two ways that you can do this.
 
-## Defining the subset
-Let's assume you are in a consortium which has data that can not be shared in total to the researchers. You want to share a subset of the whole dataset with certain researchers that applied for access to your data.
+### Specify the required variables in a separate .csv file.
 
-For each research project, we need to define a .csv file containing 3 columns:
+For each research project, you first create a .csv file containing 5 columns:
 
-| folder       | table       | variable      |
-| ------------ | ----------- | ------------- |
-| 2_1_core_1_0 | yearly_rep  | green_dist_   |
-| 2_1_core_1_0 | yearly_rep  | green_size_   |
-| 2_1_core_1_0 | yearly_rep  | green_access_ |
+| source_folder | source_table | target_folder | target_table | variable      |
+| ------------- | -----------  | ------------- | ------------ | ------------- |
+| 2_1_core_1_0  | yearly_rep   | project1      | yearly_vars  | green_dist_   |
+| 2_1_core_1_0  | yearly_rep   | project1      | yearly_vars  | green_size_   |
+| 2_1_core_1_0  | yearly_rep   | project1.     | yearly_vars  | green_access_ | 
 
-'folder' refers a folder within the master project; 'table' refers to the name of a table within this folder, and 'variable' refers to one or more variables within this table. Note that these columns need to be named exactly as above.
+'source_folder' refers a folder within the master project; 'source_table' refers to the name of a table within this folder, 
+'target_folder' refers to the name for the new folder within the target project, 'target_table' refers to the name of the 
+new table within 'target_folder' and 'variable' refers to one or more variables within source_table ('source_project' and 'target_project' are specified later). 
+
+Note that these columns need to be named exactly as above.
+
+Once you have defined the tables then you can construct the '`subset_definition`. This creates a tibble within R holding the details from the .csv file.
 
-If you defined the tables then you can construct the `subset_definition`. This creates a tibble within R holding the details from the .csv file.
 
 
 ```r
 subset_definition <- armadillo.subset_definition(
-  vars = "data/subset/vars.csv")
+  reference_csv = "data/subset/vars.csv")
 subset_definition
-#> # A tibble: 3 × 3
-#>   folder          table     vars_to_subset   
-#>   <chr>           <chr>     <list>           
-#> 1 2_1-core-1_0    yearlyrep <tibble [15 × 1]>
-#> 2 1_1-outcome-1_0 yearlyrep <tibble [9 × 1]> 
-#> 3 2_1-core-1_0    nonrep    <tibble [7 × 1]>
+#> # A tibble: 3 × 5
+#>   source_folder   source_table target_folder target_table target_vars      
+#>   <chr>           <chr>        <chr>         <chr>        <list>           
+#> 1 2_1-core-1_0    yearlyrep    core          year_rep     <tibble [14 × 1]>
+#> 2 1_1-outcome-1_0 yearlyrep    outcome       year_rep     <tibble [9 × 1]> 
+#> 3 2_1-core-1_0    nonrep       core          non_rep      <tibble [5 × 1]>
 ```
 
-After this you can create a new subset using the subset method on the Armadillo. First we are going to perform a dry-run to check whether the required folder, tables and variables are present.
+After this you can create a new subset using the subset method within Armadillo. 
+
 
 
 ```r
-not_available <- armadillo.subset(
+armadillo.subset(
+  input_source = "subset_def",
 	source_project = "gecko",
-	new_project = "study1",
-	subset_def = subset_definition,
-	dry_run = TRUE
+	target_project = "study1",
+	subset_def = subset_definition
 )
-not_available
-#> # A tibble: 23 × 3
-#>    folder       table     missing      
-#>    <chr>        <chr>     <chr>        
-#>  1 2_1-core-1_0 yearlyrep green_dist_  
-#>  2 2_1-core-1_0 yearlyrep green_size_  
-#>  3 2_1-core-1_0 yearlyrep green_access_
-#>  4 2_1-core-1_0 yearlyrep ndvi100_     
-#>  5 2_1-core-1_0 yearlyrep ndvi300_     
-#>  6 2_1-core-1_0 yearlyrep ndvi500_     
-#>  7 2_1-core-1_0 yearlyrep blue_dist_   
-#>  8 2_1-core-1_0 yearlyrep blue_size_   
-#>  9 2_1-core-1_0 yearlyrep blue_access_ 
-#> 10 2_1-core-1_0 yearlyrep no2_         
-#> # ℹ 13 more rows
+#> Created project 'study1' without users
+#> ✔ All views were successfully created!
+#> ✔ View 'core/year_rep' successfully created
+#> ✔ View 'outcome/year_rep' successfully created
+#> ✔ View 'core/non_rep' successfully created
 ```
 
-This outputs a tibble with details of any variables that are missing within the actual data. You can check whether you suspected this or this is an anomaly.
+This method is generally the best choice if you need to create subsets for multiple tables.
 
-If you are confident that it will work you can run the subset method without dry_run.
+## Specifying the subset via arguments
+An alternative is to specify the subset in R, via arguments to the `armadillo.subset` function:
 
 
 ```r
 armadillo.subset(
+  input_source = "arguments",
 	source_project = "gecko",
-	new_project = "study1",
-	subset_def = subset_definition, 
-	dry_run = FALSE
+	source_folder = "2_1-core-1_0", 
+	source_table = "yearlyrep",
+	target_project = "study2",
+	target_folder = "core",
+	target_table = "year_rep", 
+	target_vars = c("occup_f1_", "occupcode_f2_", "edu_f1_", "edu_f1_fath", "edu_f2_", "edu_f2_fath", "pets_", "cats_", "cats_quant_", "dogs_")
 )
-#> Created project 'study1'
-#> Compressing...
-#> Uploaded 2_1-core-1_0/yearlyrep
-#> Compressing...
-#> Uploaded 1_1-outcome-1_0/yearlyrep
-#> Compressing...
-#> Uploaded 2_1-core-1_0/nonrep
-#> # A tibble: 23 × 3
-#>    folder       table     missing      
-#>    <chr>        <chr>     <chr>        
-#>  1 2_1-core-1_0 yearlyrep green_dist_  
-#>  2 2_1-core-1_0 yearlyrep green_size_  
-#>  3 2_1-core-1_0 yearlyrep green_access_
-#>  4 2_1-core-1_0 yearlyrep ndvi100_     
-#>  5 2_1-core-1_0 yearlyrep ndvi300_     
-#>  6 2_1-core-1_0 yearlyrep ndvi500_     
-#>  7 2_1-core-1_0 yearlyrep blue_dist_   
-#>  8 2_1-core-1_0 yearlyrep blue_size_   
-#>  9 2_1-core-1_0 yearlyrep blue_access_ 
-#> 10 2_1-core-1_0 yearlyrep no2_         
-#> # ℹ 13 more rows
+#> Created project 'study2' without users
+#> ✔ All views were successfully created!
+#> ✔ View 'core/year_rep' successfully created
 ```
+This method may be easier if you only need to create one small subset.
 
+### Checking subsets
 Now you can also take a look at the files in the armadillo user interface, if you open it in a browser window.
diff --git a/vignettes/creating_data_subsets.Rmd.orig b/vignettes/creating_data_subsets.Rmd.orig
@@ -14,72 +14,92 @@ knitr::opts_chunk$set(
 )
 ```
 
-When researchers request access to your data they may in many cases not be granted access to the whole dataset, but only to a subset. In Armadilllo, access is regulated on the project level, so you will need to create a new project using a subset of the data. Here are the required steps to create these subsets.
+
+When researchers request access to your data they may not be granted access to the whole dataset, but only to the
+variables which they will use in their project. In Armadilllo, access is regulated on the project level, so you will need to create a view containing only these variables. 
 
 ## Install and load the package
-You need to install and load the package first to be able to create the subsets.
+You first need to install and load the package to be able to create the subsets.
+
 
-```{r, install the package, eval = FALSE}
+```{r eval = F}
 install.packages("MolgenisArmadillo")
 ```
 
-```{r, load the package}
+
+```{r}
 library(MolgenisArmadillo)
 ```
 
 ## Logging in
-In order to access the files you need to log in using the URL of the Armadillo server. A browser window will be opened where you can identify yourself with the ID provider.
+In order to access the files, you need to log in using the URL of the Armadillo server. A browser window will be opened where you can identify yourself with the ID provider.
 
-```{r, login to armadillo}
+
+```{r}
 armadillo.login("https://armadillo-demo.molgenis.net/")
 ```
 
-A session will be created and the credentials are stored in the environment.
+A session will be created and the credentials stored in the environment.
+
+## Creating the subset
+Let's assume you are in a consortium which has data that can not be shared in entirety to researchers. You want to share a subset of the whole dataset with certain researchers that applied for access to your data. There
+are two ways that you can do this.
 
-## Defining the subset
-Let's assume you are in a consortium which has data that can not be shared in total to the researchers. You want to share a subset of the whole dataset with certain researchers that applied for access to your data.
+### Specify the required variables in a separate .csv file.
 
-For each research project, we need to define a .csv file containing 3 columns:
+For each research project, you first create a .csv file containing 5 columns:
 
-| folder       | table       | variable      |
-| ------------ | ----------- | ------------- |
-| 2_1_core_1_0 | yearly_rep  | green_dist_   |
-| 2_1_core_1_0 | yearly_rep  | green_size_   |
-| 2_1_core_1_0 | yearly_rep  | green_access_ |
+| source_folder | source_table | target_folder | target_table | variable      |
+| ------------- | -----------  | ------------- | ------------ | ------------- |
+| 2_1_core_1_0  | yearly_rep   | project1      | yearly_vars  | green_dist_   |
+| 2_1_core_1_0  | yearly_rep   | project1      | yearly_vars  | green_size_   |
+| 2_1_core_1_0  | yearly_rep   | project1.     | yearly_vars  | green_access_ | 
 
-'folder' refers a folder within the master project; 'table' refers to the name of a table within this folder, and 'variable' refers to one or more variables within this table. Note that these columns need to be named exactly as above.
+'source_folder' refers a folder within the master project; 'source_table' refers to the name of a table within this folder, 
+'target_folder' refers to the name for the new folder within the target project, 'target_table' refers to the name of the 
+new table within 'target_folder' and 'variable' refers to one or more variables within source_table ('source_project' and 'target_project' are specified later). 
 
-If you defined the tables then you can construct the `subset_definition`. This creates a tibble within R holding the details from the .csv file.
+Note that these columns need to be named exactly as above.
 
-```{r, create the subset definition object}
+Once you have defined the tables then you can construct the '`subset_definition`. This creates a tibble within R holding the details from the .csv file.
+
+
+```{r}
 subset_definition <- armadillo.subset_definition(
-  vars = "data/subset/vars.csv")
+  reference_csv = "data/subset/vars.csv")
 subset_definition
 ```
 
-After this you can create a new subset using the subset method on the Armadillo. First we are going to perform a dry-run to check whether the required folder, tables and variables are present.
+After this you can create a new subset using the subset method within Armadillo. 
 
-```{r, create a subset in dry-run mode}
-not_available <- armadillo.subset(
+
+```{r}
+armadillo.subset(
+  input_source = "subset_def",
 	source_project = "gecko",
-	new_project = "study1",
-	subset_def = subset_definition,
-	dry_run = TRUE
+	target_project = "study1",
+	subset_def = subset_definition
 )
-not_available
 ```
 
-This outputs a tibble with details of any variables that are missing within the actual data. You can check whether you suspected this or this is an anomaly.
+This method is generally the best choice if you need to create subsets for multiple tables.
 
-If you are confident that it will work you can run the subset method without dry_run.
+## Specifying the subset via arguments
+An alternative is to specify the subset in R, via arguments to the `armadillo.subset` function:
 
-```{r, running the subset for real}
+```{r}
 armadillo.subset(
+  input_source = "arguments",
 	source_project = "gecko",
-	new_project = "study1",
-	subset_def = subset_definition, 
-	dry_run = FALSE
+	source_folder = "2_1-core-1_0", 
+	source_table = "yearlyrep",
+	target_project = "study2",
+	target_folder = "core",
+	target_table = "year_rep", 
+	target_vars = c("occup_f1_", "occupcode_f2_", "edu_f1_", "edu_f1_fath", "edu_f2_", "edu_f2_fath", "pets_", "cats_", "cats_quant_", "dogs_")
 )
 ```
+This method may be easier if you only need to create one small subset.
 
+### Checking subsets
 Now you can also take a look at the files in the armadillo user interface, if you open it in a browser window.