Skip to content

Commit

Permalink
Merge pull request #126 from molgenis/chore/vignette
Browse files Browse the repository at this point in the history
docs: vignette
  • Loading branch information
timcadman authored May 24, 2024
2 parents 37c4b85 + c43f30e commit 5d4b52a
Show file tree
Hide file tree
Showing 8 changed files with 194 additions and 135 deletions.
9 changes: 7 additions & 2 deletions R/subset.R
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
#' \dontrun{
#' armadillo.subset(
#' source_project = "gecko",
#' new_project = "study1",
#' target_project = "study1",
#' subset_def = local_subset
#' )
#' }
Expand Down Expand Up @@ -75,7 +75,12 @@ armadillo.subset <- function(input_source = NULL, subset_def = NULL, source_proj

#' Builds an R object containing info required to make subsets
#'
#' @param reference_csv \code{.csv} file containing vars to subset
#' @param reference_csv \code{.csv} file containing details of the variable to subset. Must contain
#' 5 columns: 'source_folder' specifying the folder from which to subset, 'souce_table' specifying the
#' table from which to subset, 'target_folder' specifying the folder in which to create the subset
#' 'target_table' specifying the name of the subset and 'variable' specifying the variable(s) to
#' include in the subset. Note that 'source_project' and 'target_project' are specified as arguments
#' to `armadillo.subset`.
#' @param vars Deprecated: use \code{reference_csv} instead
#'
#' @return A dataframe containing variables that is used for input in the
Expand Down
2 changes: 1 addition & 1 deletion man/armadillo.subset.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 6 additions & 1 deletion man/armadillo.subset_definition.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

19 changes: 19 additions & 0 deletions man/dot-get_linkfile_content.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

19 changes: 19 additions & 0 deletions man/dot-load_linked_table.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

128 changes: 61 additions & 67 deletions vignettes/creating_data_subsets.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -9,119 +9,113 @@ vignette: >



When researchers request access to your data they may in many cases not be granted access to the whole dataset, but only to a subset. In Armadilllo, access is regulated on the project level, so you will need to create a new project using a subset of the data. Here are the required steps to create these subsets.

When researchers request access to your data they may not be granted access to the whole dataset, but only to the
variables which they will use in their project. In Armadilllo, access is regulated on the project level, so you will need to create a view containing only these variables.

## Install and load the package
You need to install and load the package first to be able to create the subsets.
You first need to install and load the package to be able to create the subsets.



```r
install.packages("MolgenisArmadillo")
```



```r
library(MolgenisArmadillo)
```

## Logging in
In order to access the files you need to log in using the URL of the Armadillo server. A browser window will be opened where you can identify yourself with the ID provider.
In order to access the files, you need to log in using the URL of the Armadillo server. A browser window will be opened where you can identify yourself with the ID provider.



```r
armadillo.login("https://armadillo-demo.molgenis.net/")
#> [1] "We're opening a browser so you can log in with code 5FLGYF"
```

A session will be created and the credentials are stored in the environment.
A session will be created and the credentials stored in the environment.

## Creating the subset
Let's assume you are in a consortium which has data that can not be shared in entirety to researchers. You want to share a subset of the whole dataset with certain researchers that applied for access to your data. There
are two ways that you can do this.

## Defining the subset
Let's assume you are in a consortium which has data that can not be shared in total to the researchers. You want to share a subset of the whole dataset with certain researchers that applied for access to your data.
### Specify the required variables in a separate .csv file.

For each research project, we need to define a .csv file containing 3 columns:
For each research project, you first create a .csv file containing 5 columns:

| folder | table | variable |
| ------------ | ----------- | ------------- |
| 2_1_core_1_0 | yearly_rep | green_dist_ |
| 2_1_core_1_0 | yearly_rep | green_size_ |
| 2_1_core_1_0 | yearly_rep | green_access_ |
| source_folder | source_table | target_folder | target_table | variable |
| ------------- | ----------- | ------------- | ------------ | ------------- |
| 2_1_core_1_0 | yearly_rep | project1 | yearly_vars | green_dist_ |
| 2_1_core_1_0 | yearly_rep | project1 | yearly_vars | green_size_ |
| 2_1_core_1_0 | yearly_rep | project1. | yearly_vars | green_access_ |

'folder' refers a folder within the master project; 'table' refers to the name of a table within this folder, and 'variable' refers to one or more variables within this table. Note that these columns need to be named exactly as above.
'source_folder' refers a folder within the master project; 'source_table' refers to the name of a table within this folder,
'target_folder' refers to the name for the new folder within the target project, 'target_table' refers to the name of the
new table within 'target_folder' and 'variable' refers to one or more variables within source_table ('source_project' and 'target_project' are specified later).

Note that these columns need to be named exactly as above.

Once you have defined the tables then you can construct the '`subset_definition`. This creates a tibble within R holding the details from the .csv file.

If you defined the tables then you can construct the `subset_definition`. This creates a tibble within R holding the details from the .csv file.


```r
subset_definition <- armadillo.subset_definition(
vars = "data/subset/vars.csv")
reference_csv = "data/subset/vars.csv")
subset_definition
#> # A tibble: 3 × 3
#> folder table vars_to_subset
#> <chr> <chr> <list>
#> 1 2_1-core-1_0 yearlyrep <tibble [15 × 1]>
#> 2 1_1-outcome-1_0 yearlyrep <tibble [9 × 1]>
#> 3 2_1-core-1_0 nonrep <tibble [7 × 1]>
#> # A tibble: 3 × 5
#> source_folder source_table target_folder target_table target_vars
#> <chr> <chr> <chr> <chr> <list>
#> 1 2_1-core-1_0 yearlyrep core year_rep <tibble [14 × 1]>
#> 2 1_1-outcome-1_0 yearlyrep outcome year_rep <tibble [9 × 1]>
#> 3 2_1-core-1_0 nonrep core non_rep <tibble [5 × 1]>
```

After this you can create a new subset using the subset method on the Armadillo. First we are going to perform a dry-run to check whether the required folder, tables and variables are present.
After this you can create a new subset using the subset method within Armadillo.



```r
not_available <- armadillo.subset(
armadillo.subset(
input_source = "subset_def",
source_project = "gecko",
new_project = "study1",
subset_def = subset_definition,
dry_run = TRUE
target_project = "study1",
subset_def = subset_definition
)
not_available
#> # A tibble: 23 × 3
#> folder table missing
#> <chr> <chr> <chr>
#> 1 2_1-core-1_0 yearlyrep green_dist_
#> 2 2_1-core-1_0 yearlyrep green_size_
#> 3 2_1-core-1_0 yearlyrep green_access_
#> 4 2_1-core-1_0 yearlyrep ndvi100_
#> 5 2_1-core-1_0 yearlyrep ndvi300_
#> 6 2_1-core-1_0 yearlyrep ndvi500_
#> 7 2_1-core-1_0 yearlyrep blue_dist_
#> 8 2_1-core-1_0 yearlyrep blue_size_
#> 9 2_1-core-1_0 yearlyrep blue_access_
#> 10 2_1-core-1_0 yearlyrep no2_
#> # ℹ 13 more rows
#> Created project 'study1' without users
#> ✔ All views were successfully created!
#> ✔ View 'core/year_rep' successfully created
#> ✔ View 'outcome/year_rep' successfully created
#> ✔ View 'core/non_rep' successfully created
```

This outputs a tibble with details of any variables that are missing within the actual data. You can check whether you suspected this or this is an anomaly.
This method is generally the best choice if you need to create subsets for multiple tables.

If you are confident that it will work you can run the subset method without dry_run.
## Specifying the subset via arguments
An alternative is to specify the subset in R, via arguments to the `armadillo.subset` function:


```r
armadillo.subset(
input_source = "arguments",
source_project = "gecko",
new_project = "study1",
subset_def = subset_definition,
dry_run = FALSE
source_folder = "2_1-core-1_0",
source_table = "yearlyrep",
target_project = "study2",
target_folder = "core",
target_table = "year_rep",
target_vars = c("occup_f1_", "occupcode_f2_", "edu_f1_", "edu_f1_fath", "edu_f2_", "edu_f2_fath", "pets_", "cats_", "cats_quant_", "dogs_")
)
#> Created project 'study1'
#> Compressing...
#> Uploaded 2_1-core-1_0/yearlyrep
#> Compressing...
#> Uploaded 1_1-outcome-1_0/yearlyrep
#> Compressing...
#> Uploaded 2_1-core-1_0/nonrep
#> # A tibble: 23 × 3
#> folder table missing
#> <chr> <chr> <chr>
#> 1 2_1-core-1_0 yearlyrep green_dist_
#> 2 2_1-core-1_0 yearlyrep green_size_
#> 3 2_1-core-1_0 yearlyrep green_access_
#> 4 2_1-core-1_0 yearlyrep ndvi100_
#> 5 2_1-core-1_0 yearlyrep ndvi300_
#> 6 2_1-core-1_0 yearlyrep ndvi500_
#> 7 2_1-core-1_0 yearlyrep blue_dist_
#> 8 2_1-core-1_0 yearlyrep blue_size_
#> 9 2_1-core-1_0 yearlyrep blue_access_
#> 10 2_1-core-1_0 yearlyrep no2_
#> # ℹ 13 more rows
#> Created project 'study2' without users
#> ✔ All views were successfully created!
#> ✔ View 'core/year_rep' successfully created
```
This method may be easier if you only need to create one small subset.

### Checking subsets
Now you can also take a look at the files in the armadillo user interface, if you open it in a browser window.
84 changes: 52 additions & 32 deletions vignettes/creating_data_subsets.Rmd.orig
Original file line number Diff line number Diff line change
Expand Up @@ -14,72 +14,92 @@ knitr::opts_chunk$set(
)
```

When researchers request access to your data they may in many cases not be granted access to the whole dataset, but only to a subset. In Armadilllo, access is regulated on the project level, so you will need to create a new project using a subset of the data. Here are the required steps to create these subsets.

When researchers request access to your data they may not be granted access to the whole dataset, but only to the
variables which they will use in their project. In Armadilllo, access is regulated on the project level, so you will need to create a view containing only these variables.

## Install and load the package
You need to install and load the package first to be able to create the subsets.
You first need to install and load the package to be able to create the subsets.


```{r, install the package, eval = FALSE}
```{r eval = F}
install.packages("MolgenisArmadillo")
```

```{r, load the package}

```{r}
library(MolgenisArmadillo)
```

## Logging in
In order to access the files you need to log in using the URL of the Armadillo server. A browser window will be opened where you can identify yourself with the ID provider.
In order to access the files, you need to log in using the URL of the Armadillo server. A browser window will be opened where you can identify yourself with the ID provider.

```{r, login to armadillo}

```{r}
armadillo.login("https://armadillo-demo.molgenis.net/")
```

A session will be created and the credentials are stored in the environment.
A session will be created and the credentials stored in the environment.

## Creating the subset
Let's assume you are in a consortium which has data that can not be shared in entirety to researchers. You want to share a subset of the whole dataset with certain researchers that applied for access to your data. There
are two ways that you can do this.

## Defining the subset
Let's assume you are in a consortium which has data that can not be shared in total to the researchers. You want to share a subset of the whole dataset with certain researchers that applied for access to your data.
### Specify the required variables in a separate .csv file.

For each research project, we need to define a .csv file containing 3 columns:
For each research project, you first create a .csv file containing 5 columns:

| folder | table | variable |
| ------------ | ----------- | ------------- |
| 2_1_core_1_0 | yearly_rep | green_dist_ |
| 2_1_core_1_0 | yearly_rep | green_size_ |
| 2_1_core_1_0 | yearly_rep | green_access_ |
| source_folder | source_table | target_folder | target_table | variable |
| ------------- | ----------- | ------------- | ------------ | ------------- |
| 2_1_core_1_0 | yearly_rep | project1 | yearly_vars | green_dist_ |
| 2_1_core_1_0 | yearly_rep | project1 | yearly_vars | green_size_ |
| 2_1_core_1_0 | yearly_rep | project1. | yearly_vars | green_access_ |

'folder' refers a folder within the master project; 'table' refers to the name of a table within this folder, and 'variable' refers to one or more variables within this table. Note that these columns need to be named exactly as above.
'source_folder' refers a folder within the master project; 'source_table' refers to the name of a table within this folder,
'target_folder' refers to the name for the new folder within the target project, 'target_table' refers to the name of the
new table within 'target_folder' and 'variable' refers to one or more variables within source_table ('source_project' and 'target_project' are specified later).

If you defined the tables then you can construct the `subset_definition`. This creates a tibble within R holding the details from the .csv file.
Note that these columns need to be named exactly as above.

```{r, create the subset definition object}
Once you have defined the tables then you can construct the '`subset_definition`. This creates a tibble within R holding the details from the .csv file.


```{r}
subset_definition <- armadillo.subset_definition(
vars = "data/subset/vars.csv")
reference_csv = "data/subset/vars.csv")
subset_definition
```

After this you can create a new subset using the subset method on the Armadillo. First we are going to perform a dry-run to check whether the required folder, tables and variables are present.
After this you can create a new subset using the subset method within Armadillo.

```{r, create a subset in dry-run mode}
not_available <- armadillo.subset(

```{r}
armadillo.subset(
input_source = "subset_def",
source_project = "gecko",
new_project = "study1",
subset_def = subset_definition,
dry_run = TRUE
target_project = "study1",
subset_def = subset_definition
)
not_available
```

This outputs a tibble with details of any variables that are missing within the actual data. You can check whether you suspected this or this is an anomaly.
This method is generally the best choice if you need to create subsets for multiple tables.

If you are confident that it will work you can run the subset method without dry_run.
## Specifying the subset via arguments
An alternative is to specify the subset in R, via arguments to the `armadillo.subset` function:

```{r, running the subset for real}
```{r}
armadillo.subset(
input_source = "arguments",
source_project = "gecko",
new_project = "study1",
subset_def = subset_definition,
dry_run = FALSE
source_folder = "2_1-core-1_0",
source_table = "yearlyrep",
target_project = "study2",
target_folder = "core",
target_table = "year_rep",
target_vars = c("occup_f1_", "occupcode_f2_", "edu_f1_", "edu_f1_fath", "edu_f2_", "edu_f2_fath", "pets_", "cats_", "cats_quant_", "dogs_")
)
```
This method may be easier if you only need to create one small subset.

### Checking subsets
Now you can also take a look at the files in the armadillo user interface, if you open it in a browser window.
Loading

0 comments on commit 5d4b52a

Please sign in to comment.