Many (if not all) data analytical projects integrate raw input data from
upstream producers with tools and scripts aiding in their interpretation. Where
R
is used as the platform for such analysis, the
canonical R
packaging infrastructure (as e.g. described here
in detail) is well suited for this integration and the representation of a
package tracking input data and its analysis parallely might (minimally) look
something like this in the file system:
analysis.of.x.2017
(top level directory named as the resulting package)DESCRIPTION
(standardized file containing basic information about the package)NAMESPACE
(file describing what functions/objects the resulting package provides and requires from other packages; see here for details)R
(Directory containingR
functions)Generalizable_Function_A.R
Generalizable_Function_B.R
- ...
data
(directory containingR
objects saved usingsave(...)
and lazy loaded on attaching the resulting package)raw_data.Rda
inst
(directory holding other material installed intoR
s infrastructure along with the package)extdata
(directory holding compressed external raw data)raw_data.csv.zip
scripts
(directory holding work flow-documenting scripts)00_Script_Documenting_Data_Import.R
01_Script_Documenting_Exploratory_Preliminary_Analysis.R
- ...
After building and loading, such a package makes readily available from within
R
a) raw data (inst/extdata/raw_data.csv.zip
), b) its parsed, immediately
R
-accessible counterpart (data/raw_data.Rda
, respectively a raw_data
object), c) generalizable custom R
functionality (R/...
), as well as d) work
flow documentation (inst/scripts/...
) in a neat, integrated and easily
distributed unit. The structure is easily extended and further customized to
e.g. include a manuscript when using rmarkdown
and/or bookdown
(e.g. in
inst/mansucript/manuscript.Rmd
).
When using R
's packaging infrastructure in this manner, two major challenges
are
-
The complexity of establishing the project-specific infrastructure and
-
Obstacles in distributing the data at the basis of the documented project, which may, among other issues, be rooted in
- Size restrictions of shipped data in a targeted package repository (CRAN, for example, as of 23.10.2017 states "... As a general rule, neither data nor documentation should exceed 5MB.").
- The need to enforce data onwnership/licensing in the context of freely providing peer review of analytical methodology and strategies employed.
Building heavily on the toolkit provided by the excellent devtools
package, datapackageR
aims to aleviate these difficulties and provides a
simple interface for the management of raw data and deriving objects within the
packaging infrastructure, emphasizing cryptographically strong data integrity
assurance (using digest) and
mechanisms for separate distribution of data and analyzing code.
datapackageR
is targeted to eventually be released on CRAN,
but until that happens and for the latest features and fixes, the following
installation procedure should be followed:
install.packages('devtools')
devtools::install_bitbucket('graumannlabtools/datapackageR')
-
Load the required tools
library(datapackageR) library(magrittr) library(readxl) requireNamespace('devtools')
-
Define a temporary working directory & packaging root
pkg_root <- '~' %>% file.path('packagetest')
-
Generate a (dummy) data file
data.frame( x = 1, y = 1:10, fac = sample(LETTERS[1:3], 10, replace = TRUE)) %>% write.table( file = file.path(dirname(pkg_root), 'data_dummy.tsv'), sep = '\t', col.names = TRUE, row.names = FALSE)
-
Create the packaging skeleton, at the same time inserting the dummy data file from above:
data_catalogue <- init( objects_to_include = file.path(dirname(pkg_root), 'data_dummy.tsv'), root = pkg_root, parsing_function = 'read.csv', parsing_options = list(sep = '\t', stringsAsFactors = FALSE))
-
Investigate the resulting
data_catalogue
str(data_catalogue)
-
(Crudely) investigate the result in the file system
list.files(pkg_root, recursive = TRUE)
-
-
Addition of other object classes:
-
Add a remote file that needs parsing (from Billing et al. (2016). Comprehensive transcriptomic and proteomic characterization of human mesenchymal stem cells reveals source specific cellular markers. Sci Rep 6, 21507. Licensed under the Creative Commons Attribution 4.0 International License. Note that the file is excluded from built packages (via
.rBuildignore
) and wouldn't be tracked bygit
(using.gitignore
).data_catalogue <- include_data( object_to_include = 'http://www.nature.com/article-assets/npg/srep/2016/160209/srep21507/extref/srep21507-s4.xls', root = pkg_root, parsing_function = 'read_excel', parsing_options = list(skip = 1), package_dependencies = 'readxl', distributable = FALSE)
-
Add a local object
local_data_object <-list(A = LETTERS, B = letters) data_catalogue <- include_data( object_to_include = 'local_data_object', root = pkg_root)
-
Add a remote *.Rda
data_catalogue <- include_data( object_to_include = 'https://bitbucket.org/graumannlabtools/datapackager/downloads/remote_rda.Rda', root = pkg_root)
-
Add a remote *.Rds
data_catalogue <- include_data( object_to_include = 'https://bitbucket.org/graumannlabtools/datapackager/downloads/remote_rds.Rds', root = pkg_root)
-
-
Investigate the resulting
data_catalogue
& file system structurestr(data_catalogue) list.files(pkg_root, recursive = TRUE)
-
Remove one of the tracked data sets
data_catalogue <- remove_data( object_to_remove = 'data_dummy.tsv', root = pkg_root)
-
Investigate the resulting
data_catalogue
str(data_catalogue)
-
(Crudely) investigate the result in the file system
list.files(pkg_root, recursive = TRUE)
-
-
(Remotely) install the result & use the internal functionality to test data integrity
devtools::install(pkg_root) devtools::test(pkg_root)
-
Clean up the example package & -root
pkg_root %>% basename() %>% remove.packages() unlink(pkg_root, recursive = TRUE)
-
Create an empty packaging skeleton:
pkg_root <- tempdir() %>% file.path('packagetest') data_catalogue <- init( root = pkg_root)
-
Attempt to add (access restricted) remote data
data_catalogue <- include_data( object_to_include = 'https://bitbucket.org/graumannlabtools/datapackager-restricted-access/downloads/remote_rda.rda', root = pkg_root) # --> Can't access URL: Client error: (401) Unauthorized
-
Authenticatedly add remote data
data_catalogue <- include_data( object_to_include = 'https://bitbucket.org/graumannlabtools/datapackager-restricted-access/downloads/remote_rda.rda', root = pkg_root, user = 'datapackageR_user', password = 'datapackageR_user', distributable = FALSE) data_catalogue <- include_data( object_to_include = 'https://bitbucket.org/graumannlabtools/datapackager-restricted-access/downloads/remote_rds.Rds', root = pkg_root, user = 'datapackageR_user', password = 'datapackageR_user', distributable = FALSE)
-
Investigate the resulting
data_catalogue
str(data_catalogue)
-
(Crudely) investigate the result in the file system
list.files(pkg_root, recursive = TRUE)
-
-
Simulate a package infrastructure shared without the remote & authentication-protected data
list.files( pkg_root, recursive = TRUE, pattern = '^data[[:punct:]]remote', full.names = TRUE) %>% unlink()
-
(Crudely) investigate the result in the file system
list.files(pkg_root, recursive = TRUE)
-
-
Use
datapackageR
for attempted retreival of the missing dataretrieve_missing_remote_data(pkg_root) # --> Can't access URL: Client error: (401) Unauthorized
-
Repeat and properly authenticate (& implicitly check downloads against the stored hashes)
retrieve_missing_remote_data( pkg_root, user = 'datapackageR_user', password = 'datapackageR_user')
-
(Crudely) investigate the result in the file system
list.files(pkg_root, recursive = TRUE)
-
-
Clean up the example package & -root
pkg_root %>% basename() %>% remove.packages() unlink(pkg_root, recursive = TRUE)