Arcus Project Template Orientation

Overview

@comment

Is this module right for me?

@long_description

Details

Estimated time to completion: @estimated_time

Prerequisites:

@pre_reqs

Learning Objectives:

@learning_objectives

Hi! This document is still under construction and testing. We apologize in advance for any broken links or unclear language. We invite your feedback. Please add a support ticket or email Arcus Library Science to let us know what we can improve or suggest additional topics.

Please note that many of the links here will only work if you're on the CHOP network.

Audience

This module introduces the CHOP community to the Project Template, a structured file directory for managing research data. It is useful to ANYONE at CHOP involved in creating, managing or analyzing research data.

The Project Template offers an easy to use and flexible structure for organizing data, and ensures necessary context is preserved for project files and documents. It provides a shared, documented framework for organizing a research effort. This achieves multiple goals:

It assists research teams in building transparency and reproducible workflows
It establishes a structure for easier long-term preservation
It preserves context needed for future reuse of research by collaborators or for the original researcher
It organizes a research project for archiving within Arcus

Learning connection

The Project Template is useful in creating reproducible, generalized and reuseable research. To learn more about reproducible, generalized and reuseable principles in research, check out the following module from the Arcus Education team.

In progress data contributors

Thank you for agreeing to contribute your data, all contributed research data will be arranged in CHOP's Project Template structure. This module describes all sections of the project template structure, what data goes in each section, and shows examples of research data arranged in the template.

This module can be used as a reference while you navigate the archiving process, please reach out to the Digital Archivist if you have further questions. Please share this module with others on your research team involved with preparing the data contribution, or other researchers that may be interested in archiving data with Arcus.

Future data contributors

This module is an overview of CHOP's Project Template structure, all contributed data is arranged in this structure for archiving in Arcus. The Project Template is useful at all stages of research, we suggest implementing it as early as possible in your research as it provides a shared, documented framework for organizing a research effort.

Arcus's Library Science team is happy to meet with researchers at all phases of research for research data management consultations and planning for future archival contributions. This includes early in your research project, as we can help set up a project template file directory structure for storing data and recommend metadata and organization best practices. In addition to this module, there are additional data management resources available on CHOP’s Arcus resources page.

If after viewing this module, you are prepared to archive data with Arcus, please fill out a data contribution request to start the process.

Arcus Lab Users

The Project Template is added to all Arcus Scientific Computing Labs (Arcus Lab), so all you have to do is set up your workflows to conform to the template. This module describes all sections of the project template structure, what data goes in each section, and shows examples of research data arranged in the template.

As part of the Arcus Lab deployment, your team should have received a Project Template orientation by a member of the Library Science team. If you missed the orientation, or need a refresher on the orientation, please see this video of the orientation. Much of the content covered in the video is also in this module.

When appropriate, archiving your research in Arcus is expected from Scientific Projects with an Arcus Lab. This is documented in the Arcus Terms of Use. Archiving is required if you would like to move any data created within an Arcus Lab to a new Scientific Project or if other research teams would like to reuse your data.

When you are ready to archive your lab data, please submit the following request in the Arcus Help Center to begin the data contribution process.

Arcus Data Lifecycle

Arcus's goal for research data management and the project template is to provide tools that are relevant throughout the entire lifecycle of research data. The project template is designed to be adaptable and iterative to capture the wide range of research activities at CHOP. The project template combines flexibility to encompass diverse data capturing needs, while also maintaining a consist structure for all archived data which facilitates seamless communication among various projects spanning different domains, thereby promoting effective data sharing.

How was this structure developed?

The CHOP project template file directory structure was adapted from DrivenData’s Cookiecutter Data Science template. It was adapted by former Arcus Digital Archivist, Christiana Dobrzynski, and former CHOP Bioinformatician, Perry Evans. Both Arcus’s and DrivenData’s templates aim to organize research data and tools for accuracy and reproducibility. See DataDriven's introduction to learn more about the goals and purpose of project template structures for data preservation and sharing.

The CHOP project template evolved through iterations and feedback from CHOP researchers. A multi-disciplinary groups of practitioners were consulted in the template adaptation development, including:

Bioinformatics
Cancer research
Microbiome center
Research IT
Clinical sequencing unit
Medical Informatics Unit

How is the Project Template used at Arcus?

The Project Template prioritizes streamlined archiving and reproducible research pathways. It archives a wide range of research types from the Research Institute, making them discoverable through tools like Arcus Cohort Discovery, available as an application on the Arcus website, Gene, and the Arcus Variant Browser, (available as an application on the Arcus website). The Project Template facilitates organizing diverse research data in a single directory structure, enabling automated archiving, metadata management, and data delivery throughout the research data lifecycle. This file directory structure is used for the entire lifecycle of research data within Arcus:

The project template provides a shared structure so that institutional knowledge previously held locally by various members of the data creation team becomes centralized.

The utility of the project template for lab drive organization and integration with the Arcus archives is summarized in the graphic below.

Project Template

The project template structure includes directories for capturing three major aspects of a research effort: the data (data), the tools needed to work with that data (access tools), and the contextual information needed to understand the effort and its constituent parts (contextual). The high level directories are as follows (items with asterix are required):

Configs (contextual)
Data (data)*
Manifests (data)*
Models (access tools)
References (contextual)
Reports (contextual)
Requirements (contextual)
SRC (access tools)

Below is an image of the entire Project Template Directory, with more detail about each section:

Research Data

The Project Template brings together three categories of information: Research Data, Access Tools and Contextual files. Research data is the actual data collected during the course of research processes used for analysis. The manifests describe this data, crosswalking files to participants. Research data (with manifests) is the minimum required information for all Arcus data contributions.

Access Tools

Access Tools are the code used to do the analysis. This can include machine models, scripts and Jupyter notebooks.

Contextual Files

Contextual Files provide information needed to understand the data and analysis. This can include omics protocols, data dictionaries, reports and diagrams.

Project Template Directories

The next part of this module will walk through each sub-directory of the project template in detail. Though the project template is flexible enough to handle a wide range of research data, it's application and the filetypes in each directory will be different depending on the type of project. For this reason, we have two different examples: clinical data or omics data. In many of the following sections, you can select the option to view examples and specific information for the data type.

File Naming and File Type Standards

Regardless of project type, Arcus follows industry standard guidelines for digital archiving applying these standards to incoming data contributions. File names should follow a consistent and clear schema, and not contain any spaces, periods or special characters. Further recommendations for file naming are below:

File naming tips sheet
File naming conventions and activity sheet
Recommended practices for README files

Whenever feasible, Arcus prefers to archive non-proprietary file formats as opposed to proprietary ones. Proprietary formats necessitate specific software for access or utilization, while non-proprietary formats are frequently open-source. Whenever you have the option, it's advisable to store data in a non-proprietary (open) file format. This choice enhances the accessibility of your content to others, enabling effortless reuse across various software platforms. Furthermore, this approach guarantees the continued utility of the file in the long term. In contrast, proprietary files carry the risk of becoming obsolete due to potential software incompatibility or restricted access.

Important note

When it is necessary to save files in a proprietary format, consider including a README file that documents the name and version of the software used to generate the file, as well as the company who made the software. This could help down the road if we need to figure out how to open these files again.

Preferred File Formats

For both the clinical data and omics examples in the Project Template walk through, we reference our preferred data formats for each type of data. Below are some general resources to help in choosing file formats:

Library of Congress Recommended Formats Statement
UCSC Glossary of Omics File Formats
NIH Clinical Trials Data Formats Overview

data/

The data folder is where the data files are organized. Data is the information collected during the course of research processes used for analysis. The data directory maintains descriptions of authoritative source data and their associated files and metadata in both raw and processed formats. There are four sub-directories within the data folder for organizing the data: raw/, interim/, endpoints/, and ref-data.

All files within the data/ folder and its subdirectories will be listed in the file_manifest.csv. The manifests are detailed in the manifests section of this course.

data/raw

This directory holds authoritative source data that should never be deleted. This folder is where the original, unmodified data for the research project is stored. In a research process, this is the data used for the initial analysis. Further sub-directories can be added to organize data, if necessary.

Within an Arcus Scientific Lab

Arcus delivered archival data will be found here.
Study team generated data brought into Arcus goes here.
This data is managed by Arcus, and should not be modified by the research team.

Raw data is different depending on the type of research. Please select what type(s) of data you would like more information about, you can select both:

omics data
clinical data

Important note
REDCap is a great application for clinical data projects of all sizes available to all CHOP personnel. The REDCap team at CHOP has great resources for [data collection best practices](https://storage.googleapis.com/arcus-edu-libsci/PDFs/Best%20Practices%20for%20REDCap%20Data%20Collection.pdf) for new projects and how to [import data](https://storage.googleapis.com/arcus-edu-libsci/PDFs/REDCap_Data_Import_Instructions.pdf) residing in a different application for complete projects ready to be archived. If you automate data collection directly from patients encounters in the EHR, there are options to feed that data directly into [REDCap via an API](https://liascript.github.io/course/?https://raw.githubusercontent.com/arcus/education_modules/main/using_redcap_api/using_redcap_api.md#1). If you collect data in REDCap, there is an option to both tag data with an identifiability label at the onset of a project as well as export data with all identifiable fields tagged.

`) } else send.clear() } catch(e) { } </script> <script modify="false"> try { let data_type = @input(`data_type`) if(!data_type[0] & !data_type[1]) { send.liascript(`**Nothing is selected**
`) } else send.clear() } catch(e) { } </script>

data/interim

The interim directory is for storing outputs of data processing and analysis completed using the original, unmodified data store in data/raw. It is generally used for files that do not need to be stored long-term. Further sub-directories can be added to organize data, if necessary.

Within an Arcus Scientific Lab

Data in this directory is managed by the study team
It should be used as an unregulated space for intermediate and temporary files.
Recommend establishing retention schedules for regular review/clean-up of data in this folder.

Interim data is different depending on the type of research. Please select what type(s) of data you would like more information about, you can select both:

omics data
clinical data

data/endpoints

The endpoints directory holds the final results created as part of a research analysis. Often, these are files created to support papers or grants, and other dissemination. Further sub-directories can be added to organize data, if necessary.

Within an Arcus Scientific Lab

Data in this directory is managed by the study team.
Data in this directory will be saved if the project is archived in Arcus.

Endpoints data is different depending on the type of research. Please select what type(s) of data you would like more information about, you can select both:

omics data
clinical data

data/ref-data

This directory is for any external or public datasets not created by the study team necessary to understand or repeat the analysis for the project.

Within an Arcus Scientific Lab

External or public datasets not supplied by Research IS or your lab, such as census data, will be available in this directory.

Ref-data data is different depending on the type of research. Please select what below if need need more information about omics data for this directory:

omics data
clinical data

manifests/

Manifests are an inventory of all data in the collection, and provide a mapping between research data in the data folders, and participant information. The manifests also create a mapping between data and associated pipeline and technical information about workflows. There are three main manifests that are mandatory for every archival collection:

file_manifest.csv
participant_manifest.csv
participant-crosswalk.txt

Additional manifests are only required if needed for the data or collection type. These files are detailed in the next sections. The graphic below illustrates the linking between the files:

Within an Arcus Scientific Lab

Managed by Arcus, you will not need to create these for yourselves
This will only appear in the lab if archival data is delivered

manifests/file_manifest

The file_manifest.csv matches the biosample_id to each file in the data folders. Below is more detail about each section in the file:

biosample_id is an ID number for each file. For some studies, each file is derived about specific biosamples, so we suggest using the sample id. ideally, biosample_id links to the CHOP biobank. When you cannot link the the biobank, treat biosample_id as the IDs you use for samples taken from participants. For studies where there are no biosamples, the biosample_id can be the file name.
file_type is the type of file, indicated by the file extension
protocol is only for omics data, select the omics data example below for more information
file_path is the file path for each file in the data\ folders. File paths should start with data\ and end with the full file name with extension

The file_manifest.csv may look different depending on the type of research. Please select below if you need more information about either omics or clinical data for this directory:

omics data
clinical data

manifests/participant_manifest

The participant_manifest.csv identifies which participants information links to each of the files in the file_manifest. Below is more detail about each section of the file:

local_participant_id is a local identifier the study team used to identify the patient
The biosample_id will be the same as the one listed in the file_manifests.csv. Linking a local_participant_id to a biosample_id identifies which patients information is related to the file.
cohort is optional, please fill this in if there is additional cohort information or identification needed.

The participant_manifest.csv may look different depending on the type of research. Please select which type of data from below you need more information about for this directory:

omics data
clinical data

manifests/participant_crosswalk

The participant-crosswalk.txt manifest is a tab delimited file with no header that links local_participant_id in the participant_manifest.csv to MRN (Medical Record Number). See below for the terms in the file:

column	definition	type	notes
local_id_type	The type of participant id (local).	String	This will always be local.
local_participant_id	Id that is used in PARTICIPANT_MANIFEST	String
auth_id_type	The type of participant id(chop)	String	This will always be chop.
auth_participant_id	Authorative Id of the participant. (Often MRN)	String	Use an 8 digit MRN. Left-pad the MRN with zeroes as necessary.

manifests/participant_family_role

The participant_family_role.csv file is only needed for some omics data. If you have family data (ie sequencing data from related family members), use this file to describe relationships. See below for an example.

local_participant_id	local_relative_id	relative_family_role
participant1	participant2	biological mother
participant2	participant1	biological son
participant1	participant3	biological father
participant3	participant1	biological son
participant1	participant4	biological sister
participant4	participant1	biological brother

column	definition	type
local_participant_id	The local id of a participant.	String
local_relative_id	The local id of a relative to the participant.	String
relative_family_role	The familial relationship of the relative to the participant. Use terms from eHB_relationship_types_as_of_10_30.json.	String

manifests/familyid_crosswalk

This manifest should be used with trio and cohort omics data. A trio will contain three participants, a cohort can contain hundreds. This file walks the name for the trio or cohort file with the local_participant_id's included in it.
This is also called a PED or pedigree file in bioinformatics workflows. A pedigree is a structured description of the familial relationships between samples, see this link for more information.

family_id	individual_id	paternal_id	maternal_id	sex
LML100	101354			2
LML100	101355		101354	1
LML101	102454			2
LML101	102455	102456	102454	1
LML101	102456			1
LML102	103767			1
LML102	103768			2
LML102	103769	103767	103768	2
LML103	108976	108977	108978	1
LML103	108977			1
LML103	108978			2
LML104	104666	104667	104668	2
LML104	104667	-9	-9	1
LML104	104668	-9	-9	2

family_id	Required, the family_id for the trio data
local_participant_id	Required, the local_particpant_id for each of the participants included in the trio
paternal_id	Optional, the local_participant_id for the father of the participant
maternal_id	Optional, the local_participant_id for the mother of the participant
sex	Optional, the sex of the participant. 1 for male, 2 for female

manifests/file_derivation.csv

The file_derivation.csv manifest is only required for omics contributions with multiple filetypes generated through a bioinformatics pipeline or workflow.

file_derivation.csv describes the relationships between files in a pipeline or workflow.

column	definition	type
destination_file_group	The files in this file group is derived from source_file_group.	String
source_file_group	File group used to derive the destination_file_group.	String

manifests/env manifest

For each script/notebook in src/, and each model in models/, there should be an env file (here env refers to a file named env with any extension, so env.yaml or env.txt, for example) that describes the environment in which it was created or run. Environment files should be named as follows: descriptiveName_env and placed in a folder called environments within the configs/environment/ directory. Either individuals files or entire folders (whichever is the appropriate level) in scripts and notebooks within the src/ directory, or the models/ directory will need to be added to the env manifest file, matching them with their related environment file. See the see below for more information about this file:

column	definition	type
programming_filegroup	Enter the highest level folder that the environment file relates to. If the file relates to an entire directory then put the whole directory file path. If the file relates to a subdirectory enter that filepath. If it relates to a single file enter the file path and filename.	String
related_environment	Enter the environment filename. Some environment files will be entered multiple times as they relate to multiple files.	String

src/

The src or sources folder stores the access tools required to work with the research data and repeat the analysis. The need for access tools is dependent on the type of research, not all research has rules, scripts or notebooks. Any scripts saved in the src folder require an environment manifest to document the computing environment the code is run in, see the environment manifest section of this module for more information. Subdirectory folders can be customized and added as needed, below are the common directories or data types used in scientific research:

notebooks: Jupyter, Beaker, Zeppelin, WDL, CWL etc.
scripts: custom software, code, tools
rules: for computational workflows
test: unit testing for code, customizable to team needs

Important note
Version Control is important when working collaboratively with access tools like scripts and workflows. For a description of version control and version control systems, see the Arcus Education module, [Intro to version control](https://liascript.github.io/course/?https://raw.githubusercontent.com/arcus/education_modules/main/git_intro/git_intro.md#1)

src files may look different depending on the type of research. Please select which type of data below if you need more information about this directory:

omics data
clinical data

models/

The models directory is for saving any type of machine learning models, model predictions, model summaries, data sheets for model training data. Please consult with the Library Science team on your specific model type for more information about formats and directory structure for archiving.

references/

This directory contains general information about the research effort such as IRB documents, reference papers, sample information, lab prep information, and data dictionaries. This directory holds the technical information needed to understand the research data. Further subdirectories can be customized depending on the collection.

references files may look different depending on the type of research. Please select what type of data you need more information about for this directory:

omics data
clinical data

reports/

The reports directory holds published papers and content used for producing papers, presentations, websites, metrics, etc. It can additionally hold the following information:

Figures & tables: generated metrics and graphics for supporting reports
Log.md: computational notebook (if one was used to create the content)
Methods.md: version controlled methods section for the project
Further subdirectories can be customized based on the needs of the collection.

requirements/

Any module or library dependencies for workflows.
Additional requirements files can be added as needed.

configs/

This directory holds configuration files for workflows or applications. Sub-directories can be added as needed for the collection.

configs/environments

Environment means the analysis environment for a script or model.

For each script/notebook in src, each model in models, there should be an environment manifest and file that describes the environment in which it was created or run.
Environment files should be named as follows: descriptiveName_env.* and placed in a folder called environments within the configs/ directory.
Either individuals files or entire folders(whichever is the appropriate level) in scripts and notebooks within the src/ directory, the models/ directory, or the data/endpoints folder will need to be added to the env_manifest.csv file, matching them with their related environment file.
All environmental files should be documented in an environment manifest, see the manifests/environment manifests for more information

Arcus Lab Images

There should also be a file named lab-image-tag within a folder titled lab-image within the configs/ directory that contains the tag of the Arcus Lab Image that the Lab was using.
Though unlikely, if artifacts use more that one image than follow the directions above (in the environments section): add a descriptive name to each lab-image-tag file, and add the file paths and related files or directories to the env_manifest linking them together.

Concluding Quiz

True or False: Data in the raw directory can be edited.

[( )] TRUE [(X)] FALSE

FALSE. This directory holds authoritative source data that should never be deleted. This folder is where the original, unmodified data for the research project is stored.

***

True or False: MRNs are preferred for use as both the local biosample and/or participant id and the authoritative id.

[( )] TRUE [(X)] FALSE

FALSE. MRNs should only be used as the authoritative id in the participant-crosswalk.txt to protect patient privacy and to minimize data leaks.

***

Which folders in the project template are minimally required?

[[]] references [[]] src [[X]] data [[X]] manifests [[]] configs [[]] models [[]] reports [[]] requirements

Only data and manifests are required but more is always better. References and src are probably the two most common non-required folders generated over the course of research.

***

I am contributing a clinical dataset to Arcus. I have a python script that was used to transform my raw datset for further analysis. What directory should this file be saved in?

[[src]]

The src or sources folder stores the access tools required to work with the research data and repeat the analysis, like scripts. Remember, any scripts saved in the src folder require an environment manifest to document the computing environment the code is run in, see the environment manifest page for more information.

***

Files

data_contribution_2_data.md

Latest commit

History

data_contribution_2_data.md

File metadata and controls

Arcus Project Template Orientation

Overview

Is this module right for me?

Details

Audience

In progress data contributors

Future data contributors

Arcus Lab Users

Arcus Data Lifecycle

Project Template

Research Data

Access Tools

Contextual Files

Project Template Directories

File Naming and File Type Standards

Preferred File Formats

data/

data/raw

data/interim

data/endpoints

data/ref-data

manifests/

manifests/file_manifest

manifests/participant_manifest

manifests/participant_crosswalk

manifests/participant_family_role

manifests/familyid_crosswalk

manifests/file_derivation.csv

manifests/env manifest

src/

models/

references/

reports/

requirements/

configs/

configs/environments

Concluding Quiz