@comment
@long_description
Estimated time to completion: @estimated_time
Prerequisites:
@pre_reqs
Learning Objectives:
@learning_objectives
Hi! This document is still under construction and testing. We apologize in advance for any broken links or unclear language. We invite your feedback. Please add a support ticket or email Arcus Library Science to let us know what we can improve or suggest additional topics.
Please note that many of the links here will only work if you're on the CHOP network.
This module introduces the CHOP community to the Project Template, a structured file directory for managing research data. It is useful to ANYONE at CHOP involved in creating, managing or analyzing research data.
The Project Template offers an easy to use and flexible structure for organizing data, and ensures necessary context is preserved for project files and documents. It provides a shared, documented framework for organizing a research effort. This achieves multiple goals:
- It assists research teams in building transparency and reproducible workflows
- It establishes a structure for easier long-term preservation
- It preserves context needed for future reuse of research by collaborators or for the original researcher
- It organizes a research project for archiving within Arcus
The Project Template is useful in creating reproducible, generalized and reuseable research. To learn more about reproducible, generalized and reuseable principles in research, check out the following module from the Arcus Education team.
Thank you for agreeing to contribute your data, all contributed research data will be arranged in CHOP's Project Template structure. This module describes all sections of the project template structure, what data goes in each section, and shows examples of research data arranged in the template.
This module can be used as a reference while you navigate the archiving process, please reach out to the Digital Archivist if you have further questions. Please share this module with others on your research team involved with preparing the data contribution, or other researchers that may be interested in archiving data with Arcus.
This module is an overview of CHOP's Project Template structure, all contributed data is arranged in this structure for archiving in Arcus. The Project Template is useful at all stages of research, we suggest implementing it as early as possible in your research as it provides a shared, documented framework for organizing a research effort.
Arcus's Library Science team is happy to meet with researchers at all phases of research for research data management consultations and planning for future archival contributions. This includes early in your research project, as we can help set up a project template file directory structure for storing data and recommend metadata and organization best practices. In addition to this module, there are additional data management resources available on CHOP’s Arcus resources page.
If after viewing this module, you are prepared to archive data with Arcus, please fill out a data contribution request to start the process.
The Project Template is added to all Arcus Scientific Computing Labs (Arcus Lab), so all you have to do is set up your workflows to conform to the template. This module describes all sections of the project template structure, what data goes in each section, and shows examples of research data arranged in the template.
As part of the Arcus Lab deployment, your team should have received a Project Template orientation by a member of the Library Science team. If you missed the orientation, or need a refresher on the orientation, please see this video of the orientation. Much of the content covered in the video is also in this module.
When appropriate, archiving your research in Arcus is expected from Scientific Projects with an Arcus Lab. This is documented in the Arcus Terms of Use. Archiving is required if you would like to move any data created within an Arcus Lab to a new Scientific Project or if other research teams would like to reuse your data.
When you are ready to archive your lab data, please submit the following request in the Arcus Help Center to begin the data contribution process.
Arcus's goal for research data management and the project template is to provide tools that are relevant throughout the entire lifecycle of research data. The project template is designed to be adaptable and iterative to capture the wide range of research activities at CHOP. The project template combines flexibility to encompass diverse data capturing needs, while also maintaining a consist structure for all archived data which facilitates seamless communication among various projects spanning different domains, thereby promoting effective data sharing.
How was this structure developed?
The CHOP project template file directory structure was adapted from DrivenData’s Cookiecutter Data Science template. It was adapted by former Arcus Digital Archivist, Christiana Dobrzynski, and former CHOP Bioinformatician, Perry Evans. Both Arcus’s and DrivenData’s templates aim to organize research data and tools for accuracy and reproducibility. See DataDriven's introduction to learn more about the goals and purpose of project template structures for data preservation and sharing.
The CHOP project template evolved through iterations and feedback from CHOP researchers. A multi-disciplinary groups of practitioners were consulted in the template adaptation development, including:
- Bioinformatics
- Cancer research
- Microbiome center
- Research IT
- Clinical sequencing unit
- Medical Informatics Unit
How is the Project Template used at Arcus?
The Project Template prioritizes streamlined archiving and reproducible research pathways. It archives a wide range of research types from the Research Institute, making them discoverable through tools like Arcus Cohort Discovery, available as an application on the Arcus website, Gene, and the Arcus Variant Browser, (available as an application on the Arcus website). The Project Template facilitates organizing diverse research data in a single directory structure, enabling automated archiving, metadata management, and data delivery throughout the research data lifecycle. This file directory structure is used for the entire lifecycle of research data within Arcus:
The project template provides a shared structure so that institutional knowledge previously held locally by various members of the data creation team becomes centralized.
The utility of the project template for lab drive organization and integration with the Arcus archives is summarized in the graphic below.
The project template structure includes directories for capturing three major aspects of a research effort: the data (data), the tools needed to work with that data (access tools), and the contextual information needed to understand the effort and its constituent parts (contextual). The high level directories are as follows (items with asterix are required):
- Configs (contextual)
- Data (data)*
- Manifests (data)*
- Models (access tools)
- References (contextual)
- Reports (contextual)
- Requirements (contextual)
- SRC (access tools)
Below is an image of the entire Project Template Directory, with more detail about each section:
The Project Template brings together three categories of information: Research Data, Access Tools and Contextual files. Research data is the actual data collected during the course of research processes used for analysis. The manifests describe this data, crosswalking files to participants. Research data (with manifests) is the minimum required information for all Arcus data contributions.
Access Tools are the code used to do the analysis. This can include machine models, scripts and Jupyter notebooks.
Contextual Files provide information needed to understand the data and analysis. This can include omics protocols, data dictionaries, reports and diagrams.
The next part of this module will walk through each sub-directory of the project template in detail. Though the project template is flexible enough to handle a wide range of research data, it's application and the filetypes in each directory will be different depending on the type of project. For this reason, we have two different examples: clinical data or omics data. In many of the following sections, you can select the option to view examples and specific information for the data type.
Regardless of project type, Arcus follows industry standard guidelines for digital archiving applying these standards to incoming data contributions. File names should follow a consistent and clear schema, and not contain any spaces, periods or special characters. Further recommendations for file naming are below:
- File naming tips sheet
- File naming conventions and activity sheet
- Recommended practices for README files
Whenever feasible, Arcus prefers to archive non-proprietary file formats as opposed to proprietary ones. Proprietary formats necessitate specific software for access or utilization, while non-proprietary formats are frequently open-source. Whenever you have the option, it's advisable to store data in a non-proprietary (open) file format. This choice enhances the accessibility of your content to others, enabling effortless reuse across various software platforms. Furthermore, this approach guarantees the continued utility of the file in the long term. In contrast, proprietary files carry the risk of becoming obsolete due to potential software incompatibility or restricted access.
When it is necessary to save files in a proprietary format, consider including a README file that documents the name and version of the software used to generate the file, as well as the company who made the software. This could help down the road if we need to figure out how to open these files again.
For both the clinical data and omics examples in the Project Template walk through, we reference our preferred data formats for each type of data. Below are some general resources to help in choosing file formats:
- Library of Congress Recommended Formats Statement
- UCSC Glossary of Omics File Formats
- NIH Clinical Trials Data Formats Overview
The data folder is where the data files are organized. Data is the information collected during the course of research processes used for analysis. The data directory maintains descriptions of authoritative source data and their associated files and metadata in both raw and processed formats. There are four sub-directories within the data folder for organizing the data: raw/, interim/, endpoints/, and ref-data.
All files within the data/ folder and its subdirectories will be listed in the file_manifest.csv. The manifests are detailed in the manifests section of this course.
This directory holds authoritative source data that should never be deleted. This folder is where the original, unmodified data for the research project is stored. In a research process, this is the data used for the initial analysis. Further sub-directories can be added to organize data, if necessary.
Within an Arcus Scientific Lab
- Arcus delivered archival data will be found here.
- Study team generated data brought into Arcus goes here.
- This data is managed by Arcus, and should not be modified by the research team.
Raw data is different depending on the type of research. Please select what type(s) of data you would like more information about, you can select both:
- omics data
- clinical data
REDCap is a great application for clinical data projects of all sizes available to all CHOP personnel. The REDCap team at CHOP has great resources for [data collection best practices](https://storage.googleapis.com/arcus-edu-libsci/PDFs/Best%20Practices%20for%20REDCap%20Data%20Collection.pdf) for new projects and how to [import data](https://storage.googleapis.com/arcus-edu-libsci/PDFs/REDCap_Data_Import_Instructions.pdf) residing in a different application for complete projects ready to be archived. If you automate data collection directly from patients encounters in the EHR, there are options to feed that data directly into [REDCap via an API](https://liascript.github.io/course/?https://raw.githubusercontent.com/arcus/education_modules/main/using_redcap_api/using_redcap_api.md#1). If you collect data in REDCap, there is an option to both tag data with an identifiability label at the onset of a project as well as export data with all identifiable fields tagged.
`) } else send.clear() } catch(e) { } </script>
The interim directory is for storing outputs of data processing and analysis completed using the original, unmodified data store in data/raw. It is generally used for files that do not need to be stored long-term. Further sub-directories can be added to organize data, if necessary.
Within an Arcus Scientific Lab
- Data in this directory is managed by the study team
- It should be used as an unregulated space for intermediate and temporary files.
- Recommend establishing retention schedules for regular review/clean-up of data in this folder.
Interim data is different depending on the type of research. Please select what type(s) of data you would like more information about, you can select both:
- omics data
- clinical data
The endpoints directory holds the final results created as part of a research analysis. Often, these are files created to support papers or grants, and other dissemination. Further sub-directories can be added to organize data, if necessary.
Within an Arcus Scientific Lab
- Data in this directory is managed by the study team.
- Data in this directory will be saved if the project is archived in Arcus.
Endpoints data is different depending on the type of research. Please select what type(s) of data you would like more information about, you can select both:
- omics data
- clinical data
This directory is for any external or public datasets not created by the study team necessary to understand or repeat the analysis for the project.
Within an Arcus Scientific Lab
- External or public datasets not supplied by Research IS or your lab, such as census data, will be available in this directory.
Ref-data data is different depending on the type of research. Please select what below if need need more information about omics data for this directory:
- omics data
- clinical data
Manifests are an inventory of all data in the collection, and provide a mapping between research data in the data folders, and participant information. The manifests also create a mapping between data and associated pipeline and technical information about workflows. There are three main manifests that are mandatory for every archival collection:
- file_manifest.csv
- participant_manifest.csv
- participant-crosswalk.txt
Additional manifests are only required if needed for the data or collection type. These files are detailed in the next sections. The graphic below illustrates the linking between the files:
Within an Arcus Scientific Lab
- Managed by Arcus, you will not need to create these for yourselves
- This will only appear in the lab if archival data is delivered
The file_manifest.csv matches the biosample_id to each file in the data folders. Below is more detail about each section in the file:
- biosample_id is an ID number for each file. For some studies, each file is derived about specific biosamples, so we suggest using the sample id. ideally, biosample_id links to the CHOP biobank. When you cannot link the the biobank, treat biosample_id as the IDs you use for samples taken from participants. For studies where there are no biosamples, the biosample_id can be the file name.
- file_type is the type of file, indicated by the file extension
- protocol is only for omics data, select the omics data example below for more information
- file_path is the file path for each file in the data\ folders. File paths should start with data\ and end with the full file name with extension
The file_manifest.csv may look different depending on the type of research. Please select below if you need more information about either omics or clinical data for this directory:
- omics data
- clinical data
The participant_manifest.csv identifies which participants information links to each of the files in the file_manifest. Below is more detail about each section of the file:
- local_participant_id is a local identifier the study team used to identify the patient
- The biosample_id will be the same as the one listed in the file_manifests.csv. Linking a local_participant_id to a biosample_id identifies which patients information is related to the file.
- cohort is optional, please fill this in if there is additional cohort information or identification needed.
The participant_manifest.csv may look different depending on the type of research. Please select which type of data from below you need more information about for this directory:
- omics data
- clinical data
The participant-crosswalk.txt manifest is a tab delimited file with no header that links local_participant_id in the participant_manifest.csv to MRN (Medical Record Number). See below for the terms in the file:
column | definition | type | notes |
---|---|---|---|
local_id_type | The type of participant id (local). | String | This will always be local. |
local_participant_id | Id that is used in PARTICIPANT_MANIFEST | String | |
auth_id_type | The type of participant id(chop) | String | This will always be chop. |
auth_participant_id | Authorative Id of the participant. (Often MRN) | String | Use an 8 digit MRN. Left-pad the MRN with zeroes as necessary. |
The participant_family_role.csv file is only needed for some omics data. If you have family data (ie sequencing data from related family members), use this file to describe relationships. See below for an example.
local_participant_id | local_relative_id | relative_family_role |
---|---|---|
participant1 | participant2 | biological mother |
participant2 | participant1 | biological son |
participant1 | participant3 | biological father |
participant3 | participant1 | biological son |
participant1 | participant4 | biological sister |
participant4 | participant1 | biological brother |
column | definition | type |
---|---|---|
local_participant_id | The local id of a participant. | String |
local_relative_id | The local id of a relative to the participant. | String |
relative_family_role | The familial relationship of the relative to the participant. Use terms from eHB_relationship_types_as_of_10_30.json. | String |
- This manifest should be used with trio and cohort omics data. A trio will contain three participants, a cohort can contain hundreds. This file walks the name for the trio or cohort file with the local_participant_id's included in it.
- This is also called a PED or pedigree file in bioinformatics workflows. A pedigree is a structured description of the familial relationships between samples, see this link for more information.
family_id | individual_id | paternal_id | maternal_id | sex |
---|---|---|---|---|
LML100 | 101354 | 2 | ||
LML100 | 101355 | 101354 | 1 | |
LML101 | 102454 | 2 | ||
LML101 | 102455 | 102456 | 102454 | 1 |
LML101 | 102456 | 1 | ||
LML102 | 103767 | 1 | ||
LML102 | 103768 | 2 | ||
LML102 | 103769 | 103767 | 103768 | 2 |
LML103 | 108976 | 108977 | 108978 | 1 |
LML103 | 108977 | 1 | ||
LML103 | 108978 | 2 | ||
LML104 | 104666 | 104667 | 104668 | 2 |
LML104 | 104667 | -9 | -9 | 1 |
LML104 | 104668 | -9 | -9 | 2 |
family_id | Required, the family_id for the trio data |
---|---|
local_participant_id | Required, the local_particpant_id for each of the participants included in the trio |
paternal_id | Optional, the local_participant_id for the father of the participant |
maternal_id | Optional, the local_participant_id for the mother of the participant |
sex | Optional, the sex of the participant. 1 for male, 2 for female |
The file_derivation.csv manifest is only required for omics contributions with multiple filetypes generated through a bioinformatics pipeline or workflow.
file_derivation.csv describes the relationships between files in a pipeline or workflow.
column | definition | type |
---|---|---|
destination_file_group | The files in this file group is derived from source_file_group. | String |
source_file_group | File group used to derive the destination_file_group. | String |
For each script/notebook in src/, and each model in models/, there should be an env file (here env refers to a file named env with any extension, so env.yaml or env.txt, for example) that describes the environment in which it was created or run. Environment files should be named as follows: descriptiveName_env and placed in a folder called environments within the configs/environment/ directory. Either individuals files or entire folders (whichever is the appropriate level) in scripts and notebooks within the src/ directory, or the models/ directory will need to be added to the env manifest file, matching them with their related environment file. See the see below for more information about this file:
column | definition | type |
---|---|---|
programming_filegroup | Enter the highest level folder that the environment file relates to. If the file relates to an entire directory then put the whole directory file path. If the file relates to a subdirectory enter that filepath. If it relates to a single file enter the file path and filename. | String |
related_environment | Enter the environment filename. Some environment files will be entered multiple times as they relate to multiple files. | String |
The src or sources folder stores the access tools required to work with the research data and repeat the analysis. The need for access tools is dependent on the type of research, not all research has rules, scripts or notebooks. Any scripts saved in the src folder require an environment manifest to document the computing environment the code is run in, see the environment manifest section of this module for more information. Subdirectory folders can be customized and added as needed, below are the common directories or data types used in scientific research:
- notebooks: Jupyter, Beaker, Zeppelin, WDL, CWL etc.
- scripts: custom software, code, tools
- rules: for computational workflows
- test: unit testing for code, customizable to team needs
Version Control is important when working collaboratively with access tools like scripts and workflows. For a description of version control and version control systems, see the Arcus Education module, [Intro to version control](https://liascript.github.io/course/?https://raw.githubusercontent.com/arcus/education_modules/main/git_intro/git_intro.md#1)
src files may look different depending on the type of research. Please select which type of data below if you need more information about this directory:
- omics data
- clinical data
The models directory is for saving any type of machine learning models, model predictions, model summaries, data sheets for model training data. Please consult with the Library Science team on your specific model type for more information about formats and directory structure for archiving.
This directory contains general information about the research effort such as IRB documents, reference papers, sample information, lab prep information, and data dictionaries. This directory holds the technical information needed to understand the research data. Further subdirectories can be customized depending on the collection.
references files may look different depending on the type of research. Please select what type of data you need more information about for this directory:
- omics data
- clinical data
The reports directory holds published papers and content used for producing papers, presentations, websites, metrics, etc. It can additionally hold the following information:
- Figures & tables: generated metrics and graphics for supporting reports
- Log.md: computational notebook (if one was used to create the content)
- Methods.md: version controlled methods section for the project
- Further subdirectories can be customized based on the needs of the collection.
- Any module or library dependencies for workflows.
- Additional requirements files can be added as needed.
This directory holds configuration files for workflows or applications. Sub-directories can be added as needed for the collection.
Environment means the analysis environment for a script or model.
- For each script/notebook in src, each model in models, there should be an environment manifest and file that describes the environment in which it was created or run.
- Environment files should be named as follows: descriptiveName_env.* and placed in a folder called environments within the configs/ directory.
- Either individuals files or entire folders(whichever is the appropriate level) in scripts and notebooks within the src/ directory, the models/ directory, or the data/endpoints folder will need to be added to the env_manifest.csv file, matching them with their related environment file.
- All environmental files should be documented in an environment manifest, see the manifests/environment manifests for more information
Arcus Lab Images
- There should also be a file named lab-image-tag within a folder titled lab-image within the configs/ directory that contains the tag of the Arcus Lab Image that the Lab was using.
- Though unlikely, if artifacts use more that one image than follow the directions above (in the environments section): add a descriptive name to each lab-image-tag file, and add the file paths and related files or directories to the env_manifest linking them together.
- True or False: Data in the raw directory can be edited.
[( )] TRUE [(X)] FALSE
FALSE. This directory holds authoritative source data that should never be deleted. This folder is where the original, unmodified data for the research project is stored.
- True or False: MRNs are preferred for use as both the local biosample and/or participant id and the authoritative id.
[( )] TRUE [(X)] FALSE
FALSE. MRNs should only be used as the authoritative id in the participant-crosswalk.txt to protect patient privacy and to minimize data leaks.
- Which folders in the project template are minimally required?
[[]] references [[]] src [[X]] data [[X]] manifests [[]] configs [[]] models [[]] reports [[]] requirements
Only data and manifests are required but more is always better. References and src are probably the two most common non-required folders generated over the course of research.
- I am contributing a clinical dataset to Arcus. I have a python script that was used to transform my raw datset for further analysis. What directory should this file be saved in?
[[src]]
The src or sources folder stores the access tools required to work with the research data and repeat the analysis, like scripts. Remember, any scripts saved in the src folder require an environment manifest to document the computing environment the code is run in, see the environment manifest page for more information.