Skip to content

Commit

Permalink
Feat/bulk methylation templates (#80)
Browse files Browse the repository at this point in the history
* Add BulkMethylation-seq templates for more levels

* Add kit enum

* Add more enums

* Add slots

* Cleanup

* Lint stray tabs

* Refactor

* Cleanup

* Copy over current dca-template-config.json per #79

* Add bulk methylation-seq templates to templates config

* Lint

* Update README

* Update test

* Update README

* Remove unneeded step of setting up remote config in test

* Use ParentDataFileID

* Wording

* Increase throttle setting

* Build jsonld

* Fix

* Update throttle parameter again

* Build jsonld

* Fix extra spaces

* More cleanup

* Build jsonld

* Update imaging.yaml

* Build jsonld

* Update slots.yaml

remove LabID since it's duplicated by ParticipantID

* Build jsonld

* Update imaging.yaml

* Build jsonld

* Update imaging.yaml

* Update imaging.yaml

* Build jsonld

* Update imaging.yaml

* Update imaging.yaml

* change to participantID

* Build jsonld

---------

Co-authored-by: gf-dcc-service <[email protected]>
Co-authored-by: Christina Conrad <[email protected]>
  • Loading branch information
3 people authored Nov 29, 2023
1 parent 1415ca3 commit d53bda4
Show file tree
Hide file tree
Showing 8 changed files with 7,990 additions and 7,585 deletions.
15,346 changes: 7,793 additions & 7,553 deletions GF.jsonld

Large diffs are not rendered by default.

17 changes: 15 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,26 @@
# Gray Foundation Data Model

This repository hosts the data model used for the Gray Foundation project.
This repository hosts the data model used for the Gray Foundation project.
The data model defines metadata that we collect for patients and data files, as templates that are created and validated the Data Curator App (see related section [Updating DCA configuration](#Updating-DCA-configuration)).

## Internal development
## Development

We use a branching model for development of the source.
Create a branch starting with `dm/**` to make changes in one of the files in `modules`, then submit a pull request.
This will run some automated checks and "compile" the full model files (`.csv` and `.jsonld`).

### Using the LinkML framework

For now, please reference [the docs in our sister DCC repo](https://github.com/nf-osi/nf-metadata-dictionary/#data-model-framework) to understand the LinkML framework and files.

### Updating DCA configuration

The `dca-template-config.json` defines which templates in the data model can be generated through the Data Curator App.
If you've cooked up a new template that's ready for production, you will need to add it to the menu through updating this file.
This will make sure that:
- It shows up in the DCA menu.
- Continuous integration tests are run for this template; if a change breaks a user-facing template, it show up in the test for the pull request.

## External comments, questions, contributions

For questions and suggestions please submit an issue.
Expand Down
56 changes: 56 additions & 0 deletions dca-template-config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
{
"manifest_schemas": [
{
"display_name": "Imaging Level 2",
"schema_name": "ImagingLevel2",
"type": "file"
},
{
"display_name": "Imaging Level 2 Channels",
"schema_name": "ImagingLevel2Channels",
"type": "file"
},
{
"display_name": "ScRNA-seq Level 1",
"schema_name": "ScRNA-seqLevel1",
"type": "file"
},
{
"display_name": "ScRNA-seq Level 2",
"schema_name": "ScRNA-seqLevel2",
"type": "file"
},
{
"display_name": "ScRNA-seq Level 3",
"schema_name": "ScRNA-seqLevel3",
"type": "file"
},
{
"display_name": "ScRNA-seq Level 4",
"schema_name": "ScRNA-seqLevel4",
"type": "file"
},
{
"display_name": "Bulk Methylation-seq Level 1",
"schema_name": "BulkMethylation-seqLevel1",
"type": "file"
},
{
"display_name": "Bulk Methylation-seq Level 2",
"schema_name": "BulkMethylation-seqLevel2",
"type": "file"
},
{
"display_name": "Bulk Methylation-seq Level 3",
"schema_name": "BulkMethylation-seqLevel3",
"type": "file"
},
{
"display_name": "Patient Cohort Data",
"schema_name": "CohortCoreTemplate",
"type": "record"
}
],
"service_version": "v23.1.1",
"schema_version": ""
}
4 changes: 1 addition & 3 deletions modules/classes/imaging.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ classes:
- Component
- Filename
- FileFormat
- LabID
- ParticipantID
- ChannelMetadataFilename
- ImagingAssayType
- ProtocolLink
Expand Down Expand Up @@ -78,5 +78,3 @@ classes:
- OligoBarcodeLowerStrand
- Dilution
- Concentration


79 changes: 67 additions & 12 deletions modules/classes/sequencing.yaml
Original file line number Diff line number Diff line change
@@ -1,15 +1,23 @@
classes:

Sequencing:
File:
abstract: true
description: Base template for some type of sequencing data, for which below properties are relevant/in common.
description: Basic metadata for a file entity that contains data (whether sequencing, imaging, etc.) will inherit.
annotations:
requiresComponent: ''
required: false
slots:
- Component
- Filename
- FileFormat

SequencingData:
is_a: File
description: Base template for some type of sequencing data file, for which below properties are relevant/in common.
annotations:
requiresComponent: ''
required: false
slots:
- ParticipantID
- SampleID
- AltSampleID
Expand All @@ -30,8 +38,10 @@ classes:
- LaneNumber
- TechnicalReplicateGroup

# Single-cell base ######################################################################################

SingleCellSequencing:
is_a: Sequencing
is_a: SequencingData
annotations:
requiresComponent: ''
required: false
Expand All @@ -45,6 +55,8 @@ classes:
- SpikeIn
- EndBias

####### ScRNA-seq ########################################################################################

ScRNA-seqLevel1:
is_a: SingleCellSequencing
annotations:
Expand Down Expand Up @@ -109,6 +121,18 @@ classes:
- ScRNAseqWorkflowParametersDescription
- WorkflowLink
- WorkflowVersion

CEL-seq2:
is_a: ScRNA-seqLevel1
description: Highly-multiplexed plate-based single-cell RNA-Seq assay (so a subtype of scRNA-seq, and some additional metadata parameters are collected)
annotations:
requiresComponent: ''
required: false
slots:
- EmptyWellBarcode
- WellIndex

####### ScATAC-seq ########################################################################################

ScATAC-seqLevel1:
is_a: SingleCellSequencing
Expand All @@ -131,30 +155,61 @@ classes:
# - TotalReads


# BulkDNA-seq ##############################################################################################

BulkDNA-seqLevel1:
is_a: Sequencing
is_a: SequencingData
annotations:
requiresComponent: ''
required: false
slots:
- TargetCaptureKit



# Bulk Methylation ##############################################################################################

BulkMethylation-seqLevel1:
is_a: Sequencing
is_a: File
annotations:
requiresComponent: ''
required: false
description: Raw data for bulk methylation sequencing, such as FASTQs and unaligned BAMs
slots:
- GenomicCoverageType
- ParticipantID
- SampleID
- AltSampleID
- NucleicAcidSource
- BisulfiteConversionKit
- BulkMethylationAssayType
- SequencingPlatform
- TechnicalReplicateGroup


CEL-seq2:
is_a: ScRNA-seqLevel1
BulkMethylation-seqLevel2:
is_a: File
annotations:
requiresComponent: ''
required: false
description: Aligned primary data for bulk methylation sequencing, such as gene expression matrix files, VCFs, etc.
slots:
- EmptyWellBarcode
- WellIndex

- WorkflowVersion
- WorkflowLink
- GenomicReference
- GenomicReferenceURL
- ProportionofMinimumCpGCoverage10X
# Omitting optional metrics as these can be referenced/re-generated during reprocessing

BulkMethylation-seqLevel3:
is_a: File
annotations:
requiresComponent: ''
required: false
description: Sample level summary data for bulk methylation sequencing, such as t-SNE plot coordinates, etc.
slots:
- DMCallingTool
- WorkflowLink
- PUC19methylationratio
- Lambdamethylationratio
# - DMCdatafileformat -- omitting this optional metadata as unnecessary/low value
# - DMRdatafileFormat --omitting this optional metadata as unnecessar/low value

14 changes: 14 additions & 0 deletions modules/enums/Sequencing.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -89,3 +89,17 @@ enums:
DNA Insert:
Sample Index:
Cell Barcode:

BisulfiteConversionKitEnum:
permissible_values:
Zimo EZ DNA Methylation Kit:
Zimo EZ-96 DNA Methylation Shallow Kit:
Zimo EZ-96 DNA Methylation Deep Kit:
NEBNext Enzymatic Methyl-seq Kit:
Agilent SureSelectXT Methyl-Seq:

SequencingStrategy:
permissible_values:
Whole genome:
Targeted Genome:
Beadchip Array:
47 changes: 42 additions & 5 deletions modules/slots.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,12 @@ slots:
description: Biospecimen Type
annotations:
range: BiospecimenTypeEnum
BisulfiteConversionKit:
title: Bisulfite Conversion Kit
required: true
description: Name of the kit used in bisulfite conversion
annotations:
range: BisulfiteConversionKitEnum
BloodTestNormalRangeLower:
title: Blood Test Normal Range Lower
required: false
Expand Down Expand Up @@ -268,6 +274,12 @@ slots:
description: The breast quadrant or structure from which the breast tissue specimen was removed for microscopic examination.
annotations:
range: BreastQuadrantSiteEnum
BulkMethylationAssayType:
title: Bulk Methylation Assay Type
required: true
description: Assay types normally determine genomic coverage
annotations:
range: SequencingStrategy
COVID19AntibodyTesting:
title: COVID19 Antibody Testing
required: false
Expand Down Expand Up @@ -634,6 +646,11 @@ slots:
description: The degree to which the lesion has been cut out, or resected.
annotations:
range: ExtentofTumorResectionEnum
DMCallingTool:
title: DM Calling Tool
description: Software used for calling differentially methylated CpG (DMC) and/or differentially methylated region (DMR)
required: false
annotations:
FOVX:
title: FOVX
required: false
Expand Down Expand Up @@ -842,11 +859,6 @@ slots:
description: A yes/no/unknown indicator to identify whether there is a known genetic predisposition mutation present in the patient.
annotations:
range: IndicatorEnum
LabID:
title: Lab ID
required: false
description: TBD
annotations:
LabTestsforMMRStatus:
title: Lab Tests for MMR Status
required: false
Expand All @@ -859,6 +871,11 @@ slots:
description: The text term used to describe the medical testing used to diagnose, treat or further understand a patient's disease.
annotations:
range: LaboratoryTestEnum
Lambdamethylationratio:
title: Lambda methylation ratio
required: true
description: Methylation ratio of mostly unmethylated lambda control, as a percentage
annotations:
LaneNumber:
title: Lane Number
required: true
Expand Down Expand Up @@ -1170,6 +1187,11 @@ slots:
required: false
description: Biospecimen identifier indicating the biospecimen(s) from which these files were derived; multiple parent biospecimen should be comma-separated
annotations:
ParentDataFileID:
title: Parent Data File ID
required: false
description: Optional, Synapse ID of a parent data file from which this data file is derived.
annotations:
ParentID:
title: Parent ID
required: false
Expand Down Expand Up @@ -1302,6 +1324,11 @@ slots:
required: false
description: TBD
annotations:
ProportionofMinimumCpGCoverage10X:
title: Proportion of Minimum CpG Coverage 10X
required: false
description: (For whole genome sequencing or targeted sequencing) Proportion of all reference bases for that achieves 10X or greater coverage per CpG.
annotations:
ProtocolLink:
title: Protocol Link
required: false
Expand All @@ -1312,6 +1339,11 @@ slots:
required: false
description: TBD
annotations:
PUC19methylationratio:
title: pUC19 methylation ratio
required: true
description: Methylation ratio of mostly methylated pUC19 control, as a percentage
annotations:
Pyramid:
title: Pyramid
required: false
Expand Down Expand Up @@ -1710,6 +1742,11 @@ slots:
required: false
description: Numeric value that represents the number of times the patient uses tobacco each day.
annotations:
TotalDNAInput:
title: Total DNA Input
required: false
description: A sample amount used for the total number of reads, in microgram or nanogram.
annotations:
TotalNumberofInputCells:
title: Total Number of Input Cells
required: false
Expand Down
12 changes: 2 additions & 10 deletions tests/generate/basic_templates.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,13 @@
# Generally for 1) this is run in test directory with other tests

OUTPUT=${1:-google_sheet}
CONFIG=https://raw.githubusercontent.com/Sage-Bionetworks/data_curator_config/staging/GF/dca-template-config.json
TEST_CONFIG=config.json
TEST_CONFIG=../../dca-template-config.json
CREDS=creds.json
DATA_MODEL_PATH=../../GF.jsonld
DATA_MODEL=GF.jsonld
LOG_DIR=logs
TEMPLATE_DIR=../../templates
SLEEP_THROTTLE=17 # API rate-limiting, need to better figure out dynamically based on # of templates
SLEEP_THROTTLE=20 # API rate-limiting, need to better figure out dynamically based on # of templates

# Setup for creds
# If testing locally, it might already be in folder;
Expand All @@ -26,13 +25,6 @@ else
exit 1
fi

# Setup config
if [ -f "$TEST_CONFIG" ]; then
echo "Local $TEST_CONFIG present, running test with this local config..."
else
echo "Getting $CONFIG to use as test config..."
wget $CONFIG -O $TEST_CONFIG
fi

TEMPLATES=($(jq '.manifest_schemas[] | .schema_name' $TEST_CONFIG | tr -d '"'))
#TITLES=($(jq '.manifest_schemas[] | .display_name' $TEST_CONFIG | tr -d '"'))
Expand Down

0 comments on commit d53bda4

Please sign in to comment.