Skip to content

Latest commit

 

History

History
206 lines (157 loc) · 30.3 KB

v1.0.md

File metadata and controls

206 lines (157 loc) · 30.3 KB

Molecular Targets Platform Documentation v1

The Open Pediatric Cancer (OpenPedCan) project at the Children’s Hospital of Philadelphia, in partnership with the National Cancer Institute, is combining and harmonizing pediatric cancer datasets and integrating them into the Molecular Targets Platform https://moleculartargets.ccdi.cancer.gov/ in order to accelerate pediatric cancer target identification and drug development. This is high-level overview of the Molecular Targets Platform data processing and analysis. For more information on the Molecular Targets Platform itself, see https://moleculartargets.ccdi.cancer.gov/about. Please note that OpenPedCan is in continuous development and the GitHub repository main branch contents may not be identical to the Molecular Targets Platform site contents.

Contents

Datasets

While adult pan-cancer repositories have existed and accelerated cancer research for a decade, pediatrics cancers have been excluded, despite having different genetic and molecular etiologies than adult cancers. Over the past few years larger pediatric consortia, both disease-specific and pan-cancer, have tried to address this disparity. The Molecular Targets Platform is harmonizing the data from across these different consortia in one unified location where it can be queried for associations between putative targets and pediatric cancers. As the project is ongoing, more data will continue to be added, but this current release includes 3 pediatric consortia datasets as well as GTEx data for comparisons to normal tissue expression:

Image of a table giving the count of biospecimens for each sequencing type (DNA or RNA) for each tumor stage (primary or relapse) for each Dataset

For expanded descriptions of the datasets, please see the Pediatric Cancer Data Sources on the About page on the Molecular Targets Platform https://moleculartargets.ccdi.cancer.gov/about.


DNA Sequencing

Data Processing

DNA-seq Alignment and Haplotype Calling Workflow

For both whole genome, whole exome, and targeted panel DNA sequencing, the workflow begins by flagging duplicates and aligning fastq files, or re-aligning previously aligned BAMs, to the reference genome GRCh38 using bwa mem. The majority of Pediatric Molecular Target data is paired-end, but single end methods are provided if you want to apply the pipeline to your own data. Sequencing quality is checked using FastQC and tumor/normal pairs are double-checked to confirm they are from the same individual using NGSCheckMate. For more details on sample identity confirmation please see the Kid's First NGS Checkmate Workflow. Variants are called using GATK4 HaplotypeCaller. For more details on the alignment or to run the CAVATICA app yourself, see the GitHub release at Kid's First Alignment and Haplotype Calling Workflow and the CAVATICA App. Once in the Cavatica workflow page, please click on the "Read All" link to open up the full documentation.

Somatic Variant Calling

Small variants are called using multiple tools: Strelka2 for single nucleotide variants (SNVs) and small insertions/deletions (INDELs), GATK Mutect2 for SNVs, multinucleotide variants greater than 1bp in length (MNVs) and INDELs, Lancet for SNVs, MNVs, and INDELs, and VarDict Java which calls SNVs, MNVs, INDELs and more. Larger copy number variants are also called using multiple tools as well: ControlFreeC, CNVkit and GATK CNV. CNVkit calls are adjusted for purity estimations using THeTa2. Manta is also used to determine structural variants (SVs) and INDELs. All calls are made using GRCh38 references and variants are then annotated using gnomAD and for cancer hot spots. Publicly available files are further subjected to “germline masking,” which removes low frequency variants that could be used to identify the sample donor. For more details see the GitHub release at Kid's First Somatic Variant Workflow or to run the pipeline see the CAVATICA App. Once in the Cavatica workflow page, please click on the "Read All" link to open up the full documentation.

Somatic Alteration Data

Small Variants

Multiple callers were used to determine single nucleotide variants (SNVs) since the literature suggests this reduces false positives. Using custom R scripts, a consensus SNV file was constructed, consisting only of SNVs that were called by 2 or more variant callers: GATK Mutect2, Strelka2, Lancet, and VarDict Java. See the consensus calling documentation for more detail on how the calls were combined. Annotations, including alternative gene and protein IDs and cancer references, were also added, see the annotation calling workflow for more details.

A unique variant id consisting of the hg38 coordinates and the reference and alternative alleles was created for consistency. Then several variant frequencies were calculated for each of those IDs within each cancer group and cohort. The frequency in the overall dataset, for each unique variant and gene, is the percentage of patients that have that variant or gene in the given cohort out of all patients in that cohort. The frequency in primary or relapse tumors, for each unique variant and gene, is the percentage of samples that have that variant or gene in the given cohort out of all samples in that cohort. Note that the frequencies and counts may not tally as expected for several reasons. First, the total columns use unique patients, while the primary/relapse tumor columns use unique samples. Second, some submitters did not include information about the primary/relapse status of the samples, so those samples are omitted from the primary/relapse counts. Last, some patients or samples are included in multiple cohorts and may be counted multiple times. See SNV frequencies documentation for details of how the unique variant ID, variant frequencies, and annotations were done using custom R scripts and see see the hotspot detection documentation for how SNV hotspots were called. Summarized tables are returned in response to queries on the Molecular Targets Platform (as described below). If you follow the links to Pediatric cBioPortal, sample-level data is available to view in the OncoPrint and Mutation tabs there as well as to download.


The following table gives the a description of the fields and corresponding values for SNV data both at the gene and variant level within MTP.

Column Name Description Values
Gene symbol HGNC symbol for the given gene
Variant ID hg38 Specific name for the variant in human genome hg38 coordinates; for example chr12_94581668_T_C means that base 94581668 on chromosome 12 is not the reference thymine (T) but mutated to a cytosine (C)
Protein change Amino acid change if mutation causes one; for example p.R317G means that the 317th amino acid is changed from arginine (R) to glycine (G)
PMTL Whether the gene is a relevant target on the PMTL (Pediatric Molecular Target List) Binary; either an R for relevant target or NR for non-relevant target and left blank if no data
Dataset See the Dataset section in this document for more details All Cohorts = all datasets combined, TARGET = Therapeutically Applicable Research to Generate Effective Treatments, PBTA = Pediatric Brain Tumor Atlas, GMFK = Gabriella Miller Kids First Neuroblastoma
Disease Cancer type See disease table
dbSNP ID ID for variant in NCBI’s dbSNP database https://www.ncbi.nlm.nih.gov/snp/ if one exists dbSNP ID starting with “rs” if it exists, blank if the variant is not in dbSNP but is in other variant databases, and novel if the variant is not in any database used
VEP impact Predicted mutation impact from Ensembl Variant Effect Predictor; only mutations predicted to have some impact are reported high = predicted to cause complete or nearly complete loss of function, moderate = predicted to reduce protein effectiveness, modifier = affects a non-coding region where predictions are difficult or there is no evidence of impact
SIFT impact Predicted mutation impact from SIFT, with the score in parentheses. The closer the score is to 0, the more deleterious the mutation is predicted to be. If there is sufficient reference material available at that position, SIFT will warn that there’s low confidence in the predicted impact. SIFT only makes predictions for missense variants. deleterious, deleterious_low_confidence = SIFT score between 0 to 0.05 where mutation is predicted to decrease protein function, tolerated, tolerated_low_confidence = SIFT score 0.05 to 1 where the mutation probably doesn’t affect protein function and the closer to 1 the more true that is, left blank if SIFT is not able to be applied to the variant
PolyPhen impact Predicted mutation impact from PolyPhen, with the score in parentheses. The closer the score is to 1, the more deleterious the impact is predicted to be. If there isn’t sufficient data to make a prediction it’s reported as “unknown.” PolyPhen only makes predictions for missense variants. probably_damaging = mutation predicted to reduce or eliminate protein function with a score between 0.909 and 1, possibly_damaging = mutation may effect protein function with a score between 0.446 and 0.908, benign = no effect on protein function with a score between 0.001 and 0.445, unknown = no prediction due to lack of data and assigned a score of 0, left blank if PolyPhen is not able to be applied to the variant
Variant classification Variant effect on protein, whether frame is maintained, whether it’s a mutation or insertion/deletion and whether it effects specific translation regions Frame_Shift_Del = deletion that changes reading frame, Frame_Shift_Ins = insertion that changes reading frame, In_Frame_Del = deletion but reading frame is unchanged, In_Frame_Ins = insertion but reading frame is unchanged, Missense_Mutation = small variant that changes the amino acid coded for, Nonsense_Mutation = small variant that adds a stop codon, Nonstop_Mutation = mutation in the stop codon so that it no longer functions as a stop codon, Splice_Site = mutation at an exon-intron boundary at a splice site, Translation_Start_Site = mutation at the translation start site
Variant type The type of small variant; how many bases are affected or whether there was a small insertion or deletion DEL = deletion, DNP = double nucleotide polymorphism, INS = insertion, ONP = oligo-nucleotide polymorphism, SNP = single nucleotide polymorphism, TNP = triple nucleotide polymorphism
Gene full name Full name of the gene from HGNC
Gene type A limited set of simplified annotations on what type of gene it is, especially whether it’s a known cancer gene CosmicCensus = gene is in COSMIC, Kinase = gene is a kinase, Oncogene = gene is an known oncogene, TranscriptionFactor = gene is a transcription factor, TumorSuppressorGene = gene is a known tumor suppressor, left blank if no annotations apply
Protein RefSeq ID Refseq ID for the protein (not the gene)
Gene Ensembl ID Ensembl ID for gene
Protein Ensembl ID Ensembl ID for protein
Total mutations / Subjects in dataset Total number of samples with the SNV over the total number of disease samples in the given dataset
Frequency in overall dataset Fraction of the samples for the given disease in the given dataset that have the SNV
Total primary tumors mutated / Primary tumors in dataset Same as Total mutations, but for primary tumors only
Frequency in primary tumors Same as Frequency in overall, but for primary tumors only
Total relapse tumors mutated / Relapse tumors in dataset Same as Total mutations, but for relapse tumors only
Frequency in relapse tumors Same as Frequency in overall, but for relapse tumors only
HotSpot Yes or no is this a known recurrently occuring (hotspot) cancer mutation binary: Y, N
OncoKB cancer gene Whether the gene is a annotated cancer gene listed in OncoKB https://www.oncokb.org/ binary: Y, N
OncoKB Oncogene|TSG Whether the gene is annoated as an oncogene or tumor suppressor (TSG) in OncoKB https://www.oncokb.org/ oncogene = contributes to cancer development, TSG = tumor suppressor gene that suppresses cancer development, oncogene,TSG = if gene can be both, left blank if neither
PedcBio PedOT oncoprint plot Link to oncoprint plot at Pediatric cBioPortal
PedcBio PedOT mutation plot Link to mutation plot at Pediatric cBioPortal

Copy Number Variants (CNVs)

Multiple callers were used to determine copy number variants (CNVs). A consensus CNV file was constructed, consisting of CNVs that were called by 2 or more copy number callers: ControlFreeC, CNVkit, and GATK CNV. See https://github.com/kids-first/kf-somatic-workflow/blob/master/docs/kfdrc-consensus-calling.md for more detail on how the calls were combined. The various nomenclatures used by the different callers are harmonized to standard descriptions of CNVs: deep deletion for when copy number equals zero, loss for when copy number is less than ploidy, neutral if copy number is the same as the genome copy number (cancer genomes may have a ploidy other than two), gain for up to two times ploidy, and amplification for a gain of more than two times ploidy. Copy number variants were retained only if they overlapped by at least 1 base pair with a gene’s exon. Further annotations, including alternative gene and protein IDs and cancer references, were also added.

Then several variant frequencies were calculated for each of those genes within each cancer group and cohort. The frequency in the overall dataset is the percentage of patients that have that a CNV affecting that gene in the given cohort out of all patients in that cohort. The frequency in primary or relapse tumors is the percentage of samples that that have that a CNV affecting that gene in the given cohort out of all samples in that cohort. Note that the frequencies and counts may not tally as expected for several reasons. First, the total columns use unique patients, while the primary/relapse tumor columns use unique samples. Second, some submitters did not include information about the primary/relapse status of the samples, so those samples are omitted from the primary/relapse counts. Last, some patients or samples are included in multiple cohorts and may be counted multiple times. See the CNV frequencies documentation for details of how the unique variant ID, variant frequencies, and annotations were done using custom R scripts.


The following table gives the a description of the fields and corresponding values for CNV data within MTP.

Column Name Description Values
Gene symbol HGNC symbol for the given gene
Gene Ensembl ID Ensembl ID for gene
Variant type Categorical description of the variant type; cancer genomes may have a ploidy other than diploid which is why the categories are described in terms of the ploidy of the sample deep deletion = 0 copies, loss = fewer copies than ploidy, neutral = same as ploidy, gain = up to 2 times ploidy, amplification = more than 2 times ploidy
Variant category
Dataset See the Dataset section in this document for more details All Cohorts = all datasets combined, TARGET = Therapeutically Applicable Research to Generate Effective Treatments, PBTA = Pediatric Brain Tumor Atlas, GMFK = Gabriella Miller Kids First Neuroblastoma
Disease Cancer type See disease table
Total alterations / Subjects in dataset Total number of samples with the CNV over the total number of disease samples in the given dataset
Frequency in overall dataset Fraction of the samples for the given disease in the given dataset that have the CNV
Total primary tumors altered / Primary tumors in dataset Same as Total alterations, but for primary tumors only
Frequency in primary tumors Same as Frequency in overall, but for primary tumors only
Total relapse tumors altered / Relapse tumors in dataset Same as Total alterations, but for relapse tumors only
Frequency in relapse tumors Same as Frequency in overall, but for relapse tumors only
Gene full name Full name of gene from HGNC
PMTL Whether the gene is a relevant target on the PMTL (Pediatric Molecular Target List) Binary; either an R for relevant target or NR for non-relevant target and left blank if no data
OncoKB cancer gene Whether the gene is a annotated cancer gene listed in OncoKB https://www.oncokb.org/ binary; Y, N
OncoKB Oncogene|TSG Whether the gene is annoated as an oncogene or tumor suppressor (TSG) in OncoKB https://www.oncokb.org/ oncogene = contributes to cancer development, TSG = tumor suppressor gene that suppresses cancer development, oncogene,TSG = if gene can be both, blank if neither

RNA Sequencing

Data Processing

The RNA-seq Alignment Workflow begins by trimming adapters, only if adapters are provided, using Cutadapt. Sequencing quality is checked using FastQC and tumor/normal pairs are double-checked to confirm they are from the same individual using NGSCheckMate. For more details on sample identity confirmation please see the Kid's First NGS Checkmate Workflow. Reads were then aligned using STAR in two-pass mode to reference genome GRCh38. While all MTP data is paired-end, methods are provided for single-end alignment if you are interested in processing your data in the same manner. Transcripts are quantified using RSEM with the GENCODE v27 annotation, except for the GTEx samples which were not re-processed and are annotated using GENCODE v26. Fusion calling is done using both Arriba and STAR-Fusion and then filtered for high confidence fusion calls using annoFuse. QC metrics for the alignment are summarized using RNA-seQC. If you would like to view the code in more detail, please see the GitHub release Kids First RNA-seq Workflow and if you would like to run the pipeline, please see the CAVATICA App. Once in the Cavatica workflow page, please click on the "Read All" link to open up the full documentation.

RNA Sequencing Data

Fusions

Gene fusions are called solely from RNA sequencing using the programs above. Fusions are filtered using custom R scripts. Fusion calls are retained if they are called by both STAR-Fusion and Arriba and if the fusion was specific and present in 3 or more samples in a single disease. Fusions were then annotated with gene and fusion specific information as well as whether they are known cancer genes from OncoKB, TCGA, and COSMIC. Summary frequencies are calculated using R. See the fusion filtering documentation for specific code and further details.


The following table gives the a description of the fields and corresponding values for the gene fusion data both at the gene and variant level within MTP.

Annotation Description Values
Fusion Name Genes fused with the name of the genes fused separated by “--”. Gene order is the order they fused in 5’ to 3’ with the dashes representing the breakpoint. If the fusion is intergenic, the location will be represented by the two closest genes separated by a slash, /. For example "AACSP1--GABRP/RANBP17" means that AACSP1 fused with an intergenic DNA between GABRP and RNABP17.
Fusion Type Whether the genes have fused in-frame or not. frameshift = reading frame is shifted by 1 or 2 bp, in-frame = reading frame remains the same, other if neither applies
Gene Position Whether the gene is to the 5’ or 3’ of the breakpoint Gene1A = gene to left of breakpoint, Gene1B = gene to right of breakpoint, Gene2A, Gene2B = same side of breakpoint as others, but for the second gene given in an intergenic fusion
Fusion Annotation Whether the fusion is found in other data sources TCGAfusion = found in TCGA data, left blank if not
Breakpoint Location Qualitative description of where the breakpoint is located related to the annotated gene genic = breakpoint is in exon, intragenic = breakpoint is in gene but not the exon, intergenic = breakpoint is between genes
Annotations Other information on the fusion from callers Arriba and STAR-Fusion that includes if the fusion was previously seen in cancer, more details about the fusion construction and more description about the type of fusion. See the Arriba docs at https://arriba.readthedocs.io/en/latest/output-files/ and the STAR-Fusion docs at https://github.com/STAR-Fusion/STAR-Fusion/wiki#Outputs for detailed descriptions of the annotations
Kinase Domain Retained Gene1A Whether the kinase domain is retained in the 5’ gene Yes = kinase domain retained, Partial = some but not all of the kinase domain is present, No = none of the kinase domain is present, left blank if no kinase domain in protein
Kinase Domain Retained Gene1B Whether the kinase domain is retained in the 3’ gene Yes = kinase domain retained, Partial = some but not all of the kinase domain is present, No = none of the kinase domain is present, left blank if no kinase domain in protein
Reciprocal exists either gene kinase Whether or not the reciprocal fusion with the gene order around the breakpoint swapped exists Binary; TRUE, FALSE
Gene1A Annotation A limited set of simplified annotations for the most 5’ gene in the fusion CosmicCensus = fusion is in COSMIC, Kinase = one of the fusion genes is a kinase, Oncogene = one gene is an known oncogene, TranscriptionFactor = one of the fusion genes is a transcription factor, TumorSuppressorGene = one gene is a known tumor suppressor, left blank if no annotations apply
Gene1B Annotation A limited set of simplified annotations for the most 3’ gene in the fusion CosmicCensus = fusion is in COSMIC, Kinase = one of the fusion genes is a kinase, Oncogene = one gene is an known oncogene, TranscriptionFactor = one of the fusion genes is a transcription factor, TumorSuppressorGene = one gene is a known tumor suppressor, left blank if no annotations apply
Gene2A Annotation A limited set of simplified annotations for the second gene on the 5’ side of an intergenic fusion CosmicCensus = fusion is in COSMIC, Kinase = one of the fusion genes is a kinase, Oncogene = one gene is an known oncogene, TranscriptionFactor = one of the fusion genes is a transcription factor, TumorSuppressorGene = one gene is a known tumor suppressor, left blank if no annotations applyy
Gene2B Annotation A limited set of simplified annotations for the second gene on the 3’ side of an intergenic fusion CosmicCensus = fusion is in COSMIC, Kinase = one of the fusion genes is a kinase, Oncogene = one gene is an known oncogene, TranscriptionFactor = one of the fusion genes is a transcription factor, TumorSuppressorGene = one gene is a known tumor suppressor, left blank if no annotations apply
Gene Ensembl ID Official gene ID from Ensembl
Dataset See the Dataset section in this document for more details All Cohorts = all datasets combined, TARGET = Therapeutically Applicable Research to Generate Effective Treatments, PBTA = Pediatric Brain Tumor Atlas, GMFK = Gabriella Miller Kids First Neuroblastoma
Disease Cancer type See disease table
Total alterations / Subjects in Dataset Total number of samples with the given fusions over the total number of disease samples in the given Dataset
Frequency in overall dataset Fraction of the samples for the given disease in the given dataset that have the fusion
Total primary tumors altered / Primary tumors in dataset Same as Total alterations, but for primary tumors only
Frequency in primary tumors Same as overall frequency, but for primary tumors only
Total relapse tumors altered / Relapse tumors in dataset Same as Total alterations, but for relapse tumors only
Frequency in relapse tumors Same as Total alterations, but for relapse tumors only
Gene full name Full name of the gene from HGNC
PMTL Whether the gene is a relevant target on the PMTL (Pediatric Molecular Target List) Binary; either an R for relevant target or NR for non-relevant target and left blank if no data
OncoKB Cancer Gene Whether the gene is a annotated cancer gene listed in OncoKB https://www.oncokb.org/ binary; Y, N
OncoKB Oncogene|TSB Whether the gene is annoated as an oncogene or tumor suppressor (TSG) in OncoKB https://www.oncokb.org/ oncogene = contributes to cancer development, TSG = tumor suppressor gene that suppresses cancer development, oncogene,TSG = if gene can be both, blank if neither

Gene Expression

TPMs (transcripts per million reads) were calculated using RSEM and plotted using R. Please see the CAVATICA App for more details. Once in the Cavatica workflow page, please click on the "Read All" link to open up the full documentation.

OpenPedCan Gene Expression Boxplot

OpenPedCan gene expression boxplot (Figure 1) summarizes the expression levels of a gene in multiple cancer and normal tissue types. The plotted gene expression levels are obtained from the bulk-tissue RNA-seq data in OpenPedCan-analysis release. In an OpenPedCan gene expression boxplot, each box summarizes the expression levels of a cancer or normal tissue type. The x-axis label of each box lists the corresponding cancer or normal tissue type, dataset, biospecimen type, and total number of samples. The y-axis value corresponds to gene expression level in the unit of transcript per million (TPM). The scale of y-axis can either be TPM or log10(TPM + 1), which can be selected by clicking the "Linear" (default) or "Log10" tab on the top left side of the boxplot.

Figure 1. OpenPedCan gene expression boxplot that summarizes the expression of NBPF1 gene in pediatric neuroblastoma and normal adult tissues.

Figure 1. OpenPedCan gene expression boxplot that summarizes the expression of NBPF1 gene in pediatric neuroblastoma and normal adult tissues.

The OpenPedCan gene expression boxplot widget on Molecular Targets Platform (MTP) “Evidence” and “Gene symbol” pages plots different sets of cancer and normal tissue types. On an MTP “Evidence” page, the widget plots all cancer types that have the same Experimental Factor Ontology (EFO) ID as the page and all normal tissue types. On an MTP “Gene symbol” page, the widget plots all cancer types.

The gene expression levels in each boxplot are also summarized in a table that can be downloaded in different formats for further analysis, by clicking the "JSON", "CSV", or "TSV" button on the top right side of the boxplot. In the summary table, each box in the boxplot is summarized in a row with the following columns:

Column name Column description
xLabel X-axis label.
specimenDescriptorFill Biospecimen descriptor of the box fill color.
boxSampleCount Number of samples.
geneEnsemblId Ensembl ID of the plotted gene.
geneSymbol Symbol of the plotted gene.
pmtl US Food & Drug Administration Pediatric Molecular Target Lists designation of the plotted gene.
dataset Dataset that contains the samples.
disease Cancer type.
gtexTissueSubgroup Normal tissue type.
efo Cancer Experimental Factor Ontology (EFO) ID.
mondo Cancer Mondo Disease Ontology (Mondo) ID.
gtexTissueSubgroupUberon Normal tissue type Uberon anatomical ontology ID.
tpmMean Mean of TPM values.
tpmSd Standard deviation of TPM values.
tpmMin Minimum TPM value.
tpm25thPercentile 25th percentile of TPM values.
tpmMedian Median of TPM values.
tpm75thPercentile 75th percentile of TPM values.
tpmMax Maximum TPM value.