Molecular Targets Platform Documentation

The Open Pediatric Cancer (OpenPedCan) project at the Children’s Hospital of Philadelphia, in partnership with the National Cancer Institute, is combining and harmonizing pediatric cancer datasets and integrating them into the pediatric Molecular Targets Platform (MTP) in order to accelerate pediatric cancer target identification and drug development. This is high-level overview of the MTP data processing and analysis. For more information on the MTP itself, see the about page.

For documentation on previous versions of the Molecular Targets Platform, please see the previous version v1.1 or here for older versions.

Datasets
DNA Sequencing
- Data Processing
  - DNA-seq Alignment and Variant Calling Workflow
  - Somatic CNV Calling
- Somatic Alteration Data
  - Small Variants
  - Copy Number Variants (CNVs)
RNA_Sequencing
- Data Processing
- RNA Sequencing Data
Methylation
- Data Processing
  - Methylation Arrays
- Methylation Data
Data_Availability
- Harmonized Data Download

Datasets

While adult pan-cancer repositories have existed and accelerated cancer research for a decade, pediatrics cancers have been excluded, despite having different genetic and molecular etiologies than adult cancers. Over the past few years larger pediatric consortia, both disease-specific and pan-cancer, have tried to address this disparity. The Molecular Targets Platform is harmonizing the data from across these different consortia in one unified location where it can be queried for associations between putative targets and pediatric cancers. As the project is ongoing, more data will continue to be added, but this current release includes harmonized pediatric consortia datasets as well as external consortia GTEx and TCGA data for comparisons to normal and cancer tissue expression respectively:

Primary tumors follow the standard definition of the original tumor in the body. However, for analysis purposes, MTP combines progressive, progressive post-mortem, and relapse tumors into our relapse tumor stage. If needed, more detailed sample information is available in the histology data provided through the OpenPedCan-analysis GitHub repository.

For expanded descriptions of the datasets, please see the Pediatric Cancer Data Sources on the About page on the Molecular Targets Platform https://moleculartargets.ccdi.cancer.gov/about.

DNA Sequencing

Data Processing

DNA-seq Alignment and Variant Calling Workflow

For both whole genome, whole exome, and targeted panel DNA sequencing, the workflow begins by flagging duplicates and aligning fastq files, or re-aligning previously aligned BAMs, to the reference genome GRCh38 using bwa mem. The majority of Pediatric Molecular Target data is paired-end, but single end methods are provided if you want to apply the pipeline to your own data. Sequencing quality is checked using FastQC and NGSCheckMate is used to confirm tumor/normal DNA and normal DNA/ tumor RNA pairs come from the same individual. For more details on sample identity confirmation please see the NGS Checkmate Workflow. Somatic variants are called using Mutect2, Strelka2, Lancet, and VarDict, followed by a 2/4 consensus and hotspot scavenging. For more detailed alignment and variant calling methods, please refer to the OpenPBTA manuscript. To run the CAVATICA apps yourself, see the OpenPBTA workflows repository and the CAVATICA App. Once in the Cavatica workflow page, please click on the "Read All" link to open up the full documentation.

CHOP P30 targeted sequencing panel data was processed using the GRCh37 reference with CHOP's Division of Genomic Diagnostics published pipeline and then coordinates were lifted over to GRCh38. For more details on the panel design and variant calling pipeline, please see the methods in the Genome Medicine paper.

Somatic CNV Calling

Copy number variants are called using multiple tools: ControlFreeC, CNVkit, and GATK CNV. CNVkit calls are adjusted for sex using a pre-calculated germline sex estimate. Manta is used to call structural variants (SVs). All calls are made using GRCh38 references and variants are then annotated with gnomAD frequencies. Publicly available files are further subjected to “germline masking,” which removes low frequency variants that could be used to identify the sample donor. For more details see the GitHub release at Kid's First Somatic Variant Workflow or to run the pipeline see the CAVATICA App. Once in the Cavatica workflow page, please click on the "Read All" link to open up the full documentation. Panel data was not reprocessed and the variant files provided by the submitters were merged with our harmonized data files.

Somatic Alteration Data

Small Variants

Multiple callers were used to determine single nucleotide variants (SNVs) since the literature suggests this reduces false positives. Using custom R scripts, a consensus SNV file was constructed, consisting only of SNVs that were called by 2 or more variant callers: GATK Mutect2, Strelka2, Lancet, and VarDict Java. See the consensus calling documentation for more detail on how the calls were combined. Annotations, including alternative gene and protein IDs and cancer references, were also added, see the annotation calling workflow for more details.

A unique variant id consisting of the hg38 coordinates and the reference and alternative alleles was created for consistency. Then several variant frequencies were calculated for each of those IDs within each cancer group and cohort. The frequency in the overall dataset, for each unique variant and gene, is the percentage of patients that have that variant or gene in the given cohort out of all patients in that cohort. The frequency in primary or relapse tumors, for each unique variant and gene, is the percentage of samples that have that variant or gene in the given cohort out of all samples in that cohort. Note that the frequencies and counts may not tally as expected for several reasons. First, the total columns use unique patients, while the primary/relapse tumor columns use unique samples. Second, some submitters did not include information about the primary/relapse status of the samples, so those samples are omitted from the primary/relapse counts. Third, panel sample data is not included in the all cohorts calculation. Last, some patients or samples are included in multiple cohorts and may be counted multiple times. See SNV frequencies documentation for details of how the unique variant ID, variant frequencies, and annotations were done using custom R scripts and see see the hotspot detection documentation for how SNV hotspots were called. Summarized tables are returned in response to queries on the Molecular Targets Platform (as described below). If you follow the links to Pediatric cBioPortal, sample-level data is available to view in the OncoPrint and Mutation tabs there as well as to download.

The following table gives the a description of the fields and corresponding values for SNV data both at the gene and variant level within MTP.

Column Name	Description	Values
Gene symbol	HGNC symbol for the given gene
Variant ID hg38	Specific name for the variant in human genome hg38 coordinates; for example chr12_94581668_T_C means that base 94581668 on chromosome 12 is not the reference thymine (T) but mutated to a cytosine (C)
Protein change	Amino acid change if mutation causes one; for example p.R317G means that the 317th amino acid is changed from arginine (R) to glycine (G)
Dataset	See the Dataset section in this document for more details	All Cohorts = all datasets combined except for panel data, TARGET = Therapeutically Applicable Research to Generate Effective Treatments, TARGET panel = same as TARGET but the cohort sequencing panel data only, PBTA = Pediatric Brain Tumor Atlas, GMKF = Gabriella Miller Kids First Neuroblastoma
Disease	Cancer type	See disease table
dbSNP ID	ID for variant in NCBI’s dbSNP database https://www.ncbi.nlm.nih.gov/snp/ if one exists	dbSNP ID starting with “rs” if it exists, blank if the variant is not in dbSNP but is in other variant databases, and novel if the variant is not in any database used
VEP impact	Predicted mutation impact from Ensembl Variant Effect Predictor; only mutations predicted to have some impact are reported	high = predicted to cause complete or nearly complete loss of function, moderate = predicted to reduce protein effectiveness, modifier = affects a non-coding region where predictions are difficult or there is no evidence of impact
SIFT impact	Predicted mutation impact from SIFT, with the score in parentheses. The closer the score is to 0, the more deleterious the mutation is predicted to be. If there is sufficient reference material available at that position, SIFT will warn that there’s low confidence in the predicted impact. SIFT only makes predictions for missense variants.	deleterious, deleterious_low_confidence = SIFT score between 0 to 0.05 where mutation is predicted to decrease protein function, tolerated, tolerated_low_confidence = SIFT score 0.05 to 1 where the mutation probably doesn’t affect protein function and the closer to 1 the more true that is, left blank if SIFT is not able to be applied to the variant
PolyPhen impact	Predicted mutation impact from PolyPhen, with the score in parentheses. The closer the score is to 1, the more deleterious the impact is predicted to be. If there isn’t sufficient data to make a prediction it’s reported as “unknown.” PolyPhen only makes predictions for missense variants.	probably_damaging = mutation predicted to reduce or eliminate protein function with a score between 0.909 and 1, possibly_damaging = mutation may effect protein function with a score between 0.446 and 0.908, benign = no effect on protein function with a score between 0.001 and 0.445, unknown = no prediction due to lack of data and assigned a score of 0, left blank if PolyPhen is not able to be applied to the variant
Variant classification	Variant effect on protein, whether frame is maintained, whether it’s a mutation or insertion/deletion and whether it effects specific translation regions	Frame_Shift_Del = deletion that changes reading frame, Frame_Shift_Ins = insertion that changes reading frame, In_Frame_Del = deletion but reading frame is unchanged, In_Frame_Ins = insertion but reading frame is unchanged, Missense_Mutation = small variant that changes the amino acid coded for, Nonsense_Mutation = small variant that adds a stop codon, Nonstop_Mutation = mutation in the stop codon so that it no longer functions as a stop codon, Splice_Site = mutation at an exon-intron boundary at a splice site, Translation_Start_Site = mutation at the translation start site
Variant type	The type of small variant; how many bases are affected or whether there was a small insertion or deletion	DEL = deletion, DNP = double nucleotide polymorphism, INS = insertion, ONP = oligo-nucleotide polymorphism, SNP = single nucleotide polymorphism, TNP = triple nucleotide polymorphism
Gene full name	Full name of the gene from HGNC
Gene type	A limited set of simplified annotations on what type of gene it is, especially whether it’s a known cancer gene	CosmicCensus = gene is in COSMIC, Kinase = gene is a kinase, Oncogene = gene is an known oncogene, TranscriptionFactor = gene is a transcription factor, TumorSuppressorGene = gene is a known tumor suppressor, left blank if no annotations apply
Protein RefSeq ID	Refseq ID for the protein (not the gene)
Gene Ensembl ID	Ensembl ID for gene
Protein Ensembl ID	Ensembl ID for protein
Total mutations / Subjects in dataset	Total number of samples with the SNV over the total number of disease samples in the given dataset
Frequency in overall dataset	Fraction of the samples for the given disease in the given dataset that have the SNV
Total primary tumors mutated / Primary tumors in dataset	Same as Total mutations, but for primary tumors only
Frequency in primary tumors	Same as Frequency in overall, but for primary tumors only
Total relapse tumors mutated / Relapse tumors in dataset	Same as Total mutations, but for relapse tumors only
Frequency in relapse tumors	Same as Frequency in overall, but for relapse tumors only
HotSpot	Yes or no is this a known recurrently occuring (hotspot) cancer mutation	binary: Y, N
OncoKB cancer gene	Whether the gene is a annotated cancer gene listed in OncoKB https://www.oncokb.org/	binary: Y, N
OncoKB Oncogene\|TSG	Whether the gene is annoated as an oncogene or tumor suppressor (TSG) in OncoKB https://www.oncokb.org/	oncogene = contributes to cancer development, TSG = tumor suppressor gene that suppresses cancer development, oncogene,TSG = if gene can be both, left blank if neither
PedcBio PedOT oncoprint plot	Link to oncoprint plot at Pediatric cBioPortal
PedcBio PedOT mutation plot	Link to mutation plot at Pediatric cBioPortal

Copy Number Variants (CNVs)

Multiple callers were used to determine copy number variants (CNVs). A consensus CNV file was constructed, consisting of CNVs that were called by 2 or more copy number callers: ControlFreeC, CNVkit, and Manta SV. See the copy_number_consensus_call GitHub module for more detail on how the calls were combined. The various nomenclatures used by the different callers are harmonized to standard descriptions of CNVs: deep deletion for when copy number equals zero, loss for when copy number is less than ploidy, neutral if copy number is the same as the genome copy number (cancer genomes may have a ploidy other than two), gain for up to two times ploidy, and amplification for a gain of more than two times ploidy. Copy number variants were retained only if they overlapped by at least 1 base pair with a gene’s exon. Further annotations, including alternative gene and protein IDs and cancer references, were also added.

Then several variant frequencies were calculated for each of those genes within each cancer group and cohort. The frequency in the overall dataset is the percentage of patients that have that a CNV affecting that gene in the given cohort out of all patients in that cohort. The frequency in primary or relapse tumors is the percentage of samples that that have that a CNV affecting that gene in the given cohort out of all samples in that cohort. Note that the frequencies and counts may not tally as expected for several reasons. First, the total columns use unique patients, while the primary/relapse tumor columns use unique samples. Second, some submitters did not include information about the primary/relapse status of the samples, so those samples are omitted from the primary/relapse counts. Last, some patients or samples are included in multiple cohorts and may be counted multiple times. See the CNV frequencies documentation for details of how the unique variant ID, variant frequencies, and annotations were done using custom R scripts.

The following table gives the a description of the fields and corresponding values for CNV data within MTP.

Column Name	Description	Values
Gene symbol	HGNC symbol for the given gene
Gene Ensembl ID	Ensembl ID for gene
Variant type	Categorical description of the variant type; cancer genomes may have a ploidy other than diploid which is why the categories are described in terms of the ploidy of the sample	deep deletion = 0 copies, loss = fewer copies than ploidy, neutral = same as ploidy, gain = up to 2 times ploidy, amplification = more than 2 times ploidy
Variant category
Dataset	See the Dataset section in this document for more details	All Cohorts = all datasets combined, TARGET = Therapeutically Applicable Research to Generate Effective Treatments, PBTA = Pediatric Brain Tumor Atlas, GMFK = Gabriella Miller Kids First Neuroblastoma
Disease	Cancer type	See disease table
Total alterations / Subjects in dataset	Total number of samples with the CNV over the total number of disease samples in the given dataset
Frequency in overall dataset	Fraction of the samples for the given disease in the given dataset that have the CNV
Total primary tumors altered / Primary tumors in dataset	Same as Total alterations, but for primary tumors only
Frequency in primary tumors	Same as Frequency in overall, but for primary tumors only
Total relapse tumors altered / Relapse tumors in dataset	Same as Total alterations, but for relapse tumors only
Frequency in relapse tumors	Same as Frequency in overall, but for relapse tumors only
Gene full name	Full name of gene from HGNC
OncoKB cancer gene	Whether the gene is a annotated cancer gene listed in OncoKB https://www.oncokb.org/	binary; Y, N
OncoKB Oncogene\|TSG	Whether the gene is annoated as an oncogene or tumor suppressor (TSG) in OncoKB https://www.oncokb.org/	oncogene = contributes to cancer development, TSG = tumor suppressor gene that suppresses cancer development, oncogene,TSG = if gene can be both, blank if neither

RNA_Sequencing

Data Processing

The RNA-seq Alignment Workflow begins by trimming adapters, only if adapters are provided, using Cutadapt. Sequencing quality is checked using FastQC and tumor/normal pairs are double-checked to confirm they are from the same individual using NGSCheckMate. For more details on sample identity confirmation please see the Kid's First NGS Checkmate Workflow. Reads were then aligned using STAR in two-pass mode to reference genome GRCh38. While all MTP data is paired-end, methods are provided for single-end alignment if you are interested in processing your data in the same manner. Transcripts are quantified using RSEM at both the gene and isoform level with the GENCODE v39 annotation, except for external consortia data from GTEx and TCGA which were not re-processed and are annotated using GENCODE v26 and v36 respectively. For data that was not reprocessed, gene symbols were lifted over to the GENCODE v39 gene symbols to align with the rest of the data.

Fusion calling is done using both Arriba and STAR-Fusion and then filtered for high confidence fusion calls using annoFuse. QC metrics for the alignment are summarized using RNA-seQC. If you would like to view the code in more detail, please see the GitHub release OpenPBTA RNA-seq Workflow and if you would like to run the pipeline, please see the CAVATICA App. Once in the Cavatica workflow page, please click on the "Read All" link to open up the full documentation.

RNA Sequencing Data

Fusions

Gene fusions are called solely from RNA sequencing using the programs above. Fusions are filtered using custom R scripts. Fusion calls are retained if they are called by both STAR-Fusion and Arriba and if the fusion was specific and present in 3 or more samples in a single disease. Fusions panels were not reprocessed and were merged with our harmonized fusion calls. Fusions were then annotated with gene and fusion specific information as well as whether they are known cancer genes from OncoKB, TCGA, and COSMIC.

Summary frequencies are calculated using R. See the fusion filtering documentation for specific code and further details. Panel data was not reprocessed and the variant files provided by the submitters were merged with our harmonized data files.

The following table gives the a description of the fields and corresponding values for the gene fusion data both at the gene and variant level within MTP.

Annotation	Description	Values
Fusion Name	Genes fused with the name of the genes fused separated by “--”. Gene order is the order they fused in 5’ to 3’ with the dashes representing the breakpoint. If the fusion is intergenic, the location will be represented by the two closest genes separated by a slash, /. For example "AACSP1--GABRP/RANBP17" means that AACSP1 fused with an intergenic DNA between GABRP and RNABP17.
Fusion Type	Whether the genes have fused in-frame or not.	frameshift = reading frame is shifted by 1 or 2 bp, in-frame = reading frame remains the same, other if neither applies
Gene Position	Whether the gene is to the 5’ or 3’ of the breakpoint	Gene1A = gene to left of breakpoint, Gene1B = gene to right of breakpoint, Gene2A, Gene2B = same side of breakpoint as others, but for the second gene given in an intergenic fusion
Fusion Annotation	Whether the fusion is found in other data sources	TCGAfusion = found in TCGA data, left blank if not
Breakpoint Location	Qualitative description of where the breakpoint is located related to the annotated gene	genic = breakpoint is in exon, intragenic = breakpoint is in gene but not the exon, intergenic = breakpoint is between genes
Annotations	Other information on the fusion from callers Arriba and STAR-Fusion that includes if the fusion was previously seen in cancer, more details about the fusion construction and more description about the type of fusion.	See the Arriba docs at https://arriba.readthedocs.io/en/latest/output-files/ and the STAR-Fusion docs at https://github.com/STAR-Fusion/STAR-Fusion/wiki#Outputs for detailed descriptions of the annotations
Kinase Domain Retained Gene1A	Whether the kinase domain is retained in the 5’ gene	Yes = kinase domain retained, Partial = some but not all of the kinase domain is present, No = none of the kinase domain is present, left blank if no kinase domain in protein
Kinase Domain Retained Gene1B	Whether the kinase domain is retained in the 3’ gene	Yes = kinase domain retained, Partial = some but not all of the kinase domain is present, No = none of the kinase domain is present, left blank if no kinase domain in protein
Reciprocal exists either gene kinase	Whether or not the reciprocal fusion with the gene order around the breakpoint swapped exists	Binary; TRUE, FALSE
Gene1A Annotation	A limited set of simplified annotations for the most 5’ gene in the fusion	CosmicCensus = fusion is in COSMIC, Kinase = one of the fusion genes is a kinase, Oncogene = one gene is an known oncogene, TranscriptionFactor = one of the fusion genes is a transcription factor, TumorSuppressorGene = one gene is a known tumor suppressor, left blank if no annotations apply
Gene1B Annotation	A limited set of simplified annotations for the most 3’ gene in the fusion	CosmicCensus = fusion is in COSMIC, Kinase = one of the fusion genes is a kinase, Oncogene = one gene is an known oncogene, TranscriptionFactor = one of the fusion genes is a transcription factor, TumorSuppressorGene = one gene is a known tumor suppressor, left blank if no annotations apply
Gene2A Annotation	A limited set of simplified annotations for the second gene on the 5’ side of an intergenic fusion	CosmicCensus = fusion is in COSMIC, Kinase = one of the fusion genes is a kinase, Oncogene = one gene is an known oncogene, TranscriptionFactor = one of the fusion genes is a transcription factor, TumorSuppressorGene = one gene is a known tumor suppressor, left blank if no annotations applyy
Gene2B Annotation	A limited set of simplified annotations for the second gene on the 3’ side of an intergenic fusion	CosmicCensus = fusion is in COSMIC, Kinase = one of the fusion genes is a kinase, Oncogene = one gene is an known oncogene, TranscriptionFactor = one of the fusion genes is a transcription factor, TumorSuppressorGene = one gene is a known tumor suppressor, left blank if no annotations apply
Gene Ensembl ID	Official gene ID from Ensembl
Dataset	See the Dataset section in this document for more details	All Cohorts = all datasets combined except for panel data, TARGET = Therapeutically Applicable Research to Generate Effective Treatments, PBTA = Pediatric Brain Tumor Atlas, GMFK = Gabriella Miller Kids First Neuroblastoma, CHOP P30 Panel = CHOP diagnostic fusion panel
Disease	Cancer type	See disease table
Total alterations / Subjects in Dataset	Total number of samples with the given fusions over the total number of disease samples in the given Dataset
Frequency in overall dataset	Fraction of the samples for the given disease in the given dataset that have the fusion
Total primary tumors altered / Primary tumors in dataset	Same as Total alterations, but for primary tumors only
Frequency in primary tumors	Same as overall frequency, but for primary tumors only
Total relapse tumors altered / Relapse tumors in dataset	Same as Total alterations, but for relapse tumors only
Frequency in relapse tumors	Same as Total alterations, but for relapse tumors only
Gene full name	Full name of the gene from HGNC
OncoKB Cancer Gene	Whether the gene is a annotated cancer gene listed in OncoKB https://www.oncokb.org/	binary; Y, N
OncoKB Oncogene\|TSB	Whether the gene is annoated as an oncogene or tumor suppressor (TSG) in OncoKB https://www.oncokb.org/	oncogene = contributes to cancer development, TSG = tumor suppressor gene that suppresses cancer development, oncogene,TSG = if gene can be both, blank if neither

Gene Expression

TPMs (transcripts per million reads) were calculated using RSEM and plotted using R. Please see the CAVATICA App for more details. Once in the Cavatica workflow page, please click on the "Read All" link to open up the full documentation. TPMs from The Cancer Genome Atlas (TCGA) were pulled directly from that platform and were not recalculated.

OpenPedCan Gene Expression Boxplot

OpenPedCan gene expression boxplot (Figure 1) summarizes the expression levels of a gene in multiple cancer and normal tissue types. The plotted gene expression levels are obtained from the bulk-tissue RNA-seq data in OpenPedCan-analysis release. In an OpenPedCan gene expression boxplot, each box summarizes the expression levels of a cancer or normal tissue type. The x-axis label of each box lists the corresponding cancer or normal tissue type, dataset, biospecimen type, and total number of samples. The y-axis value corresponds to gene expression level in the unit of transcript per million (TPM). The scale of y-axis can either be TPM or log10(TPM + 1), which can be selected by clicking the appropriate tab on the top left side of the boxplot. In addition to selecting the y-axis scale in the tabs, the user can also select, in addition to the data scaling and depending on the view, whether they want to view pediatric data alone ("OpenPedCan Linear", "OpenPedCan Log10), with normal adult tissue expression from GTEx ("GTEx Linear", "GTEx Log10"), or with adult cancer expression from TCGA ("TCGA Linear", "TCGA").

Figure 1. OpenPedCan gene expression boxplot that summarizes the expression of the MYCN gene in pediatric acute myeloid leukemia with normal adult tissue expression from GTEx

Figure 2. OpenPedCan gene expression boxplot that summarizes the expression of the MYCN gene in pediatric acute myeloid leukemia with adult cancer expression from TCGA.

The OpenPedCan gene expression boxplot widget on Molecular Targets Platform (MTP) “Evidence” and “Gene symbol” pages plots different sets of cancer and normal tissue types. On an MTP “Evidence” page, the widget plots all cancer types that have the same Experimental Factor Ontology (EFO) ID as the page and all normal tissue types. On an MTP “Gene symbol” page, the widget plots all cancer types.

The gene expression levels in each boxplot are also summarized in a table that can be downloaded in different formats for further analysis, by clicking the "JSON", "CSV", or "TSV" button on the top right side of the boxplot. In the summary table, each box in the boxplot is summarized in a row with the following columns:

Column name	Column description
xLabel	X-axis label.
specimenDescriptorFill	Biospecimen descriptor of the box fill color.
boxSampleCount	Number of samples.
geneEnsemblId	Ensembl ID of the plotted gene.
geneSymbol	Symbol of the plotted gene.
pmtl	US Food & Drug Administration Pediatric Molecular Target Lists designation of the plotted gene.
dataset	Dataset that contains the samples.
disease	Cancer type.
gtexTissueSubgroup	Normal tissue type.
efo	Cancer Experimental Factor Ontology (EFO) ID.
mondo	Cancer Mondo Disease Ontology (Mondo) ID.
gtexTissueSubgroupUberon	Normal tissue type Uberon anatomical ontology ID.
tpmMean	Mean of TPM values.
tpmSd	Standard deviation of TPM values.
tpmMin	Minimum TPM value.
tpm25thPercentile	25th percentile of TPM values.
tpmMedian	Median of TPM values.
tpm75thPercentile	75th percentile of TPM values.
tpmMax	Maximum TPM value.

Data_Availability

Data Processing

Methylation Arrays

Methylation arrays are aggregated from multiple independent projects, so samples have been measured using either 27K, 450K, or EPIC (850K) Illumina Infinium HumanMethylation BeadChips or Roche Nimblegen HELP microarrays. All Illumina arrays were reprocessed by cancer group from signal intensities to methylation values in R using the minfi Bioconductor package with default arguments. Only probes with gene annotations from Illumina were retained. The TARGET Acute Lymphoblastic Leukemia (ALL) samples were measured using a different platform, the Roche Nimblegen HELP array, so they were not reprocessed. Please see the OpenPedCan-analysis methylation preprocessing documentation for more details.

Methylation Data

Methylation values were summarized only for samples that have both RNA-seq expression data and methylation array data. Quantiles and correlation with RNA-seq TPMs were calculated for each annotated methylation array probe. were calculated using custom R scripts Please see the OpenPedCan-analysis methylation-summary module for more details.

The following table gives the a description of the fields and corresponding values for the gene fusion data both at the gene and variant level within MTP.

Column Name	Description	Values
Gene symbol	HGNC symbol for the given gene
Gene Ensembl ID	Ensembl ID for gene
Gene Feature	Description of the relative location of a probe near or within the gene body	five_prime_UTR = 5' untranslated region upstream of the gene, promoter = gene promoter region, exon = in an exon within the gene, intron = in an intron within the gene, three_prime_UTR = 3' untranslated region upstream of the gene, intergenic = not within an annotated gene
Dataset	See the Dataset section in this document for more details	All Cohorts = all datasets combined, TARGET = Therapeutically Applicable Research to Generate Effective Treatments, PBTA = Pediatric Brain Tumor Atlas
Disease	Cancer type	See disease table
Median TPM	Median transcripts per million (TPM) from the RNA-seq for the given gene
RNA Correlation	Correlation with RNA-seq for the given gene
Probe ID	Methylation array probe ID
Chromosome	Chromosomal location of the probe
Location	Location in bases of the probe on the chromosome in the previous column
Beta_Q1	Minimum value for all samples at the given probe
Beta_Q2	Quartile 1/25th percentile value for all samples at the given probe
Beta_Median	Median/50th percentile value for all samples at the given probe
Beta_Q4	Quartile 3/75th percentile value for all samples at the given probe
Beta_Q5	Maximum value for all samples at the given probe

Data_Availability

Harmonized Data Download

For publically available harmonized data, please see the instructions for data download at the OpenPedCan-analysis repository https://github.com/PediatricOpenTargets/OpenPedCan-analysis#how-to-obtain-openpedcan-data

Citing the Molecular Targets Platform

If you used the Pediatric Molecular Targets Platform or the OpenPedCan-analysis repository data or code, please cite us!

The Molecular Targets Platform (2022). NCI. https://moleculartargets.ccdi.cancer.gov
PediatricOpenTargets (2022). OpenPedCan-analysis (Version 1.1). https://doi.org/10.5281/zenodo.7383318

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
OpenPedCan-analysis @ 0b21ac4		OpenPedCan-analysis @ 0b21ac4
figures		figures
past_versions		past_versions
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
disease_subject_counts.tsv		disease_subject_counts.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Molecular Targets Platform Documentation

Contents

Datasets

DNA Sequencing

Data Processing

DNA-seq Alignment and Variant Calling Workflow

Somatic CNV Calling

Somatic Alteration Data

Small Variants

Copy Number Variants (CNVs)

RNA_Sequencing

Data Processing

RNA Sequencing Data

Fusions

Gene Expression

OpenPedCan Gene Expression Boxplot

Data_Availability

Data Processing

Methylation Arrays

Methylation Data

Data_Availability

Harmonized Data Download

Citing the Molecular Targets Platform

About

Releases

Packages

Contributors 4

Languages

PediatricOpenTargets/documentation

Folders and files

Latest commit

History

Repository files navigation

Molecular Targets Platform Documentation

Contents

Datasets

DNA Sequencing

Data Processing

DNA-seq Alignment and Variant Calling Workflow

Somatic CNV Calling

Somatic Alteration Data

Small Variants

Copy Number Variants (CNVs)

RNA_Sequencing

Data Processing

RNA Sequencing Data

Fusions

Gene Expression

OpenPedCan Gene Expression Boxplot

Data_Availability

Data Processing

Methylation Arrays

Methylation Data

Data_Availability

Harmonized Data Download

Citing the Molecular Targets Platform

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages