Skip to content

Latest commit

 

History

History
550 lines (426 loc) · 65.6 KB

07.methods.md

File metadata and controls

550 lines (426 loc) · 65.6 KB

STAR METHODS

RESOURCE AVAILABILITY

Lead contact

Requests for access to OpenPBTA raw data and/or specimens may be directed to, and will be fulfilled by Jo Lynne Rokita ([email protected]).

Materials availability

This study did not create new, unique reagents.

Data and code availability

Raw and harmonized WGS, WXS, and RNA-Seq data derived from human samples are available within the KidsFirst Portal [@doi:10.24370/OPENPBTA] upon access request to the CBTN (https://cbtn.org/) as of the date of the publication. In addition, merged summary files are openly accessible at https://cavatica.sbgenomics.com/u/cavatica/openpbta or via download script in the https://github.com/AlexsLemonade/OpenPBTA-analysis repository. Summary data are visible within PedcBioPortal at https://pedcbioportal.kidsfirstdrc.org/study/summary?id=openpbta. Associated DOIs are listed in the Key Resources Table. Data underlying manuscript figures are available on Zenodo [@doi:10.5281/zenodo.7805408].

All original code was developed within the following repositories and is publicly available as follows. Primary data analyses can be found at https://github.com/d3b-center/OpenPBTA-workflows. Downstream data analyses can be found at https://github.com/AlexsLemonade/OpenPBTA-analysis. Manuscript code can be found at https://github.com/AlexsLemonade/OpenPBTA-manuscript. Associated DOIs are listed in the Key Resources Table. Software versions are documented in Table S5 as an appendix to the Key Resources Table.

Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

Data releases

We maintained a data release folder on Amazon S3, downloadable directly from S3 or our open-access CAVATICA project, with merged files for each analysis (See Data and code availability section). As we produced new results (e.g., tumor mutation burden calculations) that we expected to be used across multiple analyses, or identified data issues, we created new data releases in a versioned manner. We reran all manuscript-specific analysis modules with the latest data release (v23) prior to submission and subsequently created a GitHub repository-tagged release to ensure reproducibility.

EXPERIMENTAL MODEL AND STUDY PARTICIPANT DETAILS

The Pediatric Brain Tumor Atlas specimens are comprised of samples from Children's Brain Tumor Network (CBTN) [@doi:10.1016/j.neo.2022.100846] and the Pediatric Pacific Neuro-Oncology Consortium (PNOC). The CBTN is a collaborative, multi-institutional (32 institutions worldwide) research program dedicated to the study of childhood brain tumors. PNOC is an international consortium dedicated to bringing new therapies to children and young adults with brain tumors. We also include blood and tumor biospecimens from newly-diagnosed diffuse intrinsic pontine glioma (DIPG) patients as part of the PNOC003 clinical trial PNOC003/NCT02274987 [@doi:10.1002/ijc.32258].

Model generation

Previously, CBTN-generated cell lines were derived from either fresh tumor tissue directly obtained from surgery performed at Children’s Hospital of Philadelphia (CHOP) or from prospectively collected tumor specimens stored in Recover Cell Culture Freezing medium (cat# 12648010, Gibco). Tumor tissue was dissociated using enzymatic method with papain as described [@doi:10.1093/neuonc/noz192]. Briefly, we washed tissue with HBSS (cat# 14175095, Gibco), and tissue was minced and incubated with activated papain solution (cat# LS003124, SciQuest) for up to 45 minutes. Ovomucoid solution (cat# 542000, SciQuest) was used to inactivate the papain, tissue was briefly treated tissue with DNase (cat# 10104159001, Roche) and passed through a 100μm cell strainer (cat# 542000, Greiner Bio-One). Two cell culture conditions were initiated based on the number of cells available. For cultures utilizing the fetal bovine serum (FBS), cells were plated a minimum density of 3×105 cells/mL in DMEM/F-12 medium (cat# D8062, Sigma) supplemented with 20% FBS (cat# SH30910.03, Hyclone), 1% GlutaMAX (cat# 35050061, Gibco), Penicillin/Streptomycin-Amphotericin B Mixture (cat# 17-745E, Lonza), and 0.2% Normocin (cat# ant-nr-2, Invivogen). For serum-free media conditions, cells were plated at minimum density of 1×106 cells/mL in DMEM/F12 medium supplemented with 1% GlutaMAX, 1X B-27 supplement minus vitamin A (cat# 12587-010, Gibco), 1x N-2 supplement (cat# 17502001, Gibco), 20 ng/ml epidermal growth factor (cat# PHG0311L, Gibco), 20 ng/mL basic fibroblast growth factor (cat# 100-18B, PeproTech), 2.5μg/mL heparin (cat# H3149, Sigma), Penicillin/Streptomycin-Amphotericin B Mixture, and 0.2% Normocin. All cell lines used for nucleic acid extraction were confirmed to be mycoplasma-free. Guardian Forensic Sciences performed GenePrint 24 (cat# B1870, Promega), short tandem repeat (STR) analysis on cell line extracted DNA to both confirm identity and that they were free of cross-contamination. Additionally, we performed NGSCheckMate [@doi:10.1093/nar/gkx193] on matched DNA and RNA cell line (tumor) and peripheral blood (normal) CRAM files to further confirm identity.

METHOD DETAILS

Nucleic acids extraction and library preparation

PNOC samples

The Translational Genomic Research Institute (TGEN; Phoenix, AZ) performed DNA and RNA extractions on tumor biopsies using a DNA/RNA AllPrep Kit (Qiagen, #80204). All RNA used for library prep had a minimum RIN of seven, but no QC thresholds were implemented for the DNA. For library preparation, 500 ng of nucleic acids were used as input for RNA-Seq, WXS, and targeted DNA panel (panel) sequencing. RNA library preparation was performed using the TruSeq RNA Sample Prep Kit (Illumina, #FC-122-1001) with poly-A selection, and the exome prep was performed using KAPA Library Preparation Kit (Roche, #KK8201) using Agilent's SureSelect Human All Exon V5 backbone with custom probes. The targeted DNA panel developed by Ashion Analytics (formerly known as the GEM Cancer panel) consisted of exonic probes against 541 cancer genes. Both panel and WXS assays contained 44,000 probes across evenly spaced genomic loci used for genome-wide copy number analysis. For the panel, additional probes tiled across intronic regions of 22 known tumor suppressor genes and 22 genes involved in common cancer translocations for structural analysis. All extractions and library preparations were performed according to manufacturer's instructions.

CBTN samples

Blood, tissue, and cell line DNA/RNA extractions were performed at the Biorepository Core at CHOP. Briefly, 10-20 mg frozen tissue, 0.4-1ml of blood, or 2e6 cells pellet was used for extractions. Tissues were lysed using a Qiagen TissueLyser II (Qiagen) with 2×30 sec at 18Hz settings using 5 mm steel beads (cat# 69989, Qiagen). Both tissue and cell pellets processes included a CHCl3 extraction and were run on the QIACube automated platform (Qiagen) using the AllPrep DNA/RNA/miRNA Universal kit (cat# 80224, Qiagen). Blood was thawed and treated with RNase A (cat#, 19101, Qiagen); 0.4-1ml was processed using the Qiagen QIAsymphony automated platform (Qiagen) using the QIAsymphony DSP DNA Midi Kit (cat# 937255, Qiagen). DNA and RNA quantity and quality was assessed by PerkinElmer DropletQuant UV-VIS spectrophotometer (PerkinElmer) and an Agilent 4200 TapeStation (Agilent, USA) for RIN and DIN (RNA Integrity Number and DNA Integrity Number, respectively). The NantHealth Sequencing Center, BGI at CHOP, or the Genomic Clinical Core at Sidra Medical and Research Center performed library preparation and sequencing. BGI at CHOP and Sidra Medical and Research Center used in house, center-specific workflows for sample preparation. At NantHealth Sequencing Center, DNA sequencing libraries were prepared for tumor and matched-normal DNA using the KAPA HyperPrep kit (cat# 08098107702, Roche), and tumor RNA-Seq libraries were prepared using KAPA Stranded RNA-Seq with RiboErase kit (cat# 07962304001, Roche).

Data generation

NantHealth and Sidra performed 2x150 bp WGS on paired tumor (~60X) and constitutive DNA (~30X) samples on an Illumina X/400. BGI at CHOP performed 2x100 bp WGS sequenced at 60X depth for both tumor and normal samples. NantHealth performed ribosomal-depleted whole transcriptome stranded RNA-Seq to an average depth of 200M. BGI at CHOP performed poly-A or ribosomal-depleted whole transcriptome stranded RNA-Seq to an average depth of 100M. The Translational Genomic Research Institute (TGEN; Phoenix, AZ) performed paired tumor (~200X) and constitutive whole exome sequencing (WXS) or targeted DNA panel (panel) and poly-A selected RNA-Seq (~200M reads) for PNOC tumor samples. The panel tumor sample was sequenced to 470X, and the normal panel sample was sequenced to 308X. PNOC 2x100 bp WXS and RNA-Seq libraries were sequenced on an Illumina HiSeq 2500.

DNA WGS Alignment

We used BWA-MEM [@arxiv:1303.3997] to align paired-end DNA-seq reads to the version 38 patch release 12 of the Homo sapiens genome reference, obtained as a FASTA file from UCSC (see Key Resources Table). Next, we used the Broad Institute's Best Practices [@doi:10.1038/ng.806] to process Binary Alignment/Map files (BAMs) in preparation for variant discovery. We marked duplicates using SAMBLASTER [@doi:10/f6kft3], and we merged and sorted BAMs using Sambamba [@doi:10/gfzsfw] We used the BaseRecalibrator submodule of the Broad's Genome Analysis Tool Kit GATK [@doi:10.1101/gr.107524.110] to process BAM files. Lastly, for normal/germline input, we used the GATK HaplotypeCaller [@biorxiv:10.1101/201178] submodule on the recalibrated BAM to generate a genomic variant call format (GVCF) file. This file is used as the basis for germline calling, described in the SNV calling for B-allele Frequency (BAF) generation section.

We obtained references from the Broad Genome References on AWS bucket with a general description of references at https://s3.amazonaws.com/broad-references/broad-references-readme.html.

Quality Control of Sequencing Data

To confirm sample matches and remove mis-matched samples from the dataset, we performed NGSCheckMate [@doi:10.1093/nar/gkx193] on matched tumor/normal CRAM files. Briefly, we processed CRAMs using BCFtools to filter and call 20k common single nucleotide polymorphisms (SNPs) using default parameters. We used the resulting VCFs to run NGSCheckMate. Per NGSCheckMate author recommendations, we used <= 0.61 as a correlation coefficient cutoff at sequencing depths > 10 to predict mis-matched samples. We determined RNA-Seq read strandedness by running the infer_experiment.py script from RNA-SeQC [@doi:10.1093/bioinformatics/bts196] on the first 200k mapped reads. We removed any samples whose calculated strandedness did not match strandedness information provided by the sequencing center. We required that at least 60% of RNA-Seq reads mapped to the human reference for samples to be included in analysis. During OpenPBTA analysis, we identified some samples which were mis-identified or potentially swapped. Through collaborative analyses and pathology review, these samples were removed from our data releases and from the Kids First portal. Sample removal and associated justifications were documented in the OpenPBTA data release notes.

Germline Variant Calling

SNP calling for B-allele Frequency (BAF) generation

We performed germline haplotype calls using the GATK Joint Genotyping Workflow on individual GVCFs from the normal sample alignment workflow. Using only SNPs, we applied the GATK generic hard filter suggestions to the VCF, with an additional requirement of 10 reads minimum depth per SNP. We used the filtered VCF as input to Control-FREEC and CNVkit (below) to generate B-allele frequency (BAF) files. This single-sample workflow is available in the D3b GitHub repository. References can be obtained from the Broad Genome References on AWS bucket, and a general description of references can be found at https://s3.amazonaws.com/broad-references/broad-references-readme.html.

Assessment of germline variant pathogenicity

For patients with hypermutant samples, we first added population frequency of germline variants using ANNOVAR [@doi:10.1093/nar/gkq603] and pathogenicity scoring from ClinVar [@doi:10.1093/nar/gkt1113] using SnpSift [@doi:10.3389/fgene.2012.00035]. We then filtered for variants with read depth >= 15, variant allele fraction >= 0.20, and which were observed at < 0.1% allele frequency across each population in the Genome Aggregation Database (see Key Resources Table). Finally, we retained variants in genes included in the KEGG MMR gene set (see Key Resources Table), POLE, and/or TP53 which were ClinVar-annotated as pathogenic (P) or likely pathogenic (LP) with review status of >= 2 stars. All P/LP variants were manually reviewed by an interdisciplinary team of scientists, clinicians, and genetic counselors. This workflow is available in the D3b GitHub repository.

Somatic Mutation Calling

SNV and indel calling

We used four variant callers to call SNVs and indels from paired tumor/normal samples with Targeted Panel, WXS, and/or WGS data: Strelka2 [@doi:10.1038/s41592-018-0051-x], Mutect2 [@biorxiv:10.1101/861054], Lancet [@doi:10.1038/s42003-018-0023-9], and VarDictJava [@doi:10.1093/nar/gkw227]. VarDictJava-only calls were not retained since ~ 39M calls with low VAF were uniquely called and may be potential false positives. (~1.2M calls were called by Mutect2, Strelka2, and Lancet and included consensus CNV calling as described below.) We used only Strelka2, Mutect2 and Lancet to analyze WXS samples from TCGA. TCGA samples were captured using various WXS target capture kits and we downloaded the BED files from the GDC portal. The manufacturers provided the input interval BED files for both panel and WXS data for PBTA samples. We padded all panel and WXS BED files were by 100 bp on each side for Strelka2, Mutect2, and VarDictJava runs and by 400 bp for the Lancet run. For WGS calling, we utilized the non-padded BROAD Institute interval calling list wgs_calling_regions.hg38.interval_list, comprised of the full genome minus N bases, unless otherwise noted below. We ran Strelka2 [@doi:10/gdwrp4] using default parameters for canonical chromosomes (chr1-22, X,Y,M), as recommended by the authors, and we filtered the final Strelka2 VCF for PASS variants. We ran Mutect2 from GATK according to Broad best practices outlined from their Workflow Description Language (WDL), and we filtered the final Mutect2 VCF for PASS variants. To manage memory issues, we ran VarDictJava [@doi:10.1093/nar/gkw227] using 20 Kb interval chunks of the input BED, padded by 100 bp on each side, such that if an indel occurred in between intervals, it would be captured. Parameters and filtering followed BCBIO standards except that variants with a variant allele frequency (VAF) >= 0.05 (instead of >= 0.10) were retained. The 0.05 VAF increased the true positive rate for indels and decreased the false positive rate for SNVs when using VarDictJava in consensus calling. We filtered the final VarDictJava VCF for PASS variants with TYPE=StronglySomatic. We ran Lancet using default parameters, except for those noted below. For input intervals to Lancet WGS, we created a reference BED from only the UTR, exome, and start/stop codon features of the GENCODE 31 reference, augmented as recommended with PASS variant calls from Strelka2 and Mutect2. We then padded these intervals by 300 bp on each side during Lancet variant calling. Per recommendations for WGS samples, we augmented the Lancet input intervals described above with PASS variant calls from Strelka2 and Mutect2 as validation [@doi:10.1038/s41598-019-55636-3].

VCF annotation and MAF creation

We normalized INDELs with bcftools norm on all PASS VCFs using the kfdrc_annot_vcf_sub_wf.cwl subworkflow, release v3 (See Table S5). The Ensembl Variant Effect Predictor (VEP) [@doi:10/gdz75c], reference release 93, was used to annotate variants and bcftools was used to add population allele frequency (AF) from gnomAD [@doi:10.1038/s41586-020-2308-7]. We annotated SNV and INDEL hotspots from v2 of Memorial Sloan Kettering Cancer Center's (MSKCC) database (See Key Resources Table) as well as the TERT promoter mutations C228T and C250T [@doi:10.3390/ijms21176034]. We annotated SNVs by matching amino acid position (Protein_position column in MAF file) with SNVs in the MSKCC database, we matched splice sites to HGVSp_Short values in the MSKCC database, and we matched INDELs based on amino acid present within the range of INDEL hotspots values in the MSKCC database. We removed non-hotspot annotated variants with a normal depth less than or equal to 7 and/or gnomAD allele frequency (AF) greater than 0.001 as potential germline variants. We matched TERT promoter mutations using hg38 coordinates as indicated in ref. [@doi:10.3390/ijms21176034]: C228T occurs at 5:1295113 is annotated as existing variant s1242535815, COSM1716563, or COSM1716558, and is 66 bp away from the TSS; C250T occurs at Chr5:1295135, is annotated as existing variant COSM1716559, and is 88 bp away from the TSS. We retained variants annotated as PASS or HotSpotAllele=1 in the final set, and we created MAFs using MSKCC's vcf2maf tool.

Gather SNV and INDEL Hotspots

We retained all variant calls from Strelka2, Mutect2, or Lancet that overlapped with an SNV or INDEL hotspot in a hotspot-specific MAF file, which we then used for select analyses as described below.

Consensus SNV Calling

Our SNV calling process led to separate sets of predicted mutations for each caller. We considered mutations to describe the same change if they were identical for the following MAF fields: Chromosome, Start_Position, Reference_Allele, Allele, and Tumor_Sample_Barcode. Strelka2 does not call multinucleotide variants (MNV), but instead calls each component SNV as a separate mutation, so we separated MNV calls from Mutect2 and Lancet into consecutive SNVs before comparing them to Strelka2 calls. We examined VAFs produced by each caller and compared their overlap with each other (Figure S2). VarDictJava calls included many variants that were not identified by other callers (Figure S2C), while the other callers produced results that were relatively consistent with one another. Many of these VarDictJava-specific calls were variants with low allele frequency (Figure S2B). We therefore derived consensus mutation calls as those shared among the other three callers (Strelka2, Mutect2, and Lancet), and we did not further consider VarDictJava calls due to concerns it called a large number of false positives. This decision had minimal impact on results because VarDictJava also identified nearly every mutation that the other three callers identified, in addition to many unique mutations.

Somatic Copy Number Variant Calling (WGS samples only)

We used Control-FREEC [@doi:10/ckt4vz; @doi:10/c6bcps] and CNVkit [@doi:10.1371/journal.pcbi.1004873] for copy number variant calls. For both algorithms, the germline_sex_estimate (described below) was used as input for sample sex and germline variant calls (above) were used as input for BAF estimation. Control-FREEC was run on human genome reference hg38 using the optional parameters of a 0.05 coefficient of variation, ploidy choice of 2-4, and BAF adjustment for tumor-normal pairs. Theta2 [@doi:10.1093/bioinformatics/btu651] used VarDictJava germline and somatic calls, filtered on PASS and strongly somatic, to infer tumor purity. Theta2 purity was added as an optional parameter to CNVkit to adjust copy number calls. CNVkit was run on human genome reference hg38 using the optional parameters of Theta2 purity and BAF adjustment for tumor-normal pairs. We used GISTIC [@doi:10.1186/gb-2011-12-4-r41] on the CNVkit and the consensus CNV segmentation files to generate gene-level copy number abundance (Log R Ratio) as well as chromosomal arm copy number alterations using the parameters specified in the (run-gistic analysis module in the OpenPBTA Analysis repository).

Consensus CNV Calling

For each caller and sample, we called CNVs based on consensus among Control-FREEC [@doi:10/ckt4vz; @doi:10/c6bcps], CNVkit [@doi:10.1371/journal.pcbi.1004873], and Manta [@doi:10/gf3ggb]. We specifically included CNVs called significant by Control-FREEC (p-value < 0.01) and Manta calls that passed all filters in consensus calling. We removed sample and consensus caller files with more than 2,500 CNVs because we expected these to be noisy and derive poor quality samples based on cutoffs used in GISTIC [@doi:10.1186/gb-2011-12-4-r41]. For each sample, we included the regions in the final consensus set: 1) regions with reciprocal overlap of 50% or more between at least two of the callers; 2) smaller CNV regions in which more than 90% of regions are covered by another caller. We did not include any copy number alteration called by a single algorithm in the consensus file. We defined copy number as NA for any regions that had a neutral call for the samples included in the consensus file. We merged CNV regions within 10,000 bp of each other with the same direction of gain or loss into single region. We filtered out any CNVs that overlapped 50% or more with immunoglobulin, telomeric, centromeric, segment duplicated regions, or that were shorter than 3000 bp.

Somatic Structural Variant Calling (WGS samples only)

We used Manta [@doi:10/gf3ggb] for structural variant (SV) calls, and we limited to regions used in Strelka2. The hg38 reference for SV calling used was limited to canonical chromosome regions. We used AnnotSV [@doi:10.1093/bioinformatics/bty304] to annotate Manta output. All associated workflows are available in the workflows GitHub repository.

Gene Expression

Abundance Estimation

We used STAR [@doi:10/f4h523] to align paired-end RNA-seq reads, and we used the associated alignment for all subsequent RNA analysis. We used Ensembl GENCODE 27 "Comprehensive gene annotation" (see Key Resources Table) as a reference. We used RSEM [@doi:10/cwg8n5] for both FPKM and TPM transcript- and gene-level quantification.

Gene Expression Matrices with Unique HUGO Symbols

To enable downstream analyses, we next identified gene symbols that map to multiple Ensembl gene identifiers (in GENCODE v27, 212 gene symbols map to 1866 Ensembl gene identifiers), known as multi-mapped gene symbols, and ensured unique mappings (collapse-rnaseq analysis module in the OpenPBTA Analysis repository). To this end, we first removed genes with no expression from the RSEM abundance data by requiring an FPKM > 0 in at least 1 sample across the PBTA cohort. We computed the mean FPKM across all samples per gene. For each multi-mapped gene symbol, we chose the Ensembl identifier corresponding to the maximum mean FPKM, using the assumption that the gene identifier with the highest expression best represented the expression of the gene. After collapsing gene identifiers, 46,400 uniquely-expressed genes remained in the poly-A dataset, and 53,011 uniquely-expressed genes remained in the stranded dataset.

Gene fusion detection

We set up Arriba [@doi:10.1101/gr.257246.119] and STAR-Fusion [@biorxiv:10.1101/120295] fusion detection tools using CWL on CAVATICA. For both of these tools, we used aligned BAM and chimeric SAM files from STAR as inputs and GRCh38_gencode_v27 GTF for gene annotation. We ran STAR-Fusion with default parameters and annotated all fusion calls with the GRCh38_v27_CTAT_lib_Feb092018.plug-n-play.tar.gz file from the STAR-Fusion release. For Arriba, we used a blacklist file blacklist_hg38_GRCh38_2018-11-04.tsv.gz from the Arriba release to remove recurrent fusion artifacts and transcripts present in healthy tissue. We provided Arriba with strandedness information for stranded samples, or we set it to auto-detection for poly-A samples. We used FusionAnnotator on Arriba fusion calls to harmonize annotations with those of STAR-Fusion. The RNA expression and fusion workflows can be found in the D3b GitHub repository. The FusionAnnotator workflow we used for this analysis can be found in the D3b GitHub repository.

QUANTIFICATION AND STATISTICAL ANALYSIS

All p-values are two-sided unless otherwise stated. Z-scores were calculated using the formula $z=(x –\mu)/\sigma$ where $x$ is the value of interest, $\mu$ is the mean, and $\sigma$ is the standard deviation.

Tumor purity (tumor-purity-exploration module)

Estimating tumor fraction from RNA directly is challenging because most assume tumor cells comprise all non-immune cells[@doi:10.1038/ncomms3612], which is not a valid assumption for many diagnoses in the PBTA cohort. We therefore used Theta2 (as described in the "Somatic Copy Number Variant Calling section" Methods section) to infer tumor purity from WGS samples, further assuming that co-extracted RNA and DNA samples had the same tumor purity. We then created a set of stranded RNA-Seq data thresholded by median tumor purity of the cancer group to rerun selected transcriptomic analyses: telomerase-activity-prediction, tp53_nf1_score, transcriptomic-dimension-reduction, immune-deconv, and gene-set-enrichment-analysis. Note that these thresholded analyses, which only considered stranded RNA samples that also had co-extracted DNA, were performed in their respective OpenPBTA analyses modules (not within tumor-purity-exploration).

Recurrently mutated genes and co-occurrence of gene mutations (interaction-plots analysis module)

Using the consensus SNV calls, we identified genes that were recurrently mutated in the OpenPBTA cohort, including nonsynonymous mutations with a VAF > 5% among the set of independent samples. We used VEP [@doi:10/gdz75c] annotations, including "High" and "Moderate" consequence types as defined in the R package Maftools [@doi:10.1101/gr.239244.118], to determine the set of nonsynonymous mutations. For each gene, we then tallied the number of samples that had at least one nonsynonymous mutation.

For genes that contained nonsynonymous mutations in multiple samples, we calculated pairwise mutation co-occurrence scores. This score was defined as $I(-\log_{10}(P))$ where $I$ is 1 when the odds ratio is > 1 (indicating co-occurrence), and -1 when the odds ratio is < 1 (indicating mutual exclusivity), with $P$ defined by Fisher's Exact Test.

Focal Copy Number Calling (focal-cn-file-preparation analysis module)

We added the ploidy inferred via Control-FREEC to the consensus CNV segmentation file and used the ploidy and copy number values to define gain and loss values broadly at the chromosome level. We used bedtools coverage [@doi:10.1093/bioinformatics/btq033] to add cytoband status using the UCSC cytoband file [@doi:10.1093/nar/gks1048] (See Key Resources Table). The output status call fractions, which are values of the loss, gain, and callable fractions of each cytoband region, were used to define dominant status at the cytoband-level. We calculated the weighted means of each status call fraction using band length. We used the weighted means to define the dominant status at the chromosome arm-level.

A status was considered dominant if more than half of the region was callable and the status call fraction was greater than 0.9 for that region. We adopted this 0.9 threshold to ensure that the dominant status fraction call was greater than the remaining status fraction calls in a region.

We aimed to define focal copy number units to avoid calling adjacent genes in the same cytoband or arm as copy number losses or gains where it would be more appropriate to call the broader region a loss or gain. To determine the most focal units, we first considered the dominant status calls at the chromosome arm-level. If the chromosome arm dominant status was callable but not clearly defined as a gain or loss, we instead included the cytoband-level status call. Similarly, if a cytoband dominant status call was callable but not clearly defined as a gain or loss, we instead included gene-level status call. To obtain the gene-level data, we used the IRanges package in R [@doi:10.1371/journal.pcbi.1003118] to find overlaps between the segments in the consensus CNV file and the exons in the GENCODE v27 annotation file (See Key Resources Table) . If the copy number value was 0, we set the status to "deep deletion". For autosomes only, we set the status to "amplification" when the copy number value was greater than two times the ploidy value. We plotted genome-wide gains and losses in (Figure {@fig:S3}C) using the R package ComplexHeatmap [@doi:10.1093/bioinformatics/btw313].

Breakpoint Density (WGS samples only; chromosomal-instability analysis module)

We defined breakpoint density as the number of breaks per genome or exome per sample. For Manta SV calls, we filtered to retain "PASS" variants and used breakpoints from the algorithm. For consensus CNV calls, if |log2 ratio| > log2(1), we annotated the segment as a break. We then calculated breakpoint density as:

$$\textrm{breakpoint density} = \frac{\textrm{N breaks}}{\textrm{Size in Mb of }\textit{effectively surveyed} \textrm{ genome}}$$

Chromothripsis Analysis (WGS samples only; chromothripsis analysis module)

Considering only chromosomes 1-22 and X, we identified candidate chromothripsis regions in the set of independent tumor WGS samples with ShatterSeek [@doi:10.1038/s41588-019-0576-7], using Manta SV calls that passed all filters and consensus CNV calls. We modified the consensus CNV data to fit ShatterSeek input requirements as follows: we set CNV-neutral or excluded regions as the respective sample’s ploidy value from Control-FREEC, and we then merged consecutive segments with the same copy number value. We classified candidate chromothripsis regions as high- or low-confidence using the statistical criteria described by the ShatterSeek authors.

Immune Profiling and Deconvolution (immune-deconv analysis module)

We used the R package immunedeconv [@pubmed:31510660] with the method quanTIseq [@doi:10.1186/s13073-019-0638-6] to deconvolute various immune cell types in tumors using collapsed FPKM RNA-seq, with samples batched by library type and then combined. The quanTIseq deconvolution method directly estimates absolute fractions of 10 immune cell types that represent inferred proportions of the cell types in the mixture. Therefore, we utilized quanTIseq for inter-sample, intra-sample, and inter-histology score comparisons.

Gene Set Variation Analysis (gene-set-enrichment-analysis analysis module)

We performed Gene Set Variation Analysis (GSVA) on collapsed, log2-transformed RSEM FPKM data for stranded RNA-Seq samples using the GSVA Bioconductor package [@doi:10.1186/1471-2105-14-7]. We specified the parameter mx.diff=TRUE to obtain Gaussian-distributed scores for each of the MSigDB hallmark gene sets [@doi:10.1016/j.cels.2015.12.004]. We compared GSVA scores among histology groups using ANOVA and subsequent Tukey tests; p-values were Bonferroni-corrected for multiple hypothesis testing. We plotted scores by cancer group using the ComplexHeatmap R package (Figure {@fig:Fig5}B) [@doi:10.1093/bioinformatics/btw313].

Transcriptomic Dimension Reduction (transcriptomic-dimension-reduction analysis module)

We applied Uniform Manifold Approximation and Projection (UMAP) [@https://doi.org/10.48550/arXiv.1802.03426] to log2-transformed FPKM data for stranded RNA-Seq samples using the umap R package (See Key Resources Table). We considered all stranded RNA-Seq samples for this analysis, but we removed genes whose FPKM sum across samples was less than 100. We set the UMAP number of neighbors parameter to 15.

Fusion prioritization (fusion_filtering analysis module)

We performed artifact filtering and additional annotation on fusion calls to prioritize putative oncogenic fusions. Briefly, we considered all in-frame and frameshift fusion calls with at least one junction read and at least one gene partner expressed (TPM > 1) to be true calls. If a fusion call had a large number of spanning fragment reads compared to junction reads (spanning fragment minus junction read greater than ten), we removed these calls as potential false positives. We prioritized a union of fusion calls as true calls if the fused genes were detected by both callers, the same fusion was recurrent within a broad histology grouping (> 2 samples), or the fusion was specific to the given broad histology. If either 5' or 3' genes fused to more than five different genes within a sample, we removed these calls as potential false positives. We annotated putative driver fusions and prioritized fusions based on partners containing known kinases, oncogenes, tumor suppressors, curated transcription factors [@doi:10.1016/j.cell.2018.01.029], COSMIC genes, and/or known TCGA fusions from curated references. Based on pediatric cancer literature review, we added MYBL1 [@doi:10.1073/pnas.1300252110], SNCAIP [@doi:10.1038/nature11327], FOXR2 [@doi:10.1016/j.cell.2016.01.015], TTYH1 [@doi:10.1038/ng.2849], and TERT [@doi:10.1038/ng.3438; @doi:10.1002/gcc.22110; @doi:10.1016/j.canlet.2014.11.057; @doi:10.1007/s11910-017-0722-5] to the oncogene list, and we added BCOR [@doi:10.1016/j.cell.2016.01.015] and QKI [@doi:10.1038/ng.3500] to the tumor suppressor gene list.

Oncoprint figure generation (oncoprint-landscape analysis module)

We used Maftools [@doi:10.1101/gr.239244.118] to generate oncoprints depicting the frequencies of canonical somatic gene mutations, CNVs, and fusions for the top 20 genes mutated across primary tumors within broad histologies of the OpenPBTA dataset. We collated canonical genes from the literature for low-grade gliomas (LGGs) [@doi:10.1186/s40478-020-00902-z], embryonal tumors [@doi:10.1038/nature22973; @doi:10.1007/s00401-020-02182-2; @doi:10.1186/s40478-020-00984-9; @doi:10.1016/j.ccell.2016.02.001; @doi:10.1038/s41598-020-59812-8], high-grade gliomas (HGGs) [@doi:10.1016/j.ccell.2017.08.017; @doi:10.1002/ijc.32258; @doi:10.1093/neuonc/noab106; @doi:10.1186/s40478-020-00905-w], and other tumors: ependymomas, craniopharyngiomas, neuronal-glial mixed tumors, histiocytic tumors, chordoma, meningioma, and choroid plexus tumors [@pubmed:28623522; @doi:10.1016/j.ccell.2015.04.002; @doi:10.1038/nature13109; @doi:10.1038/s41525-017-0014-7; @doi:10.3171/2019.8.JNS191266; @doi:10.1007/s00401-016-1539-z; @doi:10.1093/neuonc/noaa267; @pubmed:12466115; @doi:10.1016/j.jaad.2017.05.059; @doi:10.1186/s40478-020-01056-8].

Mutational Signatures (mutational-signatures analysis module)

We obtained weights (i.e., exposures) for signature sets using the deconstructSigs R package function whichSignatures() [@doi:10.1186/s13059-016-0893-4] from consensus SNVs with the BSgenome.Hsapiens.UCSC.hg38 annotations (see Key Resources Table). Specifically, we estimated signature weights across samples for eight signatures previously identified in the Signal reference set of signatures ("RefSig") as associated with adult central nervous system (CNS) tumors [@doi:10.1038/s43018-020-0027-5]. These eight RefSig signatures are 1, 3, 8, 11, 18, 19, N6, and MMR2. Weights for signatures fall in the range zero to one inclusive. deconstructSigs estimates the weights for each signature across samples and allows for a proportion of unassigned weights referred to as "Other" in the text. These results do not include signatures with small contributions; deconstructSigs drops signature weights that are less than 6% [@doi:10.1186/s13059-016-0893-4]. We plotted mutational signatures for patients with hypermutant tumors (Figure {@fig:Fig4}E) using the R package ComplexHeatmap [@doi:10.1093/bioinformatics/btw313].

Tumor Mutation Burden (snv-callers analysis module)

We consider tumor mutation burden (TMB) to be the number of consensus SNVs per effectively surveyed base of the genome. We considered base pairs to be effectively surveyed if they were in the intersection of the genomic ranges considered by the callers used to generate the consensus and where appropriate, regions of interest, such as coding sequences. We calculated TMB as:

$$\textrm{TMB} = \frac{\textrm{# of coding sequence SNVs}}{\textrm{Size in Mb of }\textit{effectively surveyed} \textrm{ genome} }$$

We used the total number coding sequence consensus SNVs for the numerator and the size of the intersection of the regions considered by Strelka2 and Mutect2 with coding regions (CDS from GENCODE v27 annotation, see Key Resources Table) as the denominator.

Clinical Data Harmonization

WHO Classification of Disease Types

Table S1 contains a README, along with sample technical, clinical, and additional metadata used for this study.

Molecular Subtyping

We performed molecular subtyping on tumors in the OpenPBTA to the extent possible. The molecular_subtype field in pbta-histologies.tsv contains molecular subtypes for tumor types selected from pathology_diagnosis and pathology_free_text_diagnosis fields as described below, following World Health Organization 2016 classification criteria [@doi:10.1007/s00401-016-1545-1]. We further categorized broad tumor histologies into smaller groupings we denote "cancer groups."

Medulloblastoma (MB) subtypes SHH, WNT, Group 3, and Group 4 were predicted using the consensus of two RNA expression classifiers: MedulloClassifier [@doi:10.1371/journal.pcbi.1008263] and MM2S [@doi:10.1186/s13029-016-0053-y] on the RSEM FPKM data (molecular-subtyping-MB analysis module). The 43 "true positive" subtypes were manually curated from pathology reports by two independent reviewers.

High-grade glioma (HGG) subtypes were derived (molecular-subtyping-HGG analysis module) using the following criteria:

  1. If any sample contained an H3F3A p.K28M, HIST1H3B p.K28M, HIST1H3C p.K28M, or HIST2H3C p.K28M mutation and no BRAF p.V600E mutation, it was subtyped as DMG, H3 K28.
  2. If any sample contained an HIST1H3B p.K28M, HIST1H3C p.K28M, or HIST2H3C p.K28M mutation and a BRAF p.V600E mutation, it was subtyped as DMG, H3 K28, BRAF V600E.
  3. If any sample contained an H3F3A p.G35V or p.G35R mutation, it was subtyped asHGG, H3 G35.
  4. If any high-grade glioma sample contained an IDH1 p.R132 mutation, it was subtyped as HGG, IDH.
  5. If a sample was initially classified as HGG, had no defining histone mutations, and a BRAF p.V600E mutation, it was subtyped as BRAF V600E.
  6. All other high-grade glioma samples that did not meet any of these criteria were subtyped as HGG, H3 wildtype.

Embryonal tumors were included in non-MB and non-ATRT embryonal tumor subtyping (molecular-subtyping-embryonal analysis module) if they met any of the following criteria:

  1. A TTYH1 (5' partner) fusion was detected.
  2. A MN1 (5' partner) fusion was detected, with the exception of MN1::PATZ1 since it is an entity separate of CNS HGNET-MN1 tumors [@doi:10.1111/nan.12626].
  3. Pathology diagnoses included "Supratentorial or Spinal Cord PNET" or "Embryonal Tumor with Multilayered Rosettes".
  4. A pathology diagnosis of "Neuroblastoma", where the tumor was not indicated to be peripheral or metastatic and was located in the CNS.
  5. Any sample with "embryonal tumor with multilayer rosettes, ros (who grade iv)", "embryonal tumor, nos, congenital type", "ependymoblastoma" or "medulloepithelioma" in pathology free text.

Non-MB and non-ATRT embryonal tumors identified with the above criteria were further subtyped (molecular-subtyping-embryonal analysis module) using the criteria below [@pubmed:30249036; @doi:10.1007/s00381-017-3551-6; @pubmed:26389418; @doi:10.3390/ijms21051818].

  1. Any RNA-seq biospecimen with LIN28A overexpression, plus a TYH1 fusion (5' partner) with a gene adjacent or within the C19MC miRNA cluster and/or copy number amplification of the C19MC region was subtyped as ETMR, C19MC-altered (Embryonal tumor with multilayer rosettes, chromosome 19 miRNA cluster altered) [@doi:10.1007/s00401-012-1068-3; @doi:10.1038/ng.2849].
  2. Any RNA-seq biospecimen with LIN28A overexpression, a TTYH1 fusion (5' partner) with a gene adjacent or within the C19MC miRNA cluster but no evidence of copy number amplification of the C19MC region was subtyped as ETMR, NOS (Embryonal tumor with multilayer rosettes, not otherwise specified) [@doi:10.1007/s00401-012-1068-3; @doi:10.1038/ng.2849].
  3. Any RNA-seq biospecimen with a fusion having a 5' MN1 and 3' BEND2 or CXXC5 partner were subtyped as CNS HGNET-MN1 [Central nervous system (CNS) high-grade neuroepithelial tumor with MN1 alteration].
  4. Non-MB and non-ATRT embryonal tumors with internal tandem duplication (as defined in [@doi:10.1186/s12859-016-1031-8]) of BCOR were subtyped as CNS HGNET-BCOR (CNS high-grade neuroepithelial tumor with BCOR alteration).
  5. Non-MB and non-ATRT embryonal tumors with over-expression and/or gene fusions in FOXR2 were subtyped as CNS NB-FOXR2 (CNS neuroblastoma with FOXR2 activation).
  6. Non-MB and non-ATRT embryonal tumors with CIC::NUTM1 or other CIC fusions, were subtyped as CNS EFT-CIC (CNS Ewing sarcoma family tumor with CIC alteration) [@doi:10.1016/j.cell.2016.01.015]
  7. Non-MB and non-ATRT embryonal tumors that did not fit any of the above categories were subtyped as CNS Embryonal, NOS (CNS Embryonal tumor, not otherwise specified).

Neurocytoma subtypes central neurocytoma (CNC) and extraventricular neurocytoma (EVN) were assigned (molecular-subtyping-neurocytoma analysis module) based on the primary site of the tumor [@doi:10.1007/978-3-319-33432-5_20]. If the tumor's primary site was "ventricles," we assigned the subtype as CNC; otherwise, we assigned the subtype as EVN.

Craniopharyngiomas (CRANIO) were subtyped (molecular-subtyping-CRANIO analysis module) into adamantinomatous (CRANIO, ADAM), papillary (CRANIO, PAP) or undetermined (CRANIO, To be classified) based on the following criteria [@doi:10.3171/jns.1995.83.2.0206; @doi:10.3171/jns.1998.89.4.0547]:

  1. Craniopharyngiomas from patients over 40 years old with a BRAF p.V600E mutation were subtyped as CRANIO, PAP.
  2. Craniopharyngiomas from patients younger than 40 years old with mutations in exon 3 of CTNNB1 were subtyped as CRANIO, ADAM.
  3. Craniopharyngiomas that did not fall into the above two categories were subtyped as CRANIO, To be classified.

A molecular subtype of EWS was assigned to any tumor with a EWSR1 fusion or with a pathology_diagnosis of Ewings Sarcoma (molecular-subtyping-EWS analysis module).

LGG or glialneuronal tumors (GNT) were subtyped (molecular-subtyping-LGAT analysis module) based on SNV, fusion, and CNV status based on @doi:10.1016/j.ccell.2020.03.011 and as described below.

  1. If a sample contained a NF1 somatic mutation, either nonsense or missense, it was subtyped as LGG, NF1-somatic.
  2. If a sample contained NF1 germline mutation, as indicated by a patient having the neurofibromatosis cancer predisposition, it was subtyped as LGG, NF1-germline.
  3. If a sample contained the IDH p.R132 mutation, it was subtyped as LGG, IDH.
  4. If a sample contained a histone p.K28M mutation in either H3F3A, H3F3B, HIST1H3B, HIST1H3C, or HIST2H3C, or if it contained a p.G35R or p.G35V mutation in H3F3A, it was subtyped as LGG, H3.
  5. If a sample contained BRAF p.V600E or any other non-canonical BRAF mutations in the kinase (PK_Tyr_Ser-Thr) domain PF07714 (see Key Resources Table), it was subtyped as LGG, BRAF V600E.
  6. If a sample contained KIAA1549::BRAF fusion, it was subtyped as LGG, KIAA1549::BRAF.
  7. If a sample contained SNV or indel in either KRAS, NRAS, HRAS, MAP2K1, MAP2K2, MAP2K1, ARAF, RAF1, or non-kinase domain of BRAF, or if it contained RAF1 fusion, or BRAF fusion that was not KIAA1549::BRAF, it was subtyped as LGG, other MAPK.
  8. If a sample contained SNV in either MET, KIT or PDGFRA, or if it contained fusion in ALK, ROS1, NTRK1, NTRK2, NTRK3 or PDGFRA, it was subtyped as LGG, RTK.
  9. If a sample contained FGFR1 p.N546K, p.K656E, p.N577, or p. K687 hotspot mutations, or tyrosine kinase domain tandem duplication (See Key Resources Table), or FGFR1 or FGFR2 fusions, it was subtyped as LGG, FGFR.
  10. If a sample contained MYB or MYBL1 fusion, it was subtyped as LGG, MYB/MYBL1.
  11. If a sample contained focal CDKN2A and/or CDKN2B deletion, it was subtyped as LGG, CDKN2A/B.

For LGG tumors that did not have any of the above molecular alterations, if both RNA and DNA samples were available, it was subtyped as LGG, wildtype. Otherwise, if either RNA or DNA sample was unavailable, it was subtyped as LGG, To be classified.

If pathology diagnosis was Subependymal Giant Cell Astrocytoma (SEGA), the LGG portion of molecular subtype was recoded to SEGA.

Lastly, for all LGG- and GNT- subtyped samples, if the tumors were glialneuronal in origin, based on pathology_free_text_diagnosis entries of desmoplastic infantile,desmoplastic infantile ganglioglioma, desmoplastic infantile astrocytoma or glioneuronal, each was recoded as follows: If pathology diagnosis is Low-grade glioma/astrocytoma (WHO grade I/II) or Ganglioglioma, the LGG portion of the molecular subtype was recoded to GNT.

Ependymomas (EPN) were subtyped (molecular-subtyping-EPN analysis module) into EPN, ST RELA, EPN, ST YAP1, EPN, PF A and EPN, PF B based on evidence for these molecular subgroups as described in Pajtler et al. [@doi:10.1016/j.ccell.2015.04.002]. Briefly, fusion, CNV and gene expression data were used to subtype EPN as follows:

  1. Any tumor with fusions containing RELA as fusion partner, e.g., C11orf95::RELA, LTBP3::RELA, was subtyped as EPN, ST RELA.
  2. Any tumor with fusions containing YAP1 as fusion partner, such as C11orf95::YAP1, YAP1::MAMLD1 and YAP1::FAM118B, was subtyped as EPN, ST YAP1.
  3. Any tumor with the following molecular characterization would be subtyped as EPN, PF A:
  • CXorf67 expression z-score of over 3
  • TKTL1 expression z-score of over 3 and 1q gain
  1. Any tumor with the following molecular characterization would be subtyped as EPN, PF B:
  • GPBP17 expression z-score of over 3 and loss of 6q or 6p
  • IFT46 expression z-score of over 3 and loss of 6q or 6p

Any tumor with the above molecular characteristics would be exclusively subtyped to the designated group.

For all other remaining EPN tumors without above molecular characteristics, they would be subtyped to EPN, ST RELA and EPN, ST YAP1 in a non-exclusive way (e.g., a tumor could have both EPN, ST RELA and EPN, ST YAP1 subtypes) if any of the following alterations were present.

  1. Any tumor with the following alterations was assigned EPN, ST RELA:
  • PTEN::TAS2R1 fusion
  • chromosome 9 arm (9p or 9q) loss
  • RELA expression z-score of over 3
  • L1CAM expression z-score of over 3
  1. Any tumor with the following alterations was assigned EPN, ST YAP1:
  • C11orf95::MAML2 fusion
  • chromosome 11 short arm (11p) loss
  • chromosome 11 long arm (11q) gain
  • ARL4D expression z-score of over 3
  • CLDN1 expression z-score of over 3

After all relevant tumor samples were subtyped by the above molecular subtyping modules, the results from these modules, along with other clinical information (such as pathology diagnosis free text), were compiled in the molecular-subtyping-pathology module and integrated into the OpenPBTA data in the molecular-subtyping-integrate module.

TP53 Alteration Annotation (tp53_nf1_score analysis module)

We annotated TP53 altered HGG samples as either TP53 lost or TP53 activated and integrated this within the molecular subtype. To this end, we applied a TP53 inactivation classifier originally trained on TCGA pan-cancer data [@doi:10.1016/j.celrep.2018.03.076] to the matched RNA expression data, with samples batched by library type. Along with the TP53 classifier scores, we collectively used consensus SNV and CNV, SV, and reference databases that list TP53 hotspot mutations [@doi:10.1158/2159-8290.CD-17-0321; @doi:10.1038/nbt.3391] and functional domains [@doi:10.1038/sj.cdd.4401904] to determine TP53 alteration status for each sample. We adopted the following rules for calling either TP53 lost or TP53 activated:

  1. If a sample had either of the two well-characterized TP53 gain-of-function mutations, p.R273C or p.R248W [@doi:10.1038/ng0593-42], we assigned TP53 activated status.

  2. Samples were annotated as TP53 lost if they contained i) a TP53 hotspot mutation as defined by IARC TP53 database or the MSKCC cancer hotspots database [@doi:10.1158/2159-8290.CD-17-0321; @doi:10.1038/nbt.3391] (see also, Key Resources Table), ii) two TP53 alterations, including SNV, CNV or SV, indicative of probable bi-allelic alterations; iii) one TP53 somatic alteration, including SNV, CNV, or SV or a germline TP53 mutation indicated by the diagnosis of Li-Fraumeni syndrome (LFS) [@doi:10.1101/cshperspect.a026187], or iv) one germline TP53 mutation indicated by LFS and the TP53 classifier score for matched RNA-Seq was greater than 0.5.

Prediction of participants' genetic sex

Participant metadata included a reported gender. We used WGS germline data, in concert with the reported gender, to predict participant genetic sex so that we could identify sexually dimorphic outcomes. This analysis may also indicate samples that may have been contaminated. We used the idxstats utility from SAMtools [@pubmed:19505943] to calculate read lengths, the number of mapped reads, and the corresponding chromosomal location for reads to the X and Y chromosomes. We used the fraction of total normalized X and Y chromosome reads that were attributed to the Y chromosome as a summary statistic. We manually reviewed this statistic in the context of reported gender and determined that a threshold of less than 0.2 clearly delineated female samples. We marked fractions greater than 0.4 as predicted males, and we marked samples with values in the inclusive range 0.2-0.4 as unknown. We performed this analysis through CWL on CAVATICA. We added resulting calls to the histologies file under the column header germline_sex_estimate.

Selection of independent samples (independent-samples analysis module)

Certain analyses required that we select only a single representative specimen for each individual. In these cases, we identified a single specimen by prioritizing primary tumors and those with whole-genome sequencing available. If this filtering still resulted in multiple specimens, we randomly selected a single specimen from the remaining set.

Quantification of Telomerase Activity using Gene Expression Data (telomerase-activity-prediction analysis module)

We predicted telomerase activity of tumor samples using the recently developed EXTEND method [@doi:10.1038/s41467-020-20474-9], with samples batched by library type. Briefly, EXTEND estimates telomerase activity based on the expression of a 13-gene signature. We derived this signature by comparing telomerase-positive tumors and tumors with activated alternative lengthening of telomeres pathway, a group presumably negative of telomerase activity.

Survival models (survival-analysis analysis module)

We calculated overall survival (OS) as days since initial diagnosis and performed several survival analyses on the OpenPBTA cohort using the survival R package. We performed survival analysis for patients by HGG subtype using the Kaplan-Meier estimator [@doi:10.2307/2281868] and a log-rank test (Mantel-Cox test) [@pubmed:5910392] on the different HGG subtypes. Next, we used multivariate Cox (proportional hazards) regression analysis [@doi:10.1111/j.2517-6161.1972.tb00899.x] to model the following: a) tp53 scores + telomerase scores + extent of tumor resection + LGG group + HGG group, in which tp53 scores and telomerase scores are numeric, extent of tumor resection is categorical, and LGG group and HGG group are binary variables indicating whether the sample is in either broad histology grouping, b) tp53 scores + telomerase scores + extent of tumor resection for each cancer_group with an N>=3 deceased patients (DIPG, DMG, HGG, MB, and EPN), and c) quantiseq cell type fractions + CD274 expression + extent of tumor resection for each cancer_group with an N>=3 deceased patients (DIPG, DMG, HGG, MB, and EPN), in which quantiseq cell type fractions and CD274 expression are numeric.

KEY RESOURCES TABLE

REAGENT or RESOURCE SOURCE IDENTIFIER
Chemicals, peptides, and recombinant proteins
Recover Cell Culture Freezing media Gibco Cat# 12648010
Hank's Balanced Salt Solution (HBSS) Gibco Cat# 14175095
Papain SciQuest Cat# LS003124
Ovomucoid SciQuest Cat# 542000
DNase Roche Cat# 10104159001
RNase A Qiagen Cat# 19101
100μm cell strainer Greiner Bio-One Cat# 542000
DMEM/F-12 medium Sigma Cat# D8062
Fetal Bovine Serum (FBS) Hyclone Cat# SH30910.03
GlutaMAX Gibco Cat# 35050061
Penicillin/Streptomycin-Amphotericin B Lonza Cat# 17-745E
Normocin Invivogen Cat# ant-nr-2
B-27 supplement minus vitamin A Gibco Cat# 12587-010
N-2 supplement Gibco Cat# 17502001
Epidermal growth factor Gibco Cat# PHG0311L
Basic fibroblast growth factor PeproTech Cat# 100-18B
Heparin Sigma Cat# H3149
Critical commercial assays
GenePrint 24 STR profiling kit Promega Cat# B1870
DNA/RNA AllPrep Kit Qiagen Cat# 80204
TruSeq RNA Sample Prep Kit Illumina Cat# FC-122-1001
KAPA Library Preparation Kit Roche Cat# KK8201
AllPrep DNA/RNA/miRNA Universal kit Qiagen Cat# 80224
QIAsymphony DSP DNA Midi Kit Qiagen Cat# 937255
KAPA HyperPrep kit Roche Cat# 08098107702
RiboErase kit Roche Cat# 07962304001
Deposited data
Raw and harmonized WGS, WXS, Panel, RNA-Seq KidsFirst Data Resource Center, This project [@doi:10.24370/OPENPBTA]
Merged summary files This project https://cavatica.sbgenomics.com/u/cavatica/openpbta
Merged summary files and downstream analyses This project https://github.com/AlexsLemonade/OpenPBTA-analysis [@doi:10.5281/zenodo.7803335]
Processed data This project https://pedcbioportal.kidsfirstdrc.org/study/summary?id=openpbta
Data underlying figures and molecular alterations This project [@doi:10.5281/zenodo.7877739]
Experimental models: Cell lines
CBTN pediatric brain tumor-derived cell lines [@doi:10.1093/neuonc/noz192] See Table S1 for identifiers
Software and algorithms
Data processing and analysis software Multiple See Table S5 for identifiers
OpenPBTA workflows repository This project https://github.com/d3b-center/OpenPBTA-workflows [@doi:10.5281/zenodo.6968175]
OpenPBTA analysis repository This project https://github.com/AlexsLemonade/OpenPBTA-analysis [@doi:10.5281/zenodo.7877755]
OpenPBTA manuscript repository This project https://github.com/AlexsLemonade/OpenPBTA-manuscript
Other
TCGA WXS dataset NIH The Cancer Genome Atlas (TCGA) dbGAP phs000178.v11.p8
Cancer hotspots MSKCC https://www.cancerhotspots.org/#/download (v2)
Reference genomes Broad Institute https://s3.console.aws.amazon.com/s3/buckets/broad-references/hg38/v0/
Reference genome hg38, patch release 12 UCSC http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/
Human Cytoband file UCSC http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/cytoBand.txt.gz
CDS from GENCODE v27 annotation GENCODE https://www.gencodegenes.org/human/release_27.html
PFAM domains and locations UCSC http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/pfamDesc.txt.gz; https://pfam.xfam.org/family/PF07714
BSgenome.Hsapiens.UCSC.hg38 annotations Bioconductor https://bioconductor.org/packages/release/data/annotation/html/BSgenome.Hsapiens.UCSC.hg38.html
gnomAD v2.1.1 (exome and genome) Genome Aggregation Database https://gnomad.broadinstitute.org/downloads#v2-liftover-variants
KEGG MMR gene set v7.5.1 Broad Institute https://www.gsea-msigdb.org/gsea/msigdb/download_geneset.jsp?geneSetName=KEGG_MISMATCH_REPAIR
ClinVar Database (2022-05-07) NCBI https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/archive_2.0/2022/clinvar_20220507.vcf.gz