Skip to content

Xinglab/STRIPE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

STRIPE (Sequencing Targeted RNAs Identifies Pathogenic Events)

STRIPE is a versatile workflow that enables deep sequencing of full-length transcripts for any disease-specific gene panel such that a wide range of clinically informative readouts, including transcript aberrations and sequence variants, can be detected at haplotype-level resolution.

Dependencies

Tools:

  • Bcftools (version 1.21)
  • CADD (version 1.6)
  • LongcallR (version 0.1.0)
  • Minimap2 (version 2.26)
  • Python (version 3.8.18)
    • numpy (version 1.24.2)
    • pandas (version 2.0.3)
    • pysam (version 0.21.0)
    • pytabix (version 0.1)
    • tensorflow (version 2.13.1)
  • R (version 4.3.1)
    • argparse (version 2.2.2)
    • dplyr (version 1.1.4)
    • ggplot2 (version 3.4.4)
    • tidyr (version 1.3.1)
  • Samtools (version 1.21)
  • Singularity (version 3.11.1)
    • Clair3 (version 1.0.10)
    • DeepVariant (version 1.6.1)
    • DeepTrio (version 1.6.1)
  • SpliceAI (version 1.3.1)
  • Stringtie (version 2.2.3)

Databases:

Usage:

Building a STRIPE database

python Build_STRIPE.py [-h] --gene_list /path/to/target/gene/list --anno_gtf /path/to/gene/annotation/gtf
    [--gtex_files /path/to/GTEx/files/list] [--tissue_list /path/to/GTEx/tissues/file]
    [--gnomad_files /path/to/gnomAD/files/list] [--only_bed] --outdir /path/to/output/folder
Parameter Description
--gene_list Path to a single-column text file with a list of Ensembl gene IDs for target genes
--anno_gtf Path to a GTF file with reference gene annotations from GENCODE (download here)
--gtex_files (Optional) Path to a single-column text file containing file paths to GTEx v8 files (see Databases)
--tissue_list (Optional) Path to a two-column text file in which column 1 has the official names of GTEx tissues (see here) and column 2 has the desired labels for each tissue
--gnomad_files (Optional) Path to a single-column text file containing file paths to all gnomAD VCF files (chr1-22,X,Y)
--only_bed (Optional) Forces Build_STRIPE.py to only construct a BED file for input target genes
--outdir Path to output directory for STRIPE database files

Build_STRIPE.py will generate the following intermediate folders/files in --outdir that are needed for running STRIPE:

  • target_genes.bed: A BED6 file containing the genomic coordinates of input target genes (based on reference gene annotations from --anno_gtf)
  • GTEx_v8: A folder containing subfolders for each tissue label in --tissue_list. Each subfolder contains the following files:
    • target_genes.gene_tpm.txt: A matrix file containing normalized abundances (TPM) for target genes across GTEx samples of the given tissue type
    • target_genes.junction_counts.txt: A matrix file containing read counts for splice junctions mapping to target genes across GTEx samples of the given tissue type
    • target_genes.haplotype_expression.txt: A matrix file containing haplotype-specific read counts for target genes across GTEx samples of the given tissue type
  • gnomAD_v4: A folder containing subfolders for each target gene. Each subfolder contains a file gnomad_variants.tsv, which describes all known gnomAD variants mapping to the target gene as follows:
    • CHROM, POS, REF, ALT: Chromosome, position, reference allele, alternate allele
    • AF_GRPMAX_JOINT: Maximum allele frequency across gnomAD populations
    • NHOMALT_GRPMAX_JOINT: Maximum number of homozygous individuals across gnomAD populations

Running STRIPE

python Run_STRIPE.py [-h] --infile /path/to/input/TEQUILA/bam --targets /path/to/target/gene/BED --genome /path/to/reference/genome/FASTA
    --gencode /path/to/gencode/GTF --samtools /path/to/samtools --stringtie /path/to/stringtie --longcallr /path/to/longcallr
    --clair3 /path/to/clair3/sif --model /path/to/clair3/model --cadd /path/to/CADD --gnomad /path/to/gnomad/database
    --clinvar /path/to/clinvar/vcf --bcftools /path/to/bcftools --minimap2 /path/to/minimap2 --gtex_splice /path/to/gtex/splice/database
    --gtex_haplotype /path/to/gtex/haplotype/database [--threads <number of threads>] [--proband_vcf /path/to/proband/vcf]
    [--parent1_vcf /path/to/parent1/vcf] [--parent2_vcf /path/to/parent2/vcf] [--only_qc] [--phase_indels] --outdir /path/to/output/directory
Parameter Description
--infile Path to BAM file with TEQUILA-seq read alignments
--targets Path to BED6 file containing genomic coordinates of input target genes (output of running Build_STRIPE.py)
--genome Path to FASTA file with reference genome sequence
--gencode Path to GTF file with reference gene annotations from GENCODE (please use the same file as the one used in running Build_STRIPE.py)
--samtools Path to samtools executable (see here)
--stringtie Path to stringtie executable (see here)
--longcallr Path to longcallr executable (see here)
--clair3 Path to SIF file for Clair3. Run singularity pull on the appropriate SIF file from here
--model Path to folder for Clair3 model. Clair3 models based on the latest nanopore sequencing chemistries can be downloaded from here
--cadd Path to CADD.sh script (download and follow installation instructions here)
--gnomad Path to gnomAD_v4 output folder generated by Build_STRIPE.py
--clinvar Path to VCF file containing the latest release of ClinVar variants (see here)
--bcftools Path to bcftools executable (see here)
--minimap2 Path to minimap2 executable (see here)
--gtex_splice Path to the appropriate target_genes.junction_counts.txt file generated by Build_STRIPE.py
--gtex_haplotype Path to the appropriate target_genes.haplotype_expression.txt file generated by Build_STRIPE.py
--threads (Optional) Number of threads (default: 1)
--proband_vcf (Optional) Path to a VCF file with small variants called from prior exome or genome sequencing data for the proband
--parent1_vcf (Optional) Path to a VCF file with small variants called from prior exome or genome sequencing data for the proband's parent
--parent2_vcf (Optional) Path to a VCF file with small variants called from prior exome or genome sequencing data for the proband's parent
--only_qc (Optional) Forces Run_STRIPE.py to only perform a quality control check on the input BAM file (see below for more details)
--phase_indels (Optional) Allows Run_STRIPE.py to use indels called from prior exome or genome sequencing data to inform phasing of TEQUILA-seq reads
--outdir Path to output directory for STRIPE

Run_STRIPE.py consists of the following steps:

  1. Quality control analyses: Generates the following files:

    • [SAMPLE].mapping_stats.txt: Read mapping statistics and the overall on-target rate for SAMPLE
    • [SAMPLE].top_2000_genes.txt: A three-column text file describing the 2000 most abundant genes in SAMPLE as follows:
      • Column 1: Gene ID
      • Column 2: Indicator for whether the given gene is a target gene or not
      • Column 3: Estimated abundance for the gene (TPM)
    • [SAMPLE].top_2000_genes.pdf: Plot showing the abundances (TPM) of the 2000 most abundant genes in SAMPLE, in which each gene is colored based on whether it is targeted or not. The overall on-target rate for SAMPLE is also indicated on the plot.
    • [SAMPLE].junctions.tsv: A four-column text file describing splice junctions discovered across all target genes in SAMPLE as follows:
      • Columns 1-3: (Chrom, Start, End) for each discovered splice junction
      • Column 4: Read count for each discovered splice junction
    • stringtie/[SAMPLE].gtf: A GTF file describing all genes and transcripts discovered by stringtie in SAMPLE
    • stringtie/[SAMPLE].gene_abundance.tsv: A tab-separated file describing all genes discovered by stringtie in SAMPLE and their abundances
  2. Phase TEQUILA-seq reads mapping to target genes: For each target gene, STRIPE will generate the following files (if TEQUILA-seq reads mapping to the gene can be phased)

    • gene.vcf.gz: A compressed and indexed VCF file containing phased variants used for guiding assignment of TEQUILA-seq reads to individual haplotypes
    • hap1_reads.bam: A BAM file containing alignments of TEQUILA-seq reads assigned to haplotype 1
    • hap2_reads.bam: A BAM file containing alignments of TEQUILA-seq reads assigned to haplotype 2
    • phasing_stats.txt: A text file summarizing the results of phasing TEQUILA-seq reads, including:
      • Number of reads assigned to each haplotype and the number of reads that could not be assigned to a haplotype
      • A phasing "quality score", which is computed as follows. Suppose we have a total of $N$ reads mapping to some gene with start and end genomic coordinates $(S, T)$. The quality score, $Q$, is computed as: $$Q = \frac{\sum_{k=1}^{N}w_kz_k}{\sum_{k=1}^{N}w_k}$$ where $w_k = \max(\min(t_k, T) - \max(s_k, S), 0)/(T-S)$ and $(s_k, t_k)$ denotes the start and end genomic coordinates for the given read alignment. Here, a higher value of $Q$ indicates that more full-length reads mapping to the gene were able to be phased.
  3. Identify rare, deleterious variants in each target gene: For each target gene, STRIPE will generate a merged callset of variants from longcallR, clair3, deepvariant, and bcftools. The merged callset will then be filtered for variants meeting the following criteria: (i) CADD score > 15 or SpliceAI score > 0.1, (ii) maximum allele frequency across gnomAD populations < 1%. The filtered merged callset, merged_variants.tsv, will contain the following fields:

    • CHROM, POS, REF, ALT: Chromosome, position, reference allele, alternate allele
    • LONGCALLR_DP, LONGCALLR_AF, LONGCALLR_GT: Read depth, allele frequency, and genotype from longcallR for given variant
    • CLAIR3_DP, CLAIR3_AF, CLAIR3_GT: Read depth, allele frequency, and genotype from clair3 for given variant
    • DEEPVARIANT_DP, DEEPVARIANT_AF, DEEPVARIANT_GT: Read depth, allele frequency, and genotype from deepvariant for given variant
    • PHASING_DP, PHASING_AF, PHASING_GT: Read depth, allele frequency, and genotype from bcftools for given variant
      • Here, we ran bcftools on haplotype-assigned TEQUILA-seq read sets and constructed a phased genotype
    • CADD: Functional deleteriousness score from CADD (verison 1.6)
    • SPLICEAI: Likelihood of impacting splicing from SpliceAI
    • GNOMAD_AF_GRPMAX_JOINT: Maximum allele frequency across gnomAD populations
    • GNOMAD_NHOMALT_GRPMAX_JOINT: Maximum number of homozygous individuals across gnomAD populations
    • CLINVAR: ClinVar annotations for variant
    • HOMOPOLYMER: Indicator for whether the variant lies in a homopolymer region (useful for flagging false positive variant calls due to basecalling artifacts)
    • END_DISTANCE_PCT: Percentage of reads supporting the variant in which the variant is found within 30 bp of the alignment ends (useful for flagging false positive variant calls due to alignment artifacts)
    • SKIP_FREQ: Percentage of reads covering the variant locus in which the variant position is intronic (useful for flagging false positive variant calls due to local mis-alignments around splice sites)
  4. Detect aberrant splice junctions in each target gene: For each target gene, STRIPE will extract unique splice junctions from all TEQUILA-seq reads (as well as haplotype-specific reads) mapping to the gene locus and evaluate whether usage frequencies for these splice junctions are unusually higher than expected compared to usage frequencies observed among tissue-matched GTEx controls. Each output file will contain the following fields:

    • Junction: Splice junction coordinates
    • Count: Number of reads supporting the splice junction
    • Coverage: Total number of reads covering the corresponding splice sites
    • Usage: Usage frequency of the splice junction in sample
    • Population: Average usage frequency of the splice junction across GTEx controls
    • Shift: Difference in usage frequencies between the sample and GTEx controls
    • PVal: P value for testing whether the sample has a usage frequency that is consistent with those observed in GTEx controls
  5. Find target genes showing strong haplotype dosage imbalance: For each target gene, STRIPE will also assess whether the gene shows unusually strong haplotype dosage imbalance relative to haplotype expression ratios observed among tissue-matched GTEx controls. The output file will contain the following information:

    • Number of reads assigned to haplotypes 1 and 2
    • Quality score for read phasing results
    • P value for testing whether the haplotype expression ratio observed in the sample is consistent with those observed in GTEx controls

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published