STRIPE is a versatile workflow that enables deep sequencing of full-length transcripts for any disease-specific gene panel such that a wide range of clinically informative readouts, including transcript aberrations and sequence variants, can be detected at haplotype-level resolution.
Tools:
- Bcftools (version 1.21)
- CADD (version 1.6)
- LongcallR (version 0.1.0)
- Minimap2 (version 2.26)
- Python (version 3.8.18)
- numpy (version 1.24.2)
- pandas (version 2.0.3)
- pysam (version 0.21.0)
- pytabix (version 0.1)
- tensorflow (version 2.13.1)
- R (version 4.3.1)
- argparse (version 2.2.2)
- dplyr (version 1.1.4)
- ggplot2 (version 3.4.4)
- tidyr (version 1.3.1)
- Samtools (version 1.21)
- Singularity (version 3.11.1)
- Clair3 (version 1.0.10)
- DeepVariant (version 1.6.1)
- DeepTrio (version 1.6.1)
- SpliceAI (version 1.3.1)
- Stringtie (version 2.2.3)
Databases:
- GTEx v8:
- gnomAD v4.1.0:
- Please download the VCF files (chr1-22,X,Y) under "Genomes" here.
- ClinVar:
- Please download the latest VCF file of ClinVar variants here.
Building a STRIPE database
python Build_STRIPE.py [-h] --gene_list /path/to/target/gene/list --anno_gtf /path/to/gene/annotation/gtf
[--gtex_files /path/to/GTEx/files/list] [--tissue_list /path/to/GTEx/tissues/file]
[--gnomad_files /path/to/gnomAD/files/list] [--only_bed] --outdir /path/to/output/folder
Parameter | Description |
---|---|
--gene_list |
Path to a single-column text file with a list of Ensembl gene IDs for target genes |
--anno_gtf |
Path to a GTF file with reference gene annotations from GENCODE (download here) |
--gtex_files |
(Optional) Path to a single-column text file containing file paths to GTEx v8 files (see Databases) |
--tissue_list |
(Optional) Path to a two-column text file in which column 1 has the official names of GTEx tissues (see here) and column 2 has the desired labels for each tissue |
--gnomad_files |
(Optional) Path to a single-column text file containing file paths to all gnomAD VCF files (chr1-22,X,Y) |
--only_bed |
(Optional) Forces Build_STRIPE.py to only construct a BED file for input target genes |
--outdir |
Path to output directory for STRIPE database files |
Build_STRIPE.py
will generate the following intermediate folders/files in --outdir
that are needed for running STRIPE:
target_genes.bed
: A BED6 file containing the genomic coordinates of input target genes (based on reference gene annotations from--anno_gtf
)GTEx_v8
: A folder containing subfolders for each tissue label in--tissue_list
. Each subfolder contains the following files:target_genes.gene_tpm.txt
: A matrix file containing normalized abundances (TPM) for target genes across GTEx samples of the given tissue typetarget_genes.junction_counts.txt
: A matrix file containing read counts for splice junctions mapping to target genes across GTEx samples of the given tissue typetarget_genes.haplotype_expression.txt
: A matrix file containing haplotype-specific read counts for target genes across GTEx samples of the given tissue type
gnomAD_v4
: A folder containing subfolders for each target gene. Each subfolder contains a filegnomad_variants.tsv
, which describes all known gnomAD variants mapping to the target gene as follows:CHROM
,POS
,REF
,ALT
: Chromosome, position, reference allele, alternate alleleAF_GRPMAX_JOINT
: Maximum allele frequency across gnomAD populationsNHOMALT_GRPMAX_JOINT
: Maximum number of homozygous individuals across gnomAD populations
Running STRIPE
python Run_STRIPE.py [-h] --infile /path/to/input/TEQUILA/bam --targets /path/to/target/gene/BED --genome /path/to/reference/genome/FASTA
--gencode /path/to/gencode/GTF --samtools /path/to/samtools --stringtie /path/to/stringtie --longcallr /path/to/longcallr
--clair3 /path/to/clair3/sif --model /path/to/clair3/model --cadd /path/to/CADD --gnomad /path/to/gnomad/database
--clinvar /path/to/clinvar/vcf --bcftools /path/to/bcftools --minimap2 /path/to/minimap2 --gtex_splice /path/to/gtex/splice/database
--gtex_haplotype /path/to/gtex/haplotype/database [--threads <number of threads>] [--proband_vcf /path/to/proband/vcf]
[--parent1_vcf /path/to/parent1/vcf] [--parent2_vcf /path/to/parent2/vcf] [--only_qc] [--phase_indels] --outdir /path/to/output/directory
Parameter | Description |
---|---|
--infile |
Path to BAM file with TEQUILA-seq read alignments |
--targets |
Path to BED6 file containing genomic coordinates of input target genes (output of running Build_STRIPE.py ) |
--genome |
Path to FASTA file with reference genome sequence |
--gencode |
Path to GTF file with reference gene annotations from GENCODE (please use the same file as the one used in running Build_STRIPE.py ) |
--samtools |
Path to samtools executable (see here) |
--stringtie |
Path to stringtie executable (see here) |
--longcallr |
Path to longcallr executable (see here) |
--clair3 |
Path to SIF file for Clair3. Run singularity pull on the appropriate SIF file from here |
--model |
Path to folder for Clair3 model. Clair3 models based on the latest nanopore sequencing chemistries can be downloaded from here |
--cadd |
Path to CADD.sh script (download and follow installation instructions here) |
--gnomad |
Path to gnomAD_v4 output folder generated by Build_STRIPE.py |
--clinvar |
Path to VCF file containing the latest release of ClinVar variants (see here) |
--bcftools |
Path to bcftools executable (see here) |
--minimap2 |
Path to minimap2 executable (see here) |
--gtex_splice |
Path to the appropriate target_genes.junction_counts.txt file generated by Build_STRIPE.py |
--gtex_haplotype |
Path to the appropriate target_genes.haplotype_expression.txt file generated by Build_STRIPE.py |
--threads |
(Optional) Number of threads (default: 1) |
--proband_vcf |
(Optional) Path to a VCF file with small variants called from prior exome or genome sequencing data for the proband |
--parent1_vcf |
(Optional) Path to a VCF file with small variants called from prior exome or genome sequencing data for the proband's parent |
--parent2_vcf |
(Optional) Path to a VCF file with small variants called from prior exome or genome sequencing data for the proband's parent |
--only_qc |
(Optional) Forces Run_STRIPE.py to only perform a quality control check on the input BAM file (see below for more details) |
--phase_indels |
(Optional) Allows Run_STRIPE.py to use indels called from prior exome or genome sequencing data to inform phasing of TEQUILA-seq reads |
--outdir |
Path to output directory for STRIPE |
Run_STRIPE.py
consists of the following steps:
-
Quality control analyses: Generates the following files:
-
[SAMPLE].mapping_stats.txt
: Read mapping statistics and the overall on-target rate forSAMPLE
-
[SAMPLE].top_2000_genes.txt
: A three-column text file describing the 2000 most abundant genes inSAMPLE
as follows:- Column 1: Gene ID
- Column 2: Indicator for whether the given gene is a target gene or not
- Column 3: Estimated abundance for the gene (TPM)
-
[SAMPLE].top_2000_genes.pdf
: Plot showing the abundances (TPM) of the 2000 most abundant genes inSAMPLE
, in which each gene is colored based on whether it is targeted or not. The overall on-target rate forSAMPLE
is also indicated on the plot. -
[SAMPLE].junctions.tsv
: A four-column text file describing splice junctions discovered across all target genes inSAMPLE
as follows:- Columns 1-3: (Chrom, Start, End) for each discovered splice junction
- Column 4: Read count for each discovered splice junction
-
stringtie/[SAMPLE].gtf
: A GTF file describing all genes and transcripts discovered bystringtie
inSAMPLE
-
stringtie/[SAMPLE].gene_abundance.tsv
: A tab-separated file describing all genes discovered bystringtie
inSAMPLE
and their abundances
-
-
Phase TEQUILA-seq reads mapping to target genes: For each target gene, STRIPE will generate the following files (if TEQUILA-seq reads mapping to the gene can be phased)
-
gene.vcf.gz
: A compressed and indexed VCF file containing phased variants used for guiding assignment of TEQUILA-seq reads to individual haplotypes -
hap1_reads.bam
: A BAM file containing alignments of TEQUILA-seq reads assigned to haplotype 1 -
hap2_reads.bam
: A BAM file containing alignments of TEQUILA-seq reads assigned to haplotype 2 -
phasing_stats.txt
: A text file summarizing the results of phasing TEQUILA-seq reads, including:- Number of reads assigned to each haplotype and the number of reads that could not be assigned to a haplotype
- A phasing "quality score", which is computed as follows. Suppose we have a total of
$N$ reads mapping to some gene with start and end genomic coordinates$(S, T)$ . The quality score,$Q$ , is computed as:$$Q = \frac{\sum_{k=1}^{N}w_kz_k}{\sum_{k=1}^{N}w_k}$$ where$w_k = \max(\min(t_k, T) - \max(s_k, S), 0)/(T-S)$ and$(s_k, t_k)$ denotes the start and end genomic coordinates for the given read alignment. Here, a higher value of$Q$ indicates that more full-length reads mapping to the gene were able to be phased.
-
-
Identify rare, deleterious variants in each target gene: For each target gene, STRIPE will generate a merged callset of variants from
longcallR
,clair3
,deepvariant
, andbcftools
. The merged callset will then be filtered for variants meeting the following criteria: (i) CADD score > 15 or SpliceAI score > 0.1, (ii) maximum allele frequency across gnomAD populations < 1%. The filtered merged callset,merged_variants.tsv
, will contain the following fields:-
CHROM
,POS
,REF
,ALT
: Chromosome, position, reference allele, alternate allele -
LONGCALLR_DP
,LONGCALLR_AF
,LONGCALLR_GT
: Read depth, allele frequency, and genotype fromlongcallR
for given variant -
CLAIR3_DP
,CLAIR3_AF
,CLAIR3_GT
: Read depth, allele frequency, and genotype fromclair3
for given variant -
DEEPVARIANT_DP
,DEEPVARIANT_AF
,DEEPVARIANT_GT
: Read depth, allele frequency, and genotype fromdeepvariant
for given variant -
PHASING_DP
,PHASING_AF
,PHASING_GT
: Read depth, allele frequency, and genotype frombcftools
for given variant- Here, we ran
bcftools
on haplotype-assigned TEQUILA-seq read sets and constructed a phased genotype
- Here, we ran
-
CADD
: Functional deleteriousness score from CADD (verison 1.6) -
SPLICEAI
: Likelihood of impacting splicing from SpliceAI -
GNOMAD_AF_GRPMAX_JOINT
: Maximum allele frequency across gnomAD populations -
GNOMAD_NHOMALT_GRPMAX_JOINT
: Maximum number of homozygous individuals across gnomAD populations -
CLINVAR
: ClinVar annotations for variant -
HOMOPOLYMER
: Indicator for whether the variant lies in a homopolymer region (useful for flagging false positive variant calls due to basecalling artifacts) -
END_DISTANCE_PCT
: Percentage of reads supporting the variant in which the variant is found within 30 bp of the alignment ends (useful for flagging false positive variant calls due to alignment artifacts) -
SKIP_FREQ
: Percentage of reads covering the variant locus in which the variant position is intronic (useful for flagging false positive variant calls due to local mis-alignments around splice sites)
-
-
Detect aberrant splice junctions in each target gene: For each target gene, STRIPE will extract unique splice junctions from all TEQUILA-seq reads (as well as haplotype-specific reads) mapping to the gene locus and evaluate whether usage frequencies for these splice junctions are unusually higher than expected compared to usage frequencies observed among tissue-matched GTEx controls. Each output file will contain the following fields:
-
Junction
: Splice junction coordinates -
Count
: Number of reads supporting the splice junction -
Coverage
: Total number of reads covering the corresponding splice sites -
Usage
: Usage frequency of the splice junction in sample -
Population
: Average usage frequency of the splice junction across GTEx controls -
Shift
: Difference in usage frequencies between the sample and GTEx controls -
PVal
: P value for testing whether the sample has a usage frequency that is consistent with those observed in GTEx controls
-
-
Find target genes showing strong haplotype dosage imbalance: For each target gene, STRIPE will also assess whether the gene shows unusually strong haplotype dosage imbalance relative to haplotype expression ratios observed among tissue-matched GTEx controls. The output file will contain the following information:
- Number of reads assigned to haplotypes 1 and 2
- Quality score for read phasing results
- P value for testing whether the haplotype expression ratio observed in the sample is consistent with those observed in GTEx controls