This document describes the individual steps of the workflow. For instructions on installation and usage please see here.
- Table of Contents
- Third-party software used
- Description of workflow steps
- Rule graph
- Preparatory
- Sequencing mode-independent
start
create_index_star
sort_gtf
extract_transcriptome
concatenate_transcriptome_and_genome
create_index_salmon
create_index_kallisto
extract_transcripts_as_bed12
fastqc
fastqc_trimmed
sort_genomic_alignment_samtools
index_genomic_alignment_samtools
star_rpm
rename_star_rpm_for_alfa
sort_bed_4_big
prepare_bigWig
calculate_TIN_scores
salmon_quantmerge_genes
salmon_quantmerge_transcripts
kallisto_merge_genes
kallisto_merge_transcripts
pca_kallisto
pca_salmon
generate_alfa_index
alfa_qc
prepare_multiqc_config
multiqc_report
finish
- Sequencing mode-specific
- Description of SRA download workflow steps
Tag lines were taken from the developers' websites (code repository or manual)
Name | License | Tag line | More info |
---|---|---|---|
ALFA | MIT | "Annotation Landscape For Aligned reads" - "[...] provides a global overview of features distribution composing NGS dataset(s)" | code / manual / publication |
bedGraphToBigWig | MIT | "Convert a bedGraph file to bigWig format" | code / manual |
bedtools | GPLv2 | "[...] intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF" | code / manual |
cutadapt | MIT | "[...] finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads" | code / manual / publication |
gffread | MIT | "[...] validate, filter, convert and perform various other operations on GFF files" | code / manual |
Entrez Direct | custom | "[...] an advanced method for accessing the NCBI's set of interconnected databases from a UNIX terminal window" | code / manual / publication |
FastQC | GPLv3 | "A quality control analysis tool for high throughput sequencing data" | code / manual |
ImageMagick | custom^ | "[...] create, edit, compose, or convert bitmap images" | code / manual |
kallisto | BSD-2 | "[...] program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads" | code / manual / publication |
MultiQC | GPLv3 | "Aggregate results from bioinformatics analyses across many samples into a single report" | code / manual / publication |
pigz | custom | "[...] parallel implementation of gzip, is a fully functional replacement for gzip that exploits multiple processors and multiple cores to the hilt when compressing data" | code / manual |
RSeqC | GPLv3 | "[...] comprehensively evaluate different aspects of RNA-seq experiments, such as sequence quality, GC bias, polymerase chain reaction bias, nucleotide composition bias, sequencing depth, strand specificity, coverage uniformity and read distribution over the genome structure." | code / manual / publication |
Salmon | GPLv3 | "Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment" | code / manual / publication |
SAMtools | MIT | "[...] suite of programs for interacting with high-throughput sequencing data" | code / manual / publication |
SRA Tools | custom | "[...] collection of tools and libraries for using data in the INSDC Sequence Read Archives" | code / manual |
STAR | MIT | "Spliced Transcripts Alignment to a Reference" - "RNA-seq aligner" | code / manual / publication |
^ compatible with GPLv3
The workflow consists of three Snakemake files: A main
Snakefile
and an individual Snakemake file for each sequencing mode (single-end and paired-end), as parameters for some tools differ between sequencing modes. The mainSnakefile
contains general steps for the creation of indices and other required files derived from the annotations, steps that are applicable to both sequencing modes, and steps that deal with gathering, summarizing or combining results. Individual steps of the workflow are described briefly, and links to the respective software manuals are given. Parameters that can be modified by the user (via the samples table) are also described. Descriptions for steps for which individual "rules" exist for single- and paired-end sequencing libraries are combined, and only differences between the modes are highlighted.
Visual representation of workflow. Automatically prepared with Snakemake.
- tab-separated values (
.tsv
) file - first row has to contain parameter names as in
samples.tsv
- first column used as sample identifiers
Parameter name | Description | Data type(s) |
---|---|---|
sample | Descriptive sample name. NOTE: samples split in multiple fastq files (multilane samples), can be automatically merged by using the same ID |
str |
seqmode | There are two allowed values pe (paired-end) and se (single-end) according to the protocol used. |
str |
fq1 | Path of library file in .fastq.gz format (or mate 1 read file for paired-end libraries). |
str |
fq2 | Path of mate 2 read file in .fastq.gz format. Value ignored for for single-end libraries. |
str |
fq1_3p | Required for Cutadapt. 3' adapter of mate 1. Use value such as XXXXXXXXXXXXXXX if no adapter present or if no trimming is desired. |
str |
fq1_5p | Required for Cutadapt. 5' adapter of mate 1. Use value such as XXXXXXXXXXXXXXX if no adapter present or if no trimming is desired. |
str |
fq2_3p | Required for Cutadapt. 3' adapter of mate 2. Use value such as XXXXXXXXXXXXXXX if no adapter present or if no trimming is desired. Value ignored for single-end libraries. |
str |
fq2_5p | Required for Cutadapt. 5' adapter of mate 2. Use value such as XXXXXXXXXXXXXXX if no adapter present or if no trimming is desired. Value ignored for single-end libraries. |
str |
fq1_polya_3p | Required for Cutadapt. Stretch of A s or T s, depending on read orientation. Trimmed from the 3' end of the read. Use value such as XXXXXXXXXXXXXXX if no poly(A) stretch present or if no trimming is desired. |
str |
fq1_polya_5p | Required for Cutadapt. Stretch of A s or T s, depending on read orientation. Trimmed from the 5' end of the read. Use value such as XXXXXXXXXXXXXXX if no poly(A) stretch present or if no trimming is desired. |
str |
fq2_polya_3p | Required for Cutadapt. Stretch of A s or T s, depending on read orientation. Trimmed from the 3' end of the read. Use value such as XXXXXXXXXXXXXXX if no poly(A) stretch present or if no trimming is desired. Value ignored for single-end libraries. |
str |
fq2_polya_5p | Required for Cutadapt. Stretch of A s or T s, depending on read orientation. Trimmed from the 5' end of the read. Use value such as XXXXXXXXXXXXXXX if no poly(A) stretch present or if no trimming is desired. Value ignored for single-end libraries. |
str |
index_size | Required for STAR. Ideally the maximum read length minus 1. (max(ReadLength)-1 ). Values lower than maximum read length may result in lower mapping accuracy, while higher values may result in longer processing times. |
int |
kmer | Required for Salmon. Default value of 31 usually works fine for reads of 75 bp or longer. Consider using lower values if poor mapping is observed. | int |
organism | Name or identifier of organism or organism-specific genome resource version. Has to correspond to the naming of provided genome and gene annotation files and directories, like "ORGANISM" in the path below. Example: GRCh38 |
str |
gtf | Required for STAR. Path to gene annotation .gtf file. File needs to be in subdirectory corresponding to organism field. Example: /path/to/GRCh38/gene_annotations.gtf |
str |
genome | Required for STAR. Path to genome .fa file. File needs to be in subdirectory corresponding to organism field. Example: /path/to/GRCh38/genome.fa |
str |
sd | Required for kallisto and Salmon, but only for single-end libraries. Estimated standard deviation of fragment length distribution. Can be assessed from, e.g., BioAnalyzer profiles | int |
mean | Required for kallisto and Salmon, but only for single-end libraries. Estimated mean of fragment length distribution. Can be assessed, e.g., from BioAnalyzer profiles | int |
libtype | Required for Salmon, and, after internal conversion, for kallisto and ALFA. See Salmon manual for allowed values. WARNING: do NOT use A to automatically infer the salmon library type, this will cause kallisto and ALFA to fail. |
str |
Sets up logging directories for the workflow run environment. Vanilla Python statement, not a Snakemake rule.
Copy and rename read files.
Local rule
- Input
- Reads file (
.fastq.gz
)
- Reads file (
- Output
- Reads file, copied, renamed (
.fastq.gz
); used in fastqc and remove_adapters_cutadapt
- Reads file, copied, renamed (
Create index for STAR short read aligner.
Indices need only be generated once for each combination of genome, set of annotations and index size.
- Input
- Genome sequence file (
.fasta
) - Gene annotation file (
.gtf
)
- Genome sequence file (
- Parameters
- samples.tsv
--sjdbOverhang
: maximum read length - 1; lower values may reduce accuracy, higher values may increase STAR runtime; specify in sample table columnindex_size
- samples.tsv
- Output
- STAR index; used in map_genome_star
- Index includes files:
- Chromosome length table
chrNameLength.txt
; used in generate_alfa_index and prepare_bigWig - Chromosome name list
chrName.txt
; used in create_index_salmon
- Chromosome length table
Sort provided gtf by chromosome, start and end coordinates.
This process is executed once for the provided gtf annotation file.
- Input
- Gene annotation file (
.gtf
)
- Gene annotation file (
- Output
- Sorted Gene annotation file (
.gtf
); used in extract_transcriptome, extract_transcripts_as_bed12, kallisto_merge_genes, generate_alfa_index and quantification_salmon
- Sorted Gene annotation file (
Create transcriptome from genome and gene annotations with gffread.
- Input
- Genome sequence file (
.fasta
) - Sorted Gene annotation file (
.gtf
); from sort_gtf
- Genome sequence file (
- Output
- Transcriptome sequence file (
.fasta
); used in concatenate_transcriptome_and_genome and create_index_kallisto
- Transcriptome sequence file (
Concatenate reference transcriptome and genome.
Required by Salmon
- Input
- Genome sequence file (
.fasta
) - Transcriptome sequence file (
.fasta
); from extract_transcriptome
- Genome sequence file (
- Output
- Transcriptome genome reference file (
.fasta
); used in create_index_salmon
- Transcriptome genome reference file (
Create index for Salmon quantification.
Required if Salmon is to be used in "mapping-based" mode. create_index_salmon Index is built using an auxiliary k-mer hash over k-mers of a specified length (default: 31). While the mapping algorithms will make use of arbitrarily long matches between the query and reference, the selected k-mer size will act as the minimum acceptable length for a valid match. Thus, a smaller value of k may slightly improve sensitivty. Empirically, the default value of k = 31 seems to work well for reads of 75 bp or longer. For shorter reads, consider using a smaller k.
- Input
- Transcriptome genome reference file (
.fasta
); from concatenate_transcriptome_and_genome - Chromosome name list
chrName.txt
; from create_index_star
- Transcriptome genome reference file (
- Parameters
- samples.tsv
--kmerLen
: k-mer length; specify in sample table columnkmer
- samples.tsv
- Output
- Salmon index; used in quantification_salmon
Create index for kallisto quantification.
The default kmer size of 31 is used in this workflow and is not configurable by the user.
- Input
- Transcriptome sequence file (
.fasta
); from extract_transcriptome
- Transcriptome sequence file (
- Output
- kallisto index; used in genome_quantification_kallisto
Convert transcripts from .gtf
to extended 12-column .bed
format with
custom-script. Note that the default transcript type setting is used, which is "protein_coding".
- Input
- Sorted Gene annotation file (
.gtf
); from sort_gtf
- Sorted Gene annotation file (
- Output
- Transcript annotations file (12-column
.bed
); used in calculate_TIN_scores
- Transcript annotations file (12-column
Prepare quality control report for reads library with FastQC.
- Input
- Reads file (
.fastq.gz
); from start
- Reads file (
- Output
- FastQC output directory with report (
.txt
) and figures (.png
); used in multiqc_report
- FastQC output directory with report (
Prepare quality control report for trimmed reads (after adapter and poly(A)-tail removal) with FastQC.
- Input
- Reads file (
.fastq.gz
); from remove_polya_cutadapt
- Reads file (
- Output
- FastQC output directory with report (
.txt
) and figures (.png
); used in multiqc_report
- FastQC output directory with report (
Sort BAM file with SAMtools.
Sort a genome aligned BAM file.
- Input
- Alignemnts file (
.bam
); from map_genome_star
- Alignemnts file (
- Output
- Alignemnts file (
.bam
); used in index_genomic_alignment_samtools & star_rpm & calculate_TIN_scores
- Alignemnts file (
Index BAM file with SAMtools.
Indexing a genome sorted BAM file allows one to quickly extract alignments overlapping particular genomic regions. Moreover, indexing is required by genome viewers such as IGV so that the viewers can quickly display alignments in a genomic region of interest.
- Input
- Alignemnts file (
.bam
); from sort_genomic_alignment_samtools
- Alignemnts file (
- Output
- BAM index file (
.bam.bai
); used in star_rpm & calculate_TIN_scores
- BAM index file (
Create stranded bedGraph coverage (.bg
) with
STAR.
Uses STAR's RPM normalization functionality.
STAR RPM uses SAM flags to correctly tell where the read and its mate mapped to. That is, if mate 1 is mapped to the plus strand, then mate 2 is mapped to the minus strand, and STAR will count mate 1 and mate 2 to the plus strand. This is in contrast to
bedtools genomecov -bg -split
, where a mate is assigned to a strand irrespective of its corresponding mate.
- Input
- Alignments file (
.bam
); from sort_genomic_alignment_samtools - BAM index file (
.bam.bai
); from index_genomic_alignment_samtools
- Alignments file (
- Output
- Coverage file (
.bg
); used in multiqc_report and rename_star_rpm_for_alfa
- Coverage file (
Rename and copy stranded bedGraph coverage tracks such that they comply with ALFA.
Local rule
Renaming to
plus.bg
andminus.bg
depends on library orientation, which is provided by user in sample table columnlibtype
.
- Input
- Coverage file (
.bg
); from star_rpm
- Coverage file (
- Output
- Coverage file, renamed (
.bg
); used in alfa_qc and sort_bed_4_big
- Coverage file, renamed (
Sort bedGraph files with bedtools.
- Input
- Coverage files, renamed (
.bg
); from rename_star_rpm_for_alfa
- Coverage files, renamed (
- Output
- Coverage files, sorted (
.bg
); used in prepare_bigWig
- Coverage files, sorted (
Generate bigWig from bedGraph with bedGraphToBigWig.
- Input
- Coverage files, sorted (
.bg
); from sort_bed_4_big - Chromosome length table
chrNameLength.txt
; from create_index_star
- Coverage files, sorted (
- Output
- Coverage files, one per strand and sample (
.bw
); used in finish
- Coverage files, one per strand and sample (
Calculates the Transcript Integrity Number (TIN) for each transcript with custom script based on RSeqC.
TIN is conceptually similar to RIN (RNA integrity number) but provides transcript-level measurement of RNA quality and is more sensitive for low-quality RNA samples.
- TIN score of a transcript reflects RNA integrity of the transcript
- Median TIN score across all transcripts reflects RNA integrity of library
- TIN ranges from 0 (the worst) to 100 (the best)
- A TIN of 60 means that 60% of the transcript would have been covered if the read coverage were uniform
- A TIN of 0 will be assigned if the transcript has no coverage or coverage is below threshold
- Input
- Alignments file (
.bam
); from sort_genomic_alignment_samtools - BAM index file (
.bam.bai
); from index_genomic_alignment_samtools - Transcript annotations file (12-column
.bed
); from extract_transcripts_as_bed12
- Alignments file (
- Parameters
- rule_config.yaml
-c 0
: minimum number of read mapped to a transcript (default 10)
- rule_config.yaml
- Output
- TIN score table (custom
tsv
); used in merge_TIN_scores
- TIN score table (custom
Merge gene-level expression estimates for all samples with Salmon.
Rule is run once per sequencing mode
- Input
- Gene expression tables (custom
.tsv
) for samples of same sequencing mode; from quantification_salmon
- Gene expression tables (custom
- Output
- Gene TPM table (custom
.tsv
); used in multiqc_report - Gene read count table (custom
.tsv
); used in pca_salmon and pca_kallisto
- Gene TPM table (custom
Merge transcript-level expression estimates for all samples with Salmon.
Rule is run once per sequencing mode
- Input
- Transcript expression tables (custom
.tsv
) for samples of same sequencing mode; from quantification_salmon
- Transcript expression tables (custom
- Output
- Transcript TPM table (custom
.tsv
); used in multiqc_report - Transcript read count table (custom
.tsv
); used in pca_salmon and pca_kallisto
- Transcript TPM table (custom
Merge gene-level expression estimates for all samples with custom script.
Rule is run once per sequencing mode
- Input
- Transcript expression tables (custom
.h5
) for samples of same sequencing mode; from genome_quantification_kallisto - Sorted Gene annotation file (
.gtf
); from sort_gtf
- Transcript expression tables (custom
- Output
- Gene TPM table (custom
.tsv
) - Gene read count table (custom
.tsv
) - Mapping gene/transcript IDs table (custom
.tsv
)
- Gene TPM table (custom
- Non-configurable & non-default
-txOut FALSE
: gene-level summarization (default would be transcript level)
Merge transcript-level expression estimates for all samples with custom script.
Rule is run once per sequencing mode
- Input
- Transcript expression tables (custom
.h5
) for samples of same sequencing mode; from genome_quantification_kallisto
- Transcript expression tables (custom
- Output
- Transcript TPM table (custom
.tsv
) - Transcript read count table (custom
.tsv
)
- Transcript TPM table (custom
Run PCA analysis on kallisto genes and transcripts with custom script.
Rule is run one time for transcript estimates and one time for genes estimates
- Input
- Transcript/Genes TPM table (custom
.tsv
)
- Transcript/Genes TPM table (custom
- Output
- Directory with PCA plots, scree plot and top loading scores.
Run PCA analysis on salmon genes and transcripts with custom script.
Rule is run one time for transcript estimates and one time for genes estimates
- Input
- Transcript/Genes TPM table (custom
.tsv
)
- Transcript/Genes TPM table (custom
- Output
- Directory with PCA plots, scree plot and top loading scores.
Create index for ALFA.
- Input
- Sorted Gene annotation file (
.gtf
); from sort_gtf - Chromosome length table
chrNameLength.txt
; from create_index_star
- Sorted Gene annotation file (
- Output
- ALFA index; stranded and unstranded; used in alfa_qc
Annotate alignments with ALFA.
For details on output plots, see ALFA documentation.
Note: the read orientation of a sample will be inferred from salmonlibtype
specified insamples.tsv
-
Input
- Coverage files, renamed (
.bg
); from rename_star_rpm_for_alfa - ALFA index, stranded; from generate_alfa_index
- Coverage files, renamed (
-
Parameters
-
Output
- Figures for biotypes and feature categories (
.pdf
) - Feature counts table (custom
.tsv
); used in alfa_qc_all_samples
- Figures for biotypes and feature categories (
Prepare config file for MultiQC.
Local rule
- Input
- Directories created during prepare_files_for_report
- Parameters
All parameters for this rule have to be specified in main
config.yaml
--intro-text
--custom-logo
--url
- Output
- Config file (
.yaml
); used in multiqc_report
- Config file (
Prepare interactive report from results and logs with MultiQC.
-
Input
- Config file (
.yaml
); from prepare_multiqc_config - ALFA plot, combined (
.png
); from alfa_concat_results - Coverage file (
.bg
); from star_rpm - FastQC output directory with report (
.txt
) and figures (.png
); from fastqc - Gene TPM table (custom
.tsv
); from salmon_quantmerge_genes - Gene read count table (custom
.tsv
); from salmon_quantmerge_genes - Pseudoalignments file (
.sam
); from genome_quantification_kallisto - TIN score box plots (
.pdf
and.png
); from plot_TIN_scores - Transcript TPM table (custom
.tsv
); from salmon_quantmerge_transcripts - Transcript read count table (custom
.tsv
); from salmon_quantmerge_transcripts
- Config file (
-
Output
- Directory with automatically generated
.html
report
- Directory with automatically generated
Target rule as required by Snakemake.
Local rule
- Input
- MultiQC report; from multiqc_report
- Coverage files, one per strand and sample (
.bw
); used in prepare_bigWig
Steps described here have two variants, one with the specified names for samples prepared with a single-end sequencing protocol, one with
pe_
prepended to the specified name for samples prepared with a paired-end sequencing protocol.
Remove adapter sequences from reads with Cutadapt.
-
Input
- Reads file (
.fastq.gz
); from start
- Reads file (
-
Parameters
- samples.tsv
- Adapters to be removed; specify in sample table columns
fq1_3p
,fq1_5p
,fq2_3p
,fq2_5p
- Adapters to be removed; specify in sample table columns
- rule_config.yaml:
-m 10
: Discard processed reads that are shorter than 10 nt. If specified inrule_config.yaml
, it will override ZARP's default value ofm=1
for this parameter. Note that this is different fromcutadapt
's default behavior (m=0
), which leads to empty reads being retained, causing problems in downstream applications in ZARP. We thus strongly recommend to not set the value ofm
to0
! Refer tocutadapt
's documentation for more information on them
parameter.-n 2
: search for all the given adapter sequences repeatedly, either until no adapter match was found or until 2 rounds have been performed. (default 1)
- samples.tsv
-
Output
- Reads file (
.fastq.gz
); used in remove_polya_cutadapt
- Reads file (
Remove poly(A) tails from reads with Cutadapt.
- Input
- Reads file (
.fastq.gz
); from remove_adapters_cutadapt
- Reads file (
- Parameters
- samples.tsv
- Poly(A) stretches to be removed; specify in sample table columns
fq1_polya
andfq2_polya
- Poly(A) stretches to be removed; specify in sample table columns
- rule_config.yaml
-m 10
: Discard processed reads that are shorter than 10 nt. If specified inrule_config.yaml
, it will override ZARP's default value ofm=1
for this parameter. Note that this is different fromcutadapt
's default behavior (m=0
), which leads to empty reads being retained, causing problems in downstream applications in ZARP. We thus strongly recommend to not set the value ofm
to0
! Refer tocutadapt
's documentation for more information on them
parameter.-O 1
: minimal overlap of 1 (default: 3)
- samples.tsv
- Output
- Reads file (
.fastq.gz
); used in genome_quantification_kallisto, map_genome_star and quantification_salmon
- Reads file (
Align short reads to reference genome and/or transcriptome with STAR.
- Input
- Reads file (
.fastq.gz
); from remove_polya_cutadapt - Index; from create_index_star
- Reads file (
- Parameters
- rule_config.yaml
--outFilterMultimapScoreRange=0
: the score range below the maximum score for multimapping alignments (default 1)--outFilterType=BySJout
: reduces the number of ”spurious” junctions--alignEndsType
: one ofLocal
(standard local alignment with soft-clipping allowed) orEndToEnd
(force end-to-end read alignment, do not soft-clip); specify in sample table columnsoft_clip
--twopassMode
: one ofNone
(1-pass mapping) orBasic
(basic 2-pass mapping, with all 1st-pass junctions inserted into the genome indices on the fly); specify in sample table columnpass_mode
--outFilterMultimapNmax
: maximum number of multiple alignments allowed; if exceeded, read is considered unmapped; specify in sample table columnmultimappers
- rule_config.yaml
- Output
- Aligned reads file (
.bam
); used in sort_genomic_alignment_samtools, - STAR log file
- Aligned reads file (
- Non-configurable & non-default
--outSAMattributes=All
: NH HI AS nM NM MD jM jI MC ch--outStd=BAM_Unsorted
: which output will be directed toSTDOUT
(default 'Log')--outSAMtype=BAM Unsorted
: type of SAM/BAM output (default SAM)--outSAMattrRGline
: ID:rnaseq_pipeline SM: sampleID
Estimate transcript- and gene-level expression with Salmon.
- Input
- Reads file (
.fastq.gz
); from remove_polya_cutadapt - Sorted Gene annotation file (
.gtf
); from sort_gtf - Index; from create_index_salmon
- Reads file (
- Parameters
- samples.tsv
libType
: see Salmon manual for allowed values; specify in sample table columnlibtype
--fldMean
: mean of distribution of fragment lengths; specify in sample table columnmean
(single-end only)--fldSD
: standard deviation of distribution of fragment lengths; specify in sample table columnsd
(single-end only)
- rule_config.yaml
--seqBias
: correct for sequence specific biases--validateMappings
: enables selective alignment of the sequencing reads when mapping them to the transcriptome; this can improve both the sensitivity and specificity of mapping and, as a result, can improve quantification accuracy.--writeUnmappedNames
: write out the names of reads (or mates in paired-end reads) that do not map to the transcriptome. For paired-end this gives flags that indicate how a read failed to map (paired-end only)
- samples.tsv
- Output
- Gene expression table (
quant.sf
); used in salmon_quantmerge_genes - Transcript expression table (
quant.sf
); used in salmon_quantmerge_transcripts meta_info.json
flenDist.txt
- Gene expression table (
Generate pseudoalignments of reads to transcripts with kallisto.
Note: the kallisto strandedness parameter will be inferred from salmon
libtype
specified insamples.tsv
- Input
- Reads file (
.fastq.gz
); from remove_polya_cutadapt - Index; from create_index_kallisto
- Reads file (
- Parameters
- samples.tsv
-l
: mean of distribution of fragment lengths; specify in sample table columnmean
(single-end only)-s
: standard deviation of distribution of fragment lengths; specify in sample table columnsd
(single-end only)
- samples.tsv
- Output
- Pseudoalignments file (
.sam
) and - abundance (
.h5
) used in kallisto_merge_genes
- Pseudoalignments file (
- Non-configurable & non-default
--single
: Quantify single-end reads (single-end only)--pseudobam
: Save pseudoalignments to transcriptome to BAM file
This separate workflow consists of three Snakemake files: A main
sra_download.smk
and an individual Snakemake file for each sequencing mode (single-end and paired-end), as parameters for some tools differ between sequencing modes. The mainsra_download.smk
contains general steps for downloading the samples from the SRA repository and determining the sequencing mode in order to execute the appropriate subsequent rules.Individual steps of the workflow are described briefly, and links to the respective software manuals are given. Parameters that can be modified by the user (via the samples table) are also described. Descriptions for steps for which individual "rules" exist for single- and paired-end sequencing libraries are combined, and only differences between the modes are highlighted.
Get the library type of each sample (paired or single-end) using efetch Entrez direct.
- Output
- A file with a name either PAIRED or SINGLE which is used downstream to run the appropriate subworkflow.
Download the SRA entry using SRA Tools
Aggregate the fastq file(s) path(s) in a table for all samples.
Converts SRA entry to fastq file(s) using SRA Tools
Compresses fastq file(s) to .gz format. pigz
Keep the fastq.gz file path in a table, later aggregated in one table in add_fq_file_path
.