A snakemake workflow for viral strain detection.
All dependancies are managed through
conda
environments included in the repository.
The workflow is currently built around basecalled Nanopore sequencing output. This does not mean it cannot work for PacBio sequencing data, but it has not been tested on this.
Create the environment for telsvirus
using conda:
conda create -c conda-forge -c bioconda -c anaconda -n telsvirus snakemake git git-lfs
Clone the repository:
conda activate telsvirus
git clone https://github.com/jonathan-bravo/TELSVirus.git
Instructions on updating the configuration can be found here.
Make sure to update the core
value local
or hpc
profiles located at
workflows/profiles/local/config.yaml
or workflows/profiles/hpc/config.yaml
if a different number of CPU cores is available on your system.
Profile | Profile Variable | Default Value |
---|---|---|
local |
cores |
6 |
hpc |
cores |
120 |
Running the workflow locally:
cd TELSVirus
snakemake --profile worflow/profiles/local
Running the workflow on an HPC interactively:
cd TELSVirus
snakemake --profile worflow/profiles/hpc
Make sure to update the email
, account
, and qos
values in the slurm
profile located at worflow/profiles/slurm/config.yaml
default-resources:
- mem_mb=32000
- account=
- qos=
- email=
- mail_type="NONE"
Make sure all string values are surrounded by double quotes ("").
Move the run.sh
from the resources
directory up one level:
mv resources/run.sh .
Make sure to edit the email
and time
if necessery for your run. (I believe
the email is necessary for batch runs.)
#SBATCH --mail-user=<email>
#SBATCH --time=24:00:00
Launching a SLURM job for the workflow:
cd TELSVirus
# Run the workflow
sbatch run.sh
A negative and positive sample are included in resources/test/reads/
.
NOTE:
git-lfs
is a requirement for the test data to work. Without it, the FASTA and FASTQ files come through asgit-lfs
parts and will cause the workflow to error out.
Name | Content |
---|---|
on_target_stats.tsv |
A file that containes a row with the number of input reads, number of reads mapped to host, number of reads mapped to viral database, number of unmapped reads, the host reads percent, and the on-target percent for each sample. |
{sample}_add_sample_info.done |
A flag file ensureing run id and sample id are added to all metric files. |
{sample}_chimeric_count.txt |
A file that contains a single count of reads that were split as chimeras during trimming. |
{sample}_dedup.fastq.gz |
The deduplicated input reads. |
{sample}_dup_reads.fastq.gz |
The reads removed during deduplication. |
{sample}_duplicates.txt |
The ids of reads considered duplicates. |
{sample}_find_duplcates.done |
A flag file ensuring deduplication is finished. |
{sample}_hard_trim_count.txt |
Reads that were removed from analysis for being too short. |
{sample}_non_host.fastq.gz |
Deduplicated and host removed reads. |
{sample}_post_dedup_rl.tsv |
A file that contains read lengths before deduplication. |
{sample}_pre_dedup_rl.tsv |
A file that contains read lengths after deduplication. |
{sample}_reads_per_strain_filtered.tsv |
The number of reads that aligned to each viral strain in the viral_genomes filtered to only those with |
{sample}_reads_per_strain.tsv |
The number of reads that aligned to each viral strain in the viral_genomes . |
{sample}_selected_viral_targets.log |
A file that contains the selected viral strains from viral_genomes . A strain is selected if it has a horizontal coverage of |
{sample}_start_read_count.txt |
A file that contains a single count of reads before any processing. |
{sample}_stats_viruses_sorted_sftclp_REMOVED.bam |
Alignments that were removed from the {sample}_stats_viruses_sorted_sftclp.bam for failing the soft-clip check. |
{sample}_trimmed.fastq.gz |
The reads after trimming. |
{sample}_trimmed.log |
A log file containing the sequences trimmed from each read, the number of bases trimmed from each end, and the full sequence if it was too short and removed from further processing. |
{sample}_viral_target_genomes.fasta |
A FASTA file contining all viral sequences that are found in the {sample}_selected_viral_targets.log file. |
{sample}_VIRAL_TARGETS_FOUND OR {sample}_NO_VIRAL_TARGETS
|
A flag file indicating in viral targets are found. Originally used for further processing; currently just for information. |
{sample}_viral_targets.log |
All viral targets before applying filtering. |
{sample}_viruses_sorted_sftclp_REMOVED.bam |
Alignments that were removed from the {sample}_viruses_sorted_sftclp.bam for failing the soft-clip check. |
{sample}_viruses_sorted_sftclp.bam |
The alignment files used for determing all viral targets. |
{sample}_viruses_sorted_sftclp.bam.bai |
The index file of {sample}_viruses_sorted_sftclp.bam . |
{sample}.mpileup |
Pileup file generated for determining viral targets horizontal coverage and mean depth. |
snakemake --forceall --rulegraph | dot -Tsvg > dag.svg