Skip to content

jonathan-bravo/TELSVirus

Repository files navigation

TELSVirus

Static Badge Static Badge Code style: snakefmt Lint Code Base

A snakemake workflow for viral strain detection.

Requirements

All dependancies are managed through conda environments included in the repository.

The workflow is currently built around basecalled Nanopore sequencing output. This does not mean it cannot work for PacBio sequencing data, but it has not been tested on this.

Install Snakemake and Clone the Repository

Create the environment for telsvirus using conda:

conda create -c conda-forge -c bioconda -c anaconda -n telsvirus snakemake git git-lfs

Clone the repository:

conda activate telsvirus

git clone https://github.com/jonathan-bravo/TELSVirus.git

Update Config

Instructions on updating the configuration can be found here.

Usage on Local Desktop or Interactive HPC Run

Make sure to update the core value local or hpc profiles located at workflows/profiles/local/config.yaml or workflows/profiles/hpc/config.yaml if a different number of CPU cores is available on your system.

Profile Profile Variable Default Value
local cores 6
hpc cores 120

Running the workflow locally:

cd TELSVirus

snakemake --profile worflow/profiles/local

Running the workflow on an HPC interactively:

cd TELSVirus

snakemake --profile worflow/profiles/hpc

Usage on Slurm Cluster

Make sure to update the email, account, and qos values in the slurm profile located at worflow/profiles/slurm/config.yaml

default-resources:
  - mem_mb=32000
  - account=
  - qos=
  - email=
  - mail_type="NONE"

Make sure all string values are surrounded by double quotes ("").

Move the run.sh from the resources directory up one level:

mv resources/run.sh .

Make sure to edit the email and time if necessery for your run. (I believe the email is necessary for batch runs.)

#SBATCH --mail-user=<email>
#SBATCH --time=24:00:00

Launching a SLURM job for the workflow:

cd TELSVirus

# Run the workflow
sbatch run.sh

Test Data

A negative and positive sample are included in resources/test/reads/.

NOTE: git-lfs is a requirement for the test data to work. Without it, the FASTA and FASTQ files come through as git-lfs parts and will cause the workflow to error out.

Output

Name Content
on_target_stats.tsv A file that containes a row with the number of input reads, number of reads mapped to host, number of reads mapped to viral database, number of unmapped reads, the host reads percent, and the on-target percent for each sample.
{sample}_add_sample_info.done A flag file ensureing run id and sample id are added to all metric files.
{sample}_chimeric_count.txt A file that contains a single count of reads that were split as chimeras during trimming.
{sample}_dedup.fastq.gz The deduplicated input reads.
{sample}_dup_reads.fastq.gz The reads removed during deduplication.
{sample}_duplicates.txt The ids of reads considered duplicates.
{sample}_find_duplcates.done A flag file ensuring deduplication is finished.
{sample}_hard_trim_count.txt Reads that were removed from analysis for being too short.
{sample}_non_host.fastq.gz Deduplicated and host removed reads.
{sample}_post_dedup_rl.tsv A file that contains read lengths before deduplication.
{sample}_pre_dedup_rl.tsv A file that contains read lengths after deduplication.
{sample}_reads_per_strain_filtered.tsv The number of reads that aligned to each viral strain in the viral_genomes filtered to only those with $&gt;0$ reads.
{sample}_reads_per_strain.tsv The number of reads that aligned to each viral strain in the viral_genomes.
{sample}_selected_viral_targets.log A file that contains the selected viral strains from viral_genomes. A strain is selected if it has a horizontal coverage of $\ge 80%$. If there are multiple viral accessions with the same strain then the highest horizontal coverage is chosen. If the horizontal coverage is the same then the accession with the highest mean depth is chosen.
{sample}_start_read_count.txt A file that contains a single count of reads before any processing.
{sample}_stats_viruses_sorted_sftclp_REMOVED.bam Alignments that were removed from the {sample}_stats_viruses_sorted_sftclp.bam for failing the soft-clip check.
{sample}_trimmed.fastq.gz The reads after trimming.
{sample}_trimmed.log A log file containing the sequences trimmed from each read, the number of bases trimmed from each end, and the full sequence if it was too short and removed from further processing.
{sample}_viral_target_genomes.fasta A FASTA file contining all viral sequences that are found in the {sample}_selected_viral_targets.log file.
{sample}_VIRAL_TARGETS_FOUND OR {sample}_NO_VIRAL_TARGETS A flag file indicating in viral targets are found. Originally used for further processing; currently just for information.
{sample}_viral_targets.log All viral targets before applying filtering.
{sample}_viruses_sorted_sftclp_REMOVED.bam Alignments that were removed from the {sample}_viruses_sorted_sftclp.bam for failing the soft-clip check.
{sample}_viruses_sorted_sftclp.bam The alignment files used for determing all viral targets.
{sample}_viruses_sorted_sftclp.bam.bai The index file of {sample}_viruses_sorted_sftclp.bam.
{sample}.mpileup Pileup file generated for determining viral targets horizontal coverage and mean depth.

Making a Workflow DAG

snakemake --forceall --rulegraph | dot -Tsvg > dag.svg

Workflow DAG Image

Workflow Image

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published