Skip to content

A snakemake workflow for analyzing illumina WGS data of Mycobacterium Tuberculosis complex isolates

License

Notifications You must be signed in to change notification settings

KevinLYW366/TBSeqPipe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TBSeqPipe

Introduction

TBSeqPipe is a flexible and user-friendly pipeline based on snakemake workflow for analyzing WGS data of Mycobacterium tuberculosis complex isolates. Taking illumina WGS data as input, this workflow preforms some basic analysis tasks as well as some downstream high-level analysis steps. TBSeqPipe generates a final summary report to better integrate and present results from all analysis modules.

Workflow

Workflow

Installation

Environment

Conda

Conda can function as a package manager and is available here. If you have conda make sure the bioconda and conda-forge channels are added:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Snakemake

The Snakemake workflow management system is a tool to create reproducible and scalable data analyses. Detailed intsruction could be found here. Quick installation:

  • Install mamba first (mamba provides a faster and more roboust way for conda packages installation):
conda install -n base -c conda-forge mamba
  • Install snakemake using mamba:
conda activate base
mamba create -c conda-forge -c bioconda -n snakemake snakemake

Clone the repository

git clone [email protected]:KevinLYW366/TBSeqPipe.git

Activate the environment

conda activate snakemake

Kraken database

A pre-built 8 GB database MiniKraken DB_8GB is the suggested reference database for TBSeqPipe. It is constructed from complete bacterial, archaeal, and viral genomes in RefSeq.

Set up configuration

To run the complete workflow do the following:

  • Create an sample list file for all the samples you want to analyze with one ID per line.
  • Copy all FASTQ files of your samples into one directory.
  • Customize the workflow based on your need in: config/configfile.yaml. Parameters in "Required Parameters" section must be entered manually:
    • sample_list: /path/to/sample_list_file
    • data_dir: /path/to/fastq_files
    • fastq_read_id_format, fastq_suffix_format and data_dir_format: give values based on the FASTQ file directory structure and the format of FASTQ file names
    • kraken_db: /path/to/minikraken_20171019_8GB

Usage

  1. Move to the directory of TBSeqPipe.
cd /path/to/TBSeqPipe
  1. A dry-run is recommended at first to check if everything is okay.
snakemake -r -p -n
  1. If no error message shows up, let's do a formal run (feel free to modify "-j 40" which controls the CPU cores used in parallel).
snakemake --use-conda -r -p -j 40

Note

Crashed and burned (Unlocking)

After the workflow was killed (Snakemake didn’t shutdown), the workflow directory will be still locked. If you are sure, that snakemake is no longer running (ps aux | grep snake).

Unlock the working directory:

snakemake *.snakemake --unlock

Rerun incomplete

If Snakemake marked a file as incomplete after a crash, delete and produce it again.

snakemake *.snakemake --ri

License

The code is available under the GNU GPLv3 license. The text and data are availabe under the CC-BY license.

Questions and Issues

For contacting the developer and issue reports please go to Issues.

About

A snakemake workflow for analyzing illumina WGS data of Mycobacterium Tuberculosis complex isolates

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published