SeqPanther
is a Python application that provides the user with a suite of tools to further interrogate the circumstance under which certain mutations occur or are missed and to enables the user to modify the consensus as needed. The tool is applicable to non-segmented bacterial and viral genomes where reads are mapped to a reference sequence. SeqPanther
generates detailed reports of mutations identified within a genomic segment or positions of interest, including visualization of the genome coverage and depth. The tool is particularly useful in the examination of multiple next-generation sequencing (NGS) short-read samples. Additionally, we have integrated Seqpatcher
Singh et. al. which supports the merging of Sanger sequences, or consensus thereof into their respective NGS consensus.
SeqPanther consists of the following set of commands:
-
codoncounter: performs variant calling and generates nucleotide stats at variant sites and reports the impacts of nucleotide changes on amino acids in the translated proteins.
-
cc2ns (codon counter 2 nucleotide substitution): performs transformation of codon counter output in a format where the user can select a variant to integrate into the reference or assemblies using the cc2ns command. The user is encourage to inspect and edit the output from codoncounter to only keep the desired changes before running the cc2ns command.
-
nucsubs: accepts as input the changes file generated by cc2ns and modifies the consensus sequences accordingly i.e. adds or changes the mutations as specified in the file.
-
seqpatcher: integrates sanger sequencing of missing regions of an incomplete assembly to the assembly. This command is a modification of the SeqPatcher tool.
Unix and OS X Commandline Application.
The tool relies on multiple external open source programs and python modules as listed below:
- python>=3.7
-
Samtools
- For sorting bam files and indexing them.
conda install -c bioconda samtools
-
Bcftools
- For parsing variant calling and generating consensus sequences.
conda install -c bioconda bcftools
to install.
-
Muscle (v. 3.8.31)
- To perform multiple sequence alignment.
conda install -c bioconda muscle=3.8.31
to install.
-
BLAT
- To query sequence location in the genome.
conda install -c bioconda ucsc-blat
to install.
-
MAFFT
- To align consensus sequence against the reference.
conda install -c bioconda mafft
to install.
git clone https://github.com/codemeleon/seqPanther
cd seqPanther
pip install .
To install directly from the Github repo, run the command:
pip install git+https://github.com/codemeleon/seqPanther.git
seqPanther contains four commands. The commands are listed in the help menu. To view the help menu, type: seqpanther --help
or just seqpanther
.
This command help is accessible using seqpanther codoncounter
or seqpanther codoncounter --help
.
This command help is accessible using seqpanther cc2ns
or seqpanther cc2ns --help
.
You might need to convert bam to consensus before running seqpanther nucsubs
. Consensus sequences can be generated using following commands.
samtools index <sorted_bamfile>
bcftools mpileup -f <reference_fasta> <sorted_bamfile> | bcftools call -c --ploidy 1 | vcfutils.pl vcf2fq > <sorted_bamfile>.fq
seqtk seq seq -aQ64 <sorted_bamfile>.fq > <sorted_bamfile>.fasta
This command help is accessible using seqpanther nucsubs
or seqpanther nucsubs --help
.
This command help is accessible at seqpanther seqpatcher
or seqpanther seqpatcher --help
.
-
A sample dataset of five BAM files from South African SARS-CoV-2 samples can be downloaded from SeqPanther zenodo page. The BAM files were generated by mapping reads mapped against reference (NC_045512.2) using BWA. BWA was also used to generate the SAM files which were converted to BAM and sorted using samtools.
-
Reference sequences in Fasta format and genome annotation GFF files were downloaded from NCBI Genome Fasta and NCBI Genome GFF respectively. The downloaded files need to un-compressed.
-
Consensus sequences can be calculated from the BAM files using the command
bcftools mpileup -f GCF_009858895.2_ASM985889v3_genomic.fna bam/K032258-consensus_alignment_sorted.bam| bcftools call -c --ploidy 1 | vcfutils.pl vcf2fq > K032258-consensus_alignment_sorted.fastq
andpython fastq2fasta.py -i K032258-consensus_alignment_sorted.fastq -o consensus/K032258-consensus_alignment_sorted.fasta
.fastq2fasta.py
can be downloaded from the project repo. -
Store the downloaded BAM files in a folder, e.g. called
bam
. -
To generate a summary of the codons, nucleotides and indel changes in the Spike gene of these samples, use the command
seqpanther codoncounter -bam bam -rid NC_045512.2 -ref GCF_009858895.2_ASM985889v3_genomic.fna -gff GCF_009858895.2_ASM985889v3_genomic.gff -coor_range 21563-25384
. This will generate the summaries for all the files in bam folder.
The command will generate four outputs in the current folder including: sub_output.csv
containing details of the nucleotide substitutions, indel_output.csv
containing details of the indel events, codon_output.csv
containing details of the codon changes and output.pdf
which is a plot of genome depth and breadth of coverage annotated with the positions with mutations and indels.
-
If you only want to generate the results for a single BAM file, run the command as
seqpanther codoncounter -bam ./bam/K032282-consensus_alignment_sorted.bam -rid NC_045512.2 -ref GCF_009858895.2_ASM985889v3_genomic.fna -gff GCF_009858895.2_ASM985889v3_genomic.gff -coor_range 21563-25384
replacing the BAM file name with your specific bam file name in the command. -
Outputs can be explored using a text file reader (for the text files) and pdf reader (e.g Adobe Reader) for the PDFs. An example command to view the text files would be:
cat sub_output.csv | sed 's/,/ ,/g' | column -t -s, | less -S
. The user needs to explore those files and remove the changes they would like not to be integrated. A text editor of your choice e.g. bbedit or notepad++ can be used to edit the files. -
In case you decide that there are certain mutations that you need to change, you will have to convert the outputs from
codoncounter
to the format required by thenucsubs
command and run the commandseqpanther cc2ns -s sub_output.csv -i sub_output.csv -o changes
. It generates a CSV file for each sample in the./change
folder. -
Then execute seqpanther as follows:
seqpanther nucsubs -i NC_045512.2 -r NC_045512.2.fasta -c consensus -t changes -o results
to integrate relevant changes to the consensus sequences. The output will be generated in a folder namedresults
. -
Example data for
seqpanther seqpatcher
is provided inexamples/seqpatcher
folder of this project. -
To run
seqpanther seqpatcher
on the example data, useseqpanther seqpatcher -s examples/seqpatcher/ab1 -a examples/seqpatcher/assemblies -o Results -t mmf.csv -O sanger.fasta -g 10 -3 True -x del
. It will generate consensus sequences from the Sanger trace files and integrate them into the ngs consensus. The mmf.csv file tells the application whether the sanger data were paired or single ab1, or in single fasta, and sanger.fasta containing all sanger sequences.
Recursive use of newly generated consensus sequences might result in incorrect final sequence due to the integration of indel events.
To report a bug, request support or propose a new feature, please open an issue.
GNU GPL v3.0
If you use this software please cite:
SeqPanther: Sequence manipulation and mutation statistics toolset. James Emmanuel San, Stephanie Van Wyk, Houriiyah Tegally, Simeon Eche, Eduan Wilkinson, Aquillah M. Kanzi, Tulio de Oliveira, Anmol M. Kiran. bioRxiv 2023. doi: 10.1101/2023.01.26.525629
Zenodo: