Skip to content
/ raspir Public

Raspir, the rare species identifier

License

Notifications You must be signed in to change notification settings

mmpust/raspir

Repository files navigation

Alt text

Background

In shotgun metagenomic sequencing experiments, the total DNA is extracted from complex samples. The DNA of rare species is scarce, and poor coverage of these genomes is often observed. Scientists commonly define thresholds to exclude 1-10% of the least abundant species in a given sample from further analyses. On the one hand, this filtering step allows for robust investigations of core communities. On the other hand, valuable information on the community structure is lost. The rare biosphere harbours more species than the core microbial community and hence provides the environment of interest with high functional flexibility.

However, if only a few short DNA reads are detected that are unique to species A, there are at least three explanations:
a) Sample contamination;
b) Rare species A was present in the environment of interest, so it is a true positive species. In this case, the reads are expected to spread across the entire reference genome in a fairly uniform manner due to random DNA sequencing; or
c) Rare species A was absent (false positive) but rare species B was present, which acquired genes of species A during past events. In this case, the reads are expected to cluster at specific sides of the reference genome of species A.

The raspir tool calculates a position-domain signal based on the distances of reads aligning to a circular reference genome and converts the information into a frequency signal through Discrete Fourier Transforms (DFT). In addition, a reference frequency signal is constructed with the same number of reads, but with an ideal uniform distribution of reads across the reference genome. Both frequency signals are compared using Pearson's correlation measures.

Implementation

Raspir is implemented in Python 3.7. Using this tool, a real-world dataset (5 GB) containing information on hundreds of species can be processed on a single node server. The input data must be structured in the following manner: genome length of the corresponding reference genome, organism, read position, read depth.

See the following section for further information on how to convert .FASTQ files into the .CSV input files to sucessfully execute raspir.

Get started

Set up the environment

Install conda packages

conda create --name raspir_env
conda activate raspir_env

# As combined commands: 
conda install -c bioconda trimmomatic samtools bwa 

# Install trimmomatic [1] 
conda install -c bioconda trimmomatic
# Install samtools [2] 
conda install -c bioconda samtools

# Install your alignment tool of choice
# Burrows-Wheeler aligner [3]
conda install -c bioconda bwa
# Bowtie2 [4]
conda install -c bioconda/label/cf201901 bowtie2

# Install python packages for raspir
conda install pandas
conda install -c conda-forge statsmodels matplotlib -y

Create working directory

mkdir raspir/
cd raspir/

YOURPATH=$PWD
echo "$YOURPATH"

mkdir reference_database/
mkdir run_raspir/

Set-up the reference database

The reference database has to be downloaded and indexed only once. The database should contain only complete, circular genomes with one strain per species. Note: You may use a customised reference database. It is however strongly recommended to avoid draft or low-quality reference genomes and use complete sequences of circular microorganisms only.

# Load database into your working directory
cd reference_database/
# Get high quality (default) database from Wochenende tool:
https://drive.google.com/drive/folders/1q1btJCxtU15XXqfA-iCyNwgKgQq0SrG4
# Source: https://github.com/MHH-RCUG/nf_wochenende.git

# Unzip the reference database 
gunzip 2021_12_human_bact_arch_fungi_vir.fa.gz

# Generate an index of the reference fasta depending on the alignment tool of your choice
samtools faidx 2021_12_human_bact_arch_fungi_vir.fa.gz
bwa index 2021_12_human_bact_arch_fungi_vir.fa.gz
bowtie2-build 2021_12_human_bact_arch_fungi_vir.fa.gz 2021_12_human_bact_arch_fungi_vir
cd ..

Data cleaning, alignment & sorting

Trimmomatic

Load your .FASTQ files into the working directory and run Trimmomatic for quality trimming and adapter clipping.

cd run_raspir/

# Paired-end data
trimmomatic PE \
  R1.fastq R2.fastq \
  R1.trim.fastq R1.un.trim.fastq \
  R2.trim.fastq R2.un.trim.fastq \
  SLIDINGWINDOW:4:20 MINLEN:25

# Single-end reads
trimmomatic SE \
  R.fastq \
  R.trim.fastq \
  SLIDINGWINDOW:4:20 MINLEN:25

Alignment

# Burrows-Wheeler-Aligner (bwa)
# see http://bio-bwa.sourceforge.net/
# For paired-end reads
bwa mem $YOURPATH/reference_database/2021_12_human_bact_arch_fungi_vir.fa.gz \
  R1.trim.fastq R2.trim.fastq > R.trim.bwa.sam
# For single-end reads
bwa mem $YOURPATH/reference_database/2021_12_human_bact_arch_fungi_vir.fa.gz \
  R.trim.fastq > R.trim.bwa.sam

# Bowtie2
# see http://bowtie-bio.sourceforge.net/
# For paired-end reads
bowtie2 -x $YOURPATH/reference_database/complete_bacterialRefSeqs_201910_3 \
  -1 R1.trim.fastq -2 R2.trim.fastq -S R.trim.bowtie2.sam
# For single-end reads
bowtie2 -x $YOURPATH/reference_database/complete_bacterialRefSeqs_201910_3 \
  -U R.trim.fastq -S R.trim.bowtie2.sam

Sorting, indexing & final clean-up

This process can be performed on SAMs or BAMs with the included script run_SLURM_file_prep.sh

# Activate env on cluster node
conda activate raspir_env >> /dev/null

cpus=8
human_chr_pattern="1_1_1"

# Only run SAM section if SAM files exist in the current directory
count=`ls -1 *.sam 2>/dev/null | wc -l`
if [ $count != 0 ]
    then
    # run SAM conversion

	# For pipelines which start from SAM. Many pipelines will start from BAM.
	for items in *.sam
		do
			echo $items
			fname=$(echo ${items} | sed 's/.sam//')
			echo $fname


			# Remove reads with low mapping quality
			samtools view -hM -q 20 $items > ${items%.sam}.mq20.sam

			# Convert file from SAM to BAM format
			samtools view -h -b -S ${items%.sam}.mq20.sam  > ${fname}.bam

			# Discard unmapped sequences 
			samtools view -b -F 4 $items > ${fname}_1.bam

			# Sort bam file
			samtools sort -@ $cpus ${fname}_1.bam -o ${fname}.sorted.bam
		done
fi

# Start from sorted mq20 BAM with no unmapped sequences  
for items in *.bam
	do
		echo $items
		fname=$(echo ${items} | sed 's/.bam//')
		echo $fname

   		# Obtain coverage information
   		samtools depth ${fname}.bam | grep -v $human_chr_pattern  > ${fname}.raspir1.csv

   		# Add genome size, pipe in a BAM header only
   		samtools view -H ${items} | sed 's/LN://g' > ${fname}.genomeSize_1.csv
   		sed -i 's/SN://g' ${fname}.genomeSize_1.csv
   		cut -f2- ${fname}.genomeSize_1.csv > ${fname}.genomeSize.csv

   		# Add column with genome size
		awk -v FS="\t" -v OFS="\t" 'FNR==NR{a[$1]=$2;next;} {if(a[$1]) {print a[$1], $0} else {print "NA",$0}}' \
		${fname}.genomeSize.csv ${fname}.raspir1.csv > ${fname}.raspir.csv

		# Convert into a comma-separated file
		sed -i 's/\t/,/g' ${fname}.raspir.csv
		# Add header
		sed -i '1iGenomeLength,Organism,Position,Depth\' ${fname}.raspir.csv

		# Remove intermediate files
		rm ${fname}.raspir1.csv ${fname}.genomeSize_1.csv ${fname}.genomeSize.csv
	done

Run raspir

Download the python script into the run_raspir/ folder.

python raspir.py input.csv output_prefix 

Output

A table is generated (.CSV format). The assignment output has 6 columns.

Species r_value p_value stError euclidean distribution
Pseudomonas aeruginosa 0.99 0.0 0.00019 0.01 uniform
Streptococcus salivarius 0.97 0.0 0.00016 0.002 uniform
Rothia mucilaginosa 0.99 0.0 0.000002 0.0001 uniform

The output file summarises the statistics of comparing an ideal reference distribution of reads across the genome with the real biological situation. The columns are species name, the Pearson's correlation coefficient r, the correlation p-value, the standard error, and the euclidean distance between reference and biological spectral signals.

Read patterns are classified as uniform if the reference and biological signals exhibit strong Pearson’s correlations (Correlation coefficient > 0.6, p-value < 0.05, standard error of estimates < 0.01) and low Euclidean dissimilarity indices (EDI < 0.5). This is also explained in the paper. In the current version, only species are listed with uniform distribution of reads compared to their reference signal.

Contributors

@mmpust author @colindaven updates @nick-youngblut updates

References

[1] Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.
[2] Li H., Handsaker B., Wysoker A. et al. (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9.
[3] Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60.
[4] Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012, 9:357-359.

Cite the tool

Pust, MM., Tümmler, B. Identification of core and rare species in metagenome samples based on shotgun metagenomic sequencing, Fourier transforms and spectral comparisons. ISME COMMUN. 1, 2 (2021). https://doi.org/10.1038/s43705-021-00010-6