Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R Encountering Issues When Processing PacBio Revio 16S Data with the DADA2 Pipeline #2074

Open
emankhalaf opened this issue Jan 9, 2025 · 15 comments

Comments

@emankhalaf
Copy link

Hi @benjjneb ,

I am currently working on a new batch of 16S data generated using PacBio Revio technology. While I successfully processed the sequences in Qiime2 within a reasonable time, utilizing the DADA2 plugin for the denoising step, I encountered significant challenges when using the DADA2 pipeline in R.

Each step of the pipeline takes an unusually long time, and R has crashed multiple times during the process. After each crash, I resumed the script by uploading the latest output. For example, the denoising step alone took several days to process 56 sequence files, which seems unreasonable. Similarly, the alignment step ran for four days before ultimately causing RStudio to crash.

Given these issues, I am wondering whether there might be an incompatibility between Revio sequences and the algorithms used in the DADA2 pipeline in R. I ran the script several times, both on a physical server and in cloud environments with high computational power, but the problems persisted.

Attached, I’ve included the error plots generated from R. These plots appear unusual compared to those typically generated from PacBio Sequel II technology.

I would greatly appreciate your insights on interpreting these issues and any guidance you can provide to address them.

Thank you for your time and assistance.

error_plot

@benjjneb
Copy link
Owner

benjjneb commented Jan 9, 2025

I am currently working on a new batch of 16S data generated using PacBio Revio technology. While I successfully processed the sequences in Qiime2 within a reasonable time, utilizing the DADA2 plugin for the denoising step, I encountered significant challenges when using the DADA2 pipeline in R.

Could you provide a bit more information about this. In particular, what is your QIIME2 version? And what are the versions of R and relevant packages in the R environment that is failing to process the same data? (e.g. sessionInfo() output) Also, what is your "pre-processing" workflow on the R side, i.e. what are you doing prior to getting to learnErrors.

@emankhalaf
Copy link
Author

@benjjneb

I am using qiime2-amplicon-2024.10 version, and R 4.4.2. dada2 1.32.0
Code used prior to getting to learnErrors:

fns <- list.files("/home/sequences", pattern="fq", full.names=TRUE)
F27 <- "AGAGTTTGATCMTGGCTCAG"
R1492 <- "TACGGYTACCTTGTTAYGACTT"
rc <- dada2:::rc
theme_set(theme_bw())

#Remove Primers and Filter
nops <- file.path("/home/sequences", "noprimers", basename(fns))
prim <- removePrimers(fns, nops, primer.fwd=F27, primer.rev=dada2:::rc(R1492), orient=TRUE, verbose=TRUE)

#filter
filts <- file.path("/home/sequences", "noprimers", "filtered", basename(fns))
track <- filterAndTrim(nops, filts, minQ=3, minLen=1300, maxLen=1600, maxN=0, rm.phix=FALSE, maxEE=2, verbose=TRUE)
track

#learn errors
err <- learnErrors(filts, errorEstimationFunction=PacBioErrfun, BAND_SIZE=32, multithread=TRUE)

#Sorry I did not copy other loaded packages since I have been using RStudio for other scripts, so not all loaded packages are relevant to the dada2 script

sessionInfo() #
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/Toronto
tzcode source: system (glibc)

attached base packages:
[1] grid stats graphics grDevices utils datasets methods base

other attached packages:
[1] dada2_1.32.0 Rcpp_1.0.13 gridExtra_2.3 VennDiagram_1.7.3 futile.logger_1.4.3 readxl_1.4.3
[7] dplyr_1.1.4

@cjfields
Copy link

@emankhalaf see #1892, but note that @benjjneb recently added a new function for dealing with binned quality scores on the 'master' branch. I have found processing any Revio data (especially from Kinnex kits) requires pretty significant compute resources

@emankhalaf
Copy link
Author

@cjfields
Thank you for your reply and your input. From the provided link, it seems I should increase the nbases parameter in the learnErrors step. For example:

errs <- learnErrors(dereps, 
    nbases = 1e9, 
    errorEstimationFunction = PacBioErrfun, 
    BAND_SIZE = 32, 
    multithread = 36, 
    verbose = TRUE)

Alternatively, I might need to use a value as high as 1e10.

I have not been able to find specific updates or recommendations online regarding the DADA2 workflow for processing Revio sequences. The link provided for a previous similar issue is one of the most commonly referenced sources when searching for solutions to my problem. However, I am still unclear about the exact adjustments needed to process Revio sequences properly.

What I understand so far is that the Revio system uses a quality score binning approach which means that the PacBio Revio sequencing platform groups/bins quality scores into predefined categories, rather than assigning a distinct quality score to each base call., similar to Illumina's NovaSeq, and this affects the error learning model step. However, I am unsure of the precise modifications required to address this issue effectively.

Any further guidance or clarification would be greatly appreciated.

@benjjneb
Copy link
Owner

@emankhalaf We are still evolving our functionality for optimally fitting error models from binned quality score data, and so it is our fault we don't have more specific guidance on that yet (although the thread @cjfields linked above is a great resource).

To me the larger mystery here is this:

While I successfully processed the sequences in Qiime2 within a reasonable time, utilizing the DADA2 plugin for the denoising step, I encountered significant challenges when using the DADA2 pipeline in R.

The QIIME2 plugin is simply running the DADA2 R package, so I'm having trouble understanding these qualitatively different outcomes.

You provided information about your R script and R environment above. Can you say a bit more about your QIIME2 processing? Was there any pre-processing prior to running qiime dada2 denoise-ccs? What was your full command (with parameters) that you called? Was the same compute environment used for the Q2 processing as when you were trying to run DADA2 via R/Rstudio?

@emankhalaf
Copy link
Author

@benjjneb

Thank you very much for your response. I executed the R scripts on an HPC cloud (200 GB RAM) provided by the Digital Research Alliance of Canada (DRAC). However, the Qiime2 script (outlined below) was run on a physical server with significantly lower computational power (64 GB RAM) compared to the resources utilized on the HPC.

#import
qiime tools import --type 'SampleData[SequencesWithQuality]' \
 --input-path manifest.txt \
 --output-path samples.qza \
 --input-format SingleEndFastqManifestPhred33V2
 
 #summarize the read statistics of each barcode pair by running
 qiime demux summarize --i-data samples.qza \
 --o-visualization samples.demux.summary.qzv

## #Denoising HiFi Reads into ASVs 
 qiime dada2 denoise-ccs --i-demultiplexed-seqs ./samples.qza \
 --o-table dada2-ccs_table.qza \
 --o-representative-sequences dada2-ccs_rep.qza \
 --o-denoising-stats dada2-ccs_stats.qza \
 --p-min-len 1300 --p-max-len 1600 \
 --p-front AGAGTTTGATCMTGGCTCAG --p-adapter TACGGYTACCTTGTTAYGACTT \
 --p-pooling-method pseudo \
 --p-n-threads 8 \
 --verbose

qiime feature-classifier classify-consensus-vsearch \
 --o-classification ./taxonomy.vsearch.qza \
 --o-search-results ./top_hits_blast.qza \
 --i-query dada2-ccs_rep.qza \
 --i-reference-reads ref-seqs.qza \
 --i-reference-taxonomy ref-tax.qza \
 --p-threads 8 \
 --p-maxrejects 100 \
 --p-maxaccepts 10 \
 --p-perc-identity 0.97 \
 --verbose
 
 
 qiime metadata tabulate --m-input-file ./taxonomy.vsearch.qza \
 --o-visualization taxonomy.vsearch.qzv
 
  qiime taxa barplot --i-table dada2-ccs_table.qza \
 --i-taxonomy taxonomy.vsearch.qza \
 --m-metadata-file ./metadata.tsv \
 --o-visualization ./taxa_barplot.qzv

@emankhalaf
Copy link
Author

@benjjneb I have important updates regarding the Revio sequence files which may help interpret the above issue.

Guided by a bioinformatician from PacBio, using bamtools for converting BAM files to FASTQ was not the correct approach for handling Revio sequence files. This method led to FASTQ files with duplicated sequences, as confirmed by comparing line counts between the BAM and FASTQ files. Also, caused errors when submitting these sequence files into SRA.

To resolve this, do the following steps:

#Creating a Conda Environment and Installing Required Tools
# Create a Conda environment
conda create -n pbtools_env

Install necessary packages

conda install -c bioconda pbcoretools
conda install -c bioconda pbcore
conda install -c bioconda pbtk  # Most important package containing the bam2fastq function

Verify installation

bam2fastq --help

#Converting BAM to FASTQ
# Ensure the .pbi index file is present in the same directory
bam2fastq E1.hifi_reads.bc1104_CS1_F--PBBC08_CS2_R.bam -u -o E1.hifi_reads.bc1104_CS1_F--PBBC08_CS2_R

After conversion, I rechecked the line counts before and after the process, and they were identical. I applied this method to all my sequence files and ran the DADA2 R script on both DRAC and the physical server last night.

The DRAC job is still running.
The physical server job crashed at the denoising step, likely due to differences in computational power between the physical server and the cloud.
The plot of the error model remains unchanged (duplicated vs unduplicated sequence files).
Processing time is significantly reduced when using the correctly converted sequence files instead of the duplicated ones.

Furthermore, I successfully resubmitted the newly converted sequence files to SRA.

I am looking forward to hearing back from you!

Appreciated!

Eman

@cjfields
Copy link

cjfields commented Jan 31, 2025

Just so there is record of this: we find PacBio Kinnex samples take considerable resources even when using binned reads, likely due to the ~15x increase in yield for a typical Revio run. Varies considerably based on diversity of samples of course, but we bump cores for critical steps to 24 (like learnErrors, dada, etc), increase memory (one soil run, probably worst case, required 500GB), and increase time to 58-72 hrs per key step (learnErrors, dada, chimera removal, etc). Initially skip pooling and denoise per sample, generate priors, then rerun key steps using those if needed. Binned quality data helps somewhat.

And don't use default filtered PacBio CCS data, which last I checked was 99% consensus accuracy. Needs to be 99.9%. Check the BAM file, consensus accuracy is included as one of the tags, rq (https://pacbiofileformats.readthedocs.io/en/11.0/BAM.html#use-of-read-tags-for-per-read-information). You can use numerous tools to filter these as needed

@emankhalaf
Copy link
Author

@cjfields thanks so much for your input.Yes, the PacBio bioinformatician recommended filtering BAM files based on quality scores (QS30) before use. I'm currently working on this using SAMTools for general BAM file filtering (please see below). Thanks for confirming!

samtools view -b -q 30 E1.hifi_reads.bcxxx_CS1_F--PBBC08_CS2_R.bam > E1.filtered.bam

But when you say to initially skip pooling, do you mean only at the beginning of the analysis and then apply the pooling parameter later? Pooling has a significant impact on the number of inferred ASVs, especially when working with a novel environmental niche.

I have an additional note: When running DADA2 in R, the error model learning step generated the following message:

err <- learnErrors(filts, errorEstimationFunction=PacBioErrfun, BAND_SIZE=32, multithread=TRUE)
102010037 total bases in 71533 reads from 3 samples will be used for learning the error rates.
The max qual score of 93 was not detected. Using standard error fitting.
The max qual score of 93 was not detected. Using standard error fitting. 

here you are the error model plot:

Image

I believe this indicates that the BAM files need to be filtered before processing, correct?

One more question: Is this concept specific to sequences generated from Revio, or should it also be applied to sequences from Sequel II? I haven’t encountered any issues with sequences generated from Sequel II.

@cjfields
Copy link

cjfields commented Jan 31, 2025

@cjfields thanks so much for your input.Yes, the PacBio bioinformatician recommended filtering BAM files based on quality scores (QS30) before use. I'm currently working on this using SAMTools for general BAM file filtering (please see below). Thanks for confirming!

samtools view -b -q 30 E1.hifi_reads.bcxxx_CS1_F--PBBC08_CS2_R.bam > E1.filtered.bam

I don't filter based on the Q score but the rq tag. See this comment for an example filter using this (in the linked case to see how many reads fit the criteria).

But when you say to initially skip pooling, do you mean only at the beginning of the analysis and then apply the pooling parameter later? Pooling has a significant impact on the number of inferred ASVs, especially when working with a novel environmental niche.

@benjjneb has a really nice summary here. You can essentially emulate 'pseudo' pooling by denoising per sample in a first pass, capture ASVs from the pooled data that are present in more than one sample to generate 'pseudo' priors, then use those in a second round of denoising per sample (if you look at the code for dada this is what it is doing when using pool='pseudo').

I have an additional note: When running DADA2 in R, the error model learning step generated the following message:

err <- learnErrors(filts, errorEstimationFunction=PacBioErrfun, BAND_SIZE=32, multithread=TRUE)
102010037 total bases in 71533 reads from 3 samples will be used for learning the error rates.
The max qual score of 93 was not detected. Using standard error fitting.
The max qual score of 93 was not detected. Using standard error fitting. 

here you are the error model plot:

Image

I believe this indicates that the BAM files need to be filtered before processing, correct?

Nope. You have binned quality score data where the absurdly high Q93 is no longer present, so the warning tells you it's essentially falling back to using standard error fitting (which it turns out may also be an issue). That is why I ended up testing the binned quality models from the NovaSeq ticket linked to previously, which seemed to help.

One more question: Is this concept specific to sequences generated from Revio, or should it also be applied to sequences from Sequel II? I haven’t encountered any issues with sequences generated from Sequel II.

Sequel II/IIe I believe still generates full quality scores. You can also configure the Revio to generate full quality scores (our internal core does this), but the file sizes are much larger and the data processing takes time, even above what I mentioned about. The Kinnex ticket has an example of what this looks like.

EDIT: I should add, the above error plots don't look terrible to me. See this comment in that ticket on why I switched

@emankhalaf
Copy link
Author

emankhalaf commented Feb 7, 2025

@cjfields Hello, I wanted to provide an update on my progress. I began with rq filtering, as you recommended, using the following command:

#filter reads with quality < 0.999:
`samtools view -b -e '[rq] >= 0.999' "$bam" > "${sampleID}_filtered.bam`

#generate pbi file:
pbindex sampleID_filtered.bam

#convert bam file into fastq file:
bam2fastq sampleID_filtered.bam -u -o sampleID

Then, I processed the filtered FASTQ files through the DADA2 pipeline. The analysis was significantly faster than before, with only a minimal reduction in taxa count. This suggests that the previous delays were primarily due to low-quality reads, which may have introduced noise and confusion into the algorithm.
#I did the above with sequence files generated from both Revio and Sequel II systems.

Thank you!

@cjfields
Copy link

cjfields commented Feb 7, 2025

@cjfields Hello, I wanted to provide an update on my progress. I began with rq filtering, as you recommended, using the following command:
...
Then, I processed the filtered FASTQ files through the DADA2 pipeline. The analysis was significantly faster than before, with only a minimal reduction in taxa count. This suggests that the previous delays were primarily due to low-quality reads, which may have introduced noise and confusion into the algorithm. #I did the above with sequence files generated from both Revio and Sequel II systems.

Thank you!

No problem @emankhalaf. I suspect many of the PacBio-related issues reported here could be related to including lower quality reads.

Just a question, did you use the standard PacBio error model or the binned model for learnErrors? It works in both cases, but we did see the PacBio error model internally fall back to using the standard (Illumina) model due to missing Q93 data. Worth checking the error model plots.

@emankhalaf
Copy link
Author

emankhalaf commented Feb 7, 2025

@cjfields I used the standard PacBio error model since I do not have the code for the binned model based on DADA2 pipeline tutorials. I would greatly appreciate any guidance on this.

Hereinafter the plot of the error model generated from the following code:
err <- learnErrors(filts, errorEstimationFunction=PacBioErrfun, BAND_SIZE=32, multithread=TRUE)
which generated the same warning message:

125095200 total bases in 87924 reads from 4 samples will be used for learning the error rates.
The max qual score of 93 was not detected. Using standard error fitting.
The max qual score of 93 was not detected. Using standard error fitting.
The max qual score of 93 was not detected. Using standard error fitting.

Image

This warning is probably due to the filtering step. Also, DADA2 was originally optimized for Illumina data, where quality scores are lower (max ~40). However, PacBio HiFi reads have a much higher quality scale (max 93). Furthermore, this warning does not indicate a failure as DADA2 is still proceeding with this error modeling but using a standard approach. We look forward to any updates on this, as I don't see any alternative solutions for handling such high-quality 16S sequences.

Appreciated!

@cjfields
Copy link

cjfields commented Feb 7, 2025

@emankhalaf that appears to be identical to the error plot you had posted in the comment earlier which (IIRC) was using the original PacBio data with default accuracy (99%, Q20). What would be good to see is a plot using your newly filtered data (99.9% or Q30). Note the error frequency scaling on both is a couple of orders of magnitude different than the scaling from the ticket I linked to where the reads were filtered to 99.9% by our sequencing core.

The 'dip' in my linked graph on the right side of the plots was my concern for those data. I had seen a similar problem with binned Illumina data for many years now, so I ended up trying one of the Illumina binned models from here, specifically option 4. That did seem to address the fit and looks better, though again it would be best to evaluate a truth set to see if it made much difference; we also need to test @benjjneb's new binned model he just recently added. If you use the linked Illumina binned model, you will need dplyr: library(dplyr).

@emankhalaf
Copy link
Author

Hello @cjfields Yes, surprisingly, the error model generated from the filtered sequence files is identical to the one generated from the unfiltered sequences. Could this be because, in both cases, the quality scores are still binned? I checked issue #1307, but it pertains to Illumina sequences, where the maximum Q score is estimated to be 40. Could you please provide the link to @benjjneb's new binned quality model? Also, is this model specifically designed for processing PacBio HiFi reads?
I am currently re-running the Sequel II sequences after filtering to check whether the error model plot changes. I think this should help determine whether binned quality scores are the main issue.

Appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants