Problem with long vectors using Pacbio long reads #2066

BigFatPandar · 2024-12-17T10:27:41Z

Hi dada2 community,

Background:
I have been using dada2 for quite a while to a analyze Illumina 16s short read sequencing data. However, we have now tried to use the PacBio long read sequencing (Kinnex). From the sequencing center we received already filtered sequences (from the sequencing center - "CCS step was done automatically on the machine (not manually in SMRTLink). However, the reads are filtered in a later step, so only reads --min-rq≥0.99 are in the hifi files that you received."). I have not worked with pre-filtered data but I thought I'd just send it through the dada2 pipeline following the previously outlined pipeline pipeline for PacBio files as a base. Each sample is in one fastq file and not split between forward and reverse. All the primers/adapters were removed before sending the data to us.

The problem I run into when running the below code, is "Error in asMethod(object) : long vectors not supported yet: memory.c:3948" during the dereplication step. I have tried playing around with the different variables and only using the one sample but I still get the same error message. I do use the newest version of R and have run this script on a HPC with a bunch of memory and cpus. The wild thing is that it did work once, where I got an output that was about 25% of the reads which I did use for some downstream steps and it worked fine.

So my questions if I may:

Is this dada2 pipeline appropriate for the already filtered data that was provided to us?
If no, how would I go about clustering the data into ASVs?
How could I avoid or solve the problem of long vectors mentioned above?
There might be some shortened sequences that we are interested in. How could we extract these shorted sequences without filtering them out?

Any input would be greatly appreciated and thank you for all your help!

Kind regards,
Johann

`# Print R version being used
cat("R version: ", R.Version()$version.string, "\n")

Get command line arguments (FASTQ file path)

args <- commandArgs(trailingOnly=TRUE)
fastq_file <- args[1] # The file passed in by the job array script

cat("Processing file: ", fastq_file, "\n")

Load the dada2 library

library(dada2); packageVersion("dada2")

Define the path (adjust if needed)

path <- "L:/Users/deBeer_Johann/datav1/testr"

List all the fastq files in the directory (single-end reads)

fnFs <- sort(list.files(path, pattern=".fastq", full.names = TRUE))

Extract sample names from the filenames (assuming SAMPLENAME.fastq format)

sample.names <- sapply(strsplit(basename(fnFs), "_"), [, 1)

Set up file paths for filtered data

filtFs <- file.path(path, "filtered", paste0(sample.names, "_F_filt.fastq.gz"))
print(filtFs)

Filtering step (adjust truncLen based on actual read lengths)

out <- filterAndTrim(fnFs, filtFs,
maxN=0, minQ=3, maxEE=20, rm.phix=FALSE, minLen=1400, maxLen=1700,
compress=TRUE)
head(out)

Dereplicate the filtered sequences

derepFs <- derepFastq(filtFs, verbose=TRUE)

Name dereplicated objects by sample names

names(derepFs) <- sample.names

Learn the error rates for the single-end reads

errF <- learnErrors(filtFs, multithread=TRUE)
saveRDS(errF, file.path(path, "errPbio.rds"))
plotErrors(errF)

Apply DADA2 algorithm to infer ASVs

dadaFs <- dada(derepFs, err=errF, OMEGA_A=1e-10, DETECT_SINGLETONS=TRUE, BAND_SIZE=32, multithread=TRUE)
saveRDS(dadaFs, file.path(path.rds, "16sPbio.rds"))
summary_table <- cbind(
ccs=out[,1], # Number of input reads (before filtering)
filtered=track2[,2], # Number of reads after filtering
denoised=sapply(dadaFs, function(x) sum(x$denoised)) # Number of denoised reads
)
print(summary_table)
write.csv(summary_table, file = "summary_table.csv")

st2 <- makeSequenceTable(dadaFs) # Here I make a sequence table
saveRDS(st2, "L:/Users//datav1/testr/seqtblv1.rds")
dim(st2)

#taxa <- readRDS("path/to/taxa.rds") #(for starting with X after you have saved with saveRDS)
tax2 <- assignTaxonomy(st2, "L:/Users//datav1/testr/silva_nr99_v138.1_wSpecies_train_set.fa.gz", minBoot=50, multithread=TRUE) # Slowest part
head(unname(tax2))
saveRDS(tax2, "L:/Users//datav1/testr/taxav1.rds")`

TLDR Problem: When running dada2 (on my local pc or HPC cluster) I get "Error in asMethod(object) : long vectors not supported yet: memory.c:3948". If anyone has any advice of how to deal with this I would greatly appreciate it.

The text was updated successfully, but these errors were encountered:

cjfields · 2025-01-13T23:24:18Z

@BigFatPandar see #1892, but also note there is a newer way of treating binned quality reads @benjjneb added recently

BigFatPandar · 2025-02-20T10:59:10Z

Hi @cjfields

Thanks so much for linking the post that discuss how to treat binned quality scores! I used the error model (option 4) that was outlined in the post. The error plots now look as they should.

However, I still lose a lot of my Kinnex reads when running the FilterAndTrim step. I have played around with the parameters (i.e. maxN=0, minQ=3, maxEE=20, rm.phix=FALSE, minLen=1400, maxLen=1700, I have even tried removing the Len requirements), but I still lose about 70% of the original reads. According to the requencing center, only reads --min-rq≥0.99 are in the hifi files that we received. So we should not observe such a huge drop-off in reads? Your insights in this would be highly appreciated!

Kind regards,
Johann

benjjneb · 2025-02-21T18:25:41Z

My understanding is that it is common for a smaller number of reads using the Kinnex approach to pass filtering, however you do filtering. See a bit more here: #2013 (comment)

Also see this comment on the value (or lack thereof) of trying to force reads through a workflow by lowering quality standards: #2013 (comment)

cjfields · 2025-02-21T20:23:51Z

My understanding is that it is common for a smaller number of reads using the Kinnex approach to pass filtering, however you do filtering. See a bit more here: #2013 (comment)

Yes, we're a bit spoiled here in that I coordinate with our sequencing core to produce reads filtered to 99.9% for these studies (Q30). More sequencing cores with a PacBio (either Sequel IIe or Revio) by default produce Q20 (99%).

Also see this comment on the value (or lack thereof) of trying to force reads through a workflow by lowering quality standards: #2013 (comment)

Agree 100%. FWIW when we've seen big drops in sequence retention there are good explanations behind them, though (as long as we're dealing with Q30 data) full-length 16S has been very robust, Kinnex or standard.

I also noticed the same interesting pattern with Revio Kinnex data that is in the above error plots (A2G subset here):

Not sure what to make of it and whether it's worth worrying about. I believe it's Kinnex-specific, I don't recall standard 16S on Revio with this issue but would need to check. I suspect it may have something to do with the number of passes or the location of the fragment in the concatemer, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with long vectors using Pacbio long reads #2066

Problem with long vectors using Pacbio long reads #2066

BigFatPandar commented Dec 17, 2024

cjfields commented Jan 13, 2025

BigFatPandar commented Feb 20, 2025

benjjneb commented Feb 21, 2025

cjfields commented Feb 21, 2025 •

edited

Loading

Problem with long vectors using Pacbio long reads #2066

Problem with long vectors using Pacbio long reads #2066

Comments

BigFatPandar commented Dec 17, 2024

Get command line arguments (FASTQ file path)

Load the dada2 library

Define the path (adjust if needed)

List all the fastq files in the directory (single-end reads)

Extract sample names from the filenames (assuming SAMPLENAME.fastq format)

Set up file paths for filtered data

Filtering step (adjust truncLen based on actual read lengths)

Dereplicate the filtered sequences

Name dereplicated objects by sample names

Learn the error rates for the single-end reads

Apply DADA2 algorithm to infer ASVs

cjfields commented Jan 13, 2025

BigFatPandar commented Feb 20, 2025

benjjneb commented Feb 21, 2025

cjfields commented Feb 21, 2025 • edited Loading

cjfields commented Feb 21, 2025 •

edited

Loading