Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent ASV inference when running samples together vs. individually with default (pool=FALSE, selfConsist=FALSE) settings. #2085

Open
saharbagheri opened this issue Feb 20, 2025 · 8 comments

Comments

@saharbagheri
Copy link

Hi,

I have a large NextSeq dataset of approximately 400 million reads across 209 samples. To process these data with DADA2, I developed a Snakemake pipeline. The standard steps of dereplication, sample inference, and merging paired reads (following the DADA2 v1.8 tutorial or a simple for-loop approach) ran very slowly on such a large dataset.

To speed this up, I modified the pipeline so that each sample undergoes dereplication, sample inference, and merging in a separate Snakemake job submitted to the cluster. I tested this approach on a smaller subset of the data, comparing it to the original single-job-run (as per the DADA2 v1.8 tutorial), using DADA2 v1.34.0 in both cases.

Although my approach completed way faster, I observed fewer inferred sequence variants when each sample was processed entirely independently. Based on default parameters like pool=F and selfConsist=F I thought both methods should give me similar results but they didn't.

I even added set.seed=100 before sample inference to make sure that is not contributing to the differences but still the results differ. Could you please help me figure out why I see these differences? Thanks!!!

Original way:
Sample 1:
starting number of reads: 480535
Dereplicated: 12725
Inferred: 202

Sample 2:
starting number of reads: 575464
Dereplicated: 24298
Inferred: 148

Sample 3:
starting number of reads: 318852
Dereplicated: 85071
Inferred: 2774

Running samples independently:
Sample 1:
starting number of reads: 480535
Dereplicated: 12725
Inferred: 179

Sample 2:
starting number of reads: 575464
Dereplicated: 24298
Inferred: 143

Sample 3:
starting number of reads: 318852
Dereplicated: 85071
Inferred: 2099

@saharbagheri
Copy link
Author

saharbagheri commented Feb 20, 2025

Hi,

I discovered that the discrepancy mentioned above had arisen because I had used different subsampling percentages across projects, causing the “error-learnt” object to differ in the above example. Interestingly, when I subsample a smaller percentage of reads from each sample (so I can include all samples in the error-learning process), I end up losing more data during the denoising step.

I have done two tests now and I'm planning to do a third one with subsampling 0.2 of reads from each sample. I should mention that I'm doing the learnErrors function with all the default parameters.

Example in another NextSeq dataset with total 100M reads across 161 samples:

-subsampling 0.002 of reads from each sample (learning from all the samples in the run):
48388926 total bases in 175658 reads from 161 samples will be used for learning the error rates.
43200473 total bases in 175658 reads from 161 samples will be used for learning the error rates.

total reads in the run lost in denoising step: 47372873
clean non chimeric reads left at the end: 1747531

-subsampling 0.02 of reads from each sample (learning from ~20% of the samples):
100833940 total bases in 366776 reads from 30 samples will be used for learning the error rates.
102946776 total bases in 418936 reads from 33 samples will be used for learning the error rates.

total reads in the run lost in denoising step: 41140009
clean non chimeric reads left at the end: 2729601

My questions are:
whether you have any experience or insights on learning error rates from all samples versus a smaller subset. Is it better to lose more data by learning from all samples, or is it riskier to learn from fewer samples and potentially include inaccurate data? I think this depends on how many reads are used from each sample. Any advice would be greatly appreciated.

I was wondering if you have any recommendation about a percentage of samples to include for representativeness (considering all are coming from the same environment), so we can determine the appropriate subsampling percentage before each run? For example, is there a threshold we can apply—say, at least 100 million reads from a certain number or percentage of samples—to ensure that the learned errors are reliable?

Does selecting a consistent number of reads from each sample provide better representativeness than using a % number of reads in samples for learning error rates?

I understand that the ideal approach is to experiment with different proportions of reads from various sample subsets to optimize results for each dataset. However, I’m curious whether there is a rough threshold or guideline that would let me standardize my pipeline so it works well in most cases, without having to tweak it every time, especially in the case of large datasets that take more time and resources to complete.

Thanks!!

@benjjneb
Copy link
Owner

Less samples but keeping more of the samples (deeper samples) is preferable for training the error model.

Less diverse samples for training are also preferable (easier to jointly infer composition and fit the error model in simpler samples).

@saharbagheri
Copy link
Author

saharbagheri commented Feb 21, 2025

Thanks for clarifying! It sounds like using most (or all) reads from a smaller number of samples can be more effective than simply increasing the total number of samples. However, in larger datasets where a single sample might already provide enough reads to reach 100 million bases, what’s the best approach? Is it sufficient to use just one deeply sequenced sample, or is there a more optimal strategy? For example, subsampling 20-40% of the reads in each sample and making sure that the errors are learnt from more than 1 sample and still has reached 100M bases and more depending on the last sample used. Or should I increase the nbases to make sure more than 1 sample is used for learning error rates?

@benjjneb
Copy link
Owner

For example, subsampling 20-40% of the reads in each sample and making sure that the errors are learnt from more than 1 sample and still has reached 100M bases and more depending on the last sample used. Or should I increase the nbases to make sure more than 1 sample is used for learning error rates?

Either is appropriate.

In an ideal situation, it doesn't matter what sample you learn the error rates from. The reads on the sequencer are effectively randomly assigned to different samples, and thus learning the error model from one is good for learning the error model from all. That said, cases can arise where something goes wrong at the PCR or library-prep stage in a particular sample, so it is not unreasonable to want more than one sample contributing to the learnErrors step. But in that case, aims for a few, not some fraction of the total samples in the dataset, because sequencing depth is also helpful for learning the error model.

@saharbagheri
Copy link
Author

Great! Thank you so much for getting back to me so quickly. I really appreciate it.

@saharbagheri
Copy link
Author

saharbagheri commented Feb 24, 2025

My final question is whether there is an effective solution for learning the error model and subsequently inferring microbial composition when we have samples from different body sites of the same organism, or from the same body site of different organisms, all processed in a single run. Are samples loaded into memory in alphabetical or numerical order? For example, if gut samples are listed first and vaginal samples follow, could errors be predominantly learned from the gut microbial samples? Should we ensure that error rates are learned from both sample types, or is this not a major concern?

I recall reading in a forum that samples from vastly different environments (e.g., ocean vs. human gut microbiome) should not be run together, and I'm curious how this principle applies to samples, such as human gut vs. vaginal microbiome or human vs. mouse gut microbiome. I'm wondering how much your input that "it doesn't matter what sample you learn the error rates from" might be challenged by different sample types.

Thanks!!!

@benjjneb
Copy link
Owner

In principle, what DADA2 is doing is learning the error rates introduced by PCR and sequencing. So -- if the same protocol was used on each body type, then it is OK to learn error rates from one and use the model on another. The other principle that it is better (all else equal) to use less diverse samples might point one way or the other, but overall validity is not an issue.

This is one of the reasons that I find the ASV approach in DADA2 (and other tools) clarifying. It is trying to solve one problem -- fixing errors -- not trying to do two things at once -- fixing errors and clustering biological sequences to achieve some analogy to named taxonomic levels, as motivated classic OTU methods.

@saharbagheri
Copy link
Author

Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants