-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent ASV inference when running samples together vs. individually with default (pool=FALSE, selfConsist=FALSE) settings. #2085
Comments
Hi, I discovered that the discrepancy mentioned above had arisen because I had used different subsampling percentages across projects, causing the “error-learnt” object to differ in the above example. Interestingly, when I subsample a smaller percentage of reads from each sample (so I can include all samples in the error-learning process), I end up losing more data during the denoising step. I have done two tests now and I'm planning to do a third one with subsampling 0.2 of reads from each sample. I should mention that I'm doing the learnErrors function with all the default parameters. Example in another NextSeq dataset with total 100M reads across 161 samples: -subsampling 0.002 of reads from each sample (learning from all the samples in the run): total reads in the run lost in denoising step: 47372873 -subsampling 0.02 of reads from each sample (learning from ~20% of the samples): total reads in the run lost in denoising step: 41140009 My questions are: I was wondering if you have any recommendation about a percentage of samples to include for representativeness (considering all are coming from the same environment), so we can determine the appropriate subsampling percentage before each run? For example, is there a threshold we can apply—say, at least 100 million reads from a certain number or percentage of samples—to ensure that the learned errors are reliable? Does selecting a consistent number of reads from each sample provide better representativeness than using a % number of reads in samples for learning error rates? I understand that the ideal approach is to experiment with different proportions of reads from various sample subsets to optimize results for each dataset. However, I’m curious whether there is a rough threshold or guideline that would let me standardize my pipeline so it works well in most cases, without having to tweak it every time, especially in the case of large datasets that take more time and resources to complete. Thanks!! |
Less samples but keeping more of the samples (deeper samples) is preferable for training the error model. Less diverse samples for training are also preferable (easier to jointly infer composition and fit the error model in simpler samples). |
Thanks for clarifying! It sounds like using most (or all) reads from a smaller number of samples can be more effective than simply increasing the total number of samples. However, in larger datasets where a single sample might already provide enough reads to reach 100 million bases, what’s the best approach? Is it sufficient to use just one deeply sequenced sample, or is there a more optimal strategy? For example, subsampling 20-40% of the reads in each sample and making sure that the errors are learnt from more than 1 sample and still has reached 100M bases and more depending on the last sample used. Or should I increase the nbases to make sure more than 1 sample is used for learning error rates? |
Either is appropriate. In an ideal situation, it doesn't matter what sample you learn the error rates from. The reads on the sequencer are effectively randomly assigned to different samples, and thus learning the error model from one is good for learning the error model from all. That said, cases can arise where something goes wrong at the PCR or library-prep stage in a particular sample, so it is not unreasonable to want more than one sample contributing to the |
Great! Thank you so much for getting back to me so quickly. I really appreciate it. |
My final question is whether there is an effective solution for learning the error model and subsequently inferring microbial composition when we have samples from different body sites of the same organism, or from the same body site of different organisms, all processed in a single run. Are samples loaded into memory in alphabetical or numerical order? For example, if gut samples are listed first and vaginal samples follow, could errors be predominantly learned from the gut microbial samples? Should we ensure that error rates are learned from both sample types, or is this not a major concern? I recall reading in a forum that samples from vastly different environments (e.g., ocean vs. human gut microbiome) should not be run together, and I'm curious how this principle applies to samples, such as human gut vs. vaginal microbiome or human vs. mouse gut microbiome. I'm wondering how much your input that "it doesn't matter what sample you learn the error rates from" might be challenged by different sample types. Thanks!!! |
In principle, what DADA2 is doing is learning the error rates introduced by PCR and sequencing. So -- if the same protocol was used on each body type, then it is OK to learn error rates from one and use the model on another. The other principle that it is better (all else equal) to use less diverse samples might point one way or the other, but overall validity is not an issue. This is one of the reasons that I find the ASV approach in DADA2 (and other tools) clarifying. It is trying to solve one problem -- fixing errors -- not trying to do two things at once -- fixing errors and clustering biological sequences to achieve some analogy to named taxonomic levels, as motivated classic OTU methods. |
Thanks!! |
Hi,
I have a large NextSeq dataset of approximately 400 million reads across 209 samples. To process these data with DADA2, I developed a Snakemake pipeline. The standard steps of dereplication, sample inference, and merging paired reads (following the DADA2 v1.8 tutorial or a simple for-loop approach) ran very slowly on such a large dataset.
To speed this up, I modified the pipeline so that each sample undergoes dereplication, sample inference, and merging in a separate Snakemake job submitted to the cluster. I tested this approach on a smaller subset of the data, comparing it to the original single-job-run (as per the DADA2 v1.8 tutorial), using DADA2 v1.34.0 in both cases.
Although my approach completed way faster, I observed fewer inferred sequence variants when each sample was processed entirely independently. Based on default parameters like pool=F and selfConsist=F I thought both methods should give me similar results but they didn't.
I even added set.seed=100 before sample inference to make sure that is not contributing to the differences but still the results differ. Could you please help me figure out why I see these differences? Thanks!!!
Original way:
Sample 1:
starting number of reads: 480535
Dereplicated: 12725
Inferred: 202
Sample 2:
starting number of reads: 575464
Dereplicated: 24298
Inferred: 148
Sample 3:
starting number of reads: 318852
Dereplicated: 85071
Inferred: 2774
Running samples independently:
Sample 1:
starting number of reads: 480535
Dereplicated: 12725
Inferred: 179
Sample 2:
starting number of reads: 575464
Dereplicated: 24298
Inferred: 143
Sample 3:
starting number of reads: 318852
Dereplicated: 85071
Inferred: 2099
The text was updated successfully, but these errors were encountered: