-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarification on umi_count/duplicate_count/consensus_count #743
Comments
I'm glad you asked that question... 8-) Not... Other might provide a more precise answer, but my understanding is that It is for use in non-UMI bulk studies if I understand correctly. The different fields stem from trying to capture the different types of counts one might have with such varying techniques such as bulk and single-cell. See the associated very long threads as to how this was arrived at. Perhaps we need to update the docs to better reflect this? |
Yes, I think this is correct. See #161 (comment) that |
Exact sequence duplicates, that is, same length and identical nucleotide sequence. Almost used exclusively by pre-processing tools for bulk AIRR-seq. As those duplicate sequences will have identical annotations from (say) IgBlast, it is used as an optimization to speed up the analysis workflow. |
A further point. It is important for downstream analysis tools to be aware and use |
Got it, thanks everyone! |
In the context of single-cell TCR data, I've always been a bit confused about what to put in which field. I've seen in the latest revision of the AIRR standard, a
umi_count
field has been added to resolve some of this ambiguity.Just to be sure I got this right:
umi_count
should contain the deduplicated read count (i.e. the number of unique UMIs)duplicate_count
now remains empty (what is it actually for?)consensus_count
should contain the raw read count (before UMI deduplication)Is this correct?
This came up in scverse/scirpy#478
The text was updated successfully, but these errors were encountered: