Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on umi_count/duplicate_count/consensus_count #743

Closed
grst opened this issue Jan 24, 2024 · 6 comments
Closed

Clarification on umi_count/duplicate_count/consensus_count #743

grst opened this issue Jan 24, 2024 · 6 comments

Comments

@grst
Copy link
Contributor

grst commented Jan 24, 2024

In the context of single-cell TCR data, I've always been a bit confused about what to put in which field. I've seen in the latest revision of the AIRR standard, a umi_count field has been added to resolve some of this ambiguity.

Just to be sure I got this right:

  • umi_count should contain the deduplicated read count (i.e. the number of unique UMIs)
  • duplicate_count now remains empty (what is it actually for?)
  • consensus_count should contain the raw read count (before UMI deduplication)

Is this correct?

This came up in scverse/scirpy#478

@bcorrie
Copy link
Contributor

bcorrie commented Jan 25, 2024

I'm glad you asked that question... 8-) Not...

Other might provide a more precise answer, but my understanding is that duplicate_count doesn't apply to single-cell 10X style experiments - see #543 (comment)

It is for use in non-UMI bulk studies if I understand correctly.

The different fields stem from trying to capture the different types of counts one might have with such varying techniques such as bulk and single-cell.

See the associated very long threads as to how this was arrived at. Perhaps we need to update the docs to better reflect this?

@bcorrie
Copy link
Contributor

bcorrie commented Jan 25, 2024

I am hoping that @javh @scharch might comment on your use of consensus_count and umi_count, it seems right to me but I am not an expert.

@scharch
Copy link
Contributor

scharch commented Jan 25, 2024

Yes, I think this is correct. See #161 (comment) that duplicate_count is intended for use when there are no UMIs in the experimental protocol.

@schristley
Copy link
Member

  • duplicate_count now remains empty (what is it actually for?)

Exact sequence duplicates, that is, same length and identical nucleotide sequence. Almost used exclusively by pre-processing tools for bulk AIRR-seq. As those duplicate sequences will have identical annotations from (say) IgBlast, it is used as an optimization to speed up the analysis workflow.

@schristley
Copy link
Member

A further point. It is important for downstream analysis tools to be aware and use duplicate_count, especially if they are performing counts or statistics that take the number of sequences into account.

@grst
Copy link
Contributor Author

grst commented Jan 29, 2024

Got it, thanks everyone!
Scirpy will use the umi_count field by default from the next release on.

@grst grst closed this as completed Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants