Clarification on umi_count/duplicate_count/consensus_count #743

grst · 2024-01-24T08:41:10Z

In the context of single-cell TCR data, I've always been a bit confused about what to put in which field. I've seen in the latest revision of the AIRR standard, a umi_count field has been added to resolve some of this ambiguity.

Just to be sure I got this right:

umi_count should contain the deduplicated read count (i.e. the number of unique UMIs)
duplicate_count now remains empty (what is it actually for?)
consensus_count should contain the raw read count (before UMI deduplication)

Is this correct?

This came up in scverse/scirpy#478

The text was updated successfully, but these errors were encountered:

bcorrie · 2024-01-25T16:28:17Z

I'm glad you asked that question... 8-) Not...

Other might provide a more precise answer, but my understanding is that duplicate_count doesn't apply to single-cell 10X style experiments - see #543 (comment)

It is for use in non-UMI bulk studies if I understand correctly.

The different fields stem from trying to capture the different types of counts one might have with such varying techniques such as bulk and single-cell.

See the associated very long threads as to how this was arrived at. Perhaps we need to update the docs to better reflect this?

bcorrie · 2024-01-25T16:35:47Z

I am hoping that @javh @scharch might comment on your use of consensus_count and umi_count, it seems right to me but I am not an expert.

scharch · 2024-01-25T16:38:53Z

Yes, I think this is correct. See #161 (comment) that duplicate_count is intended for use when there are no UMIs in the experimental protocol.

schristley · 2024-01-25T17:42:51Z

duplicate_count now remains empty (what is it actually for?)

Exact sequence duplicates, that is, same length and identical nucleotide sequence. Almost used exclusively by pre-processing tools for bulk AIRR-seq. As those duplicate sequences will have identical annotations from (say) IgBlast, it is used as an optimization to speed up the analysis workflow.

schristley · 2024-01-25T17:48:35Z

A further point. It is important for downstream analysis tools to be aware and use duplicate_count, especially if they are performing counts or statistics that take the number of sequences into account.

grst · 2024-01-29T09:05:35Z

Got it, thanks everyone!
Scirpy will use the umi_count field by default from the next release on.

grst closed this as completed Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on umi_count/duplicate_count/consensus_count #743

Clarification on umi_count/duplicate_count/consensus_count #743

grst commented Jan 24, 2024

bcorrie commented Jan 25, 2024 •

edited

Loading

bcorrie commented Jan 25, 2024

scharch commented Jan 25, 2024

schristley commented Jan 25, 2024

schristley commented Jan 25, 2024

grst commented Jan 29, 2024

Clarification on umi_count/duplicate_count/consensus_count #743

Clarification on umi_count/duplicate_count/consensus_count #743

Comments

grst commented Jan 24, 2024

bcorrie commented Jan 25, 2024 • edited Loading

bcorrie commented Jan 25, 2024

scharch commented Jan 25, 2024

schristley commented Jan 25, 2024

schristley commented Jan 25, 2024

grst commented Jan 29, 2024

bcorrie commented Jan 25, 2024 •

edited

Loading