Clone spec mapping for common tools #543

kira-neller · 2021-09-03T20:22:53Z

Hello!

The iReceptor team is hoping to load some clones data, so we've been reviewing the clones spec.

I've created a mapping file with my best guess at mapping clones output from MiXCR, Change-o, 10X, and ImmuneDB.

clone.mapping_with_counts_examples-2021-10-29.xlsx

Hoping for a review from the community to see if we are in agreement. Some specific questions that came out of this exercise:

It seems that most tools do not include v/d/j alignment start/end positions in the output. Am I missing something here - maybe a data processing step or a command-line argument?
For single-cell data, it became clear we'll need to resolve this issue: extend clones to the single-cell context. Specifically for 10X data, clone_id comprises multiple loci, so we should decide if/how to split this out. We suggest modifying clone_id as follows (example is for BCR data):

change clone_id = “CLONEA” to three clones with clone_id = “CLONEA_IGH”, clone_id = “CLONEA_IGL”, and clone_id = “CLONEA_IGK”.

In the same vein, we'll need to consider how to count clones derived from single-cell data. I came up with 3 options; option 1 (aggregating clone counts across loci) isn't very useful; option 2 (locus-specific clone count) makes more sense and can be implemented fairly easily; option 3 (cell-specific clone count) is most informative but might be trickier to implement right now, as it links to the cell spec. Any thoughts on how best to proceed?

Thanks!

bussec · 2021-09-06T12:11:58Z

Would it be so difficult to link to the Cell object? The splitting (points 2+3) does not appear to be too meaningful to me. For example, assume you have (single-cell defined) clones in a mouse that use the same IGL rearrangement. If you split those you will either have two IGL "clones" with identical chains, which in a non-single-cell context would appear as one, or a single merged IGL "clone" of which you know that it does not exist. In either case, the cell-based grouping would be lost and require re-processing of the data.

javh · 2021-09-07T16:32:07Z

change clone_id = “CLONEA” to three clones with clone_id = “CLONEA_IGH”, clone_id = “CLONEA_IGL”, and clone_id = “CLONEA_IGK”.

This would be an incorrect definition of clone_id, because the "clone" has two chains. So "CLONEA" if you have paired heavy and light chain data. Also, if you were to annotate the chain in the clone_id column like this, I think using _VH and _VL would be more meaningful - for the most part, you shouldn't have both a kappa and lambda chain.

I agree with @bussec on splitting 2+3. The answer shouldn't differ in cases where you have the heavy+light chain and have resolved doublets / allelic inclusion.

For Change-O/SCOPer specifically, you can count clones by just skipping the light chain row (IGK or IGL) for the clone; ie, count only on the IGH data. The algorithm uses both the heavy and light chain to define clones, but the heavy chain is required.

bcorrie · 2021-09-07T16:58:08Z

Would it be so difficult to link to the Cell object?

No, but this requires us to change the spec such that the Clone object to link to the Cell object (#317) and that has become dormant (last comment over a year ago). So maybe this is the driver to make that change...

The suggestions @kira-neller made were to map clone annotation tools to the AIRR Clone spec. Nothing like trying to use a spec to poke holes in it 8-)

bcorrie · 2021-09-07T17:32:02Z

change clone_id = “CLONEA” to three clones with clone_id = “CLONEA_IGH”, clone_id = “CLONEA_IGL”, and clone_id = “CLONEA_IGK”.

This would be an incorrect definition of clone_id, because the "clone" has two chains. So "CLONEA" if you have paired heavy and light chain data. Also, if you were to annotate the chain in the clone_id column like this, I think using _VH and _VL would be more meaningful - for the most part, you shouldn't have both a kappa and lambda chain.

I agree with @bussec on splitting 2+3. The answer shouldn't differ in cases where you have the heavy+light chain and have resolved doublets / allelic inclusion.

For Change-O/SCOPer specifically, you can count clones by just skipping the light chain row (IGK or IGL) for the clone; ie, count only on the IGH data. The algorithm uses both the heavy and light chain to define clones, but the heavy chain is required.

So if I understand correctly, one should use the clone_id as assigned by the single cell tools. This means that for a specific clone, if a tool does single cell, there are both VH and VL rearrangements and they will have the same clone_id (and presumably the same cell_id), and that is good and we don't want to break that.

The problem then becomes counting and we need to come up with a mechanism for counting clones for each tool. I think that because our Clone object is not well enough defined in relation to Cells this is ambiguous at the moment. Which is exactly why @kira-neller posted the above... So is this true:

clone.sequence_count = sum(rearrangement.clone_id) - this would include VH and VL sequences...

clone_abundance is then somewhat more complicated based on how the tool calculates clones.

For Change-O/SCOPer clone.clone_abundance = sum(rearrangement.clone_id where rearrangement.locus == IGH)

We would need a rule for each tool in this case?

javh · 2021-09-07T18:09:57Z

We would need a rule for each tool in this case?

I hope not. I don't have a good sense of what the various new single-cell tools that have cropped up recently are doing, so without reviewing the state of the field, I'm not sure, but...

If we're not using the Cell object, and just using Rearrangement, then I assume the general rule of one clone_id per cell_id would hold and you can use that across tools. Ie, count the number of unique cell_id per clone_id.

bussec · 2021-09-07T21:22:07Z

This means that for a specific clone, if a tool does single cell, there are both VH and VL rearrangements and they will have the same clone_id (and presumably the same cell_id), [...]

Agreed. I cannot think of a situation in with a clone_id and cell_id would diverge. Also -- for any given clone inference -- a cell can only be a member of zero or one clone. (I hope such a deterministic view does not cause a Bayesian outcry 😉 )

So is this true:
clone.sequence_count = sum(rearrangement.clone_id) - this would include VH and VL sequences...

Looks ok to me. However, the usefulness of clone.sequence_count might be reduced in a single-cell context, if tools perform consensus builds and you end up with few unique sequences. In that case clone.clone_abundance would be the more useful parameter.

One more question: Would any of these clone statistics exclude out-of-frame rearrangements?

kira-neller · 2021-09-08T20:14:55Z

One more question: Would any of these clone statistics exclude out-of-frame rearrangements?

I think most tools exclude non-productive rearrangements prior to assigning clones, right?

bcorrie · 2021-09-08T23:28:39Z

We would need a rule for each tool in this case?

I hope not. I don't have a good sense of what the various new single-cell tools that have cropped up recently are doing, so without reviewing the state of the field, I'm not sure, but...

If we're not using the Cell object, and just using Rearrangement, then I assume the general rule of one clone_id per cell_id would hold and you can use that across tools. Ie, count the number of unique cell_id per clone_id.

But we aren't guaranteed that every tool that produces a clone_id will produce a cell_id are we? I could see how that might work for a single cell pipeline, but what about the bulk AIRR seq clone tools that don't know anything about cells (e.g. MiXCR, ImmuneDB, and even the Change-O non paired pipeline)? They simply assign a clone_id, no?

javh · 2021-09-08T23:33:57Z

Yeah, I meant in the single-cell context. For bulk, there isn't going to be a single right answer for how to count clones -reasonable approximations are going to differ with the technology too (mRNA vs gDNA, with vs without replicates).

bcorrie · 2021-09-08T23:39:16Z

Possibly a dumb question. I have a study with VH and VL data, not paired (for example two different sample processings, with different PCR targets)

My understanding is that one would often run a clone tool (Change-O, MIXCR) on the VH data.

Presumably one could run a clone tool on the VL data. Would one want to do that?

That would give separate clone_id's than that of the VH data, and there is no way to link the VH and VL data. So you would have two independent sets of clones, one from the VH and VL data. Does that make sense and/or is it useful? I assume this is a case that we might run across so would need to support.

javh · 2021-09-08T23:57:37Z

Presumably one could run a clone tool on the VL data. Would one want to do that?

Mostly, I would say "no". Light chains don't have as much information, so while you could certainly cluster the VLs for some purposes, it's not going to be specific enough to reliably infer decent from a common ancestor. The V:J gene pairing is going to drive the clustering.

Maybe it's workable in a repertoire with a lot of SHM in the VL, but it's not something I've spent a lot of time on... Dunno if anyone has dug into it using single-cell data. This is sort of on topic, but not exactly:
https://www.jimmunol.org/content/203/7/1687

kira-neller · 2021-09-10T03:57:39Z

Thanks all for your comments. I've modified my mapping file to include new fields showing example counts for sequence_count and clone_abundance for 10X, Change-o, and MiXCR. These are the fields highlighted in yellow in the respective tabs. Based on what's been discussed, are these values correct?

clone.mapping_with_counts_examples.xlsx

Note that for the 10X example, I've included both clone_count and number of unique cells (10X refers to this as "frequency"). It seems like frequency is most useful, but not sure if/how we want to include this in the clones specification.

kira-neller · 2021-10-05T22:27:03Z

Hi @schristley @javh @scharch, just pinging you on this issue per yesterday's Standards Call. Thanks!

bcorrie · 2021-10-29T15:59:45Z

For discussion at meeting next week - current proposal that it would be good to get agreement on (from @kira-neller spreadsheet) is the counts as they are the most "problematic":

clone_id:

MiXCR = cloneID
Change-o = clone_id
10X = clonetype_id/clone_id
ImmuneDB = clone_id

sequence_count:

MiXCR = count(readID for each cloneID)
Change-O = sum(umi_count for each clone_id)
10X = sum(duplicate_count for each clone_id)
ImmuneDB = instances

clone_abundance:

MiXCR = cloneCount
Change-o = count(clone_id)
10X = count(clone_id) - ideally somehow done taking into account paired chains???
ImmuneDB = copies

bcorrie · 2021-10-29T16:01:34Z

@kira-neller I notice in your spreadsheet for 10X clone_abundance it says sum(duplicate_count), but I think this should be count(clone_id). Can you check. I used count(clone_id) above even though that is not what is in the spreadsheet 8-)

kira-neller · 2021-10-29T16:27:31Z

@bcorrie confirming the above for 10X clone_abundance. In the original file my description was wrong but in the 10x_rgmt tab for clone_abundance I did use count(clone_id), so we are in agreement. I fixed this and uploaded the new version in the original post.

Also here, for convenience:
clone.mapping_with_counts_examples-2021-10-29.xlsx

bcorrie · 2021-11-08T19:20:37Z

From the Clone spec:

        sequence_count:
            type: integer
            description: Number of Rearrangement records (sequences) included in this clone.
        clone_abundance:
            type: integer
            description: Non-normalized absolute count of the number of members (immune cells) in this clone.

scharch · 2021-11-08T20:04:44Z

From the call:

sequence_count to be removed from the Clone schema
Add umi_count (sum of umi_count from Rearrangement), which will map to
** Change-O = sum(umi_count for each clone_id)
** 10X = sum(duplicate_count for each clone_id)
Add duplicate_count (sum of duplicate_count from Rearrangement), which will map to
** MiXCR = count(readID for each cloneID)
Rename clone_abundance to clone_count for naming consistency, mappings the same as @bcorrie listed above
** Corresponds to # of cells for single cell protocols
** Corresponds to # of unique variants for bulk protocols

bcorrie · 2022-02-09T17:34:29Z

@scharch about to implement this for 10X, and I am on the verge of being confused again 8-) If I understand correctly we now should have:

umi_count:
** Change-O = sum(umi_count for each clone_id)
** 10X = sum(duplicate_count for each clone_id)
clone_count:
** Corresponds to # of cells for single cell protocols
** Corresponds to # of unique variants for bulk protocols
duplicate_count:
** ???

I am not sure what duplicate_count should be, in particular in the 10X case, when umi_count = sum(duplicate_count for each clone_id)

bcorrie · 2022-02-09T17:39:01Z

To be VERY specific for 10X, clone_count = count(unique cell_id for each clone_id)

E.g. For clonotype1 in a 10X airr_rearrangements.tsv

awk -F '\t' '$2=="clonotype1" {print $0}' airr_rearrangement.tsv | awk -F '\t' '{print $1}' | sort -u | wc -l

Gets all the data for a specific clone_id, extracts the cell_id, gets the unique cell_id list, and counts the unique cells...

Assumes $1 is cell_id and $2 is clone_id 8-)

scharch · 2022-02-10T18:23:04Z

duplicate_count:
** ???

PROFIT!!!!

scharch · 2022-02-10T18:28:11Z

@bcorrie I think the actual answer is that duplicate_count doesn't apply to 10x data. But damned if I can figure out what I was in my head 3 months ago from the notes here and at #161...

schristley · 2022-02-10T18:49:11Z

@bcorrie I think the actual answer is that duplicate_count doesn't apply to 10x data. But damned if I can figure out what I was in my head 3 months ago from the notes here and at #161...

duplicate_count == 1 or null, under the assumption that a single cell is sequenced.

javh · 2022-02-10T18:50:01Z

I think the actual answer is that duplicate_count doesn't apply to 10x data. But damned if I can figure out what I was in my head 3 months ago from the notes here and at #161...

And I think you are correct. :)

When you load in the cellranger data, rename duplicate_count to umi_count. Then don't assign duplicate_count because cellranger doesn't remove duplicate sequences. Then things should be consistent.

Quoting @scharch:

add a umi_count field.

revise the documentation of duplicate_count to deprecate its use in favor or umi_count where possible. (duplicate_count would remain available for non-umi contexts.) Per @kira-neller's comment above, this would mean 10x would have to revise/update the output of cellranger.

Ie, we've used duplicate_count for UMI counts from bulk UMI protocols historically. That's not worked out well with single-cell data. So, preferably, the definition of duplicate_count gets narrowed to copy number of identical sequences (still working the same for bulk UMI protocols). You could still have a duplicate_count in 10x data, but you'd have to go through the calculation and it wouldn't have anything to do with UMIs.

The break is that for bulk sequencing, UMI ~ transcript in any cells; for single-cell, UMI ~ transcript within a given cell. So UMIs were (badly) approximating cell count in bulk, whereas they directly reflect intra-cellular expression in single-cell.

As per #543

As per #543 * Update `_count` field language and add `umi_count` to Rearrangement. Co-authored-by: Jason Vander Heiden <[email protected]> Co-authored-by: Scott Christley <[email protected]>

kira-neller mentioned this issue Oct 5, 2021

more documentation for duplicate_count #161

Closed

bcorrie added a commit that referenced this issue Feb 12, 2022

Change counts

74e2733

As per #543

bcorrie mentioned this issue Feb 12, 2022

Clone updates #586

Merged

javh closed this as completed in #586 Apr 25, 2022

javh added a commit that referenced this issue Apr 25, 2022

Clone updates (#586)

746e55e

As per #543 * Update `_count` field language and add `umi_count` to Rearrangement. Co-authored-by: Jason Vander Heiden <[email protected]> Co-authored-by: Scott Christley <[email protected]>

bcorrie mentioned this issue Jan 25, 2024

Clarification on umi_count/duplicate_count/consensus_count #743

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clone spec mapping for common tools #543

Clone spec mapping for common tools #543

kira-neller commented Sep 3, 2021 •

edited

Loading

bussec commented Sep 6, 2021

javh commented Sep 7, 2021 •

edited

Loading

bcorrie commented Sep 7, 2021

bcorrie commented Sep 7, 2021

javh commented Sep 7, 2021 •

edited

Loading

bussec commented Sep 7, 2021

kira-neller commented Sep 8, 2021

bcorrie commented Sep 8, 2021

javh commented Sep 8, 2021 •

edited

Loading

bcorrie commented Sep 8, 2021

javh commented Sep 8, 2021

kira-neller commented Sep 10, 2021

kira-neller commented Oct 5, 2021

bcorrie commented Oct 29, 2021

bcorrie commented Oct 29, 2021

kira-neller commented Oct 29, 2021 •

edited

Loading

bcorrie commented Nov 8, 2021 •

edited

Loading

scharch commented Nov 8, 2021

bcorrie commented Feb 9, 2022

bcorrie commented Feb 9, 2022 •

edited

Loading

scharch commented Feb 10, 2022

scharch commented Feb 10, 2022

schristley commented Feb 10, 2022

javh commented Feb 10, 2022 •

edited

Loading

Clone spec mapping for common tools #543

Clone spec mapping for common tools #543

Comments

kira-neller commented Sep 3, 2021 • edited Loading

bussec commented Sep 6, 2021

javh commented Sep 7, 2021 • edited Loading

bcorrie commented Sep 7, 2021

bcorrie commented Sep 7, 2021

javh commented Sep 7, 2021 • edited Loading

bussec commented Sep 7, 2021

kira-neller commented Sep 8, 2021

bcorrie commented Sep 8, 2021

javh commented Sep 8, 2021 • edited Loading

bcorrie commented Sep 8, 2021

javh commented Sep 8, 2021

kira-neller commented Sep 10, 2021

kira-neller commented Oct 5, 2021

bcorrie commented Oct 29, 2021

bcorrie commented Oct 29, 2021

kira-neller commented Oct 29, 2021 • edited Loading

bcorrie commented Nov 8, 2021 • edited Loading

scharch commented Nov 8, 2021

bcorrie commented Feb 9, 2022

bcorrie commented Feb 9, 2022 • edited Loading

scharch commented Feb 10, 2022

scharch commented Feb 10, 2022

schristley commented Feb 10, 2022

javh commented Feb 10, 2022 • edited Loading

kira-neller commented Sep 3, 2021 •

edited

Loading

javh commented Sep 7, 2021 •

edited

Loading

javh commented Sep 7, 2021 •

edited

Loading

javh commented Sep 8, 2021 •

edited

Loading

kira-neller commented Oct 29, 2021 •

edited

Loading

bcorrie commented Nov 8, 2021 •

edited

Loading

bcorrie commented Feb 9, 2022 •

edited

Loading

javh commented Feb 10, 2022 •

edited

Loading