Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clone spec mapping for common tools #543

Closed
kira-neller opened this issue Sep 3, 2021 · 24 comments · Fixed by #586
Closed

Clone spec mapping for common tools #543

kira-neller opened this issue Sep 3, 2021 · 24 comments · Fixed by #586

Comments

@kira-neller
Copy link

kira-neller commented Sep 3, 2021

Hello!

The iReceptor team is hoping to load some clones data, so we've been reviewing the clones spec.

I've created a mapping file with my best guess at mapping clones output from MiXCR, Change-o, 10X, and ImmuneDB.

clone.mapping_with_counts_examples-2021-10-29.xlsx

Hoping for a review from the community to see if we are in agreement. Some specific questions that came out of this exercise:

  1. It seems that most tools do not include v/d/j alignment start/end positions in the output. Am I missing something here - maybe a data processing step or a command-line argument?

  2. For single-cell data, it became clear we'll need to resolve this issue: extend clones to the single-cell context. Specifically for 10X data, clone_id comprises multiple loci, so we should decide if/how to split this out. We suggest modifying clone_id as follows (example is for BCR data):

change clone_id = “CLONEA” to three clones with clone_id = “CLONEA_IGH”, clone_id = “CLONEA_IGL”, and clone_id = “CLONEA_IGK”.

  1. In the same vein, we'll need to consider how to count clones derived from single-cell data. I came up with 3 options; option 1 (aggregating clone counts across loci) isn't very useful; option 2 (locus-specific clone count) makes more sense and can be implemented fairly easily; option 3 (cell-specific clone count) is most informative but might be trickier to implement right now, as it links to the cell spec. Any thoughts on how best to proceed?

Thanks!

@bussec
Copy link
Member

bussec commented Sep 6, 2021

Would it be so difficult to link to the Cell object? The splitting (points 2+3) does not appear to be too meaningful to me. For example, assume you have (single-cell defined) clones in a mouse that use the same IGL rearrangement. If you split those you will either have two IGL "clones" with identical chains, which in a non-single-cell context would appear as one, or a single merged IGL "clone" of which you know that it does not exist. In either case, the cell-based grouping would be lost and require re-processing of the data.

@javh
Copy link
Contributor

javh commented Sep 7, 2021

change clone_id = “CLONEA” to three clones with clone_id = “CLONEA_IGH”, clone_id = “CLONEA_IGL”, and clone_id = “CLONEA_IGK”.

This would be an incorrect definition of clone_id, because the "clone" has two chains. So "CLONEA" if you have paired heavy and light chain data. Also, if you were to annotate the chain in the clone_id column like this, I think using _VH and _VL would be more meaningful - for the most part, you shouldn't have both a kappa and lambda chain.

I agree with @bussec on splitting 2+3. The answer shouldn't differ in cases where you have the heavy+light chain and have resolved doublets / allelic inclusion.

For Change-O/SCOPer specifically, you can count clones by just skipping the light chain row (IGK or IGL) for the clone; ie, count only on the IGH data. The algorithm uses both the heavy and light chain to define clones, but the heavy chain is required.

@bcorrie
Copy link
Contributor

bcorrie commented Sep 7, 2021

Would it be so difficult to link to the Cell object?

No, but this requires us to change the spec such that the Clone object to link to the Cell object (#317) and that has become dormant (last comment over a year ago). So maybe this is the driver to make that change...

The suggestions @kira-neller made were to map clone annotation tools to the AIRR Clone spec. Nothing like trying to use a spec to poke holes in it 8-)

@bcorrie
Copy link
Contributor

bcorrie commented Sep 7, 2021

change clone_id = “CLONEA” to three clones with clone_id = “CLONEA_IGH”, clone_id = “CLONEA_IGL”, and clone_id = “CLONEA_IGK”.

This would be an incorrect definition of clone_id, because the "clone" has two chains. So "CLONEA" if you have paired heavy and light chain data. Also, if you were to annotate the chain in the clone_id column like this, I think using _VH and _VL would be more meaningful - for the most part, you shouldn't have both a kappa and lambda chain.

I agree with @bussec on splitting 2+3. The answer shouldn't differ in cases where you have the heavy+light chain and have resolved doublets / allelic inclusion.

For Change-O/SCOPer specifically, you can count clones by just skipping the light chain row (IGK or IGL) for the clone; ie, count only on the IGH data. The algorithm uses both the heavy and light chain to define clones, but the heavy chain is required.

So if I understand correctly, one should use the clone_id as assigned by the single cell tools. This means that for a specific clone, if a tool does single cell, there are both VH and VL rearrangements and they will have the same clone_id (and presumably the same cell_id), and that is good and we don't want to break that.

The problem then becomes counting and we need to come up with a mechanism for counting clones for each tool. I think that because our Clone object is not well enough defined in relation to Cells this is ambiguous at the moment. Which is exactly why @kira-neller posted the above... So is this true:

clone.sequence_count = sum(rearrangement.clone_id) - this would include VH and VL sequences...

clone_abundance is then somewhat more complicated based on how the tool calculates clones.

  • For Change-O/SCOPer clone.clone_abundance = sum(rearrangement.clone_id where rearrangement.locus == IGH)

We would need a rule for each tool in this case?

@javh
Copy link
Contributor

javh commented Sep 7, 2021

We would need a rule for each tool in this case?

I hope not. I don't have a good sense of what the various new single-cell tools that have cropped up recently are doing, so without reviewing the state of the field, I'm not sure, but...

If we're not using the Cell object, and just using Rearrangement, then I assume the general rule of one clone_id per cell_id would hold and you can use that across tools. Ie, count the number of unique cell_id per clone_id.

@bussec
Copy link
Member

bussec commented Sep 7, 2021

This means that for a specific clone, if a tool does single cell, there are both VH and VL rearrangements and they will have the same clone_id (and presumably the same cell_id), [...]

Agreed. I cannot think of a situation in with a clone_id and cell_id would diverge. Also -- for any given clone inference -- a cell can only be a member of zero or one clone. (I hope such a deterministic view does not cause a Bayesian outcry 😉 )

So is this true:
clone.sequence_count = sum(rearrangement.clone_id) - this would include VH and VL sequences...

Looks ok to me. However, the usefulness of clone.sequence_count might be reduced in a single-cell context, if tools perform consensus builds and you end up with few unique sequences. In that case clone.clone_abundance would be the more useful parameter.

One more question: Would any of these clone statistics exclude out-of-frame rearrangements?

@kira-neller
Copy link
Author

One more question: Would any of these clone statistics exclude out-of-frame rearrangements?

I think most tools exclude non-productive rearrangements prior to assigning clones, right?

@bcorrie
Copy link
Contributor

bcorrie commented Sep 8, 2021

We would need a rule for each tool in this case?

I hope not. I don't have a good sense of what the various new single-cell tools that have cropped up recently are doing, so without reviewing the state of the field, I'm not sure, but...

If we're not using the Cell object, and just using Rearrangement, then I assume the general rule of one clone_id per cell_id would hold and you can use that across tools. Ie, count the number of unique cell_id per clone_id.

But we aren't guaranteed that every tool that produces a clone_id will produce a cell_id are we? I could see how that might work for a single cell pipeline, but what about the bulk AIRR seq clone tools that don't know anything about cells (e.g. MiXCR, ImmuneDB, and even the Change-O non paired pipeline)? They simply assign a clone_id, no?

@javh
Copy link
Contributor

javh commented Sep 8, 2021

Yeah, I meant in the single-cell context. For bulk, there isn't going to be a single right answer for how to count clones -reasonable approximations are going to differ with the technology too (mRNA vs gDNA, with vs without replicates).

@bcorrie
Copy link
Contributor

bcorrie commented Sep 8, 2021

Possibly a dumb question. I have a study with VH and VL data, not paired (for example two different sample processings, with different PCR targets)

My understanding is that one would often run a clone tool (Change-O, MIXCR) on the VH data.

Presumably one could run a clone tool on the VL data. Would one want to do that?

That would give separate clone_id's than that of the VH data, and there is no way to link the VH and VL data. So you would have two independent sets of clones, one from the VH and VL data. Does that make sense and/or is it useful? I assume this is a case that we might run across so would need to support.

@javh
Copy link
Contributor

javh commented Sep 8, 2021

Presumably one could run a clone tool on the VL data. Would one want to do that?

Mostly, I would say "no". Light chains don't have as much information, so while you could certainly cluster the VLs for some purposes, it's not going to be specific enough to reliably infer decent from a common ancestor. The V:J gene pairing is going to drive the clustering.

Maybe it's workable in a repertoire with a lot of SHM in the VL, but it's not something I've spent a lot of time on... Dunno if anyone has dug into it using single-cell data. This is sort of on topic, but not exactly:
https://www.jimmunol.org/content/203/7/1687

@kira-neller
Copy link
Author

Thanks all for your comments. I've modified my mapping file to include new fields showing example counts for sequence_count and clone_abundance for 10X, Change-o, and MiXCR. These are the fields highlighted in yellow in the respective tabs. Based on what's been discussed, are these values correct?

clone.mapping_with_counts_examples.xlsx

Note that for the 10X example, I've included both clone_count and number of unique cells (10X refers to this as "frequency"). It seems like frequency is most useful, but not sure if/how we want to include this in the clones specification.

@kira-neller
Copy link
Author

Hi @schristley @javh @scharch, just pinging you on this issue per yesterday's Standards Call. Thanks!

@bcorrie
Copy link
Contributor

bcorrie commented Oct 29, 2021

For discussion at meeting next week - current proposal that it would be good to get agreement on (from @kira-neller spreadsheet) is the counts as they are the most "problematic":

clone_id:

  • MiXCR = cloneID
  • Change-o = clone_id
  • 10X = clonetype_id/clone_id
  • ImmuneDB = clone_id

sequence_count:

  • MiXCR = count(readID for each cloneID)
  • Change-O = sum(umi_count for each clone_id)
  • 10X = sum(duplicate_count for each clone_id)
  • ImmuneDB = instances

clone_abundance:

  • MiXCR = cloneCount
  • Change-o = count(clone_id)
  • 10X = count(clone_id) - ideally somehow done taking into account paired chains???
  • ImmuneDB = copies

@bcorrie
Copy link
Contributor

bcorrie commented Oct 29, 2021

@kira-neller I notice in your spreadsheet for 10X clone_abundance it says sum(duplicate_count), but I think this should be count(clone_id). Can you check. I used count(clone_id) above even though that is not what is in the spreadsheet 8-)

@kira-neller
Copy link
Author

kira-neller commented Oct 29, 2021

@bcorrie confirming the above for 10X clone_abundance. In the original file my description was wrong but in the 10x_rgmt tab for clone_abundance I did use count(clone_id), so we are in agreement. I fixed this and uploaded the new version in the original post.

Also here, for convenience:
clone.mapping_with_counts_examples-2021-10-29.xlsx

@bcorrie
Copy link
Contributor

bcorrie commented Nov 8, 2021

From the Clone spec:

        sequence_count:
            type: integer
            description: Number of Rearrangement records (sequences) included in this clone.
        clone_abundance:
            type: integer
            description: Non-normalized absolute count of the number of members (immune cells) in this clone.

@scharch
Copy link
Contributor

scharch commented Nov 8, 2021

From the call:

  • sequence_count to be removed from the Clone schema
  • Add umi_count (sum of umi_count from Rearrangement), which will map to
    ** Change-O = sum(umi_count for each clone_id)
    ** 10X = sum(duplicate_count for each clone_id)
  • Add duplicate_count (sum of duplicate_count from Rearrangement), which will map to
    ** MiXCR = count(readID for each cloneID)
  • Rename clone_abundance to clone_count for naming consistency, mappings the same as @bcorrie listed above
    ** Corresponds to # of cells for single cell protocols
    ** Corresponds to # of unique variants for bulk protocols

@bcorrie
Copy link
Contributor

bcorrie commented Feb 9, 2022

@scharch about to implement this for 10X, and I am on the verge of being confused again 8-) If I understand correctly we now should have:

  • umi_count:
    ** Change-O = sum(umi_count for each clone_id)
    ** 10X = sum(duplicate_count for each clone_id)
  • clone_count:
    ** Corresponds to # of cells for single cell protocols
    ** Corresponds to # of unique variants for bulk protocols
  • duplicate_count:
    ** ???

I am not sure what duplicate_count should be, in particular in the 10X case, when umi_count = sum(duplicate_count for each clone_id)

@bcorrie
Copy link
Contributor

bcorrie commented Feb 9, 2022

To be VERY specific for 10X, clone_count = count(unique cell_id for each clone_id)

E.g. For clonotype1 in a 10X airr_rearrangements.tsv

awk -F '\t' '$2=="clonotype1" {print $0}' airr_rearrangement.tsv | awk -F '\t' '{print $1}' | sort -u | wc -l

Gets all the data for a specific clone_id, extracts the cell_id, gets the unique cell_id list, and counts the unique cells...

Assumes $1 is cell_id and $2 is clone_id 8-)

@scharch
Copy link
Contributor

scharch commented Feb 10, 2022

  • duplicate_count:
    ** ???

PROFIT!!!!

@scharch
Copy link
Contributor

scharch commented Feb 10, 2022

@bcorrie I think the actual answer is that duplicate_count doesn't apply to 10x data. But damned if I can figure out what I was in my head 3 months ago from the notes here and at #161...

@schristley
Copy link
Member

@bcorrie I think the actual answer is that duplicate_count doesn't apply to 10x data. But damned if I can figure out what I was in my head 3 months ago from the notes here and at #161...

duplicate_count == 1 or null, under the assumption that a single cell is sequenced.

@javh
Copy link
Contributor

javh commented Feb 10, 2022

I think the actual answer is that duplicate_count doesn't apply to 10x data. But damned if I can figure out what I was in my head 3 months ago from the notes here and at #161...

And I think you are correct. :)

When you load in the cellranger data, rename duplicate_count to umi_count. Then don't assign duplicate_count because cellranger doesn't remove duplicate sequences. Then things should be consistent.

Quoting @scharch:

  • add a umi_count field.
  • revise the documentation of duplicate_count to deprecate its use in favor or umi_count where possible. (duplicate_count would remain available for non-umi contexts.) Per @kira-neller's comment above, this would mean 10x would have to revise/update the output of cellranger.

Ie, we've used duplicate_count for UMI counts from bulk UMI protocols historically. That's not worked out well with single-cell data. So, preferably, the definition of duplicate_count gets narrowed to copy number of identical sequences (still working the same for bulk UMI protocols). You could still have a duplicate_count in 10x data, but you'd have to go through the calculation and it wouldn't have anything to do with UMIs.

The break is that for bulk sequencing, UMI ~ transcript in any cells; for single-cell, UMI ~ transcript within a given cell. So UMIs were (badly) approximating cell count in bulk, whereas they directly reflect intra-cellular expression in single-cell.

bcorrie added a commit that referenced this issue Feb 12, 2022
@bcorrie bcorrie mentioned this issue Feb 12, 2022
@javh javh closed this as completed in #586 Apr 25, 2022
javh added a commit that referenced this issue Apr 25, 2022
As per #543

* Update `_count` field language and add `umi_count` to Rearrangement.

Co-authored-by: Jason Vander Heiden <[email protected]>
Co-authored-by: Scott Christley <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants