-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clone spec mapping for common tools #543
Comments
Would it be so difficult to link to the |
This would be an incorrect definition of I agree with @bussec on splitting 2+3. The answer shouldn't differ in cases where you have the heavy+light chain and have resolved doublets / allelic inclusion. For Change-O/SCOPer specifically, you can count clones by just skipping the light chain row (IGK or IGL) for the clone; ie, count only on the IGH data. The algorithm uses both the heavy and light chain to define clones, but the heavy chain is required. |
No, but this requires us to change the spec such that the Clone object to link to the Cell object (#317) and that has become dormant (last comment over a year ago). So maybe this is the driver to make that change... The suggestions @kira-neller made were to map clone annotation tools to the AIRR Clone spec. Nothing like trying to use a spec to poke holes in it 8-) |
So if I understand correctly, one should use the clone_id as assigned by the single cell tools. This means that for a specific clone, if a tool does single cell, there are both VH and VL rearrangements and they will have the same clone_id (and presumably the same cell_id), and that is good and we don't want to break that. The problem then becomes counting and we need to come up with a mechanism for counting clones for each tool. I think that because our Clone object is not well enough defined in relation to Cells this is ambiguous at the moment. Which is exactly why @kira-neller posted the above... So is this true: clone.sequence_count = sum(rearrangement.clone_id) - this would include VH and VL sequences... clone_abundance is then somewhat more complicated based on how the tool calculates clones.
We would need a rule for each tool in this case? |
I hope not. I don't have a good sense of what the various new single-cell tools that have cropped up recently are doing, so without reviewing the state of the field, I'm not sure, but... If we're not using the Cell object, and just using Rearrangement, then I assume the general rule of one |
Agreed. I cannot think of a situation in with a
Looks ok to me. However, the usefulness of One more question: Would any of these clone statistics exclude out-of-frame rearrangements? |
I think most tools exclude non-productive rearrangements prior to assigning clones, right? |
But we aren't guaranteed that every tool that produces a clone_id will produce a cell_id are we? I could see how that might work for a single cell pipeline, but what about the bulk AIRR seq clone tools that don't know anything about cells (e.g. MiXCR, ImmuneDB, and even the Change-O non paired pipeline)? They simply assign a clone_id, no? |
Yeah, I meant in the single-cell context. For bulk, there isn't going to be a single right answer for how to count clones -reasonable approximations are going to differ with the technology too (mRNA vs gDNA, with vs without replicates). |
Possibly a dumb question. I have a study with VH and VL data, not paired (for example two different sample processings, with different PCR targets) My understanding is that one would often run a clone tool (Change-O, MIXCR) on the VH data. Presumably one could run a clone tool on the VL data. Would one want to do that? That would give separate clone_id's than that of the VH data, and there is no way to link the VH and VL data. So you would have two independent sets of clones, one from the VH and VL data. Does that make sense and/or is it useful? I assume this is a case that we might run across so would need to support. |
Mostly, I would say "no". Light chains don't have as much information, so while you could certainly cluster the VLs for some purposes, it's not going to be specific enough to reliably infer decent from a common ancestor. The V:J gene pairing is going to drive the clustering. Maybe it's workable in a repertoire with a lot of SHM in the VL, but it's not something I've spent a lot of time on... Dunno if anyone has dug into it using single-cell data. This is sort of on topic, but not exactly: |
Thanks all for your comments. I've modified my mapping file to include new fields showing example counts for sequence_count and clone_abundance for 10X, Change-o, and MiXCR. These are the fields highlighted in yellow in the respective tabs. Based on what's been discussed, are these values correct? clone.mapping_with_counts_examples.xlsx Note that for the 10X example, I've included both clone_count and number of unique cells (10X refers to this as "frequency"). It seems like frequency is most useful, but not sure if/how we want to include this in the clones specification. |
Hi @schristley @javh @scharch, just pinging you on this issue per yesterday's Standards Call. Thanks! |
For discussion at meeting next week - current proposal that it would be good to get agreement on (from @kira-neller spreadsheet) is the counts as they are the most "problematic": clone_id:
sequence_count:
clone_abundance:
|
@kira-neller I notice in your spreadsheet for 10X clone_abundance it says sum(duplicate_count), but I think this should be count(clone_id). Can you check. I used count(clone_id) above even though that is not what is in the spreadsheet 8-) |
@bcorrie confirming the above for 10X clone_abundance. In the original file my description was wrong but in the 10x_rgmt tab for clone_abundance I did use count(clone_id), so we are in agreement. I fixed this and uploaded the new version in the original post. Also here, for convenience: |
From the Clone spec:
|
From the call:
|
@scharch about to implement this for 10X, and I am on the verge of being confused again 8-) If I understand correctly we now should have:
I am not sure what |
To be VERY specific for 10X, clone_count = count(unique cell_id for each clone_id) E.g. For clonotype1 in a 10X airr_rearrangements.tsv
Gets all the data for a specific clone_id, extracts the cell_id, gets the unique cell_id list, and counts the unique cells... Assumes $1 is |
PROFIT!!!! |
And I think you are correct. :) When you load in the cellranger data, rename Quoting @scharch:
Ie, we've used The break is that for bulk sequencing, UMI ~ transcript in any cells; for single-cell, UMI ~ transcript within a given cell. So UMIs were (badly) approximating cell count in bulk, whereas they directly reflect intra-cellular expression in single-cell. |
As per #543 * Update `_count` field language and add `umi_count` to Rearrangement. Co-authored-by: Jason Vander Heiden <[email protected]> Co-authored-by: Scott Christley <[email protected]>
Hello!
The iReceptor team is hoping to load some clones data, so we've been reviewing the clones spec.
I've created a mapping file with my best guess at mapping clones output from MiXCR, Change-o, 10X, and ImmuneDB.
clone.mapping_with_counts_examples-2021-10-29.xlsx
Hoping for a review from the community to see if we are in agreement. Some specific questions that came out of this exercise:
It seems that most tools do not include v/d/j alignment start/end positions in the output. Am I missing something here - maybe a data processing step or a command-line argument?
For single-cell data, it became clear we'll need to resolve this issue: extend clones to the single-cell context. Specifically for 10X data, clone_id comprises multiple loci, so we should decide if/how to split this out. We suggest modifying clone_id as follows (example is for BCR data):
change clone_id = “CLONEA” to three clones with clone_id = “CLONEA_IGH”, clone_id = “CLONEA_IGL”, and clone_id = “CLONEA_IGK”.
Thanks!
The text was updated successfully, but these errors were encountered: