-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
more documentation for duplicate_count #161
Comments
Agreed, we need to describe this in more detail. To make sure we are all on the same page, my interpretation of what the docs should say is:
Is this correct? |
Actually I think it is switched around based upon what Jason has mentioned before.
However, In the first case of pre-processing, you can imagine that there can be two or more rows of sequences (that differ slightly in length or nucleotides), each with their own Maybe this ambiguity is not good, and we should have a separate I think the intent of |
Hrm. I think you two (@schristley and @bussec) might be saying the same thing? Not sure. I agree with the definitions list by @bussec... but I suspect the meaning of "UMI-collapsed reads" is causing confusion. First, I think we should avoid the word "clone" in this discussion, because its meaning is ambiguous. In a BCR context it usually means a collection of related variants, but in a TCR context it usually means a collection of identical sequences. Same thing - all sequences assumed to represent the same original V(D)J recombination event. But, they are counted in a different way. I think Mathieu means the TCR definition. I see the fields as: consensus_count: The "confidence level" of the sequence. The number of reads used to build a consensus sequence. I only use this when the count is purely technical with no real biological meaning. Usually, this would be the number of reads contributing to a UMI consensus sequence. But, it could also be from some other error correction approach, such as clustering sequences inferred to differ only by sequencing error and aggregating them somehow (see IgReC for an example). duplicate_count: The "copy number" of the sequence. This could be the count of UMIs for each unique sequence (as measured by the number of identical copies) or identical sequence counts for non-UMI protocols. I'm pretty certain the latter is exactly what the Vidjil group wants. So... I think that's the same thing you're both saying? Example: After UMI consensus generation, but before duplicate removal:
After duplicate removal:
For non-UMI protocols, I just wouldn't include I think we should wait on defining a |
In SONAR, I have implemented duplicate_count as the number of pre-VDJ assignment with exact length and nucleotide sequence matches. Because we don't use UMIs, SONAR does error handling by clustering reads at the expected error threshold, and I implemented a custom cluster_count to hold the number of reads that go into each cluster, sort of the opposite of the way @javh defined consensus_count vs duplicate_count above. I've also put in a clone_count field, though I agree that more properly belongs in a lineage schema. |
FYI, we just ran into this in a data set provided by a collaborator... MiXCR's clone assignment process is here: https://mixcr.readthedocs.io/en/master/assemble.html The question is, to which MiAIRR/DataRep field does this get assigned? |
Sorry, in this case it is the cloneCount field from the MiXCR annotation output. Which you might say obviously should be mapped to clone_count, but given the above discussion I was not clear. And I suspect that given that MiXCR is a commonly used annotation tool, I suspect that we might want to make sure we are clear as to what this mapping should be. @emilyvcbarr or @nishanth would be able to comment further on which MiXCR processing steps that produce this output file. |
Hello all, just following up on this issue - hoping to maybe close it out... Any input as to the mappings from MiXCR clone count fields to the AIRR fields? |
We discussed duplicate_count and consensus_count a bit more at MinStd this week. The suggestion was that for each rearrangement record the duplicate_count and consensus_count may or may not exist. The meaning of these fields would depend on the parameters in Section 3 and Section 5 of MiAIRR (PCR prep and Data Processing) and that the actual meaning and interpretation of these counts would likely be ill-defined without considering the higher level metadata. That is, a researcher would need to know information from Sec 3 and Sec 5 in order to interpret (and more importantly compare/analyze) the consensus_count and duplicate_count provided in two data sets (repertoires) from two different studies. This means that when providing simple summary statistics (counts of rearrangements) about aggregate data across many repertoires, it is problematic (and possibly misleading) to aggregate the consensus_count and duplicate_count because of the interpretation required. Does my interpretation make sense given the above discussion. |
Yeah, I think that's accurate. This is further confounded in the case of single-cell data with contig assembly. For example, 10X V(D)J will have reads that went into building the contig and UMI count (expression level). Because I don't have a use for the contig read count, I've been using But, the semantics are misleading in that case, because in bulk UMI projects, the I'm not sure exactly what to do about it, but we should maybe bring this topic up in the DR-WG? I'm starting to think we should maybe have |
Is it possible to separate out the case where a field unambiguously acts as a multiplier for abundance for the rearrangement (versus for clones)? That strikes me as the most important to be unambiguous. If programs need to perform some sort of logic to determine how a field is used, that can be error prone. |
A thought on this as someone who is NOT a domain expert. It seems to me like we are starting to get into "analysis". Admittedly very simple analysis, but analysis just the same... As soon as you start aggregating things with a non-identity (or not precisely defined) equality test then the test you use for equality is open to interpretation. More importantly, for someone to use the aggregation you have they will need to have a description of how the data was aggregated so they can decide if it makes sense for them to use the aggregation in comparisons. I suppose I would argue that if we are talking about data sharing and in particular comparing two data sets, having an aggregation that can't be used across data sets isn't particularly useful from a AIRR/MiAIRR data sharing perspective. Certainly important from an analysis perspective, but... UMI/consensus_count and exact sequence/duplicate_count (equality) seem to me to be well defined. I was hoping that we could define them exactly, but it seems like this is not the case. At the same time, it seems like the confusing thing is not so much that we can't come up with a well defined definition, but it is more that the definition would not be adequate for many different cases (the definition would be too limiting?). Would it make sense to try to define these precisely (so there are a small number of well defined aggregation metrics), recognizing that if other aggregation metrics are required (other count_* fields) to capture other things there would be other mechanisms to define and record such counts? It seems like if we do define such counts, we also need a mechanism to describe how the aggregation was done... I would argue that such aggregation metrics (the count_* fields) are definitely starting to go down the analysis path, and it seems to me that we should be separating the storing of the fundamental annotation data from the analysis artifacts that are created for that data. I suppose what I am wondering is whether analysis results (including aggregations that are loosely defined) should be treated as metadata that is attached to the "basic" or "fundamental" annotation data that we currently have in our existing AIRR DataRep format??? |
This isn't problematic to me. In my mind, "pre-processing" of the raw data is "analysis" because you have to make decisions about quality, and what to include/exclude with your filters. Though I do agree that leaving a field open for interpretation is best avoided.
Are these different cases listed anywhere? I cannot think of the top of head what they would be. We've done this before so it's reasonable that we don't try to represent everything. Related to this is the issue that some experimental protocols aren't "quantitative" in the sense that any counts or quantities don't directly relate to the number of receptors/cells in the biology. They might be counting something else like the read coverage, or the number of RNA molecules for example. Comparisons with other data might produce incorrect results. |
I think that is the issue I raise... We are already capturing analysis steps, and I think we are going to hit a brick wall soon because of the explosion of possible different analysis steps. For the pre-processing you speak of (which we can consider analysis) we have an explicit, well defined process (quality control) and a metadata section that captures the parameters used for that process (e.g. the metadata we capture in SequencingRun and/or DataProcessing) to produce some data. We have been able to capture that because we think it is well defined enough to capture. This is similar if you consider the processing that is done for annotation in that we have a metadata section that describes how that was done (e.g. in DataProcessing). In this instance we are talking about collapsing sequences that have been annotated. My argument is that if we can define this precisely, then we can capture it in our current metadata. If duplicate_count is defined to be an exact sequence match, then we make it clear that that is what we mean by duplicate_count and we don't need to do any more and we can use duplicate_count as is. If we are considering other types of processing/analysis (which we should be) such as collapsing sequence reads using non precise or fuzzy equality metrics or clone counts that produce other analysis artifacts (e.g. @javh count_clone or my new count_duplicate_fuzzy) then we need a metadata section that describes how the process (e.g. clone construction or fuzzy duplicate count) was performed and the parameters used for that process (e.g. number of NT that can differ in my fuzzy duplicate count match). In the general case, we need a mechanism that links 1) an analysis step (an algorithm) with 2) a set of parameters used in the analysis step with 3) a set of analysis artifacts produced using a specific combination of 1) and 2). This is basically a mechanism to incrementally add analysis layers to a set of data, in principle adding more and more complexity to the analysis. My understanding is that because the domain starts to get so rich and complex at this point (everyone has their own favorite tool or technique), a general, flexible, and extensible mechanism is required to capture all of the cases. It seems like now might be the time to consider a general solution, as it doesn't seem practical to keep extending well defined analysis parameters - because they aren't well defined any more. |
The cases that I was talking about were what @javh mentioned (below):
There are many count_* fields that one could imagine, @javh mentioned four here. I am suggesting that for any count_* field that isn't well defined (is algorithmic) for it to be useful you need to be able to describe how it was calculated so that users of the data know whether things are comparable or not. count_clone is probably the best example, because it probably has a wide variety of viewpoints on what it means.
I think we are agreeing 8-) I am suggesting that in this more general analysis world we need to have a general mechanism for linking emerging "DataRep fields" (e.g. count_clone) with a general analysis description of how they were calculated (my clones are exact CDR3 matches) so that a researcher can decide whether it makes sense to compare clone counts across two different analyses. I think we want to do that without pre-defining the count_* fields (or all the other analysis fields that will arise). If a researcher can't compare two data sets because the clone counts are not comparable, then they either need access to the underlying data so that they can generate the "desired" count_clone that they can use for comparison or the won't be able to use that data for their comparison. As you suggest, what we definitely want to avoid is having someone compare count_clone fields between two data sets because they exist when the mechanism for determining count_clone in the two data sets was not comparable... |
Yes, but isn't this already taken care of by including software versions and parameters/command lines in the metadata? I think it's fine to expect users to lookup how ChangeO defines clones versus SONAR, instead of requiring that to be explicitly repeated into the metadata. In fact, as we get further down the pathway of "analysis," the metadata would explode if the details of every step had to be spelled out. Certainly there are exceptions to this, like undocumented tools. But in that case, what are the odds that the person using that tools is going to back-fill the metadata, anyway? Maybe there needs to be a field for cases of manual curation, but I would expect those to be fairly rare. |
I agree with this. This can be formalized more but I think there are diminishing returns for the usefulness of that metadata, and IMO this gets beyond the scope of DRWG.
Yes true but also no. From a scientific standpoint these comparisons are done all the time as part of data exploration, comparison of methods and so forth. If Regardless, I understand your point, if you want to show some sort of graph, which uses abundance numbers, is there any way to insure those numbers are comparable across studies? I think this can be done for quantitative experimental protocols with some well-defined fields. What I worry about are the non-quantitative protocols, which can still be used for comparison, but they might need to be normalized in a certain way to make that comparison valid or useful. It seems to me that the fields
|
I was just chatting with @psathyrella and we might have another problem with the
|
Interesting, it never occurred to me to interpret |
That's a good point about |
I agree, better not to change the field name at this point. |
I wasn't actually confused because of the name -- it's somewhat ambiguous, but I think most choices also would be. I was confused because to me in the description:
"copy number" sounded like the total number of copies, i.e. including the "main" sequence, while "number of duplicate observations" sounded like it excluded it. I think there's other ways to read this description so it's not contradictory, but either way it seems easy to just make it super unambiguous. |
@javh This has come up again as I've been processing single cell datasets. 10x is putting a read count in For example, say I do genomic DNA bulk sequencing, with little to no amplification, the number of sequences is pretty close to the number of genomes. |
@schristley, I think 10x is using it correctly and the semantics of They are using In changeo, we've been using the custom field If we had it to do over again, and maybe we should, something like this seems more compatible with the current menu of protocols:
The problem with the above is that fields have overlapping semantics for some protocols. |
Can we keep
I know of some 5' RACE that doesn't use UMIs, even though they probably should...
This might actually be ok as it could be a quick qualitative check about the clonal assignment process. |
Yeah, we could probably skip |
That's true,
Maybe it would help in the documentation to provide invariants?
|
@kira-neller can you have a look and comment on this - both from your recent 10X experiences as well as how we have curated data in the past. Does @schristley's algorithm above work for what we do - in particular when we "unroll" based on duplicate count? I think it does... @schristley what is |
Yes, potential rename of |
To clarify for those of us who are slow: so, just adding a new
I would changed the
Yes, I think this is fine
I have trouble getting worked up about this, let's just pick one and then make sure the documentation is clear.
Took me a couple tries to parse these. I think they're useful, but I would also advocate including something like @javh's demo above (#161 (comment)), because that seems like the easiest way of clarifying things, to me. |
Well, I'm really averse to breaking backwards compatibility if we can avoid it for the sake of all people's codes. :)
Just adding |
I thought we decided up-thread that |
Makes sense to me. We could just put a note in the docs that you can copy |
@bcorrie per 10X documentation they define the two fields as: duplicate_count | The number of unique molecular identifiers associated with this rearrangement. For 10X data, each sequence in the AIRR.tsv is an assembled contig. Let's say we have a contig sequence Z built from reads associated with 2 different UMIs, A and B. As I understand it, if there are 3 UMI-A reads, they contribute +3 to consensus_count, but only +1 to duplicate_count. If there are 5 UMI-B reads, this adds +5 to consensus_count and +1 to duplicate_count. So for contig Z you end up with consensus_count = 3 + 5 = 8; duplicate_count = 1 + 1 = 2. When we unroll, it is based on duplicate_count. Per the example above, our unrolled AIRR.tsv would have the annotation for contig Z present twice. I'm not completely sure if this agrees with @schristley's definition above; is an assembled contig from a single-cell equivalent to a "sequence in post-processed RAW reads"? |
Hi @schristley @javh @scharch, just pinging you on this issue per yesterday's Standards Call. This is also related to #543 Thanks! |
We didn't really get into this one in the November call, but related discussion at: #543 (comment) My personal sense, based on both the desire for parallelism to the
I think |
From a comment by Mathieu of vidjil group:
In particular, we used the
duplicate_count
key to describe the number of reads gathered in each clone, as you suggested us in your previous email. To enable a better interoperabilityn maybe this suggestion could be clearly stated in AIRR documentation of this field? For example, taking inspiration on what was already on thesequence_id
field, "duplicate_count (...) This may also be a total number of gathered sequences in cases where query sequences have been combined in some fashion by the tool."The text was updated successfully, but these errors were encountered: