Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clone updates #586

Merged
merged 16 commits into from
Apr 25, 2022
Merged

Clone updates #586

merged 16 commits into from
Apr 25, 2022

Conversation

bcorrie
Copy link
Contributor

@bcorrie bcorrie commented Feb 12, 2022

Closes #543
Closes #161

@bcorrie
Copy link
Contributor Author

bcorrie commented Feb 12, 2022

@javh @scharch would one of you be able to update the "description" field in the spec for these two counts. I don't know what the right wording should be as per the discussion here: #543 (comment)

The old descriptions (which are still in the spec) are:

        umi_count:
            type: integer
            description: Number of Rearrangement records (sequences) included in this clone.
        clone_count:
            type: integer
            description: Non-normalized absolute count of the number of members (immune cells) in this clone.

I think maybe the clone_count is OK, but the umi_count I am pretty sure needs an update.

If you want you could add duplicate_count and a description as well... 8-)

@javh
Copy link
Contributor

javh commented Feb 21, 2022

Thanks, @bcorrie. I pushed some updates for us to hammer on. I also added umi_count to Rearrangement and changes some of the wording in there.

One of the things we need to figure out is whether clone_count should be the count of:

  1. Total sequences.
  2. Unique sequence.
  3. Either of the above, depending upon tool and analysis context.

I'm certain we talked about this before - I'll try to find where later.

@schristley
Copy link
Member

I put the description for clone_count back to its original, which is to imply that the clonal assignment tool, or companion tool, decides what is the clonal abundance (count).

@javh
Copy link
Contributor

javh commented Feb 22, 2022

@schristley, I don't think the original description for clone_count is correct (which is why I changed it). Two examples where this is plainly wrong are:

  1. Classical bulk BCR sequencing clone sizes, which are often calculated as the number of unique variants.
  2. Adaptive's data which does not provide non-normalized counts - only counts normalized to their standard curves.

In both cases, it's a best-effort to approximate cell count, but it's also not cell count nor raw counts.

@schristley
Copy link
Member

@schristley, I don't think the original description for clone_count is correct (which is why I changed it). Two examples where this is plainly wrong are:

@javh Okay, but your description makes it sound like it's just a count of rearrangement records. The original intent #471 of clone_abundance which got renamed to clone_count was to record what the analysis tool (and/or experimental protocol) inferred to be the clone size. "Non-normalized absolute" is meant to indicate that a frequency shouldn't be provided, as other analysis might want to perform their own normalization. Is this better?

Non-normalized absolute count of the inferred number of members (immune cells) in this clone.

@javh
Copy link
Contributor

javh commented Feb 22, 2022

"just a count of rearrangement records" will often be correct. Whether it's total sequences or unique sequences will depend upon whether duplicates were removed before calculating clone_count.

"Non-normalized", "inferred" and "immune cells" are all problematic language.

@javh
Copy link
Contributor

javh commented Feb 22, 2022

Some calculations to cover for clone_count:

  1. Total number of Rearrangements.
  2. Number of unique variants within the clone. Same as above if you've removed duplicates.
  3. Total number of unique cell_id. In practice this is the same as (1), but with cell_id grouping.
  4. Total number of raw reads and Adaptive's read count corrected by their standard curve. These two are the same thing w/ and w/out PCR amplification bias correction.

These are the most common ones that occur to me; excluding statistical inference methods (model fitting, unseen species correction, etc).

My wording didn't cover all these cases either...

@schristley
Copy link
Member

"just a count of rearrangement records" will often be correct. Whether it's total sequences or unique sequences will depend upon whether duplicates were removed before calculating clone_count.

That may be so, but if it's not always true then it's trouble because somebody like Brian is going to read it literally and implement it that way everywhere without considering cases when it isn't true.

"Non-normalized", "inferred" and "immune cells" are all problematic language.

How about something like this:

Absolute count of the size (number of members) of this clone in the repertoire. This could simply be the number of sequences (Rearrangement records) observed in this clone, or it may be a more sophisticated analysis calculation intertwined with an experimental protocol. Absolute count is provided versus a frequency so that downstream analysis tools can perform their own normalization. The standard frequency can be calculated by dividing by the clone_count sum for all clones in the repertoire.

@javh
Copy link
Contributor

javh commented Feb 22, 2022

@schristley, I like it. That seems to cover everything better than the original wording and my wording. We can probably drop the last sentence, as we don't need to provide suggestions for how to perform relative abundance calculations that aren't covered by a standard field.

How about this edit?

Absolute count of the size (number of members) of this clone in the repertoire. This could simply be the number of sequences (Rearrangement records) observed in this clone, the number of distinct cell barcodes (unique cell_id values), or a more sophisticated calculation appropriate to the experimental protocol. Absolute count is provided instead of a frequency so that downstream analysis tools can perform their own normalization.

@schristley
Copy link
Member

We already starting making changes for #161 so we should make the other changes as well.

@scharch
Copy link
Contributor

scharch commented Feb 22, 2022

NOTE: This PR does not address #317

@javh javh self-requested a review April 18, 2022 18:28
@javh javh added this to the AIRR v1.4.0 milestone Apr 18, 2022
@bcorrie
Copy link
Contributor Author

bcorrie commented Apr 22, 2022

@javh there is a the Travis CI check that is not passing. Expected but waiting - is that expected?

bcorrie added 6 commits April 22, 2022 16:43
As required by consistency checks.
Really it checks white space!!! 8-)
Trying to get consistency to pass, does it really require the nullable property?
@bcorrie
Copy link
Contributor Author

bcorrie commented Apr 22, 2022

Finally - synced all the specs so they pass consistency check 8-)

Hope I didn't screw up any definitions in the progress - @javh you are on for review, can you double check...

@javh
Copy link
Contributor

javh commented Apr 25, 2022

Looks good to me. Thanks, @bcorrie. I'll merge this.

That weird Travis "expected but waiting" thing should be gone in every PR now.

Also, we need to get in touch with 10x about the umi_count field being in v1.4.

@javh javh merged commit 746e55e into master Apr 25, 2022
@javh javh deleted the clone-update branch April 25, 2022 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clone spec mapping for common tools more documentation for duplicate_count
4 participants