Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variant method to get all alleles as a string #2181

Closed
jeromekelleher opened this issue Apr 5, 2022 · 9 comments
Closed

Variant method to get all alleles as a string #2181

jeromekelleher opened this issue Apr 5, 2022 · 9 comments
Labels
Python API Issue is about the Python API
Milestone

Comments

@jeromekelleher
Copy link
Member

jeromekelleher commented Apr 5, 2022

There are often times when we'd like to get the genotypes for a given site as the actual alleles rather than the indexes into the alleles list. (This is what #2168 was about, I assume.)

We can add this as a method to the Variant class easily enough. We can just raise an error if there are any non-1 length alleles in there.

What do we call it?

This is the replacement also for the old as_bytes option, which basically did the same thing.

It might look something like

def genotypes_as_alleles(self): -> str
     """
     Returns a string s in which s[j] is the value of ``var.alleles[var.genotypes[j]]``. Raises an error if 
     all alleles are not of length 1.
     """
@jeromekelleher jeromekelleher added the Python API Issue is about the Python API label Apr 5, 2022
@jeromekelleher jeromekelleher added this to the Python 0.5.0 milestone Apr 5, 2022
@jeromekelleher
Copy link
Member Author

Note that some of the changes made to the test suite for #2172 can be made simpler using this functionality.

@benjeffery
Copy link
Member

Might it be nice to have a general method:
tskit.genotypes_as_alleles(alleles, genotypes), tskit.genotypes_as_ragged_alleles(alleles, genotypes) then
Variant.genotypes_as_alleles would use this?

@jeromekelleher
Copy link
Member Author

Good idea - I guess the distinction is whether you want to returned value to be a numpy array or a string. There's definitely value in getting a numpy array back too.

@jeromekelleher
Copy link
Member Author

@hyanwong - any thoughts on names here? I feel like this is quite a handy feature and we should support "encoding" the variant data in string or numpy format. We definitely shouldn't call the it "encode" though, as we already have the "decode" method which does something quite different.

@hyanwong
Copy link
Member

hyanwong commented Apr 6, 2022

The reason I wanted this was to be able to compare sites from different encodings of the same tree sequence (which therefore could have alleles in a different order. I "hacked" around this in tskit-dev/tsinfer#652 (comment) by using ts.genotype_matrix(alleles=tskit.ALLELES_ACGT), but it isn't an obvious thing to do, and I can imagine it catching out other people who want to compare the sites-by-samples matrix, or do sitewise comparisons. So I think this is handy, yes. Another possibility for my use case would be the ability to compare the variant genotypes with each other, so that you could do (say) the equivalent of ts1.variant(4).identical_genotypes(ts2.variant(4)) or similar notation.

Re naming: genotypes_as_alleles does what it says on the tin, so that seems OK. My only reservation is that it might be a bit cryptic to those not well versed in the tskit distinction between "genotype" (an index value) and "alleles" (the list being indexed). Also note that we are using "genotype" in a rather nonstandard way here, because that's usually considered an unphased property in a diploid (your genotype can be homozygous A, heterozygous, or homozygous B).

By the way, you say "encode", but I wonder if most people would think of this as a decoding of the indexing scheme? I can't think of any brilliant alternatives which avoid "encode/decode", but here are some ideas: tskit.translate_genotypes(a, g), tskit.true_genotypes(a, g), tskit.site_variation(a, g), tskit.allelic_variation(a, g), tskit.allele_states(a, g), tskit.allele_values(a, g), tskit.genotype_values(a, g).

@hyanwong
Copy link
Member

hyanwong commented Apr 6, 2022

p.s. a numpy array would be great: then you wouldn't need to bomb out on non-single-character alleles, I guess?

@hyanwong
Copy link
Member

hyanwong commented Feb 21, 2023

This is what #2617 is about. In that PR we simply return a numpy array of strings, and after discussion, called this variant.states()

@hyanwong
Copy link
Member

Fixed in 0.6.0

@benjeffery
Copy link
Member

Fixed in #2617

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Python API Issue is about the Python API
Projects
None yet
Development

No branches or pull requests

3 participants