Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add states() method to return actual genotype states as characters #2617

Merged
merged 1 commit into from
Oct 16, 2024

Conversation

hyanwong
Copy link
Member

@hyanwong hyanwong commented Oct 30, 2022

Description

Adds variant.genotype_values(). See tskit-dev/tsinfer#739

PR Checklist:

  • Tests that fully cover new/changed functionality.
  • Documentation including tutorial content if appropriate.
  • Changelogs, if there are API changes.

@codecov
Copy link

codecov bot commented Oct 30, 2022

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.07%. Comparing base (05f8bef) to head (0e3f1fa).
Report is 4 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2617   +/-   ##
=======================================
  Coverage   87.07%   87.07%           
=======================================
  Files          11       11           
  Lines       24666    24666           
  Branches     4556     4556           
=======================================
  Hits        21478    21478           
  Misses       1824     1824           
  Partials     1364     1364           
Flag Coverage Δ
c-tests 86.69% <ø> (ø)
lwt-tests 80.78% <ø> (ø)
python-c-tests 89.05% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

@hyanwong
Copy link
Member Author

I'm not sure why this is failing CI: everything is covered AFAICS.

Ready to merge, I reckon, pending approval of the name of the function.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - not sure about the name though.

python/tskit/genotypes.py Outdated Show resolved Hide resolved
@jeromekelleher
Copy link
Member

Also not clear that an Object array is what people would want. Would it be better to have a "missing data symbol" or something like we do for haplotypes?

@hyanwong
Copy link
Member Author

hyanwong commented Nov 1, 2022

Also not clear that an Object array is what people would want. Would it be better to have a "missing data symbol" or something like we do for haplotypes?

ISWYM, but using any particular character will mean that we can't use that for an allele type (which we might conceivably want to do). This is only going to bite for missing data, so I don't think it's too bad TBH.

@jeromekelleher
Copy link
Member

Hmm - how about writing some client code for this with and without missing data? Something like the verify function in tsinfer would be a good one. We definitely don't want to have to have different code paths for dtypes with and without missing data.

We use "N" as the default missing data character elsewhere and its fine. We should be able to use this function to compare easily with e.g. alignments.

Is the np.array(list(ts.alignments()) equal to np.array(var.states() for var in ts.variants()).T with and without missing data?

@benjeffery
Copy link
Member

@hyanwong Would you like this to be in 0.5.4?

@hyanwong
Copy link
Member Author

hyanwong commented Jan 4, 2023

Ah, good point. There's no hurry from my PoV. Is it just a question of my changing the name? I can do that easily enough.

@hyanwong hyanwong force-pushed the variant-strings branch 2 times, most recently from 9f178fe to 3b884b0 Compare January 7, 2023 18:17
@jeromekelleher
Copy link
Member

I don't think we should hurry this in as we don't want to get stuck with Object arrays unless we have to. They have subtly different semantics to arrays of unicode.

We should have some example client code for this before committing ourselves to a particular representation.

@hyanwong
Copy link
Member Author

hyanwong commented Jan 8, 2023

Sure. I have updated this so it should never return an object array anyway. But no hurry to get this in. It was to help @szhan and others who spent a fair bit of time comparing integer arrays between tree sequence before realising that they weren't actually comparable.

@hyanwong hyanwong changed the title Add genotype_values() method Add states() method to return actual genotype states as characters Jan 24, 2023
@hyanwong
Copy link
Member Author

hyanwong commented Sep 24, 2024

I think this also fixes #2181 . The name was changed to .states() after discussion with @jeromekelleher, but in #2181 he also suggested genotypes_as_alleles?

@hyanwong hyanwong force-pushed the variant-strings branch 2 times, most recently from e21faaa to efa8d08 Compare September 24, 2024 12:22
@hyanwong
Copy link
Member Author

This no longer returns object arrays, and so should be ready to merge, IMO, assuming the name .states() is acceptable.

@benjeffery benjeffery merged commit edf6a3c into tskit-dev:main Oct 16, 2024
6 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants