Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug assertion in ts.allele_frequency_spectrum #2923

Open
nspope opened this issue Apr 11, 2024 · 8 comments
Open

Bug assertion in ts.allele_frequency_spectrum #2923

nspope opened this issue Apr 11, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@nspope
Copy link
Contributor

nspope commented Apr 11, 2024

with: afs_error.trees.gz

import tskit
print(tskit.__version__) # '0.5.7.dev0'
ts = tskit.load("afs_error.trees")
ts.allele_frequency_spectrum() # works
ts.allele_frequency_spectrum(sample_sets=[[x for x in range(64)]]) # segfaults
# Bug detected in lib/tskit/trees.c at line 2721. If you are using tskit directly please open an issue on 
# GitHub, ideally with a reproducible example. (https://github.com/tskit-dev/tskit/issues) 
# Aborted (core dumped)

This example has one tree and one site. The segfault seems to be linked to the fact that there are multiple mutations at the site, but mutations.parent is tskit.NULL even when mutations are beneath another-- that is, if I dump tables and correct the mutations.parent column to reflect what mutation replaced what, then everything works.

This is output from SINGER, so I'll let them know that mutations.parent should be set during conversion to a tree sequence. But, it'd be nice to have the above example run through, or at least throw an informative error (note that ts.diversity works, although it returns a negative number for the "sample_sets" case above).

@petrelharp
Copy link
Contributor

That's here. That code is only called if it's unpolarized, however, doing it polarized hits essentially the same bug assert here.

You can just tell them to run tables.compute_mutation_parents(); that will do it.

@nspope nspope changed the title Segfault in ts.allele_frequency_spectrum Bug assertion in ts.allele_frequency_spectrum Apr 11, 2024
@petrelharp
Copy link
Contributor

Here's a MWE:

import tskit

tables = tskit.TableCollection(sequence_length=1)
tables.nodes.add_row(flags=1, time=0)
tables.nodes.add_row(time=1)
tables.edges.add_row(parent=1, child=0, left=0, right=1)
for _ in range(2):
    i = tables.nodes.add_row(flags=1, time=0)
    tables.edges.add_row(parent=1, child=i, left=0, right=1)

tables.sites.add_row(position=0, ancestral_state='a')
tables.mutations.add_row(site=0, node=0, derived_state='b', time=0)
tables.mutations.add_row(site=0, node=0, derived_state='b', time=0)

tables.sort()
ts = tables.tree_sequence()
ts.allele_frequency_spectrum(sample_sets=[[0]], polarised=True)

The issue here is that all the mutations to the same allele (here b) get counted here; so the algorithm thinks "okay here's an allele that's been inherited by 2 out of these 1 samples", whoops.

The mutation.parent is what's letting us properly count mutations that are nested in others; Even in cases without the bug assert, afs is reporting the wrong thing (heck, probably all the stats are).

@nspope
Copy link
Contributor Author

nspope commented Apr 11, 2024

Thanks Peter -- so, is the tskit_bug_assert the right exit point here, or should it raise a more specific error? If the former I'll go ahead and close this.

@petrelharp
Copy link
Contributor

Don't close it, I think? It would be nice to give a more informative error. The problem is definitely not with allele_frequency_spectrum, since having correct mutation parents is something we assume in a bunch of places. The right thing to do would be to check mutation parents when loading the tree sequence - and the way to do this is to re-compute them and see if we get the same answer. IIRC we discussed this, but decided it was kinda costly, so didn't want to do it. A more informative error here wouldn't fix the problem because other operations would also turn up wrong info. So, maybe we just need to check this on load, maybe with a flag to disable the check? Either that or decide not to fix it?

@petrelharp
Copy link
Contributor

For context, I think we decided previously that "gee there aren't very many softwares producing tree sequences; they'll just all need to learn to use compute_mutation_parents". Which is I guess what's happening.

@jeromekelleher
Copy link
Member

There's a general issue here that we've been too lax with checking mutation parents. We'll probably just have to figure out how to detect whether mutation parents are set properly at load time, and start raising errors if not. Otherwise these types of error will start creeping in more and more.

@jeromekelleher jeromekelleher added the bug Something isn't working label Apr 11, 2024
@petrelharp
Copy link
Contributor

We'll probably just have to figure out how to detect whether mutation parents are set properly at load time

I'm pretty sure the only reasonable way to do that is to just run compute_mutation_parents again: to check mutation parents are right, we've got to build and traverse the trees; and that's what compute_mutation_parents does. I don't see any simpler way to check for correctness, other than e.g. "if there's only one mutation per site then it's fine".

@jeromekelleher
Copy link
Member

Your probably right. We could check for cheaper things like if we have two mutations on a branch, the parent must be set, but that's a fairly limited check.

So, I guess the proposal would be to run compute_mutation_parents if we see more than one mutation at any site, and check?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants