Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add individual statistics (eg heterozygosity) #166

Open
petrelharp opened this issue Apr 1, 2019 · 24 comments
Open

add individual statistics (eg heterozygosity) #166

petrelharp opened this issue Apr 1, 2019 · 24 comments
Labels
enhancement New feature or request statistics

Comments

@petrelharp
Copy link
Contributor

Something like

def individual_heterozygosity(ts, ind):
   ts.pairwise_diversity(ind.nodes)

should do it - very simple, but we want to make it easy and natural for people to deal with individuals.

@hyanwong
Copy link
Member

hyanwong commented Apr 5, 2019

Out of interest, is this generalizable to non-diploids? And you might need to catch cases like the X chromosome.

@jeromekelleher
Copy link
Member

Do we want this for 0.2.0 @petrelharp, or can we push it down the road to 0.2.1? I'd like to ship a reasonably complete set of statistics as soon as we can - the question is where to draw the line.

@petrelharp
Copy link
Contributor Author

Push it down the road, unless there's someone else wanting willing to do this.

@jeromekelleher jeromekelleher added this to the Python 0.2.1 milestone Jul 3, 2019
@jeromekelleher jeromekelleher added the enhancement New feature or request label Sep 29, 2020
@jeromekelleher jeromekelleher removed this from the Python upcoming milestone Sep 29, 2020
@petrelharp
Copy link
Contributor Author

petrelharp commented Oct 7, 2020

Richard Kerr has asked about this on the email list, so I'm going to write out here how to do this in at least an inefficient way. Concretely, we want to compute mean individual heterozygosity: in other words, we want to (a) compute mean diversity (pi) among all the chromosomes of each individual, and (b) average these values. To do this for a tree sequence ts:

import numpy as np

def mean_heterozygosity(ts):
  ind_nodes = []
  for ind in ts.individuals():
    ind_nodes.append(ind.nodes)

  # the vector of per-individual heterozygosities:
  ind_het = ts.diversity(ind_nodes, mode="site")
  mean_het = np.mean(ind_het)
  return mean_het

There is interest in computing F_IS, as well. Referring to wikipedia,

  (1 - F_IT) = (1 - F_IS)(1 - F_ST) ,

where

  F_IT = 1 - (observed heterozygosity) / (expected heterozygosity)
           = 1 - (mean_het) / ts.diveristy(mode='site')

So, we can define

def F_IS(ts, sample_sets):
   mean_het = mean_heterozygosity(ts)
   pi = ts.diversity(mode="site")
   F_IT = 1 - mean_het / pi
   F_ST = ts.Fst(sample_sets, mode="site")
   return 1 - (1 - F_IT) / (1 - F_ST)

Note that this would be a just fine way to implement this in tskit if anyone wants to do it. It would be nice to do it in a more efficient way - if there are a lot of individuals in the tree sequence, this will take longer than it needs to, but it should be fine in many cases.

Comments or corrections welcome!

@benjeffery benjeffery modified the milestones: Python upcoming, C Upcoming Oct 8, 2020
@castedo
Copy link
Member

castedo commented Oct 23, 2020

I've been studying F_ST this year so this sounds like a good study project/homework for me to do. I've been reading about F_ST material in Weir's Genetic Data Analysis II book and the foundational papers by Wright, Cockerham, Hill and Weir. I've also written up a very formal mathy definition in F_ST: http://www.castedo.com/osa/136/

Does it make sense that this be coded into Python so it is most readable or in C so that it is more performant?

@benjeffery
Copy link
Member

@castedo There's an example in the dev docs of how to code in python first (as you need it for the tests) then once that has been through review to check it's doing the right thing, you do it in C.

@castedo castedo self-assigned this Oct 27, 2020
@castedo
Copy link
Member

castedo commented Oct 27, 2020

@benjeffery Thanks for that guidance, I'll follow that.

Not sure what the usual process is for assigning issues. I just assigned this to myself. I'm reading the paper "Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes" before starting to work on this. This looks like the high-level background and motivation for this area of the code.

Feel free to ping me if anyone wants to know how far along I am on this.

If anybody knows of specific computation scenarios where speed of computing this stat is relevant, let me know. Doesn't need to be something so well established as a benchmark. Just anything that is vaguely realistic. I just want to avoid imagining a perf scenario that doesn't really exist in research.

@jeromekelleher
Copy link
Member

Thanks @castedo, that sounds ideal. The benchmarks we did in the paper you mentioned will give a good idea of the sort of scenarios we're imagining, we'd be applying individual stats in fairly similar cases.

@petrelharp
Copy link
Contributor Author

This is excellent, ping for input on how to proceed! But sounds like you are going on the right direction.

@castedo
Copy link
Member

castedo commented Oct 30, 2020

I see many directions in the scope of this issue #166 :

  • A) add function(s) in the Python API with "heterozygosity" in the name
    • A.i) make sure works for non-autosomal chromosome pairs (e.g. X&Y, W&Z)
  • B) define and code a function which generalizes heterozygosity to non-diploid species
  • C) add function(s) in the Python API with "F_IT" and "F_IS" in the name
  • D) enhance documentation and Python API to show how many individual-based statistics can be achieved with "efficient building blocks" of the API

I can express a number of mathematical thoughts regarding B) and C). From a mathematical standpoint I see a number of intersting proproperties of these calculations which maybe folks would like to see. But I think B) and C) would be best tracked separately and made out of scope for this issue #166. Especially with the F statistics I see muliple definitions in the liturature depending on the model being estimated, whether the F statistic is an estimator or a proability model parameter (to be estimated) and how F statistics are combined across sites.

So that leaves A) and D) for which I'm torn on how to proceed. My inclination is to proceed on A) since it's focused and probably not a lot of work and discussion. For A) I can just do it and get it done. On the other hand, part of me thinks D) is the right thing to do. If folks would like to discuss why I think D) might be the right thing to do, just holler.

@petrelharp, input on proceeding?

I've read "Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes" (great work!) and now I'm going to do the tskit tutorial.

I do not yet see how non-autosomal chromosome pairs are handled in tskit (especially the pseudo-autosomal regions of the XY chromsomes). I think I'll see soon. If not, any pointers or tips appreciated.

@gregorgorjanc
Copy link
Member

@castedo my understannding is that there is no notion of calculating statistics per individual, yet, so there are only haploid chromosomes everywhere. @petrelharp correct me if I am wrong.

@castedo
Copy link
Member

castedo commented Oct 30, 2020

@gregorgorjanc one can view individuals as very small sample sets (sub-populations)(i.e. each individuals is a tiny weeny population of haploids/monoploids). So from that view, there are statistics, since there are statistics for sample sets. But yes, not at a high-level.

@petrelharp
Copy link
Contributor Author

petrelharp commented Oct 30, 2020

Let's see! First, ploidy is naturally encoded based on how many chromosomes an individual has. And I'm not aware of any individual-level statistics that are defined differently for, or specific to, non-diploids. For instance, heterozygosity is "what's the chance two randomly chosen alleles match"; this works fine for polyploids, and for haploids the value is nan. If we were working with, say, an X chromosome, some individuals would have 2 copies and some would have one.

I agree, I think we should get an API going, for (A). Note: by "heterozygosity" here we mean "observed heterozygosity", because there's also "expected heterozygosity", which is the diversity() of the individual's chromosomes all pooled together. This method would take an individual_sets argument, that would be a list of lists of individuals, and would work just like sample_sets: the kth component of the output would be obtained by (a) computing the observed heterozygosity (i.e., diversity) of every individual in individual_sets[k], and (b) averaging these values. For instance, here's an inelegant way to do it:

def observed_heterozygosity(self, individual_sets, **kwargs):
        out = []
        for iset in individual_sets:
            het = self.diversity(sample_sets=[self.individual(i).nodes for i in iset], **kwargs)
            out.append(np.mean(het, axis=<the axis that averages across individuals>))
        return np.array(out)

I toyed with the idea of just adding individual_sets to diversity, but that doesn't seem to be a natural thing for most other statistics, so decided against it. I also thought about calling it individual_heterozygosity, but that seems redundant, as the argument is called individual_sets.

Thinking about (D) it would help to have a list of what other individual-level statistics we want. As for F_IS I think we should not be afraid to pick what we think is the right definition; and we can always add others later by adding a method=xxx argument. Note that we shouldn't be afraid to pick a very simple one (e.g., not Weir & Cockerham) because the complicated ones matter much more when you don't have much data; here, we have a lot of data so there's no need to be so fancy.

@castedo
Copy link
Member

castedo commented Oct 30, 2020

@petrelharp thx! Quick question on the first part:

If we were working with, say, an X chromosome, some individuals would have 2 copies and some would have one.

Suppose a human zygote with Klinefelter syndrome (XXY) splits into almost identical twins where by some freak event, one twin gets copies of both distinct X chromosomes but not the Y and the other twin gets just one X chromosome and the Y. So they are identical twins normal in every way except one is female and the other male missing one of the X chromosomes of the sister twin.

Is there a consistent approach in the field as how heterozygosity is defined on the sex chromosomes such that one can say the female twin most likely has higher vs lower vs equal heterozygosity as the male twin?

It would make sense to me that the male twin would have much higher heterozygosity because the Y chromosome would essentially be accounted as being an X with a crazy massive set of mutations. It would not make sense to me to say heterozygosity is undefined for the sex chromsomes in the male twin or that the male twin has lower heterozygosity. But that's me, any sense what is normal in the field? Also @gregorgorjanc

@castedo
Copy link
Member

castedo commented Oct 30, 2020

And for simplicity of illustration, let say heterozygosity is 0.1 across all the non-sex chromosomes and 0.2 for the X and the pseudo-autosomal regions. And for numerical simplicity, the X chromosome is weighted at exactly 5% in the female genome.

So I think the female twin would have heterozygosity of 0.105 = (0.95 * 0.1 + 0.05 * 0.2). And the question is whether there is consistent definition in the field that would define heterozygosity for the male twin.

@petrelharp
Copy link
Contributor Author

Hm, well, it's safe to say that there is no standard definition of heterozygosity in that situation. But, I think that this would not be the way forward:

the Y chromosome would essentially be accounted as being an X with a crazy massive set of mutations
since the Y chromosome (apart from the PAR, as you say), is just not homologous to the X. It's a good point, that if you want to measure "at what proportion of sites does an individual have more than one distinct allele" then treating males as having zero heterozygosity on the Y would make sense. But I think it's a lot safer to return nan, since asking for heterozygosity of a single chromosome is usually an error (or should be). And with the generalizable definition ("probability that two random alleles differ"), you need to return nan.

@jeromekelleher
Copy link
Member

I don't think we can cover every possible corner case here @castedo - nature is pretty creative with coming up with weird stuff. "Does what you'd want in most cases" is what we're looking for initially. If there's a need to cover the more exotic cases then we can deal with that later.

@castedo
Copy link
Member

castedo commented Oct 31, 2020

So wrapping up on the scope of this:

  • A) add function(s) in the Python API with "heterozygosity" in the name

With @petrelharp's observed_heterozygosity Python example, I've got a great specification for A), thanks! The name is unambiguous and matches nicely and in an analogous way to all the other stats functions.

  • A.i) make sure works for non-autosomal chromosome pairs (e.g. X&Y, W&Z)

Handling sex chromosomes is way out of scope right now. There are challenging issues in trying to deal with this and I'll not worry about.

  • B) define and code a function which generalizes heterozygosity to non-diploid species

"observed heterozygosity" has a consistent meaning in the field (I think, I hope!) and @petrelharp's clatification is a totally logical extension of it and NaN is the value for monoploids. (As a side note, this definition of heterozygosity seems sensible for diploids, because it has a max value of 1. But it seem wonky to me to use it rather than 'gene diversity' (see #961), as a statistic to use across many types of ploidy. Just a sort of esthetic personal reaction.)

  • C) add function(s) in the Python API with "F_IT" and "F_IS" in the name

Hammer out the details in (or after) #858 and make a separate GitHub issue to track implementing F_IT and F_IS.

  • D) enhance documentation and Python API to show how many individual-based statistics can be achieved with "efficient building blocks" of the API

This is too hard to hash out now in Github issues. More appropriate for brainstorming face-(to-screen-to-)face and/or after all this other stuff become more clear.

@castedo castedo removed their assignment Mar 28, 2021
@castedo
Copy link
Member

castedo commented Mar 28, 2021

It's been 5 months now, so I should acknowledge I'm unlikely to get to this (so un-assigning myself). I hope I didn't cause too much useless noise here. Sorry about that if so.

@benjeffery
Copy link
Member

No worries @castedo! Thanks for updating the issue.

@petrelharp
Copy link
Contributor Author

Following on the algorithm for computing additive breeding values in #1547, here's an outline of an algorithm to compute the number of homozygous mutations per individual (which we could then use to compute individual heterozygosity). Or, more generally, given a vector s of length equal to the number of mutations, compute for each individual, the sum s[i] over all mutations i for which the individual is homozygous. This is what we'd need to do efficient computation of diploid fitness or phenotypes with dominance.

In the algorithm described here, partial sums for haplotypes are propagated down the tree, until they reach the samples. But now, output corresponds to individuals, not to samples; and furthermore mutations only count towards the output if they are inherited by the MRCA of the nodes of the individual. So, we can use that algorithm, nearly unchanged, as long as we also maintain, for each node in the tree, the set of individual indices for which that node is the MRCA.

The output will be out, and suppose that the output for individual i is stored in out[i]; x will be a vector of length equal to the number of nodes, maintained just as in the other algorithm. Also, for each node u, the possibly empty list ind[u] contains individual indices; individual i is in ind[u] iff u is the MRCA of all nodes in individual i; and uncoal is a list of all individuals not currently finding an MRCA.

  1. Initialize x to 0 and set uncoal to the list of all individuals.
  2. After adding an edge with parent p and child c:
    a. traverse from the root down to p, and for each node v on this path, add x[v] to x[u] for each child u of v. For each i in ind[v] add x[v] to out[i]. Then set x[v] to zero.
    b. Check which of uncoal find an MRCA in p or a parent of p thanks to the addition of this edge, and update accordingly.
  3. Update x in a similar way before removing an edge with parent p and child c. Also append ind[p] to uncoal and empty ind[p].
  4. If a mutation i appears on node u, add s[i] to x[u].

Step 1b is the hard one - I'm not clear on how efficiently it's possible to do that yet.
Athough most nodes in a tree are not MRCAs to many individuals, the higher-up nodes both turn over faster and are MRCAs to many individuals. So, step 1b could involve re-finding a lot of MRCAs quite often. However, we only have to consider a limited number of nodes.

@jeromekelleher
Copy link
Member

Sounds good @petrelharp. Maybe worth knocking up a simple Python version where we don't think too hard about efficiency?

@benjeffery benjeffery added this to the Python upcoming milestone Sep 7, 2021
@petrelharp
Copy link
Contributor Author

As this is showing up in the SLiM list, here's how to compute individual heterozygosities:

ts.diversity([i.nodes for i in ts.individuals()])

@hyanwong
Copy link
Member

As this is showing up in the SLiM list, here's how to compute individual heterozygosities:

ts.diversity([i.nodes for i in ts.individuals()])

Is there a neat way to get the mean here over all individuals, other than doing np.mean(ts.diversity([i.nodes for i in ts.individuals()]), which is rather slow for e.g. the 1000 genomes dataset? I presume doing this efficiently is what this issue is about, however. And it's not been revisited for a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request statistics
Projects
None yet
Development

No branches or pull requests

6 participants