Richard Kerr has asked about this on the email list, so I'm going to write out here how to do this in at least an inefficient way. Concretely, we want to compute mean individual heterozygosity: in other words, we want to (a) compute mean diversity (pi) among all the chromosomes of each individual, and (b) average these values. To do this for a tree sequence ts:

import numpy as np

def mean_heterozygosity(ts):
  ind_nodes = []
  for ind in ts.individuals():
    ind_nodes.append(ind.nodes)

  # the vector of per-individual heterozygosities:
  ind_het = ts.diversity(ind_nodes, mode="site")
  mean_het = np.mean(ind_het)
  return mean_het

There is interest in computing F_IS, as well. Referring to wikipedia,

  (1 - F_IT) = (1 - F_IS)(1 - F_ST) ,

where

  F_IT = 1 - (observed heterozygosity) / (expected heterozygosity)
           = 1 - (mean_het) / ts.diveristy(mode='site')

So, we can define

def F_IS(ts, sample_sets):
   mean_het = mean_heterozygosity(ts)
   pi = ts.diversity(mode="site")
   F_IT = 1 - mean_het / pi
   F_ST = ts.Fst(sample_sets, mode="site")
   return 1 - (1 - F_IT) / (1 - F_ST)

Note that this would be a just fine way to implement this in tskit if anyone wants to do it. It would be nice to do it in a more efficient way - if there are a lot of individuals in the tree sequence, this will take longer than it needs to, but it should be fine in many cases.

Comments or corrections welcome!

add individual statistics (eg heterozygosity) #166

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions