update mutation model to mimic an HKY85 model #21

Haddox · 2022-08-09T21:27:39Z

In antigen, there is a mutation rate $\mu$. In the original version of antigen, which did not model viral sequences, mutations changed a virus's antigenic phenotype. In our updated version, which does model sequences, each mutation also results in a single nucleotide change.

When a mutation event occurs in antigen, we need to decide how to mutate the sequence. Our current strategy involves two basic steps. First, we randomly choose a site to mutate in the gene (e.g., C12). Second, we randomly choose a nucleotide-level mutation given the wildtype nucleotide and a pre-specified transition/transversion ratio $\kappa$. This approach is similar to nucleotide-mutation models used in phylogenetics, but differs in a few ways.

The goal of this issue is to update the mutation model in antigen so that mutations occur at rates proportional to ones from the HKY85 mutation model. In this model, the rate at which nucleotide $x$ mutates to nucleotide $y$ is:

$$Q_{xy} = \begin{cases} \kappa\pi_y & \text{if } x \text{ and } y \text{ differ by a transition}\\ \pi_y & \text{if } x \text{ and } y \text{ differ by a transversion}\\ \end{cases}$$

where $\pi_y$ gives the expected frequency of nucleotide $y$ in the absence of selection. Our current mutation model in antigen differs from the HKY85 model in two ways. First, it does not account for $\pi_y$ parameters in choosing $y$. Second, by choosing a random site to mutate in the first step from above, it ignores that a site's mutation rate depends on the wildtype nucleotide $x$.

Below is an idea for how to update the mutation model in antigen. A simple idea is to replace the single $\mu$ parameter with $Q_{xy}$ parameters. However, in antigen, $\mu$ accounts for both mutation and selection. The virus population in a host is represented by a single virus with a defined antigenic phenotype and gene sequence. According to the literature, mutations seem to rarely fix in hosts. Thus, $\mu$ could be interpreted as the rate at which a mutation first occurs, then increases to an appreciable frequency, and then is involved in a transmission event. Since it is difficult to model each component of this process, the below idea retains the concept of $\mu$, but ensures that mutations occur at rates proportional to the HKY85 model. Specifically, when a mutation event occurs, the new model uses the following workflow to determine how to mutate the nucleotide sequence.

Step 1: choose a site to mutate: In the HKY85 model, the probability that a site mutates depends on its wildtype nucleotide. Specifically, the rate at which a nucleotide $x$ mutates to any of the other three nucleotides is:

$$R_x = \sum_y{Q_{xy}}$$

where the expression sums over each possible mutant nucleotide. Under this model, if there is a single nucleotide mutation somewhere in a gene sequence, then the probability that the mutation involves a wildtype nucleotide of $x$ is:

$$P_x = \frac{R_x n_x}{\sum_i{R_i n_i}}$$

where $n_x$ is the number of counts of nucleotide $x$ in the gene sequence, and the expression in the denominator sums over each of the four nucleotides.

We will mimic this process in the following way. When there is a mutation event in antigen, we will first randomly select one of the four nucleotides to be the wildtype nucleotide in the mutation, with selection probabilities of $P_x$. Then, we will make a list of all sites in the gene sequence with the randomly selected wildtype nucleotide (e.g., C12, C24, C28, etc.), and randomly choose one of those sites to mutate (e.g., C12). This list could be stored in a dictionary and updated for computational efficiency.

Step 2: choose a mutant nucleotide: Once we have chosen a site to mutate, we will randomly select a mutant nucleotide $y$ with probabilities proportional to mutation rates $Q_{xy}$.

Using these two steps, $Q_{xy}$ values in antigen should be proportional to $Q_{xy}$ values in HKY85, unless I am mistaken. What do you all think? Any other ideas?

The text was updated successfully, but these errors were encountered:

Haddox · 2022-08-09T23:42:06Z

Here is a simplified version of the above idea, including a new term that accounts for functional constraint. It is inspired by equations 1-3 from a recent phydms paper, where now, $x$ and $y$ represent codons. When a mutation event happens in antigen, we randomly select the mutation it corresponds to by drawing from the following probability distribution:

$$P_{r,xy} = \frac{Q_{xy} \times F_{r,xy}}{Z}$$

where:

$x$ is the wildtype codon
$r$ is the codon number
$y$ is the mutant codon (single nucleotide change away)
$Q_{r,xy}$ is the rate of the mutation according to the HKY85 model
$F_{r,xy}$ is the rate that the mutation fixes, as determined by amino-acid preferences
$Z$ is a partition function that ensures that all probabilities sum to one

Specifically, we define $F_{r, xy}$ in the same way as the phydms paper:
$$F_{r,xy} = \frac{\ln{[ (\pi_{r,y}/\pi_{r,x}})^\beta ]}{1-(\pi_{r,y}/\pi_{r,x})^\beta}$$

where:

$\pi_{r,x}$ is the preference for the amino acid encoded by codon $x$ at site $r$
$\pi_{r,y}$ is the preference for the amino acid encoded by codon $y$ at site $r$
$\beta$ is the stringency parameter

One way that the above idea violates the HYK model is that viruses can only get a maximum of one mutation per gene per day, since there is a maximum of one mutation event per day. Though, the mutation rate in antigen is quite low ( $10^{-4}$ mutations per host per day), so this might not be very much of a problem.

Erick pointed out that a more accurate approach would be to take a gene, cycle over each nucleotide, and apply the rate matrix (e.g., $Q_{r,xy} \times F_{r,xy}$) to each site. This would solve the above violation, as well as make things more interpretable. Though we don't know how much this would slow down the simulation since we'd need a for-loop over each site in the sequence.

matsen · 2022-08-10T12:52:06Z

I think this all looks good. The only thing that I can add is that in general, if one has a continuous-time Markov chain, there is a notion of a "jump chain" which is the discrete transition story after a single event. See, e.g. section 4 of this.

The definition is quite simple: the probability of moving to another state $j$ from state $i$ assuming that there is a single event is the normalized rate of going from $j$ from $i$ in the CTMC. This is an exact statement-- not an approximation. If one wanted to have multiple mutations one could sample the first and then the second conditioned on the first.

I believe that's exactly the statement being made here... if not let's discuss more.

If we're on the right track, what is the right computing formalism that will allow us to use these complex models without too much overhead?

Haddox · 2022-08-10T18:51:02Z

Thanks, Erick. Good to know about "jump chain". I also believe that's the statement being made here.

Thien and I just discussed an idea for how to implement this in the code without too much overhead. Either she or I will post something soon.

Haddox assigned matsen, Haddox, zorian15 and thienktran Aug 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update mutation model to mimic an HKY85 model #21

update mutation model to mimic an HKY85 model #21

Haddox commented Aug 9, 2022 •

edited

Loading

Haddox commented Aug 9, 2022 •

edited

Loading

matsen commented Aug 10, 2022

Haddox commented Aug 10, 2022

update mutation model to mimic an HKY85 model #21

update mutation model to mimic an HKY85 model #21

Comments

Haddox commented Aug 9, 2022 • edited Loading

Haddox commented Aug 9, 2022 • edited Loading

matsen commented Aug 10, 2022

Haddox commented Aug 10, 2022

Haddox commented Aug 9, 2022 •

edited

Loading

Haddox commented Aug 9, 2022 •

edited

Loading