Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update mutation model to mimic an HKY85 model #21

Open
Haddox opened this issue Aug 9, 2022 · 3 comments
Open

update mutation model to mimic an HKY85 model #21

Haddox opened this issue Aug 9, 2022 · 3 comments
Assignees

Comments

@Haddox
Copy link

Haddox commented Aug 9, 2022

In antigen, there is a mutation rate $\mu$. In the original version of antigen, which did not model viral sequences, mutations changed a virus's antigenic phenotype. In our updated version, which does model sequences, each mutation also results in a single nucleotide change.

When a mutation event occurs in antigen, we need to decide how to mutate the sequence. Our current strategy involves two basic steps. First, we randomly choose a site to mutate in the gene (e.g., C12). Second, we randomly choose a nucleotide-level mutation given the wildtype nucleotide and a pre-specified transition/transversion ratio $\kappa$. This approach is similar to nucleotide-mutation models used in phylogenetics, but differs in a few ways.

The goal of this issue is to update the mutation model in antigen so that mutations occur at rates proportional to ones from the HKY85 mutation model. In this model, the rate at which nucleotide $x$ mutates to nucleotide $y$ is:

$$Q_{xy} = \begin{cases} \kappa\pi_y & \text{if } x \text{ and } y \text{ differ by a transition}\\ \pi_y & \text{if } x \text{ and } y \text{ differ by a transversion}\\ \end{cases}$$

where $\pi_y$ gives the expected frequency of nucleotide $y$ in the absence of selection. Our current mutation model in antigen differs from the HKY85 model in two ways. First, it does not account for $\pi_y$ parameters in choosing $y$. Second, by choosing a random site to mutate in the first step from above, it ignores that a site's mutation rate depends on the wildtype nucleotide $x$.

Below is an idea for how to update the mutation model in antigen. A simple idea is to replace the single $\mu$ parameter with $Q_{xy}$ parameters. However, in antigen, $\mu$ accounts for both mutation and selection. The virus population in a host is represented by a single virus with a defined antigenic phenotype and gene sequence. According to the literature, mutations seem to rarely fix in hosts. Thus, $\mu$ could be interpreted as the rate at which a mutation first occurs, then increases to an appreciable frequency, and then is involved in a transmission event. Since it is difficult to model each component of this process, the below idea retains the concept of $\mu$, but ensures that mutations occur at rates proportional to the HKY85 model. Specifically, when a mutation event occurs, the new model uses the following workflow to determine how to mutate the nucleotide sequence.

Step 1: choose a site to mutate: In the HKY85 model, the probability that a site mutates depends on its wildtype nucleotide. Specifically, the rate at which a nucleotide $x$ mutates to any of the other three nucleotides is:

$$R_x = \sum_y{Q_{xy}}$$

where the expression sums over each possible mutant nucleotide. Under this model, if there is a single nucleotide mutation somewhere in a gene sequence, then the probability that the mutation involves a wildtype nucleotide of $x$ is:

$$P_x = \frac{R_x n_x}{\sum_i{R_i n_i}}$$

where $n_x$ is the number of counts of nucleotide $x$ in the gene sequence, and the expression in the denominator sums over each of the four nucleotides.

We will mimic this process in the following way. When there is a mutation event in antigen, we will first randomly select one of the four nucleotides to be the wildtype nucleotide in the mutation, with selection probabilities of $P_x$. Then, we will make a list of all sites in the gene sequence with the randomly selected wildtype nucleotide (e.g., C12, C24, C28, etc.), and randomly choose one of those sites to mutate (e.g., C12). This list could be stored in a dictionary and updated for computational efficiency.

Step 2: choose a mutant nucleotide: Once we have chosen a site to mutate, we will randomly select a mutant nucleotide $y$ with probabilities proportional to mutation rates $Q_{xy}$.

Using these two steps, $Q_{xy}$ values in antigen should be proportional to $Q_{xy}$ values in HKY85, unless I am mistaken. What do you all think? Any other ideas?

@Haddox
Copy link
Author

Haddox commented Aug 9, 2022

Here is a simplified version of the above idea, including a new term that accounts for functional constraint. It is inspired by equations 1-3 from a recent phydms paper, where now, $x$ and $y$ represent codons. When a mutation event happens in antigen, we randomly select the mutation it corresponds to by drawing from the following probability distribution:

$$P_{r,xy} = \frac{Q_{xy} \times F_{r,xy}}{Z}$$

where:

  • $x$ is the wildtype codon
  • $r$ is the codon number
  • $y$ is the mutant codon (single nucleotide change away)
  • $Q_{r,xy}$ is the rate of the mutation according to the HKY85 model
  • $F_{r,xy}$ is the rate that the mutation fixes, as determined by amino-acid preferences
  • $Z$ is a partition function that ensures that all probabilities sum to one

Specifically, we define $F_{r, xy}$ in the same way as the phydms paper:
$$F_{r,xy} = \frac{\ln{[ (\pi_{r,y}/\pi_{r,x}})^\beta ]}{1-(\pi_{r,y}/\pi_{r,x})^\beta}$$

where:

  • $\pi_{r,x}$ is the preference for the amino acid encoded by codon $x$ at site $r$
  • $\pi_{r,y}$ is the preference for the amino acid encoded by codon $y$ at site $r$
  • $\beta$ is the stringency parameter

One way that the above idea violates the HYK model is that viruses can only get a maximum of one mutation per gene per day, since there is a maximum of one mutation event per day. Though, the mutation rate in antigen is quite low ( $10^{-4}$ mutations per host per day), so this might not be very much of a problem.

Erick pointed out that a more accurate approach would be to take a gene, cycle over each nucleotide, and apply the rate matrix (e.g., $Q_{r,xy} \times F_{r,xy}$) to each site. This would solve the above violation, as well as make things more interpretable. Though we don't know how much this would slow down the simulation since we'd need a for-loop over each site in the sequence.

@matsen
Copy link

matsen commented Aug 10, 2022

I think this all looks good. The only thing that I can add is that in general, if one has a continuous-time Markov chain, there is a notion of a "jump chain" which is the discrete transition story after a single event. See, e.g. section 4 of this.

The definition is quite simple: the probability of moving to another state $j$ from state $i$ assuming that there is a single event is the normalized rate of going from $j$ from $i$ in the CTMC. This is an exact statement-- not an approximation. If one wanted to have multiple mutations one could sample the first and then the second conditioned on the first.

I believe that's exactly the statement being made here... if not let's discuss more.

If we're on the right track, what is the right computing formalism that will allow us to use these complex models without too much overhead?

@Haddox
Copy link
Author

Haddox commented Aug 10, 2022

Thanks, Erick. Good to know about "jump chain". I also believe that's the statement being made here.

Thien and I just discussed an idea for how to implement this in the code without too much overhead. Either she or I will post something soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants