diff --git a/examples/bart/bart_categorical_hawks.ipynb b/examples/bart/bart_categorical_hawks.ipynb
index ac2a62f12..b017fd8fe 100644
--- a/examples/bart/bart_categorical_hawks.ipynb
+++ b/examples/bart/bart_categorical_hawks.ipynb
@@ -1473,7 +1473,7 @@
    "source": [
     "So far we have a very good result concerning the classification of the species based on the 5 covariables. However, if we want to select a subset of covariable to perform future classifications is not very clear which of them to select. Maybe something sure is that `Tail` could be eliminated. At the beginning when we plot the distribution of each covariable we said that the most important variables to make the classification could be `Wing`, `Weight` and, `Culmen`, nevertheless after running the model we saw that `Hallux`, `Culmen` and, `Wing`, proved to be the most important ones.\n",
     "\n",
-    "Unfortunatelly, the partial dependence plots show a very wide dispersion, making results look suspicious. One way to reduce this variability is adjusting independent trees, below we will see how to do this and get a more accurate result. "
+    "Unfortunately, the partial dependence plots show a very wide dispersion, making results look suspicious. One way to reduce this variability is adjusting independent trees, below we will see how to do this and get a more accurate result. "
    ]
   },
   {
diff --git a/examples/bart/bart_categorical_hawks.myst.md b/examples/bart/bart_categorical_hawks.myst.md
index 5704b0b5e..b4557f765 100644
--- a/examples/bart/bart_categorical_hawks.myst.md
+++ b/examples/bart/bart_categorical_hawks.myst.md
@@ -195,7 +195,7 @@ all
 
 So far we have a very good result concerning the classification of the species based on the 5 covariables. However, if we want to select a subset of covariable to perform future classifications is not very clear which of them to select. Maybe something sure is that `Tail` could be eliminated. At the beginning when we plot the distribution of each covariable we said that the most important variables to make the classification could be `Wing`, `Weight` and, `Culmen`, nevertheless after running the model we saw that `Hallux`, `Culmen` and, `Wing`, proved to be the most important ones.
 
-Unfortunatelly, the partial dependence plots show a very wide dispersion, making results look suspicious. One way to reduce this variability is adjusting independent trees, below we will see how to do this and get a more accurate result. 
+Unfortunately, the partial dependence plots show a very wide dispersion, making results look suspicious. One way to reduce this variability is adjusting independent trees, below we will see how to do this and get a more accurate result. 
 
 +++
 
diff --git a/examples/bart/bart_introduction.ipynb b/examples/bart/bart_introduction.ipynb
index d2f7fece1..a8375c905 100644
--- a/examples/bart/bart_introduction.ipynb
+++ b/examples/bart/bart_introduction.ipynb
@@ -8,7 +8,7 @@
     "(BART_introduction)=\n",
     "# Bayesian Additive Regression Trees: Introduction\n",
     ":::{post} Dec 21, 2021\n",
-    ":tags: BART, non-parametric, regression \n",
+    ":tags: BART, nonparametric, regression \n",
     ":category: intermediate, explanation\n",
     ":author: Osvaldo Martin\n",
     ":::"
@@ -210,7 +210,7 @@
    "id": "7eb4c307",
    "metadata": {},
    "source": [
-    "Before checking the result, we need to discuss one more detail, the BART variable always samples over the real line, meaning that in principle we can get values that go from $-\\infty$ to $\\infty$. Thus, we may need to transform their values as we would do for standard Generalized Linear Models, for example in the `model_coal` we computed `pm.math.exp(μ_)` because the Poisson distribution is expecting values that go from 0 to $\\infty$. This is business as usual, the novelty is that we may need to apply the inverse transformation to the values of `Y`, as we did in the previous model where we took $\\log(Y)$. The main reason to do this is that the values of `Y` are used to get a reasonable initial value for the sum of trees and also the variance of the leaf nodes. Thus, applying the inverse transformation is a simple way to improve the efficiency and accuracy of the result. Should we do this for every possible likelihood? Well, no. If we are using BART for the location parameter of distributions like Normal, StudentT, or AssymetricLaplace, we don't need to do anything as the support of these parameters is also the real line. A nontrivial exception is the Bernoulli likelihood (or Binomial with n=1), in that case, we need to apply the logistic function to the BART variable, but there is no need to apply its inverse to transform `Y`, PyMC-BART already takes care of that particular case.\n",
+    "Before checking the result, we need to discuss one more detail, the BART variable always samples over the real line, meaning that in principle we can get values that go from $-\\infty$ to $\\infty$. Thus, we may need to transform their values as we would do for standard Generalized Linear Models, for example in the `model_coal` we computed `pm.math.exp(μ_)` because the Poisson distribution is expecting values that go from 0 to $\\infty$. This is business as usual, the novelty is that we may need to apply the inverse transformation to the values of `Y`, as we did in the previous model where we took $\\log(Y)$. The main reason to do this is that the values of `Y` are used to get a reasonable initial value for the sum of trees and also the variance of the leaf nodes. Thus, applying the inverse transformation is a simple way to improve the efficiency and accuracy of the result. Should we do this for every possible likelihood? Well, no. If we are using BART for the location parameter of distributions like Normal, StudentT, or AsymmetricLaplace, we don't need to do anything as the support of these parameters is also the real line. A nontrivial exception is the Bernoulli likelihood (or Binomial with n=1), in that case, we need to apply the logistic function to the BART variable, but there is no need to apply its inverse to transform `Y`, PyMC-BART already takes care of that particular case.\n",
     "\n",
     "OK, now let's see the result of `model_coal`."
    ]
diff --git a/examples/bart/bart_introduction.myst.md b/examples/bart/bart_introduction.myst.md
index f2c978c03..a01898040 100644
--- a/examples/bart/bart_introduction.myst.md
+++ b/examples/bart/bart_introduction.myst.md
@@ -14,7 +14,7 @@ kernelspec:
 (BART_introduction)=
 # Bayesian Additive Regression Trees: Introduction
 :::{post} Dec 21, 2021
-:tags: BART, non-parametric, regression 
+:tags: BART, nonparametric, regression 
 :category: intermediate, explanation
 :author: Osvaldo Martin
 :::
@@ -98,7 +98,7 @@ with pm.Model() as model_coal:
     idata_coal = pm.sample(random_seed=RANDOM_SEED)
 ```
 
-Before checking the result, we need to discuss one more detail, the BART variable always samples over the real line, meaning that in principle we can get values that go from $-\infty$ to $\infty$. Thus, we may need to transform their values as we would do for standard Generalized Linear Models, for example in the `model_coal` we computed `pm.math.exp(μ_)` because the Poisson distribution is expecting values that go from 0 to $\infty$. This is business as usual, the novelty is that we may need to apply the inverse transformation to the values of `Y`, as we did in the previous model where we took $\log(Y)$. The main reason to do this is that the values of `Y` are used to get a reasonable initial value for the sum of trees and also the variance of the leaf nodes. Thus, applying the inverse transformation is a simple way to improve the efficiency and accuracy of the result. Should we do this for every possible likelihood? Well, no. If we are using BART for the location parameter of distributions like Normal, StudentT, or AssymetricLaplace, we don't need to do anything as the support of these parameters is also the real line. A nontrivial exception is the Bernoulli likelihood (or Binomial with n=1), in that case, we need to apply the logistic function to the BART variable, but there is no need to apply its inverse to transform `Y`, PyMC-BART already takes care of that particular case.
+Before checking the result, we need to discuss one more detail, the BART variable always samples over the real line, meaning that in principle we can get values that go from $-\infty$ to $\infty$. Thus, we may need to transform their values as we would do for standard Generalized Linear Models, for example in the `model_coal` we computed `pm.math.exp(μ_)` because the Poisson distribution is expecting values that go from 0 to $\infty$. This is business as usual, the novelty is that we may need to apply the inverse transformation to the values of `Y`, as we did in the previous model where we took $\log(Y)$. The main reason to do this is that the values of `Y` are used to get a reasonable initial value for the sum of trees and also the variance of the leaf nodes. Thus, applying the inverse transformation is a simple way to improve the efficiency and accuracy of the result. Should we do this for every possible likelihood? Well, no. If we are using BART for the location parameter of distributions like Normal, StudentT, or AsymmetricLaplace, we don't need to do anything as the support of these parameters is also the real line. A nontrivial exception is the Bernoulli likelihood (or Binomial with n=1), in that case, we need to apply the logistic function to the BART variable, but there is no need to apply its inverse to transform `Y`, PyMC-BART already takes care of that particular case.
 
 OK, now let's see the result of `model_coal`.
 
diff --git a/examples/bart/bart_quantile_regression.ipynb b/examples/bart/bart_quantile_regression.ipynb
index 1b04f5d71..299efe66f 100644
--- a/examples/bart/bart_quantile_regression.ipynb
+++ b/examples/bart/bart_quantile_regression.ipynb
@@ -8,7 +8,7 @@
     "(BART_quantile)=\n",
     "# Quantile Regression with BART\n",
     ":::{post} Jan 25, 2023\n",
-    ":tags: BART, non-parametric, quantile, regression \n",
+    ":tags: BART, nonparametric, quantile, regression \n",
     ":category: intermediate, explanation\n",
     ":author: Osvaldo Martin\n",
     ":::"
@@ -468,7 +468,7 @@
    "id": "8e963637",
    "metadata": {},
    "source": [
-    "We can see that when we use a Normal likelihood, and from that fit we compute the quantiles, the quantiles  q=0.1 and q=0.9 are symetrical with respect to q=0.5, also the shape of the curves is essentially the same just shifted up or down. Additionally the Asymmetric Laplace family allows the model to account for the increased variability in BMI as the age increases, while for the Gaussian family that variability always stays the same."
+    "We can see that when we use a Normal likelihood, and from that fit we compute the quantiles, the quantiles  q=0.1 and q=0.9 are symmetrical with respect to q=0.5, also the shape of the curves is essentially the same just shifted up or down. Additionally the Asymmetric Laplace family allows the model to account for the increased variability in BMI as the age increases, while for the Gaussian family that variability always stays the same."
    ]
   },
   {
diff --git a/examples/bart/bart_quantile_regression.myst.md b/examples/bart/bart_quantile_regression.myst.md
index 60d4094ba..f3adbf649 100644
--- a/examples/bart/bart_quantile_regression.myst.md
+++ b/examples/bart/bart_quantile_regression.myst.md
@@ -14,7 +14,7 @@ kernelspec:
 (BART_quantile)=
 # Quantile Regression with BART
 :::{post} Jan 25, 2023
-:tags: BART, non-parametric, quantile, regression 
+:tags: BART, nonparametric, quantile, regression 
 :category: intermediate, explanation
 :author: Osvaldo Martin
 :::
@@ -141,7 +141,7 @@ plt.xlabel("Age")
 plt.ylabel("BMI");
 ```
 
-We can see that when we use a Normal likelihood, and from that fit we compute the quantiles, the quantiles  q=0.1 and q=0.9 are symetrical with respect to q=0.5, also the shape of the curves is essentially the same just shifted up or down. Additionally the Asymmetric Laplace family allows the model to account for the increased variability in BMI as the age increases, while for the Gaussian family that variability always stays the same.
+We can see that when we use a Normal likelihood, and from that fit we compute the quantiles, the quantiles  q=0.1 and q=0.9 are symmetrical with respect to q=0.5, also the shape of the curves is essentially the same just shifted up or down. Additionally the Asymmetric Laplace family allows the model to account for the increased variability in BMI as the age increases, while for the Gaussian family that variability always stays the same.
 
 +++
 
diff --git a/examples/case_studies/CFA_SEM.ipynb b/examples/case_studies/CFA_SEM.ipynb
index 4922dae8e..75eb1fd23 100644
--- a/examples/case_studies/CFA_SEM.ipynb
+++ b/examples/case_studies/CFA_SEM.ipynb
@@ -4019,7 +4019,7 @@
    "source": [
     "### Intermediate Cross-Loading Model\n",
     "\n",
-    "The idea of a measurment model is maybe a little opaque when we only see models that fit well. Instead we want to briefly show how an in-apt measurement model gets reflected in the estimated parameters for the factor loadings. Here we specify a measurement model which attempts to couple the `se_social` and `sup_parents` indicators and bundle them into the same factor. "
+    "The idea of a measurement model is maybe a little opaque when we only see models that fit well. Instead we want to briefly show how an in-apt measurement model gets reflected in the estimated parameters for the factor loadings. Here we specify a measurement model which attempts to couple the `se_social` and `sup_parents` indicators and bundle them into the same factor. "
    ]
   },
   {
@@ -7890,7 +7890,7 @@
    "source": [
     "# Conclusion\n",
     "\n",
-    "We've just seen how we can go from thinking about the measurment of abstract psychometric constructs, through the evaluation of complex patterns of correlation and covariances among these latent constructs to evaluating hypothetical causal structures amongst the latent factors. This is a bit of whirlwind tour of psychometric models and the expressive power of SEM and CFA models, which we're ending by linking them to the realm of causal inference! This is not an accident, but rather evidence that causal concerns sit at the heart of most modeling endeavours. When we're interested in any kind of complex joint-distribution of variables, we're likely interested in the causal structure of the system - how are the realised values of some observed metrics dependent on or related to others? Importantly, we need to understand how these observations are realised without confusing simple correlation for cause through naive or confounded inference.\n",
+    "We've just seen how we can go from thinking about the measurement of abstract psychometric constructs, through the evaluation of complex patterns of correlation and covariances among these latent constructs to evaluating hypothetical causal structures amongst the latent factors. This is a bit of whirlwind tour of psychometric models and the expressive power of SEM and CFA models, which we're ending by linking them to the realm of causal inference! This is not an accident, but rather evidence that causal concerns sit at the heart of most modeling endeavours. When we're interested in any kind of complex joint-distribution of variables, we're likely interested in the causal structure of the system - how are the realised values of some observed metrics dependent on or related to others? Importantly, we need to understand how these observations are realised without confusing simple correlation for cause through naive or confounded inference.\n",
     "\n",
     "Mislevy and Levy highlight this connection by focusing on the role of De Finetti's theorem in the recovery of exchangeablility through Bayesian inference. By De Finetti’s theorem a distribution of exchangeable sequence of variables be expressed as mixture of conditional independent variables.\n",
     "\n",
diff --git a/examples/case_studies/CFA_SEM.myst.md b/examples/case_studies/CFA_SEM.myst.md
index 731602695..59f78ce1e 100644
--- a/examples/case_studies/CFA_SEM.myst.md
+++ b/examples/case_studies/CFA_SEM.myst.md
@@ -282,7 +282,7 @@ Which shows a relatively sound recovery of the observed data.
 
 ### Intermediate Cross-Loading Model
 
-The idea of a measurment model is maybe a little opaque when we only see models that fit well. Instead we want to briefly show how an in-apt measurement model gets reflected in the estimated parameters for the factor loadings. Here we specify a measurement model which attempts to couple the `se_social` and `sup_parents` indicators and bundle them into the same factor. 
+The idea of a measurement model is maybe a little opaque when we only see models that fit well. Instead we want to briefly show how an in-apt measurement model gets reflected in the estimated parameters for the factor loadings. Here we specify a measurement model which attempts to couple the `se_social` and `sup_parents` indicators and bundle them into the same factor. 
 
 ```{code-cell} ipython3
 coords = {
@@ -1035,7 +1035,7 @@ compare_df
 
 # Conclusion
 
-We've just seen how we can go from thinking about the measurment of abstract psychometric constructs, through the evaluation of complex patterns of correlation and covariances among these latent constructs to evaluating hypothetical causal structures amongst the latent factors. This is a bit of whirlwind tour of psychometric models and the expressive power of SEM and CFA models, which we're ending by linking them to the realm of causal inference! This is not an accident, but rather evidence that causal concerns sit at the heart of most modeling endeavours. When we're interested in any kind of complex joint-distribution of variables, we're likely interested in the causal structure of the system - how are the realised values of some observed metrics dependent on or related to others? Importantly, we need to understand how these observations are realised without confusing simple correlation for cause through naive or confounded inference.
+We've just seen how we can go from thinking about the measurement of abstract psychometric constructs, through the evaluation of complex patterns of correlation and covariances among these latent constructs to evaluating hypothetical causal structures amongst the latent factors. This is a bit of whirlwind tour of psychometric models and the expressive power of SEM and CFA models, which we're ending by linking them to the realm of causal inference! This is not an accident, but rather evidence that causal concerns sit at the heart of most modeling endeavours. When we're interested in any kind of complex joint-distribution of variables, we're likely interested in the causal structure of the system - how are the realised values of some observed metrics dependent on or related to others? Importantly, we need to understand how these observations are realised without confusing simple correlation for cause through naive or confounded inference.
 
 Mislevy and Levy highlight this connection by focusing on the role of De Finetti's theorem in the recovery of exchangeablility through Bayesian inference. By De Finetti’s theorem a distribution of exchangeable sequence of variables be expressed as mixture of conditional independent variables.
 
diff --git a/examples/case_studies/GEV.ipynb b/examples/case_studies/GEV.ipynb
index 8bf0240c2..e5c678c13 100644
--- a/examples/case_studies/GEV.ipynb
+++ b/examples/case_studies/GEV.ipynb
@@ -127,7 +127,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "And now set up the model using priors estimated from a quick review of the historgram above:\n",
+    "And now set up the model using priors estimated from a quick review of the histogram above:\n",
     "\n",
     "- $\\mu$: there is no real basis for considering anything other than a `Normal` distribution with a standard deviation limiting negative outcomes;\n",
     "- $\\sigma$: this must be positive, and has a small value, so use `HalfNormal` with a unit standard deviation;\n",
diff --git a/examples/case_studies/GEV.myst.md b/examples/case_studies/GEV.myst.md
index c0312f16b..d0234c712 100644
--- a/examples/case_studies/GEV.myst.md
+++ b/examples/case_studies/GEV.myst.md
@@ -85,7 +85,7 @@ Consider then, the 10-year return period, for which $p = 1/10$:
 p = 1 / 10
 ```
 
-And now set up the model using priors estimated from a quick review of the historgram above:
+And now set up the model using priors estimated from a quick review of the histogram above:
 
 - $\mu$: there is no real basis for considering anything other than a `Normal` distribution with a standard deviation limiting negative outcomes;
 - $\sigma$: this must be positive, and has a small value, so use `HalfNormal` with a unit standard deviation;
diff --git a/examples/case_studies/reliability_and_calibrated_prediction.ipynb b/examples/case_studies/reliability_and_calibrated_prediction.ipynb
index eb30ed396..e1d377c8a 100644
--- a/examples/case_studies/reliability_and_calibrated_prediction.ipynb
+++ b/examples/case_studies/reliability_and_calibrated_prediction.ipynb
@@ -442,7 +442,7 @@
        "      <th>Standard_Error</th>\n",
        "      <th>CI_95_lb</th>\n",
        "      <th>CI_95_ub</th>\n",
-       "      <th>ploting_position</th>\n",
+       "      <th>plotting_position</th>\n",
        "      <th>logit_CI_95_lb</th>\n",
        "      <th>logit_CI_95_ub</th>\n",
        "    </tr>\n",
@@ -512,7 +512,7 @@
        "1  2       5       197  0.025381  0.974619  0.961624  0.039131  0.038376   \n",
        "2  3       2        97  0.020619  0.979381  0.941797  0.059965  0.058203   \n",
        "\n",
-       "      V_hat  Standard_Error  CI_95_lb  CI_95_ub  ploting_position  \\\n",
+       "      V_hat  Standard_Error  CI_95_lb  CI_95_ub  plotting_position  \\\n",
        "0  0.000044        0.006622  0.000354  0.026313          0.013333   \n",
        "1  0.000164        0.012802  0.013283  0.063468          0.038376   \n",
        "2  0.000350        0.018701  0.021550  0.094856          0.058203   \n",
@@ -578,7 +578,7 @@
     "    actuarial_table[\"CI_95_ub\"] = np.where(\n",
     "        actuarial_table[\"CI_95_ub\"] > 1, 1, actuarial_table[\"CI_95_ub\"]\n",
     "    )\n",
-    "    actuarial_table[\"ploting_position\"] = actuarial_table[\"F_hat\"].rolling(1).median()\n",
+    "    actuarial_table[\"plotting_position\"] = actuarial_table[\"F_hat\"].rolling(1).median()\n",
     "    actuarial_table = logit_transform_interval(actuarial_table)\n",
     "    return actuarial_table\n",
     "\n",
@@ -715,7 +715,7 @@
        "      <th>Standard_Error</th>\n",
        "      <th>CI_95_lb</th>\n",
        "      <th>CI_95_ub</th>\n",
-       "      <th>ploting_position</th>\n",
+       "      <th>plotting_position</th>\n",
        "      <th>logit_CI_95_lb</th>\n",
        "      <th>logit_CI_95_ub</th>\n",
        "    </tr>\n",
@@ -1604,7 +1604,7 @@
        "36  1.000000  0.287376  1.246964  0.712624  0.022828        0.151089   \n",
        "37  1.000000  0.287376  1.246964  0.712624  0.022828        0.151089   \n",
        "\n",
-       "    CI_95_lb  CI_95_ub  ploting_position  logit_CI_95_lb  logit_CI_95_ub  \n",
+       "    CI_95_lb  CI_95_ub  plotting_position  logit_CI_95_lb  logit_CI_95_ub  \n",
        "0   0.000000  0.000000          0.000000             NaN             NaN  \n",
        "1   0.000000  0.077212          0.026316        0.003694        0.164570  \n",
        "2   0.000000  0.077212          0.026316        0.003694        0.164570  \n",
@@ -2477,7 +2477,7 @@
        "      <th>Standard_Error</th>\n",
        "      <th>CI_95_lb</th>\n",
        "      <th>CI_95_ub</th>\n",
-       "      <th>ploting_position</th>\n",
+       "      <th>plotting_position</th>\n",
        "      <th>logit_CI_95_lb</th>\n",
        "      <th>logit_CI_95_ub</th>\n",
        "    </tr>\n",
@@ -3090,7 +3090,7 @@
        "24  0.944506  0.057093  0.055494  2.140430e-03        0.046265       0.0   \n",
        "25  0.944506  0.057093  0.055494  2.140430e-03        0.046265       0.0   \n",
        "\n",
-       "    CI_95_ub  ploting_position  logit_CI_95_lb  logit_CI_95_ub  \n",
+       "    CI_95_ub  plotting_position  logit_CI_95_lb  logit_CI_95_ub  \n",
        "0   0.000000          0.000000             NaN             NaN  \n",
        "1   0.000000          0.000000             NaN             NaN  \n",
        "2   0.000000          0.000000             NaN             NaN  \n",
@@ -3466,7 +3466,7 @@
     ")\n",
     "ax.scatter(\n",
     "    np.log(actuarial_table_bearings[\"t\"]),\n",
-    "    norm.ppf(actuarial_table_bearings[\"ploting_position\"]),\n",
+    "    norm.ppf(actuarial_table_bearings[\"plotting_position\"]),\n",
     "    label=\"Non-Parametric CDF\",\n",
     "    color=\"black\",\n",
     ")\n",
@@ -3529,7 +3529,7 @@
     ")\n",
     "ax2.scatter(\n",
     "    np.log(actuarial_table_bearings[\"t\"]),\n",
-    "    sev_ppf(actuarial_table_bearings[\"ploting_position\"]),\n",
+    "    sev_ppf(actuarial_table_bearings[\"plotting_position\"]),\n",
     "    label=\"Non-Parametric CDF\",\n",
     "    color=\"black\",\n",
     ")\n",
@@ -3596,12 +3596,12 @@
     "\n",
     "We've now seen how to model and visualise the parametric model fits to sparse reliability using a frequentist or MLE framework. We want to now show how the same style of inferences can be achieved in the Bayesian paradigm. \n",
     "\n",
-    "As in the MLE paradigm we need to model the censored liklihood. For most log-location distributions we've seen above the likelihood is expressed as a function of a combination of the distribution pdf $\\phi$ and cdf $\\Phi$ applied as appropriately depending on whether or not the data point was fully observed in the time window or censored. \n",
+    "As in the MLE paradigm we need to model the censored likelihood. For most log-location distributions we've seen above the likelihood is expressed as a function of a combination of the distribution pdf $\\phi$ and cdf $\\Phi$ applied as appropriately depending on whether or not the data point was fully observed in the time window or censored. \n",
     "\n",
     "\n",
     "$$ L(\\mu, \\sigma) =  \\prod_{i = 1}^{n} \\Bigg(\\dfrac{1}{\\sigma t_{i}} \\phi\\Bigg[ \\dfrac{log(t_{i}) - \\mu}{\\sigma}  \\Bigg] \\Bigg)^{\\delta_{i}} \\cdot \\Bigg(1 - \\Phi \\Bigg[ \\dfrac{log(t_{i}) - \\mu}{\\sigma} \\Bigg]   \\Bigg)^{1-\\delta}$$\n",
     "\n",
-    "where $\\delta_{i}$ is an indicator for whether the observation is a faiure or a right censored observation. More complicated types of censoring can be included with similar modifications of the CDF depending on the nature of the censored observations."
+    "where $\\delta_{i}$ is an indicator for whether the observation is a failure or a right censored observation. More complicated types of censoring can be included with similar modifications of the CDF depending on the nature of the censored observations."
    ]
   },
   {
diff --git a/examples/case_studies/reliability_and_calibrated_prediction.myst.md b/examples/case_studies/reliability_and_calibrated_prediction.myst.md
index 2b28cdb83..f6145706a 100644
--- a/examples/case_studies/reliability_and_calibrated_prediction.myst.md
+++ b/examples/case_studies/reliability_and_calibrated_prediction.myst.md
@@ -232,7 +232,7 @@ def make_actuarial_table(actuarial_table):
     actuarial_table["CI_95_ub"] = np.where(
         actuarial_table["CI_95_ub"] > 1, 1, actuarial_table["CI_95_ub"]
     )
-    actuarial_table["ploting_position"] = actuarial_table["F_hat"].rolling(1).median()
+    actuarial_table["plotting_position"] = actuarial_table["F_hat"].rolling(1).median()
     actuarial_table = logit_transform_interval(actuarial_table)
     return actuarial_table
 
@@ -743,7 +743,7 @@ ax.plot(
 )
 ax.scatter(
     np.log(actuarial_table_bearings["t"]),
-    norm.ppf(actuarial_table_bearings["ploting_position"]),
+    norm.ppf(actuarial_table_bearings["plotting_position"]),
     label="Non-Parametric CDF",
     color="black",
 )
@@ -806,7 +806,7 @@ ax2.plot(
 )
 ax2.scatter(
     np.log(actuarial_table_bearings["t"]),
-    sev_ppf(actuarial_table_bearings["ploting_position"]),
+    sev_ppf(actuarial_table_bearings["plotting_position"]),
     label="Non-Parametric CDF",
     color="black",
 )
@@ -866,12 +866,12 @@ We can see here how neither MLE fit covers the range of observed data.
 
 We've now seen how to model and visualise the parametric model fits to sparse reliability using a frequentist or MLE framework. We want to now show how the same style of inferences can be achieved in the Bayesian paradigm. 
 
-As in the MLE paradigm we need to model the censored liklihood. For most log-location distributions we've seen above the likelihood is expressed as a function of a combination of the distribution pdf $\phi$ and cdf $\Phi$ applied as appropriately depending on whether or not the data point was fully observed in the time window or censored. 
+As in the MLE paradigm we need to model the censored likelihood. For most log-location distributions we've seen above the likelihood is expressed as a function of a combination of the distribution pdf $\phi$ and cdf $\Phi$ applied as appropriately depending on whether or not the data point was fully observed in the time window or censored. 
 
 
 $$ L(\mu, \sigma) =  \prod_{i = 1}^{n} \Bigg(\dfrac{1}{\sigma t_{i}} \phi\Bigg[ \dfrac{log(t_{i}) - \mu}{\sigma}  \Bigg] \Bigg)^{\delta_{i}} \cdot \Bigg(1 - \Phi \Bigg[ \dfrac{log(t_{i}) - \mu}{\sigma} \Bigg]   \Bigg)^{1-\delta}$$
 
-where $\delta_{i}$ is an indicator for whether the observation is a faiure or a right censored observation. More complicated types of censoring can be included with similar modifications of the CDF depending on the nature of the censored observations.
+where $\delta_{i}$ is an indicator for whether the observation is a failure or a right censored observation. More complicated types of censoring can be included with similar modifications of the CDF depending on the nature of the censored observations.
 
 +++
 
diff --git a/examples/case_studies/rugby_analytics.ipynb b/examples/case_studies/rugby_analytics.ipynb
index 9b894b0c8..afb63f5d6 100644
--- a/examples/case_studies/rugby_analytics.ipynb
+++ b/examples/case_studies/rugby_analytics.ipynb
@@ -405,7 +405,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let us first loook at a Pivot table with a sum of this, broken down by year."
+    "Let us first look at a Pivot table with a sum of this, broken down by year."
    ]
   },
   {
diff --git a/examples/case_studies/rugby_analytics.myst.md b/examples/case_studies/rugby_analytics.myst.md
index 73d3993d3..81d983e97 100644
--- a/examples/case_studies/rugby_analytics.myst.md
+++ b/examples/case_studies/rugby_analytics.myst.md
@@ -135,7 +135,7 @@ Let's look country by country.
 df_all["difference_non_abs"] = df_all["home_score"] - df_all["away_score"]
 ```
 
-Let us first loook at a Pivot table with a sum of this, broken down by year.
+Let us first look at a Pivot table with a sum of this, broken down by year.
 
 ```{code-cell} ipython3
 df_all.pivot_table("difference_non_abs", "home_team", "year")
diff --git a/examples/causal_inference/GLM-simpsons-paradox.ipynb b/examples/causal_inference/GLM-simpsons-paradox.ipynb
index da9ccff10..89fe58cf2 100644
--- a/examples/causal_inference/GLM-simpsons-paradox.ipynb
+++ b/examples/causal_inference/GLM-simpsons-paradox.ipynb
@@ -24,7 +24,7 @@
     "\n",
     "Another way of describing this is that we wish to estimate the causal relationship $x \\rightarrow y$. The seemingly obvious approach of modelling `y ~ 1 + x` will lead us to conclude (in the situation above) that increasing $x$ causes $y$ to decrease (see Model 1 below). However, the relationship between $x$ and $y$ is confounded by a group membership variable $group$. This group membership variable is not included in the model, and so the relationship between $x$ and $y$ is biased. If we now factor in the influence of $group$, in some situations (e.g. the image above) this can lead us to completely reverse the sign of our estimate of $x \\rightarrow y$, now estimating that increasing $x$ causes $y$ to _increase_. \n",
     "\n",
-    "In short, this 'paradox' (or simply ommitted variable bias) can be resolved by assuming a causal DAG which includes how the main predictor variable _and_ group membership (the confounding variable) influence the outcome variable. We demonstrate an example where we _don't_ incorporate group membership (so our causal DAG is wrong, or in other words our model is misspecified; Model 1). We then show 2 ways to resolve this by including group membership as causal influence upon $x$ and $y$. This is shown in an unpooled model (Model 2) and a hierarchical model (Model 3)."
+    "In short, this 'paradox' (or simply omitted variable bias) can be resolved by assuming a causal DAG which includes how the main predictor variable _and_ group membership (the confounding variable) influence the outcome variable. We demonstrate an example where we _don't_ incorporate group membership (so our causal DAG is wrong, or in other words our model is misspecified; Model 1). We then show 2 ways to resolve this by including group membership as causal influence upon $x$ and $y$. This is shown in an unpooled model (Model 2) and a hierarchical model (Model 3)."
    ]
   },
   {
diff --git a/examples/causal_inference/GLM-simpsons-paradox.myst.md b/examples/causal_inference/GLM-simpsons-paradox.myst.md
index 8a5585dcc..77cac9c04 100644
--- a/examples/causal_inference/GLM-simpsons-paradox.myst.md
+++ b/examples/causal_inference/GLM-simpsons-paradox.myst.md
@@ -27,7 +27,7 @@ kernelspec:
 
 Another way of describing this is that we wish to estimate the causal relationship $x \rightarrow y$. The seemingly obvious approach of modelling `y ~ 1 + x` will lead us to conclude (in the situation above) that increasing $x$ causes $y$ to decrease (see Model 1 below). However, the relationship between $x$ and $y$ is confounded by a group membership variable $group$. This group membership variable is not included in the model, and so the relationship between $x$ and $y$ is biased. If we now factor in the influence of $group$, in some situations (e.g. the image above) this can lead us to completely reverse the sign of our estimate of $x \rightarrow y$, now estimating that increasing $x$ causes $y$ to _increase_. 
 
-In short, this 'paradox' (or simply ommitted variable bias) can be resolved by assuming a causal DAG which includes how the main predictor variable _and_ group membership (the confounding variable) influence the outcome variable. We demonstrate an example where we _don't_ incorporate group membership (so our causal DAG is wrong, or in other words our model is misspecified; Model 1). We then show 2 ways to resolve this by including group membership as causal influence upon $x$ and $y$. This is shown in an unpooled model (Model 2) and a hierarchical model (Model 3).
+In short, this 'paradox' (or simply omitted variable bias) can be resolved by assuming a causal DAG which includes how the main predictor variable _and_ group membership (the confounding variable) influence the outcome variable. We demonstrate an example where we _don't_ incorporate group membership (so our causal DAG is wrong, or in other words our model is misspecified; Model 1). We then show 2 ways to resolve this by including group membership as causal influence upon $x$ and $y$. This is shown in an unpooled model (Model 2) and a hierarchical model (Model 3).
 
 ```{code-cell} ipython3
 import arviz as az
diff --git a/examples/causal_inference/bayesian_nonparametric_causal.ipynb b/examples/causal_inference/bayesian_nonparametric_causal.ipynb
index aabcbe398..61155c8af 100644
--- a/examples/causal_inference/bayesian_nonparametric_causal.ipynb
+++ b/examples/causal_inference/bayesian_nonparametric_causal.ipynb
@@ -8,7 +8,7 @@
     "# Bayesian Non-parametric Causal Inference\n",
     "\n",
     ":::{post} January, 2024\n",
-    ":tags: bart, propensity scores, debiased machine learning, mediation\n",
+    ":tags: BART, propensity scores, debiased machine learning, mediation\n",
     ":category: advanced, reference\n",
     ":author: Nathaniel Forde\n",
     ":::"
@@ -98,9 +98,9 @@
     "\n",
     "Firstly, and somewhat superficially, the propensity score is a dimension reduction technique. We take a complex covariate profile $X_{i}$ representing an individual's measured attributes and reduce it to a scalar $p^{i}_{T}(X)$. It is also a tool for thinking about the potential outcomes of an individual under different treatment regimes. In a policy evaluation context it can help partial out the degree of incentives for policy adoption across strata of the population. What drives adoption or assignment in each niche of the population? How can different demographic strata be induced towards or away from adoption of the policy? Understanding these dynamics is crucial to gauge why selection bias might emerge in any sample data. Paul Goldsmith-Pinkham's [lectures](https://www.youtube.com/watch?v=8gWctYvRzk4&list=PLWWcL1M3lLlojLTSVf2gGYQ_9TlPyPbiJ&index=3) are particularly clear on this last point, and why this perspective is appealing to structural econometricians.\n",
     "\n",
-    "The pivotal idea when thinking about propensity scores is that we cannot license causal claims unless (i) the treatment assignment is independent of the covariate profiles i.e $T     \\perp\\!\\!\\!\\perp X$  and (ii) the outcomes $Y(0)$, and $Y(1)$ are similarly conditionally independent of the treatement $T | X$. If these conditions hold, then we say that $T$ is __strongly ignorable__ given $X$. This is also occasionally noted as the __unconfoundedness__ or __exchangeability__ assumption. For each strata of the population defined by the covariate profile $X$, we require that, after controlling for $X$, it's as good as random which treatment status an individual adopts. This means that after controlling for $X$, any differences in outcomes between the treated and untreated groups can be attributed to the treatment itself rather than confounding variables.\n",
+    "The pivotal idea when thinking about propensity scores is that we cannot license causal claims unless (i) the treatment assignment is independent of the covariate profiles i.e $T     \\perp\\!\\!\\!\\perp X$  and (ii) the outcomes $Y(0)$, and $Y(1)$ are similarly conditionally independent of the treatment $T | X$. If these conditions hold, then we say that $T$ is __strongly ignorable__ given $X$. This is also occasionally noted as the __unconfoundedness__ or __exchangeability__ assumption. For each strata of the population defined by the covariate profile $X$, we require that, after controlling for $X$, it's as good as random which treatment status an individual adopts. This means that after controlling for $X$, any differences in outcomes between the treated and untreated groups can be attributed to the treatment itself rather than confounding variables.\n",
     "\n",
-    "It is a theorem that if $T$ is strongly ignorable given $X$, then (i) and (ii) hold given $p_{T}(X)$ too. So valid statistical inference proceeds in a lower dimensional space using the propensity score as a proxy for the higher dimensional data. This is useful because some of the strata of a complex covariate profile may be sparsely populated so substituting a propensity score enables us to avoid the risks of high dimensional missing data.  Causal inference is unconfounded when we have controlled for enough of drivers for policy adoption, that selection effects within each covariate profiles $X$ seem essentially random. The insight this suggests is that when you want to estimate a causal effect you are only required to control for the covariates which impact the probability of treatement assignment. More concretely, if it's easier to model the assignment mechanism than the outcome mechanism this can be substituted in the case of causal inference with observed data.\n",
+    "It is a theorem that if $T$ is strongly ignorable given $X$, then (i) and (ii) hold given $p_{T}(X)$ too. So valid statistical inference proceeds in a lower dimensional space using the propensity score as a proxy for the higher dimensional data. This is useful because some of the strata of a complex covariate profile may be sparsely populated so substituting a propensity score enables us to avoid the risks of high dimensional missing data.  Causal inference is unconfounded when we have controlled for enough of drivers for policy adoption, that selection effects within each covariate profiles $X$ seem essentially random. The insight this suggests is that when you want to estimate a causal effect you are only required to control for the covariates which impact the probability of treatment assignment. More concretely, if it's easier to model the assignment mechanism than the outcome mechanism this can be substituted in the case of causal inference with observed data.\n",
     "\n",
     "Given the assumption that we are measuring the right covariate profiles to induce __strong ignorability__, then propensity scores can be used thoughtfully to underwrite causal claims."
    ]
@@ -184,7 +184,7 @@
     "\n",
     "If our treatment status is such that individuals will more or less actively select themselves into the status, then a naive comparisons of differences between treatment groups and control groups will be misleading to the degree that we have over-represented types of individual (covariate profiles) in the population.Randomisation solves this by balancing the covariate profiles across treatment and control groups and ensuring the outcomes are independent of the treatment assignment. But we can't always randomise. Propensity scores are useful because they can help emulate _as-if_ random assignment of treatment status in the sample data through a specific transformation of the observed data. \n",
     "\n",
-    "This type of assumption and ensuing tests of balance based on propensity scores is often substituted for elaboration of the structural DAG that systematically determine the treatment assignment. The idea being that if we can achieve balance across covariates conditional on a propensity score we have emulated an as-if random allocation we can avoid the hassle of specifying too much structure and remain agnostic about the strucuture of the mechanism. This can often be a useful strategy but, as we will see, elides the specificity of the causal question and the data generating process. "
+    "This type of assumption and ensuing tests of balance based on propensity scores is often substituted for elaboration of the structural DAG that systematically determine the treatment assignment. The idea being that if we can achieve balance across covariates conditional on a propensity score we have emulated an as-if random allocation we can avoid the hassle of specifying too much structure and remain agnostic about the structure of the mechanism. This can often be a useful strategy but, as we will see, elides the specificity of the causal question and the data generating process. "
    ]
   },
   {
@@ -3212,7 +3212,7 @@
     "    m1_pred = m1.predict(X)\n",
     "    X[\"trt\"] = t\n",
     "    X[\"y\"] = y\n",
-    "    ## Compromise between outcome and treatement assignment model\n",
+    "    ## Compromise between outcome and treatment assignment model\n",
     "    weighted_outcome0 = (1 - X[\"trt\"]) * (X[\"y\"] - m0_pred) / (1 - X[\"ps\"]) + m0_pred\n",
     "    weighted_outcome1 = X[\"trt\"] * (X[\"y\"] - m1_pred) / X[\"ps\"] + m1_pred\n",
     "\n",
@@ -3769,7 +3769,7 @@
    "source": [
     "### Considerations when choosing between models\n",
     "\n",
-    "It is one thing to evalute change in average over the population, but we might want to allow for the idea of effect heterogenity across the population and as such the BART model is generally better at ensuring accurate predictions across the deeper strata of our data. But the flexibility of machine learning models for prediction tasks do not guarantee that the propensity scores attributed across the sample are well calibrated to recover the true-treatment effects when used in causal effect estimation. We have to be careful in how we use the flexibility of non-parametric models in the causal context. \n",
+    "It is one thing to evaluate change in average over the population, but we might want to allow for the idea of effect heterogenity across the population and as such the BART model is generally better at ensuring accurate predictions across the deeper strata of our data. But the flexibility of machine learning models for prediction tasks do not guarantee that the propensity scores attributed across the sample are well calibrated to recover the true-treatment effects when used in causal effect estimation. We have to be careful in how we use the flexibility of non-parametric models in the causal context. \n",
     "\n",
     "First observe the hetereogenous accuracy induced by the BART model across increasingly narrow strata of our sample. "
    ]
@@ -3841,7 +3841,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Observations like this go a long way to motivating the use of flexible machine learning methods in causal inference. The model used to capture the outcome distribution or the propensity score distribution ought to be sensitive to variation across extremities of the data. We can see above that the predictive power of the simpler logistic regression model deterioriates as we progress down the partitions of the data. We will see an example below where the flexibility of machine learning models such as BART becomes a problem. We'll also see and how it can be fixed. Paradoxical as it sounds, a more perfect model of the propensity scores will cleanly seperate the treatment classes making re-balancing harder to achieve. In this way, flexible models like BART (which are prone to overfit) need to be used with care in the case of inverse propensity weighting schemes. "
+    "Observations like this go a long way to motivating the use of flexible machine learning methods in causal inference. The model used to capture the outcome distribution or the propensity score distribution ought to be sensitive to variation across extremities of the data. We can see above that the predictive power of the simpler logistic regression model deterioriates as we progress down the partitions of the data. We will see an example below where the flexibility of machine learning models such as BART becomes a problem. We'll also see and how it can be fixed. Paradoxical as it sounds, a more perfect model of the propensity scores will cleanly separate the treatment classes making re-balancing harder to achieve. In this way, flexible models like BART (which are prone to overfit) need to be used with care in the case of inverse propensity weighting schemes. "
    ]
   },
   {
@@ -3968,7 +3968,7 @@
    ],
    "source": [
     "def make_prop_reg_model(X, t, y, idata_ps, covariates=None, samples=1000):\n",
-    "    ### Note the simplication for specifying the mean estimate in the regression\n",
+    "    ### Note the simplification for specifying the mean estimate in the regression\n",
     "    ### rather than post-processing the whole posterior\n",
     "    ps = idata_ps[\"posterior\"][\"p\"].mean(dim=(\"chain\", \"draw\")).values\n",
     "    X_temp = pd.DataFrame({\"ps\": ps, \"trt\": t, \"trt*ps\": t * ps})\n",
@@ -11190,7 +11190,7 @@
    "source": [
     "### How does Regression Help?\n",
     "\n",
-    "We've just seen an example of how a mis-specfied machine learning model can wildly bias the causal estimates in a study. We've seen one means of fixing it, but how would things work out if we just tried simpler exploratory regression modelling? Regression automatically weights the data points by their extremity of their covariate profile and their prevalence in the sample. So in this sense adjusts for the outlier propensity scores in a way that the inverse weighting approach cannot."
+    "We've just seen an example of how a mis-specified machine learning model can wildly bias the causal estimates in a study. We've seen one means of fixing it, but how would things work out if we just tried simpler exploratory regression modelling? Regression automatically weights the data points by their extremity of their covariate profile and their prevalence in the sample. So in this sense adjusts for the outlier propensity scores in a way that the inverse weighting approach cannot."
    ]
   },
   {
@@ -11726,7 +11726,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This is much better and we can see that modelling the propensity score feature in conjunction with the health factors leads to a more sensible treatement effect estimate. This kind of finding echoes the lesson reported in Angrist and Pischke that:\n",
+    "This is much better and we can see that modelling the propensity score feature in conjunction with the health factors leads to a more sensible treatment effect estimate. This kind of finding echoes the lesson reported in Angrist and Pischke that:\n",
     "\n",
     "> \"Regression control for the right covariates does a reasonable job of eliminating selection effects...\" pg 91 _Mostly Harmless Econometrics_ {cite:t}`angrist2008harmless`\n",
     "\n",
@@ -11741,18 +11741,18 @@
     "\n",
     "To recap - we've seen two examples of causal inference with inverse probability weighted adjustments. We've seen when it works when the propensity score model is well-calibrated. We've seen when it fails and how the failure can be fixed. These are tools in our tool belt - apt for different problems, but come with the requirement that we think carefully about the data generating process and the type of appropriate covariates. \n",
     "\n",
-    "In the case where the simple propensity modelling approach failed, we saw a data set in which our treatment assignment did not distinguish an average treatment effect. We also saw how if we augment our propensity based estimator we can improve the identification properties of the technique. Here we'll show another example of how propensity models can be combined with an insight from regression based modelling to take advantage of the flexible modelling possibilities offered by machine learning approaches to causal inference. In this secrion we draw heavily on the work of Matheus Facure, especially his book _Causal Inference in Python_ {cite:t}`facure2023causal` but the original ideas are to be found in {cite:t}`ChernozhukovDoubleML`\n",
+    "In the case where the simple propensity modelling approach failed, we saw a data set in which our treatment assignment did not distinguish an average treatment effect. We also saw how if we augment our propensity based estimator we can improve the identification properties of the technique. Here we'll show another example of how propensity models can be combined with an insight from regression based modelling to take advantage of the flexible modelling possibilities offered by machine learning approaches to causal inference. In this section we draw heavily on the work of Matheus Facure, especially his book _Causal Inference in Python_ {cite:t}`facure2023causal` but the original ideas are to be found in {cite:t}`ChernozhukovDoubleML`\n",
     "\n",
     "### The Frisch-Waugh-Lovell Theorem\n",
     "\n",
-    "The idea of the theorem is that for any OLS fitted linear model with a focal parameter $\\beta_{1}$ and the auxilary parameters $\\gamma_{i}$ \n",
+    "The idea of the theorem is that for any OLS fitted linear model with a focal parameter $\\beta_{1}$ and the auxiliary parameters $\\gamma_{i}$ \n",
     "\n",
     "$$\\hat{Y} = \\beta_{0} + \\beta_{1}D_{1} + \\gamma_{1}Z_{1} + ... + \\gamma_{n}Z_{n}  $$\n",
     "\n",
     "We can retrieve the same values $\\beta_{i}, \\gamma_{i}$ in a two step procedure: \n",
     "\n",
-    "- Regress $Y$ on the auxilary covariates i.e. $\\hat{Y} = \\gamma_{1}Z_{1} + ... + \\gamma_{n}Z_{n} $\n",
-    "- Regress $D_{1}$ on the same auxilary terms i.e. $\\hat{D_{1}} =  \\gamma_{1}Z_{1} + ... + \\gamma_{n}Z_{n}$\n",
+    "- Regress $Y$ on the auxiliary covariates i.e. $\\hat{Y} = \\gamma_{1}Z_{1} + ... + \\gamma_{n}Z_{n} $\n",
+    "- Regress $D_{1}$ on the same auxiliary terms i.e. $\\hat{D_{1}} =  \\gamma_{1}Z_{1} + ... + \\gamma_{n}Z_{n}$\n",
     "- Take the residuals $r(D) = D_{1} - \\hat{D_{1}}$ and $r(Y) = Y - \\hat{Y}$\n",
     "- Fit the regression $r(Y) = \\beta_{0} + \\beta_{1}r(D)$ to find $\\beta_{1}$\n",
     "\n",
@@ -11933,7 +11933,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Estimated Treament Effect in K-fold 0: -0.0016336648336909886\n"
+      "Estimated Treatment Effect in K-fold 0: -0.0016336648336909886\n"
      ]
     },
     {
@@ -12056,7 +12056,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Estimated Treament Effect in K-fold 1: -0.03578615474390446\n"
+      "Estimated Treatment Effect in K-fold 1: -0.03578615474390446\n"
      ]
     },
     {
@@ -12179,7 +12179,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Estimated Treament Effect in K-fold 2: -0.02477987896421092\n"
+      "Estimated Treatment Effect in K-fold 2: -0.02477987896421092\n"
      ]
     },
     {
@@ -12302,7 +12302,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Estimated Treament Effect in K-fold 3: -4.86938979011845e-05\n"
+      "Estimated Treatment Effect in K-fold 3: -4.86938979011845e-05\n"
      ]
     }
    ],
@@ -12373,7 +12373,7 @@
     "        m0 = sm.OLS(y_resid[j, :].values, covariates).fit()\n",
     "        t_effects.append(m0.params[\"t_resid\"])\n",
     "    model_fits[i] = [m0, t_effects]\n",
-    "    print(f\"Estimated Treament Effect in K-fold {i}: {np.mean(t_effects)}\")"
+    "    print(f\"Estimated Treatment Effect in K-fold {i}: {np.mean(t_effects)}\")"
    ]
   },
   {
diff --git a/examples/causal_inference/bayesian_nonparametric_causal.myst.md b/examples/causal_inference/bayesian_nonparametric_causal.myst.md
index e452bbf1b..98f103ea5 100644
--- a/examples/causal_inference/bayesian_nonparametric_causal.myst.md
+++ b/examples/causal_inference/bayesian_nonparametric_causal.myst.md
@@ -14,7 +14,7 @@ kernelspec:
 # Bayesian Non-parametric Causal Inference
 
 :::{post} January, 2024
-:tags: bart, propensity scores, debiased machine learning, mediation
+:tags: BART, propensity scores, debiased machine learning, mediation
 :category: advanced, reference
 :author: Nathaniel Forde
 :::
@@ -83,9 +83,9 @@ With observational data we cannot re-run the assignment mechanism but we can est
 
 Firstly, and somewhat superficially, the propensity score is a dimension reduction technique. We take a complex covariate profile $X_{i}$ representing an individual's measured attributes and reduce it to a scalar $p^{i}_{T}(X)$. It is also a tool for thinking about the potential outcomes of an individual under different treatment regimes. In a policy evaluation context it can help partial out the degree of incentives for policy adoption across strata of the population. What drives adoption or assignment in each niche of the population? How can different demographic strata be induced towards or away from adoption of the policy? Understanding these dynamics is crucial to gauge why selection bias might emerge in any sample data. Paul Goldsmith-Pinkham's [lectures](https://www.youtube.com/watch?v=8gWctYvRzk4&list=PLWWcL1M3lLlojLTSVf2gGYQ_9TlPyPbiJ&index=3) are particularly clear on this last point, and why this perspective is appealing to structural econometricians.
 
-The pivotal idea when thinking about propensity scores is that we cannot license causal claims unless (i) the treatment assignment is independent of the covariate profiles i.e $T     \perp\!\!\!\perp X$  and (ii) the outcomes $Y(0)$, and $Y(1)$ are similarly conditionally independent of the treatement $T | X$. If these conditions hold, then we say that $T$ is __strongly ignorable__ given $X$. This is also occasionally noted as the __unconfoundedness__ or __exchangeability__ assumption. For each strata of the population defined by the covariate profile $X$, we require that, after controlling for $X$, it's as good as random which treatment status an individual adopts. This means that after controlling for $X$, any differences in outcomes between the treated and untreated groups can be attributed to the treatment itself rather than confounding variables.
+The pivotal idea when thinking about propensity scores is that we cannot license causal claims unless (i) the treatment assignment is independent of the covariate profiles i.e $T     \perp\!\!\!\perp X$  and (ii) the outcomes $Y(0)$, and $Y(1)$ are similarly conditionally independent of the treatment $T | X$. If these conditions hold, then we say that $T$ is __strongly ignorable__ given $X$. This is also occasionally noted as the __unconfoundedness__ or __exchangeability__ assumption. For each strata of the population defined by the covariate profile $X$, we require that, after controlling for $X$, it's as good as random which treatment status an individual adopts. This means that after controlling for $X$, any differences in outcomes between the treated and untreated groups can be attributed to the treatment itself rather than confounding variables.
 
-It is a theorem that if $T$ is strongly ignorable given $X$, then (i) and (ii) hold given $p_{T}(X)$ too. So valid statistical inference proceeds in a lower dimensional space using the propensity score as a proxy for the higher dimensional data. This is useful because some of the strata of a complex covariate profile may be sparsely populated so substituting a propensity score enables us to avoid the risks of high dimensional missing data.  Causal inference is unconfounded when we have controlled for enough of drivers for policy adoption, that selection effects within each covariate profiles $X$ seem essentially random. The insight this suggests is that when you want to estimate a causal effect you are only required to control for the covariates which impact the probability of treatement assignment. More concretely, if it's easier to model the assignment mechanism than the outcome mechanism this can be substituted in the case of causal inference with observed data.
+It is a theorem that if $T$ is strongly ignorable given $X$, then (i) and (ii) hold given $p_{T}(X)$ too. So valid statistical inference proceeds in a lower dimensional space using the propensity score as a proxy for the higher dimensional data. This is useful because some of the strata of a complex covariate profile may be sparsely populated so substituting a propensity score enables us to avoid the risks of high dimensional missing data.  Causal inference is unconfounded when we have controlled for enough of drivers for policy adoption, that selection effects within each covariate profiles $X$ seem essentially random. The insight this suggests is that when you want to estimate a causal effect you are only required to control for the covariates which impact the probability of treatment assignment. More concretely, if it's easier to model the assignment mechanism than the outcome mechanism this can be substituted in the case of causal inference with observed data.
 
 Given the assumption that we are measuring the right covariate profiles to induce __strong ignorability__, then propensity scores can be used thoughtfully to underwrite causal claims.
 
@@ -138,7 +138,7 @@ This is a useful perspective on the assumption of __strong ignorability__ becaus
 
 If our treatment status is such that individuals will more or less actively select themselves into the status, then a naive comparisons of differences between treatment groups and control groups will be misleading to the degree that we have over-represented types of individual (covariate profiles) in the population.Randomisation solves this by balancing the covariate profiles across treatment and control groups and ensuring the outcomes are independent of the treatment assignment. But we can't always randomise. Propensity scores are useful because they can help emulate _as-if_ random assignment of treatment status in the sample data through a specific transformation of the observed data. 
 
-This type of assumption and ensuing tests of balance based on propensity scores is often substituted for elaboration of the structural DAG that systematically determine the treatment assignment. The idea being that if we can achieve balance across covariates conditional on a propensity score we have emulated an as-if random allocation we can avoid the hassle of specifying too much structure and remain agnostic about the strucuture of the mechanism. This can often be a useful strategy but, as we will see, elides the specificity of the causal question and the data generating process. 
+This type of assumption and ensuing tests of balance based on propensity scores is often substituted for elaboration of the structural DAG that systematically determine the treatment assignment. The idea being that if we can achieve balance across covariates conditional on a propensity score we have emulated an as-if random allocation we can avoid the hassle of specifying too much structure and remain agnostic about the structure of the mechanism. This can often be a useful strategy but, as we will see, elides the specificity of the causal question and the data generating process. 
 
 +++
 
@@ -543,7 +543,7 @@ def make_doubly_robust_adjustment(X, t, y):
     m1_pred = m1.predict(X)
     X["trt"] = t
     X["y"] = y
-    ## Compromise between outcome and treatement assignment model
+    ## Compromise between outcome and treatment assignment model
     weighted_outcome0 = (1 - X["trt"]) * (X["y"] - m0_pred) / (1 - X["ps"]) + m0_pred
     weighted_outcome1 = X["trt"] * (X["y"] - m1_pred) / X["ps"] + m1_pred
 
@@ -804,7 +804,7 @@ Note the tighter variance of the measures using the doubly robust method. This i
 
 ### Considerations when choosing between models
 
-It is one thing to evalute change in average over the population, but we might want to allow for the idea of effect heterogenity across the population and as such the BART model is generally better at ensuring accurate predictions across the deeper strata of our data. But the flexibility of machine learning models for prediction tasks do not guarantee that the propensity scores attributed across the sample are well calibrated to recover the true-treatment effects when used in causal effect estimation. We have to be careful in how we use the flexibility of non-parametric models in the causal context. 
+It is one thing to evaluate change in average over the population, but we might want to allow for the idea of effect heterogenity across the population and as such the BART model is generally better at ensuring accurate predictions across the deeper strata of our data. But the flexibility of machine learning models for prediction tasks do not guarantee that the propensity scores attributed across the sample are well calibrated to recover the true-treatment effects when used in causal effect estimation. We have to be careful in how we use the flexibility of non-parametric models in the causal context. 
 
 First observe the hetereogenous accuracy induced by the BART model across increasingly narrow strata of our sample. 
 
@@ -838,7 +838,7 @@ axs[7].set_title("Race/Gender/Active Specific PPC - BART")
 plt.suptitle("Posterior Predictive Checks - Heterogenous Effects", fontsize=20);
 ```
 
-Observations like this go a long way to motivating the use of flexible machine learning methods in causal inference. The model used to capture the outcome distribution or the propensity score distribution ought to be sensitive to variation across extremities of the data. We can see above that the predictive power of the simpler logistic regression model deterioriates as we progress down the partitions of the data. We will see an example below where the flexibility of machine learning models such as BART becomes a problem. We'll also see and how it can be fixed. Paradoxical as it sounds, a more perfect model of the propensity scores will cleanly seperate the treatment classes making re-balancing harder to achieve. In this way, flexible models like BART (which are prone to overfit) need to be used with care in the case of inverse propensity weighting schemes. 
+Observations like this go a long way to motivating the use of flexible machine learning methods in causal inference. The model used to capture the outcome distribution or the propensity score distribution ought to be sensitive to variation across extremities of the data. We can see above that the predictive power of the simpler logistic regression model deterioriates as we progress down the partitions of the data. We will see an example below where the flexibility of machine learning models such as BART becomes a problem. We'll also see and how it can be fixed. Paradoxical as it sounds, a more perfect model of the propensity scores will cleanly separate the treatment classes making re-balancing harder to achieve. In this way, flexible models like BART (which are prone to overfit) need to be used with care in the case of inverse propensity weighting schemes. 
 
 +++
 
@@ -848,7 +848,7 @@ Another perhaps more direct method of causal inference is to just use regression
 
 ```{code-cell} ipython3
 def make_prop_reg_model(X, t, y, idata_ps, covariates=None, samples=1000):
-    ### Note the simplication for specifying the mean estimate in the regression
+    ### Note the simplification for specifying the mean estimate in the regression
     ### rather than post-processing the whole posterior
     ps = idata_ps["posterior"]["p"].mean(dim=("chain", "draw")).values
     X_temp = pd.DataFrame({"ps": ps, "trt": t, "trt*ps": t * ps})
@@ -1276,7 +1276,7 @@ plot_balance(temp, "bmi", t)
 
 ### How does Regression Help?
 
-We've just seen an example of how a mis-specfied machine learning model can wildly bias the causal estimates in a study. We've seen one means of fixing it, but how would things work out if we just tried simpler exploratory regression modelling? Regression automatically weights the data points by their extremity of their covariate profile and their prevalence in the sample. So in this sense adjusts for the outlier propensity scores in a way that the inverse weighting approach cannot.
+We've just seen an example of how a mis-specified machine learning model can wildly bias the causal estimates in a study. We've seen one means of fixing it, but how would things work out if we just tried simpler exploratory regression modelling? Regression automatically weights the data points by their extremity of their covariate profile and their prevalence in the sample. So in this sense adjusts for the outlier propensity scores in a way that the inverse weighting approach cannot.
 
 ```{code-cell} ipython3
 model_ps_reg_expend, idata_ps_reg_expend = make_prop_reg_model(X, t, y, idata_expend_bart)
@@ -1302,7 +1302,7 @@ model_ps_reg_expend_h, idata_ps_reg_expend_h = make_prop_reg_model(
 az.summary(idata_ps_reg_expend_h, var_names=["b"])
 ```
 
-This is much better and we can see that modelling the propensity score feature in conjunction with the health factors leads to a more sensible treatement effect estimate. This kind of finding echoes the lesson reported in Angrist and Pischke that:
+This is much better and we can see that modelling the propensity score feature in conjunction with the health factors leads to a more sensible treatment effect estimate. This kind of finding echoes the lesson reported in Angrist and Pischke that:
 
 > "Regression control for the right covariates does a reasonable job of eliminating selection effects..." pg 91 _Mostly Harmless Econometrics_ {cite:t}`angrist2008harmless`
 
@@ -1314,18 +1314,18 @@ So we're back to the question of the right controls. There is a no real way to a
 
 To recap - we've seen two examples of causal inference with inverse probability weighted adjustments. We've seen when it works when the propensity score model is well-calibrated. We've seen when it fails and how the failure can be fixed. These are tools in our tool belt - apt for different problems, but come with the requirement that we think carefully about the data generating process and the type of appropriate covariates. 
 
-In the case where the simple propensity modelling approach failed, we saw a data set in which our treatment assignment did not distinguish an average treatment effect. We also saw how if we augment our propensity based estimator we can improve the identification properties of the technique. Here we'll show another example of how propensity models can be combined with an insight from regression based modelling to take advantage of the flexible modelling possibilities offered by machine learning approaches to causal inference. In this secrion we draw heavily on the work of Matheus Facure, especially his book _Causal Inference in Python_ {cite:t}`facure2023causal` but the original ideas are to be found in {cite:t}`ChernozhukovDoubleML`
+In the case where the simple propensity modelling approach failed, we saw a data set in which our treatment assignment did not distinguish an average treatment effect. We also saw how if we augment our propensity based estimator we can improve the identification properties of the technique. Here we'll show another example of how propensity models can be combined with an insight from regression based modelling to take advantage of the flexible modelling possibilities offered by machine learning approaches to causal inference. In this section we draw heavily on the work of Matheus Facure, especially his book _Causal Inference in Python_ {cite:t}`facure2023causal` but the original ideas are to be found in {cite:t}`ChernozhukovDoubleML`
 
 ### The Frisch-Waugh-Lovell Theorem
 
-The idea of the theorem is that for any OLS fitted linear model with a focal parameter $\beta_{1}$ and the auxilary parameters $\gamma_{i}$ 
+The idea of the theorem is that for any OLS fitted linear model with a focal parameter $\beta_{1}$ and the auxiliary parameters $\gamma_{i}$ 
 
 $$\hat{Y} = \beta_{0} + \beta_{1}D_{1} + \gamma_{1}Z_{1} + ... + \gamma_{n}Z_{n}  $$
 
 We can retrieve the same values $\beta_{i}, \gamma_{i}$ in a two step procedure: 
 
-- Regress $Y$ on the auxilary covariates i.e. $\hat{Y} = \gamma_{1}Z_{1} + ... + \gamma_{n}Z_{n} $
-- Regress $D_{1}$ on the same auxilary terms i.e. $\hat{D_{1}} =  \gamma_{1}Z_{1} + ... + \gamma_{n}Z_{n}$
+- Regress $Y$ on the auxiliary covariates i.e. $\hat{Y} = \gamma_{1}Z_{1} + ... + \gamma_{n}Z_{n} $
+- Regress $D_{1}$ on the same auxiliary terms i.e. $\hat{D_{1}} =  \gamma_{1}Z_{1} + ... + \gamma_{n}Z_{n}$
 - Take the residuals $r(D) = D_{1} - \hat{D_{1}}$ and $r(Y) = Y - \hat{Y}$
 - Fit the regression $r(Y) = \beta_{0} + \beta_{1}r(D)$ to find $\beta_{1}$
 
@@ -1435,7 +1435,7 @@ for i in range(4):
         m0 = sm.OLS(y_resid[j, :].values, covariates).fit()
         t_effects.append(m0.params["t_resid"])
     model_fits[i] = [m0, t_effects]
-    print(f"Estimated Treament Effect in K-fold {i}: {np.mean(t_effects)}")
+    print(f"Estimated Treatment Effect in K-fold {i}: {np.mean(t_effects)}")
 ```
 
 ```{code-cell} ipython3
diff --git a/examples/causal_inference/interventional_distribution.ipynb b/examples/causal_inference/interventional_distribution.ipynb
index bd79dfbcd..48c96e8a7 100644
--- a/examples/causal_inference/interventional_distribution.ipynb
+++ b/examples/causal_inference/interventional_distribution.ipynb
@@ -36,7 +36,7 @@
    "source": [
     "[PyMC](https://github.com/pymc-devs/pymc) is a pivotal component of the open source Bayesian statistics ecosystem. It helps solve real problems across a wide range of industries and academic research areas every day. And it has gained this level of utility by being accessible, powerful, and practically useful at solving _Bayesian statistical inference_ problems.\n",
     "\n",
-    "But times are changing. There's a [causal revolution](https://en.wikipedia.org/wiki/The_Book_of_Why) underway and there's a growing recognition that to answer some of the most interesting and challenging questions requires us to intergrate causal reasoning into our efforts.\n",
+    "But times are changing. There's a [causal revolution](https://en.wikipedia.org/wiki/The_Book_of_Why) underway and there's a growing recognition that to answer some of the most interesting and challenging questions requires us to integrate causal reasoning into our efforts.\n",
     "\n",
     "PyMC is rising to this challenge! While there are many novel causal concepts to learn, Bayesians will find that they are not starting from scratch. They are already pretty familiar with [Directed Acyclic Graphs (DAGs)](https://en.wikipedia.org/wiki/Directed_acyclic_graph) and so this gives a good jumping off point to gain relatively easy access into the world of **Bayesian causal inference**.\n",
     "\n",
@@ -559,7 +559,7 @@
     "\n",
     "There are two important changes here:\n",
     "1. Note that $x_3$ was previously a random variable, but this has now been 'locked' at a particular value, $x_3=1$, because of our intervention.\n",
-    "2. Note the absense of the $P(x_3|x_1)$ term, because $x_1$ no longer has any causal influence over $x_3$.\n",
+    "2. Note the absence of the $P(x_3|x_1)$ term, because $x_1$ no longer has any causal influence over $x_3$.\n",
     "\n",
     "So in summary, this is pretty cool. We can use the $\\operatorname{do}$ operator to make in intervention in our model of the world. We can then observe the consequences of this intervention and make much better predictions of what will happen when we are active and intervene (actually or hypothetically) in the world. The accuracy is of course subject to how well our causal DAG reflects the real processes in the world.\n",
     "\n",
@@ -1800,7 +1800,7 @@
    "source": [
     "## Summary\n",
     "\n",
-    "Hopefuly, I've established a strong case for why we need to expand our skillset beyond the realm of Bayesian statistics alone. While these approaches are, and will always be, at the core of PyMC, the ecosystem is embracing causal reasoning.\n",
+    "Hopefully, I've established a strong case for why we need to expand our skillset beyond the realm of Bayesian statistics alone. While these approaches are, and will always be, at the core of PyMC, the ecosystem is embracing causal reasoning.\n",
     "\n",
     "In particular, we've seen how we can use the new $\\operatorname{do}$ operator to implement realised or hypothetical interventions on causal models of the world to obtain interventional distributions. Understanding the underlying causal DAG and how interventions change this DAG are crucial components in building our understanding of causal reasoning.\n",
     "\n",
diff --git a/examples/causal_inference/interventional_distribution.myst.md b/examples/causal_inference/interventional_distribution.myst.md
index d726dcb4f..2189ef8ad 100644
--- a/examples/causal_inference/interventional_distribution.myst.md
+++ b/examples/causal_inference/interventional_distribution.myst.md
@@ -26,7 +26,7 @@ kernelspec:
 
 [PyMC](https://github.com/pymc-devs/pymc) is a pivotal component of the open source Bayesian statistics ecosystem. It helps solve real problems across a wide range of industries and academic research areas every day. And it has gained this level of utility by being accessible, powerful, and practically useful at solving _Bayesian statistical inference_ problems.
 
-But times are changing. There's a [causal revolution](https://en.wikipedia.org/wiki/The_Book_of_Why) underway and there's a growing recognition that to answer some of the most interesting and challenging questions requires us to intergrate causal reasoning into our efforts.
+But times are changing. There's a [causal revolution](https://en.wikipedia.org/wiki/The_Book_of_Why) underway and there's a growing recognition that to answer some of the most interesting and challenging questions requires us to integrate causal reasoning into our efforts.
 
 PyMC is rising to this challenge! While there are many novel causal concepts to learn, Bayesians will find that they are not starting from scratch. They are already pretty familiar with [Directed Acyclic Graphs (DAGs)](https://en.wikipedia.org/wiki/Directed_acyclic_graph) and so this gives a good jumping off point to gain relatively easy access into the world of **Bayesian causal inference**.
 
@@ -225,7 +225,7 @@ $$
 
 There are two important changes here:
 1. Note that $x_3$ was previously a random variable, but this has now been 'locked' at a particular value, $x_3=1$, because of our intervention.
-2. Note the absense of the $P(x_3|x_1)$ term, because $x_1$ no longer has any causal influence over $x_3$.
+2. Note the absence of the $P(x_3|x_1)$ term, because $x_1$ no longer has any causal influence over $x_3$.
 
 So in summary, this is pretty cool. We can use the $\operatorname{do}$ operator to make in intervention in our model of the world. We can then observe the consequences of this intervention and make much better predictions of what will happen when we are active and intervene (actually or hypothetically) in the world. The accuracy is of course subject to how well our causal DAG reflects the real processes in the world.
 
@@ -570,7 +570,7 @@ $P(y|\operatorname{do}(x=2))$ for DAG 2 and DAG 3 will actually be the same in t
 
 ## Summary
 
-Hopefuly, I've established a strong case for why we need to expand our skillset beyond the realm of Bayesian statistics alone. While these approaches are, and will always be, at the core of PyMC, the ecosystem is embracing causal reasoning.
+Hopefully, I've established a strong case for why we need to expand our skillset beyond the realm of Bayesian statistics alone. While these approaches are, and will always be, at the core of PyMC, the ecosystem is embracing causal reasoning.
 
 In particular, we've seen how we can use the new $\operatorname{do}$ operator to implement realised or hypothetical interventions on causal models of the world to obtain interventional distributions. Understanding the underlying causal DAG and how interventions change this DAG are crucial components in building our understanding of causal reasoning.
 
diff --git a/examples/diagnostics_and_criticism/Diagnosing_biased_Inference_with_Divergences.ipynb b/examples/diagnostics_and_criticism/Diagnosing_biased_Inference_with_Divergences.ipynb
index de72956fd..545bc7c7a 100644
--- a/examples/diagnostics_and_criticism/Diagnosing_biased_Inference_with_Divergences.ipynb
+++ b/examples/diagnostics_and_criticism/Diagnosing_biased_Inference_with_Divergences.ipynb
@@ -2170,7 +2170,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "As shown above, the effective sample size per iteration has drastically improved, and the trace plots no longer show any \"stickyness\". However, we do still see the rare divergence. These infrequent divergences do not seem concentrate anywhere in parameter space, which is indicative of the divergences being false positives."
+    "As shown above, the effective sample size per iteration has drastically improved, and the trace plots no longer show any \"stickiness\". However, we do still see the rare divergence. These infrequent divergences do not seem concentrate anywhere in parameter space, which is indicative of the divergences being false positives."
    ]
   },
   {
diff --git a/examples/diagnostics_and_criticism/Diagnosing_biased_Inference_with_Divergences.myst.md b/examples/diagnostics_and_criticism/Diagnosing_biased_Inference_with_Divergences.myst.md
index 1ff759ff1..9f9b9ad61 100644
--- a/examples/diagnostics_and_criticism/Diagnosing_biased_Inference_with_Divergences.myst.md
+++ b/examples/diagnostics_and_criticism/Diagnosing_biased_Inference_with_Divergences.myst.md
@@ -510,7 +510,7 @@ with NonCentered_eight:
 az.summary(fit_ncp80).round(2)
 ```
 
-As shown above, the effective sample size per iteration has drastically improved, and the trace plots no longer show any "stickyness". However, we do still see the rare divergence. These infrequent divergences do not seem concentrate anywhere in parameter space, which is indicative of the divergences being false positives.
+As shown above, the effective sample size per iteration has drastically improved, and the trace plots no longer show any "stickiness". However, we do still see the rare divergence. These infrequent divergences do not seem concentrate anywhere in parameter space, which is indicative of the divergences being false positives.
 
 ```{code-cell} ipython3
 report_trace(fit_ncp80)
diff --git a/examples/diagnostics_and_criticism/model_averaging.ipynb b/examples/diagnostics_and_criticism/model_averaging.ipynb
index 37d5db3f2..55809268e 100644
--- a/examples/diagnostics_and_criticism/model_averaging.ipynb
+++ b/examples/diagnostics_and_criticism/model_averaging.ipynb
@@ -82,7 +82,7 @@
    "source": [
     "When confronted with more than one model we have several options. One of them is to perform model selection as exemplified by the PyMC examples {ref}`pymc:model_comparison` and the {ref}`GLM-model-selection`, usually is a good idea to also include posterior predictive checks in order to decide which model to keep. Discarding all models except one is equivalent to affirm that, among the evaluated models, one is correct (under some criteria) with probability 1 and the rest are incorrect. In most cases this will be an overstatment that ignores the uncertainty we have in our models. This is somewhat similar to computing the full posterior and then just keeping a point-estimate like the posterior mean; we may become overconfident of what we really know. You can also browse the {doc}`blog/tag/model-comparison` tag to find related posts. \n",
     "\n",
-    "An alternative to this dilema is to perform model selection but to acknoledge the models we discared. If the number of models are not that large this can be part of a technical discussion on a paper, presentation, thesis, and so on. If the audience is not technical enough, this may not be a good idea.\n",
+    "An alternative to this dilemma is to perform model selection but to acknowledge the models we discarded. If the number of models are not that large this can be part of a technical discussion on a paper, presentation, thesis, and so on. If the audience is not technical enough, this may not be a good idea.\n",
     "\n",
     "Yet another alternative, the topic of this example, is to perform model averaging. The idea is to weight each model by its merit and generate predictions from each model, proportional to those weights. There are several ways to do this, including the three methods that will be briefly discussed in this notebook. You will find a more thorough explanation in the work by {cite:t}`Yao_2018` and {cite:t}`Yao_2022`. \n",
     "\n",
@@ -110,7 +110,7 @@
     "* WAIC, Widely Applicable Information Criterion\n",
     "* LOO, Pareto-Smooth-Leave-One-Out-Cross-Validation.\n",
     "\n",
-    "Both requiere and InferenceData with the log-likelihood group and are equally fast to compute. We recommend using LOO because it has better practical properties, and better diagnostics (so we known when we are having issues with the ELPD estimation).\n",
+    "Both require and InferenceData with the log-likelihood group and are equally fast to compute. We recommend using LOO because it has better practical properties, and better diagnostics (so we known when we are having issues with the ELPD estimation).\n",
     "\n",
     "## Pseudo Bayesian model averaging with Bayesian Bootstrapping\n",
     "\n",
@@ -548,7 +548,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Before LOO (or WAIC) to compare and or average models we should check that we do not have sampling issues and posterior predictive checks are resonable. For the sake of brevity we are going to skip these steps and instead jump to the model averaging.\n",
+    "Before LOO (or WAIC) to compare and or average models we should check that we do not have sampling issues and posterior predictive checks are reasonable. For the sake of brevity we are going to skip these steps and instead jump to the model averaging.\n",
     "\n",
     "First we need to call `az.compare` to compute the LOO values for each model and the weights using `stacking`. These are the default options, if you want to perform pseudo Bayesian model averaging you can use the `method='BB-pseudo-BMA'` that includes the Bayesian Bootstrap estimation of the uncertainty in the ELPD.\n"
    ]
@@ -1909,7 +1909,7 @@
     "\n",
     "When not do to model averaging? Many times we can create new models that effectively work as averages of other models. For instance in this example we could have created a new model that includes all the variables. That's actually a very sensible thing to do. Notice that if a model excludes a variable thats equivalent to setting the coefficient of that variable to zero. If we average a model with the variable and without it, it's like setting the coefficient to a value between zero and the value of the coefficient in the model that includes the variable. This is a very simple example, but the same reasoning applies to more complex models.\n",
     "\n",
-    "Hierarchical models are another example were we build a continous version of a model instead of dealing with discrete versions. A toy example is to imagine that we have a coin and we want to estimated its degree of bias, a number between 0 and 1 having a 0.5 equal chance of head and tails (fair coin). We could think of two separate models: one with a prior biased towards heads and one with a prior biased towards towards tails. We could fit both separate models and then average them. An alternative is to build a hierarchical model to estimate the prior distribution. Instead of contemplating two discrete models, we would be computing a continuous model that considers the discrete ones as particular cases. Which approach is better? That depends on our concrete problem. Do we have good reasons to think about two discrete models, or is our problem better represented with a continuous bigger model?"
+    "Hierarchical models are another example were we build a continuous version of a model instead of dealing with discrete versions. A toy example is to imagine that we have a coin and we want to estimated its degree of bias, a number between 0 and 1 having a 0.5 equal chance of head and tails (fair coin). We could think of two separate models: one with a prior biased towards heads and one with a prior biased towards towards tails. We could fit both separate models and then average them. An alternative is to build a hierarchical model to estimate the prior distribution. Instead of contemplating two discrete models, we would be computing a continuous model that considers the discrete ones as particular cases. Which approach is better? That depends on our concrete problem. Do we have good reasons to think about two discrete models, or is our problem better represented with a continuous bigger model?"
    ]
   },
   {
diff --git a/examples/diagnostics_and_criticism/model_averaging.myst.md b/examples/diagnostics_and_criticism/model_averaging.myst.md
index 648cdbe8b..1d3a6df0f 100644
--- a/examples/diagnostics_and_criticism/model_averaging.myst.md
+++ b/examples/diagnostics_and_criticism/model_averaging.myst.md
@@ -56,7 +56,7 @@ az.style.use("arviz-darkgrid")
 
 When confronted with more than one model we have several options. One of them is to perform model selection as exemplified by the PyMC examples {ref}`pymc:model_comparison` and the {ref}`GLM-model-selection`, usually is a good idea to also include posterior predictive checks in order to decide which model to keep. Discarding all models except one is equivalent to affirm that, among the evaluated models, one is correct (under some criteria) with probability 1 and the rest are incorrect. In most cases this will be an overstatment that ignores the uncertainty we have in our models. This is somewhat similar to computing the full posterior and then just keeping a point-estimate like the posterior mean; we may become overconfident of what we really know. You can also browse the {doc}`blog/tag/model-comparison` tag to find related posts. 
 
-An alternative to this dilema is to perform model selection but to acknoledge the models we discared. If the number of models are not that large this can be part of a technical discussion on a paper, presentation, thesis, and so on. If the audience is not technical enough, this may not be a good idea.
+An alternative to this dilemma is to perform model selection but to acknowledge the models we discarded. If the number of models are not that large this can be part of a technical discussion on a paper, presentation, thesis, and so on. If the audience is not technical enough, this may not be a good idea.
 
 Yet another alternative, the topic of this example, is to perform model averaging. The idea is to weight each model by its merit and generate predictions from each model, proportional to those weights. There are several ways to do this, including the three methods that will be briefly discussed in this notebook. You will find a more thorough explanation in the work by {cite:t}`Yao_2018` and {cite:t}`Yao_2022`. 
 
@@ -84,7 +84,7 @@ So far so good, but the ELPD is a theoretical quantity, and in practice we need
 * WAIC, Widely Applicable Information Criterion
 * LOO, Pareto-Smooth-Leave-One-Out-Cross-Validation.
 
-Both requiere and InferenceData with the log-likelihood group and are equally fast to compute. We recommend using LOO because it has better practical properties, and better diagnostics (so we known when we are having issues with the ELPD estimation).
+Both require and InferenceData with the log-likelihood group and are equally fast to compute. We recommend using LOO because it has better practical properties, and better diagnostics (so we known when we are having issues with the ELPD estimation).
 
 ## Pseudo Bayesian model averaging with Bayesian Bootstrapping
 
@@ -155,7 +155,7 @@ with pm.Model() as model_1:
     pm.sample_posterior_predictive(idata_1, extend_inferencedata=True, random_seed=rng)
 ```
 
-Before LOO (or WAIC) to compare and or average models we should check that we do not have sampling issues and posterior predictive checks are resonable. For the sake of brevity we are going to skip these steps and instead jump to the model averaging.
+Before LOO (or WAIC) to compare and or average models we should check that we do not have sampling issues and posterior predictive checks are reasonable. For the sake of brevity we are going to skip these steps and instead jump to the model averaging.
 
 First we need to call `az.compare` to compute the LOO values for each model and the weights using `stacking`. These are the default options, if you want to perform pseudo Bayesian model averaging you can use the `method='BB-pseudo-BMA'` that includes the Bayesian Bootstrap estimation of the uncertainty in the ELPD.
 
@@ -201,7 +201,7 @@ Model averaging is a good idea when you want to improve the robustness of your p
 
 When not do to model averaging? Many times we can create new models that effectively work as averages of other models. For instance in this example we could have created a new model that includes all the variables. That's actually a very sensible thing to do. Notice that if a model excludes a variable thats equivalent to setting the coefficient of that variable to zero. If we average a model with the variable and without it, it's like setting the coefficient to a value between zero and the value of the coefficient in the model that includes the variable. This is a very simple example, but the same reasoning applies to more complex models.
 
-Hierarchical models are another example were we build a continous version of a model instead of dealing with discrete versions. A toy example is to imagine that we have a coin and we want to estimated its degree of bias, a number between 0 and 1 having a 0.5 equal chance of head and tails (fair coin). We could think of two separate models: one with a prior biased towards heads and one with a prior biased towards towards tails. We could fit both separate models and then average them. An alternative is to build a hierarchical model to estimate the prior distribution. Instead of contemplating two discrete models, we would be computing a continuous model that considers the discrete ones as particular cases. Which approach is better? That depends on our concrete problem. Do we have good reasons to think about two discrete models, or is our problem better represented with a continuous bigger model?
+Hierarchical models are another example were we build a continuous version of a model instead of dealing with discrete versions. A toy example is to imagine that we have a coin and we want to estimated its degree of bias, a number between 0 and 1 having a 0.5 equal chance of head and tails (fair coin). We could think of two separate models: one with a prior biased towards heads and one with a prior biased towards towards tails. We could fit both separate models and then average them. An alternative is to build a hierarchical model to estimate the prior distribution. Instead of contemplating two discrete models, we would be computing a continuous model that considers the discrete ones as particular cases. Which approach is better? That depends on our concrete problem. Do we have good reasons to think about two discrete models, or is our problem better represented with a continuous bigger model?
 
 +++
 
diff --git a/examples/fundamentals/data_container.ipynb b/examples/fundamentals/data_container.ipynb
index cf0301a7e..a759536fa 100644
--- a/examples/fundamentals/data_container.ipynb
+++ b/examples/fundamentals/data_container.ipynb
@@ -1403,7 +1403,7 @@
     "\n",
     "Named dimensions are another powerful benefit of working with data containers. Data containers allow users to keep track of dimensions (like dates or cities) and coordinates (such as the actual date times or city names) of multi-dimensional data. Both allow you to specify the dimension names and coordinates of random variables, instead of specifying the shapes of those random variables as numbers. Notice that in the previous probabilistic graphs, all of the nodes `x_data`, `mu`, `obs` and `y_data` were in a box with the number 100. A natural question for a reader to ask is, \"100 what?\". Dimensions and coordinates help organize models, variables, and data by answering exactly this question.\n",
     "\n",
-    "In the next example, we generate an artifical dataset of temperatures in 3 cities over 2 months. We will then use named dimensions and coordinates to improve the readability of the model code and the quality of the visualizations."
+    "In the next example, we generate an artificial dataset of temperatures in 3 cities over 2 months. We will then use named dimensions and coordinates to improve the readability of the model code and the quality of the visualizations."
    ]
   },
   {
@@ -2656,7 +2656,7 @@
     "\n",
     "    p = pm.Deterministic(\"p\", pm.math.sigmoid(alpha + beta * x_data))\n",
     "\n",
-    "    # Here is were we link the shapes of the inputs (x_data) and the observed varaiable\n",
+    "    # Here is were we link the shapes of the inputs (x_data) and the observed variable\n",
     "    # It will be the shape we tell it, rather than the (constant!) shape of y_data\n",
     "    obs = pm.Bernoulli(\"obs\", p=p, observed=y_data, shape=x_data.shape[0])\n",
     "\n",
@@ -2673,7 +2673,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "A common post-estimation diagonstic is to look at a posterior predictive plot, using {func}`arviz.plot_ppc`. This shows the distribution of data sampled from your model along side the observed data. If they look similar, we have some evidence that the model isn't so bad.\n",
+    "A common post-estimation diagnostic is to look at a posterior predictive plot, using {func}`arviz.plot_ppc`. This shows the distribution of data sampled from your model along side the observed data. If they look similar, we have some evidence that the model isn't so bad.\n",
     "\n",
     "In this case, however, it can be difficult to interpret a posterior predictive plot. Since we're doing a logistic regression, observed values can only be zero or one. As a result, the posterior predictive graph has this tetris-block shape. What are we to make of it? Evidently our model produces more 1's than 0's, and the mean proportion matches the data. But there's also a lot of uncertainty in that proportion. What else can we say about the model's performance?  "
    ]
@@ -3089,7 +3089,7 @@
     "\n",
     "The only problem is that by default this function will return predictions for _Length_ for the observed values of _Month_, and $0.5$ months (the value Osvaldo cares about) has not been observed, -- all measures are reported for integer months. The easier way to get predictions for non-observed values of _Month_ is to pass new values to the `Data` container we defined above in our model. To do that, we need to use `pm.set_data` and then we just have to sample from the posterior predictve distribution. We will also have to set `coords` for these new observations, which we are allowed to do in the `pm.set_data` function because we have set the `obs_idx` coord as mutable. \n",
     "\n",
-    "Note that the actual value we pass for `obs_idx` is totally irrevelant *in this case*, so we give it a value of 0. What is important is that we update it to have the same length as the ages we want to do out-of-sample prediction for, and that each age has a unique index identifier."
+    "Note that the actual value we pass for `obs_idx` is totally irrelevant *in this case*, so we give it a value of 0. What is important is that we update it to have the same length as the ages we want to do out-of-sample prediction for, and that each age has a unique index identifier."
    ]
   },
   {
diff --git a/examples/fundamentals/data_container.myst.md b/examples/fundamentals/data_container.myst.md
index ff0c8bcf5..983733313 100644
--- a/examples/fundamentals/data_container.myst.md
+++ b/examples/fundamentals/data_container.myst.md
@@ -124,7 +124,7 @@ idata.constant_data
 
 Named dimensions are another powerful benefit of working with data containers. Data containers allow users to keep track of dimensions (like dates or cities) and coordinates (such as the actual date times or city names) of multi-dimensional data. Both allow you to specify the dimension names and coordinates of random variables, instead of specifying the shapes of those random variables as numbers. Notice that in the previous probabilistic graphs, all of the nodes `x_data`, `mu`, `obs` and `y_data` were in a box with the number 100. A natural question for a reader to ask is, "100 what?". Dimensions and coordinates help organize models, variables, and data by answering exactly this question.
 
-In the next example, we generate an artifical dataset of temperatures in 3 cities over 2 months. We will then use named dimensions and coordinates to improve the readability of the model code and the quality of the visualizations.
+In the next example, we generate an artificial dataset of temperatures in 3 cities over 2 months. We will then use named dimensions and coordinates to improve the readability of the model code and the quality of the visualizations.
 
 ```{code-cell} ipython3
 df_data = pd.DataFrame(columns=["date"]).set_index("date")
@@ -302,7 +302,7 @@ with pm.Model() as logistic_model:
 
     p = pm.Deterministic("p", pm.math.sigmoid(alpha + beta * x_data))
 
-    # Here is were we link the shapes of the inputs (x_data) and the observed varaiable
+    # Here is were we link the shapes of the inputs (x_data) and the observed variable
     # It will be the shape we tell it, rather than the (constant!) shape of y_data
     obs = pm.Bernoulli("obs", p=p, observed=y_data, shape=x_data.shape[0])
 
@@ -315,7 +315,7 @@ with pm.Model() as logistic_model:
     )
 ```
 
-A common post-estimation diagonstic is to look at a posterior predictive plot, using {func}`arviz.plot_ppc`. This shows the distribution of data sampled from your model along side the observed data. If they look similar, we have some evidence that the model isn't so bad.
+A common post-estimation diagnostic is to look at a posterior predictive plot, using {func}`arviz.plot_ppc`. This shows the distribution of data sampled from your model along side the observed data. If they look similar, we have some evidence that the model isn't so bad.
 
 In this case, however, it can be difficult to interpret a posterior predictive plot. Since we're doing a logistic regression, observed values can only be zero or one. As a result, the posterior predictive graph has this tetris-block shape. What are we to make of it? Evidently our model produces more 1's than 0's, and the mean proportion matches the data. But there's also a lot of uncertainty in that proportion. What else can we say about the model's performance?  
 
@@ -438,7 +438,7 @@ At the moment of writing Osvaldo's daughter is two weeks ($\approx 0.5$ months)
 
 The only problem is that by default this function will return predictions for _Length_ for the observed values of _Month_, and $0.5$ months (the value Osvaldo cares about) has not been observed, -- all measures are reported for integer months. The easier way to get predictions for non-observed values of _Month_ is to pass new values to the `Data` container we defined above in our model. To do that, we need to use `pm.set_data` and then we just have to sample from the posterior predictve distribution. We will also have to set `coords` for these new observations, which we are allowed to do in the `pm.set_data` function because we have set the `obs_idx` coord as mutable. 
 
-Note that the actual value we pass for `obs_idx` is totally irrevelant *in this case*, so we give it a value of 0. What is important is that we update it to have the same length as the ages we want to do out-of-sample prediction for, and that each age has a unique index identifier.
+Note that the actual value we pass for `obs_idx` is totally irrelevant *in this case*, so we give it a value of 0. What is important is that we update it to have the same length as the ages we want to do out-of-sample prediction for, and that each age has a unique index identifier.
 
 ```{code-cell} ipython3
 ages_to_check = [0.5]
diff --git a/examples/gaussian_processes/GP-Births.ipynb b/examples/gaussian_processes/GP-Births.ipynb
index 8b548e733..2560c55ea 100644
--- a/examples/gaussian_processes/GP-Births.ipynb
+++ b/examples/gaussian_processes/GP-Births.ipynb
@@ -9,7 +9,7 @@
     "# Baby Births Modelling with HSGPs\n",
     "\n",
     ":::{post} January, 2024\n",
-    ":tags: gaussian processes, hilbert space approximation,\n",
+    ":tags: gaussian process, hilbert space approximation,\n",
     ":category: intermediate, how-to\n",
     ":author: [Juan Orduz](https://juanitorduz.github.io/)\n",
     ":::"
diff --git a/examples/gaussian_processes/GP-Births.myst.md b/examples/gaussian_processes/GP-Births.myst.md
index 32490daa0..739129c2c 100644
--- a/examples/gaussian_processes/GP-Births.myst.md
+++ b/examples/gaussian_processes/GP-Births.myst.md
@@ -18,7 +18,7 @@ myst:
 # Baby Births Modelling with HSGPs
 
 :::{post} January, 2024
-:tags: gaussian processes, hilbert space approximation,
+:tags: gaussian process, hilbert space approximation,
 :category: intermediate, how-to
 :author: [Juan Orduz](https://juanitorduz.github.io/)
 :::
diff --git a/examples/gaussian_processes/GP-Latent.ipynb b/examples/gaussian_processes/GP-Latent.ipynb
index 69807375d..22554776c 100644
--- a/examples/gaussian_processes/GP-Latent.ipynb
+++ b/examples/gaussian_processes/GP-Latent.ipynb
@@ -8,7 +8,7 @@
     "# Gaussian Processes: Latent Variable Implementation\n",
     "\n",
     ":::{post} June 6, 2023\n",
-    ":tags: gaussian processes, time series\n",
+    ":tags: gaussian process, time series\n",
     ":category: reference, intermediate\n",
     ":author: Bill Engels\n",
     ":::"
diff --git a/examples/gaussian_processes/GP-Latent.myst.md b/examples/gaussian_processes/GP-Latent.myst.md
index 708e25738..e90160aff 100644
--- a/examples/gaussian_processes/GP-Latent.myst.md
+++ b/examples/gaussian_processes/GP-Latent.myst.md
@@ -17,7 +17,7 @@ myst:
 # Gaussian Processes: Latent Variable Implementation
 
 :::{post} June 6, 2023
-:tags: gaussian processes, time series
+:tags: gaussian process, time series
 :category: reference, intermediate
 :author: Bill Engels
 :::
diff --git a/examples/gaussian_processes/GP-Marginal.ipynb b/examples/gaussian_processes/GP-Marginal.ipynb
index 15a5316c0..b803998ec 100644
--- a/examples/gaussian_processes/GP-Marginal.ipynb
+++ b/examples/gaussian_processes/GP-Marginal.ipynb
@@ -9,7 +9,7 @@
     "# Marginal Likelihood Implementation\n",
     "\n",
     ":::{post} June 4, 2023\n",
-    ":tags: gaussian processes, time series\n",
+    ":tags: gaussian process, time series\n",
     ":category: reference, intermediate\n",
     ":author: Bill Engels, Chris Fonnesbeck\n",
     ":::\n",
diff --git a/examples/gaussian_processes/GP-Marginal.myst.md b/examples/gaussian_processes/GP-Marginal.myst.md
index 59d19c712..c8928d1da 100644
--- a/examples/gaussian_processes/GP-Marginal.myst.md
+++ b/examples/gaussian_processes/GP-Marginal.myst.md
@@ -14,7 +14,7 @@ kernelspec:
 # Marginal Likelihood Implementation
 
 :::{post} June 4, 2023
-:tags: gaussian processes, time series
+:tags: gaussian process, time series
 :category: reference, intermediate
 :author: Bill Engels, Chris Fonnesbeck
 :::
diff --git a/examples/gaussian_processes/GP-TProcess.ipynb b/examples/gaussian_processes/GP-TProcess.ipynb
index d40d60494..8d1a8cc4d 100644
--- a/examples/gaussian_processes/GP-TProcess.ipynb
+++ b/examples/gaussian_processes/GP-TProcess.ipynb
@@ -8,7 +8,7 @@
     "# Student-t Process\n",
     "\n",
     ":::{post} August 2017\n",
-    ":tags: t-process, gaussian process, bayesian non-parametrics\n",
+    ":tags: t-process, gaussian process, nonparametric\n",
     ":category: intermediate\n",
     ":author: Bill Engels\n",
     ":::\n",
diff --git a/examples/gaussian_processes/GP-TProcess.myst.md b/examples/gaussian_processes/GP-TProcess.myst.md
index 1128c0995..6c05de49c 100644
--- a/examples/gaussian_processes/GP-TProcess.myst.md
+++ b/examples/gaussian_processes/GP-TProcess.myst.md
@@ -14,7 +14,7 @@ kernelspec:
 # Student-t Process
 
 :::{post} August 2017
-:tags: t-process, gaussian process, bayesian non-parametrics
+:tags: t-process, gaussian process, nonparametric
 :category: intermediate
 :author: Bill Engels
 :::
diff --git a/examples/gaussian_processes/HSGP-Advanced.ipynb b/examples/gaussian_processes/HSGP-Advanced.ipynb
index 84da0b94c..402cfd04b 100644
--- a/examples/gaussian_processes/HSGP-Advanced.ipynb
+++ b/examples/gaussian_processes/HSGP-Advanced.ipynb
@@ -9,7 +9,7 @@
     "# Gaussian Processes: HSGP Advanced Usage\n",
     "\n",
     ":::{post} June 28, 2024\n",
-    ":tags: gaussian processes\n",
+    ":tags: gaussian process\n",
     ":category: reference, intermediate\n",
     ":author: Bill Engels, Alexandre Andorra, Maxim Kochurov\n",
     ":::"
@@ -657,7 +657,7 @@
    "source": [
     "Bingo!\n",
     "\n",
-    "One last thing we also talked about in the first turorial: increasing `c` requires increasing `m` to compensate for the loss of fidelity at smaller lengthscales. So let's err on the side of safety and choose:"
+    "One last thing we also talked about in the first tutorial: increasing `c` requires increasing `m` to compensate for the loss of fidelity at smaller lengthscales. So let's err on the side of safety and choose:"
    ]
   },
   {
@@ -2237,7 +2237,7 @@
     "\n",
     "- As data become sparse, the **long-term trend is reverting back to the overall GP mean** (i.e 0), but hasn't reached it yet, because the length scale on the trend is bigger than the testing period of 5 (`ell_mu_trend_true = 10`).\n",
     "- The **short-term variation on the mean GP isn't obvious** because it's small relative to the trend. But it _is_ noticeable: it creates the small wiggles in the orange HDI, and makes this HDI wider in comparison to the individual GPs (the blue ones).\n",
-    "- The **individual GPs revert faster to the mean GP** (orange enveloppe) **than to the GP mean** (i.e 0), which is the behavior we want from the hierarchical structure."
+    "- The **individual GPs revert faster to the mean GP** (orange envelope) **than to the GP mean** (i.e 0), which is the behavior we want from the hierarchical structure."
    ]
   },
   {
diff --git a/examples/gaussian_processes/HSGP-Advanced.myst.md b/examples/gaussian_processes/HSGP-Advanced.myst.md
index aca73536e..57864ab2b 100644
--- a/examples/gaussian_processes/HSGP-Advanced.myst.md
+++ b/examples/gaussian_processes/HSGP-Advanced.myst.md
@@ -14,7 +14,7 @@ kernelspec:
 # Gaussian Processes: HSGP Advanced Usage
 
 :::{post} June 28, 2024
-:tags: gaussian processes
+:tags: gaussian process
 :category: reference, intermediate
 :author: Bill Engels, Alexandre Andorra, Maxim Kochurov
 :::
@@ -410,7 +410,7 @@ pm.gp.hsgp_approx.set_boundary(Xs, 4.0)
 
 Bingo!
 
-One last thing we also talked about in the first turorial: increasing `c` requires increasing `m` to compensate for the loss of fidelity at smaller lengthscales. So let's err on the side of safety and choose:
+One last thing we also talked about in the first tutorial: increasing `c` requires increasing `m` to compensate for the loss of fidelity at smaller lengthscales. So let's err on the side of safety and choose:
 
 ```{code-cell} ipython3
 m, c = 100, 4.0
@@ -825,7 +825,7 @@ Phew, that's a lot of information! Let's see what we can make of this:
 
 - As data become sparse, the **long-term trend is reverting back to the overall GP mean** (i.e 0), but hasn't reached it yet, because the length scale on the trend is bigger than the testing period of 5 (`ell_mu_trend_true = 10`).
 - The **short-term variation on the mean GP isn't obvious** because it's small relative to the trend. But it _is_ noticeable: it creates the small wiggles in the orange HDI, and makes this HDI wider in comparison to the individual GPs (the blue ones).
-- The **individual GPs revert faster to the mean GP** (orange enveloppe) **than to the GP mean** (i.e 0), which is the behavior we want from the hierarchical structure.
+- The **individual GPs revert faster to the mean GP** (orange envelope) **than to the GP mean** (i.e 0), which is the behavior we want from the hierarchical structure.
 
 +++
 
diff --git a/examples/gaussian_processes/HSGP-Basic.ipynb b/examples/gaussian_processes/HSGP-Basic.ipynb
index 9328b6e8c..fbc1b7ac4 100644
--- a/examples/gaussian_processes/HSGP-Basic.ipynb
+++ b/examples/gaussian_processes/HSGP-Basic.ipynb
@@ -9,7 +9,7 @@
     "# Gaussian Processes: HSGP Reference & First Steps\n",
     "\n",
     ":::{post} June 10, 2024\n",
-    ":tags: gaussian processes\n",
+    ":tags: gaussian process\n",
     ":category: reference, intermediate\n",
     ":author: Bill Engels, Alexandre Andorra\n",
     ":::"
diff --git a/examples/gaussian_processes/HSGP-Basic.myst.md b/examples/gaussian_processes/HSGP-Basic.myst.md
index dc1553010..c4b3a3312 100644
--- a/examples/gaussian_processes/HSGP-Basic.myst.md
+++ b/examples/gaussian_processes/HSGP-Basic.myst.md
@@ -14,7 +14,7 @@ kernelspec:
 # Gaussian Processes: HSGP Reference & First Steps
 
 :::{post} June 10, 2024
-:tags: gaussian processes
+:tags: gaussian process
 :category: reference, intermediate
 :author: Bill Engels, Alexandre Andorra
 :::
diff --git a/examples/generalized_linear_models/GLM-discrete-choice_models.ipynb b/examples/generalized_linear_models/GLM-discrete-choice_models.ipynb
index 41931e47e..b30ba6bf3 100644
--- a/examples/generalized_linear_models/GLM-discrete-choice_models.ipynb
+++ b/examples/generalized_linear_models/GLM-discrete-choice_models.ipynb
@@ -366,7 +366,7 @@
    "source": [
     "## The Basic Model\n",
     "\n",
-    "We will show here how to incorporate the utility specifications in PyMC. PyMC is a nice interface for this kind of modelling because it can express the model quite cleanly following the natural mathematical expression for this system of equations. You can see in this simple model how we go about constructing equations for the utility measure of each alternative seperately, and then stacking them together to create the input matrix for our softmax transform. "
+    "We will show here how to incorporate the utility specifications in PyMC. PyMC is a nice interface for this kind of modelling because it can express the model quite cleanly following the natural mathematical expression for this system of equations. You can see in this simple model how we go about constructing equations for the utility measure of each alternative separately, and then stacking them together to create the input matrix for our softmax transform. "
    ]
   },
   {
@@ -7887,7 +7887,7 @@
    "source": [
     "### Compare Models\n",
     "\n",
-    "We'll now evaluate all three model fits on their predictive performance. Predictive performance on the original data is a good benchmark that the model has appropriately captured the data generating process. But it is not (as we've seen) the only feature of interest in these models. These models are sensetive to our theoretical beliefs about the agents making the decisions, the view of the decision process and the elements of the choice scenario."
+    "We'll now evaluate all three model fits on their predictive performance. Predictive performance on the original data is a good benchmark that the model has appropriately captured the data generating process. But it is not (as we've seen) the only feature of interest in these models. These models are sensitive to our theoretical beliefs about the agents making the decisions, the view of the decision process and the elements of the choice scenario."
    ]
   },
   {
diff --git a/examples/generalized_linear_models/GLM-discrete-choice_models.myst.md b/examples/generalized_linear_models/GLM-discrete-choice_models.myst.md
index 18f089a66..6a0107f28 100644
--- a/examples/generalized_linear_models/GLM-discrete-choice_models.myst.md
+++ b/examples/generalized_linear_models/GLM-discrete-choice_models.myst.md
@@ -131,7 +131,7 @@ long_heating_df[long_heating_df["idcase"] == 1][columns]
 
 ## The Basic Model
 
-We will show here how to incorporate the utility specifications in PyMC. PyMC is a nice interface for this kind of modelling because it can express the model quite cleanly following the natural mathematical expression for this system of equations. You can see in this simple model how we go about constructing equations for the utility measure of each alternative seperately, and then stacking them together to create the input matrix for our softmax transform. 
+We will show here how to incorporate the utility specifications in PyMC. PyMC is a nice interface for this kind of modelling because it can express the model quite cleanly following the natural mathematical expression for this system of equations. You can see in this simple model how we go about constructing equations for the utility measure of each alternative separately, and then stacking them together to create the input matrix for our softmax transform. 
 
 ```{code-cell} ipython3
 N = wide_heating_df.shape[0]
@@ -497,7 +497,7 @@ Here we can, as expected, see that a rise in the operating costs of the electric
 
 ### Compare Models
 
-We'll now evaluate all three model fits on their predictive performance. Predictive performance on the original data is a good benchmark that the model has appropriately captured the data generating process. But it is not (as we've seen) the only feature of interest in these models. These models are sensetive to our theoretical beliefs about the agents making the decisions, the view of the decision process and the elements of the choice scenario.
+We'll now evaluate all three model fits on their predictive performance. Predictive performance on the original data is a good benchmark that the model has appropriately captured the data generating process. But it is not (as we've seen) the only feature of interest in these models. These models are sensitive to our theoretical beliefs about the agents making the decisions, the view of the decision process and the elements of the choice scenario.
 
 ```{code-cell} ipython3
 compare = az.compare({"m1": idata_m1, "m2": idata_m2, "m3": idata_m3})
diff --git a/examples/generalized_linear_models/GLM-missing-values-in-covariates.ipynb b/examples/generalized_linear_models/GLM-missing-values-in-covariates.ipynb
index c89ab0f5b..95de83a1e 100644
--- a/examples/generalized_linear_models/GLM-missing-values-in-covariates.ipynb
+++ b/examples/generalized_linear_models/GLM-missing-values-in-covariates.ipynb
@@ -8,7 +8,7 @@
     "# GLM-missing-values-in-covariates\n",
     "\n",
     ":::{post} Nov 09, 2024\n",
-    ":tags: missing-covariate-values, missing-values, auto-imputation, linear-regression, bayesian-workflow\n",
+    ":tags: missing covariate values, missing values, auto-imputation, linear regression, bayesian workflow\n",
     ":category: intermediate, reference\n",
     ":author: Jonathan Sedar\n",
     ":::\n",
@@ -1021,7 +1021,7 @@
     "\n",
     "+ We have full control over how many observations we create in `df`, so the ratio of this split doesn't really matter,\n",
     "  and we'll arrange to have `10` observations in the holdout set\n",
-    "+ Eyeball the `count` of non-nulls in the below tabels to ensure we have missing values in both features `c`, `d` in \n",
+    "+ Eyeball the `count` of non-nulls in the below tables to ensure we have missing values in both features `c`, `d` in \n",
     "  both datasets `train` and `holdout`"
    ]
   },
@@ -1671,7 +1671,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This is a lightly simplifed copy of the same logic / workflow in $\\S0.3$ above. We won't take up any more space here \n",
+    "This is a lightly simplified copy of the same logic / workflow in $\\S0.3$ above. We won't take up any more space here \n",
     "with EDA, the only difference is `c` and `d` are now complete"
    ]
   },
@@ -5508,7 +5508,7 @@
    "source": [
     "**NOTE** \n",
     "\n",
-    "+ Avoid changing `ida`, instead take a deepcopy `ida_h` , remove uneccessary groups, and we'll use that\n",
+    "+ Avoid changing `ida`, instead take a deepcopy `ida_h` , remove unnecessary groups, and we'll use that\n",
     "+ We won't create a bare `az.InferenceData` then add groups, because we have to add all sorts of additional subtle \n",
     "  info to the object. Easier to copy and remove groups\n",
     "+ The `xarray` indexing in `posterior` will be wrong (set according to `dfx_train`, rather than `dfx_holdout`),\n",
@@ -9517,7 +9517,7 @@
     "  conditional on anything else, so we get pretty much all the same predicted value\n",
     "+ This should drive home the understanding that while technically this model **can** handle new missing values,\n",
     "  and **does** auto-impute values for missing data in an out-of-sample dataset (here `dfx_holdout`), these auto-imputed\n",
-    "  values for `xk_unobserved` **can't** be any more informative than the posterior distribution of the hierachical \n",
+    "  values for `xk_unobserved` **can't** be any more informative than the posterior distribution of the hierarchical \n",
     "  prior `xk_mu`."
    ]
   },
diff --git a/examples/generalized_linear_models/GLM-missing-values-in-covariates.myst.md b/examples/generalized_linear_models/GLM-missing-values-in-covariates.myst.md
index dfcfa84ba..e69962b3a 100644
--- a/examples/generalized_linear_models/GLM-missing-values-in-covariates.myst.md
+++ b/examples/generalized_linear_models/GLM-missing-values-in-covariates.myst.md
@@ -14,7 +14,7 @@ kernelspec:
 # GLM-missing-values-in-covariates
 
 :::{post} Nov 09, 2024
-:tags: missing-covariate-values, missing-values, auto-imputation, linear-regression, bayesian-workflow
+:tags: missing covariate values, missing values, auto-imputation, linear regression, bayesian workflow
 :category: intermediate, reference
 :author: Jonathan Sedar
 :::
@@ -367,7 +367,7 @@ _ = g.fig.tight_layout()
 
 + We have full control over how many observations we create in `df`, so the ratio of this split doesn't really matter,
   and we'll arrange to have `10` observations in the holdout set
-+ Eyeball the `count` of non-nulls in the below tabels to ensure we have missing values in both features `c`, `d` in 
++ Eyeball the `count` of non-nulls in the below tables to ensure we have missing values in both features `c`, `d` in 
   both datasets `train` and `holdout`
 
 ```{code-cell} ipython3
@@ -459,7 +459,7 @@ where:
 
 +++
 
-This is a lightly simplifed copy of the same logic / workflow in $\S0.3$ above. We won't take up any more space here 
+This is a lightly simplified copy of the same logic / workflow in $\S0.3$ above. We won't take up any more space here 
 with EDA, the only difference is `c` and `d` are now complete
 
 +++
@@ -1371,7 +1371,7 @@ mdla_h.debug(fn="random", verbose=True)
 
 **NOTE** 
 
-+ Avoid changing `ida`, instead take a deepcopy `ida_h` , remove uneccessary groups, and we'll use that
++ Avoid changing `ida`, instead take a deepcopy `ida_h` , remove unnecessary groups, and we'll use that
 + We won't create a bare `az.InferenceData` then add groups, because we have to add all sorts of additional subtle 
   info to the object. Easier to copy and remove groups
 + The `xarray` indexing in `posterior` will be wrong (set according to `dfx_train`, rather than `dfx_holdout`),
@@ -1517,7 +1517,7 @@ _ = plot_xkhat_vs_xk(df_h_xk_unobs, mdlnm="mdla", in_samp=False)
   conditional on anything else, so we get pretty much all the same predicted value
 + This should drive home the understanding that while technically this model **can** handle new missing values,
   and **does** auto-impute values for missing data in an out-of-sample dataset (here `dfx_holdout`), these auto-imputed
-  values for `xk_unobserved` **can't** be any more informative than the posterior distribution of the hierachical 
+  values for `xk_unobserved` **can't** be any more informative than the posterior distribution of the hierarchical 
   prior `xk_mu`.
 
 +++
diff --git a/examples/generalized_linear_models/GLM-ordinal-features.ipynb b/examples/generalized_linear_models/GLM-ordinal-features.ipynb
index 532eab206..4cbf1bf07 100644
--- a/examples/generalized_linear_models/GLM-ordinal-features.ipynb
+++ b/examples/generalized_linear_models/GLM-ordinal-features.ipynb
@@ -10,7 +10,7 @@
     "# GLM-ordinal-features\n",
     "\n",
     ":::{post} Oct 27, 2024\n",
-    ":tags: ordinal-features, ordinal-regression, glm, bayesian-workflow, r-datasets\n",
+    ":tags: ordinal features, ordinal regression, generalized linear model, bayesian workflow, r-datasets\n",
     ":category: intermediate, reference\n",
     ":author: Jonathan Sedar\n",
     ":::"
@@ -84,7 +84,7 @@
     "+ As a totally subjective opinion which can be different between observations (e.g. `[\"bad\", \"medium\", \"good\", \"better\",\n",
     "  \"way better\", \"best\", \"actually the best\"]`)  - these are difficult to work with and a symptom of poor data design\n",
     "+ On a subjective but standardized scale (e.g. `[\"strongly disagree\", \"disagree\", \"agree\", \"strongly agree\"]`) \n",
-    "  this is the approach of the familar [Likert scale](https://en.wikipedia.org/wiki/Likert_scale)\n",
+    "  this is the approach of the familiar [Likert scale](https://en.wikipedia.org/wiki/Likert_scale)\n",
     "+ As a summary binning of a real objective value on a metric scale (e.g. binning ages into age-groups \n",
     "  `[\"<30\", \"30 to 60\", \"60+\"]`), or a subjective value that's been mapped to a metric scale (e.g. medical health\n",
     "  self-scoring `[\"0-10%\", ..., \"90-100%\"]`) - these are typically a misuse of the metric because the data has been\n",
@@ -102,7 +102,7 @@
     "infinite degrees of freedom - so we lose ordering / rank information), or as a numeric coefficient (which ignores the \n",
     "unequal spacing, non-linear response). Both are poor choices and have subtly negative effects on the model performance.\n",
     "\n",
-    "A final nuance is that we might not see the occurence of all valid categorial ordinal levels in the training dataset. \n",
+    "A final nuance is that we might not see the occurrence of all valid categorial ordinal levels in the training dataset. \n",
     "For example we might know a range is measured `[\"c0\", \"c1\", \"c2\", \"c3\"]` but only see `[\"c0\", \"c1\", \"c3\"]`. This is a \n",
     "missing data problem which could further encourage the misuse of a numeric coefficient to average or \"interpolate\" a\n",
     "value. What we should do is incorporate our knowledge of the data domain into the model structure to autoimpute a\n",
@@ -270,7 +270,7 @@
     "value. What we should do is incorporate our knowledge of the data domain into the model structure to auto-impute a\n",
     "coefficient value. This means that our model can make predictions on new data where a `d450=4` value might be seen.\n",
     "\n",
-    "** _Just for completness (but not needed for this notebook) that study is reported in \n",
+    "** _Just for completeness (but not needed for this notebook) that study is reported in \n",
     "Gertheiss, J., Hogger, S., Oberhauser, C., & Tutz, G. (2011). Selection of ordinally\n",
     "784 scaled independent variables with applications to international classification of functioning\n",
     "785 core sets. Journal of the Royal Statistical Society: Series C (Applied Statistics), 60 (3),\n",
@@ -296,7 +296,7 @@
     "id": "pwNFAyKzJagB"
    },
    "source": [
-    "Annoyingly but not suprisingly for an R project, despite being a small, simple table, the dataset is only available in \n",
+    "Annoyingly but not surprisingly for an R project, despite being a small, simple table, the dataset is only available in \n",
     "an obscure R binary format, and tarred, so we'll download, unpack and store locally as a normal CSV file.\n",
     "This uses the rather helpful [`pyreadr`](https://github.com/ofajardo/pyreadr) package."
    ]
@@ -1507,7 +1507,7 @@
    "source": [
     "**Observe:**\n",
     "\n",
-    "+ `phcs` is a subjective scored measure of physical healt, see {cite:p}`burkner2018` for details\n",
+    "+ `phcs` is a subjective scored measure of physical health, see {cite:p}`burkner2018` for details\n",
     "+ Seems well-behaved, unimodal, smooth"
    ]
   },
@@ -3083,7 +3083,7 @@
    },
    "source": [
     "Just for completeness, just compare to Figure 3 in the Bürkner paper and Rochford's\n",
-    "blogpost. Those plots summarize to a mean though, which seems unneccesary - let's\n",
+    "blogpost. Those plots summarize to a mean though, which seems unnecessary - let's\n",
     "improve it a little with full sample posteriors"
    ]
   },
@@ -3469,7 +3469,7 @@
     "+ Notably:\n",
     "    + $\\mathbb{x}_{i,d450}$ is treated as an ordinal feature and used to index $\\nu_{d450}[x_{i,d450}]$\n",
     "    + $\\mathbb{x}_{i,d455}$ is treated as an ordinal feature and used to index $\\nu_{d455}[x_{i,d455}]$\n",
-    "+ NOTE: The above spec is not particuarly optimised / vectorised / DRY to aid explanation"
+    "+ NOTE: The above spec is not particularly optimised / vectorised / DRY to aid explanation"
    ]
   },
   {
@@ -3516,7 +3516,7 @@
    "outputs": [],
    "source": [
     "with pm.Model(coords=COORDS) as mdlb:\n",
-    "    # NOTE: Spec not particuarly optimised / vectorised / DRY to aid explanation\n",
+    "    # NOTE: Spec not particularly optimised / vectorised / DRY to aid explanation\n",
     "\n",
     "    # 0. create (Mutable)Data containers for obs (Y, X)\n",
     "    y = pm.Data(\"y\", dfx[ft_y].values, dims=\"oid\")  # (i, )\n",
@@ -3530,7 +3530,7 @@
     "\n",
     "    # 2. define nu\n",
     "    def _get_nu(nm, dim):\n",
-    "        \"\"\"Partition continous prior into ordinal chunks\"\"\"\n",
+    "        \"\"\"Partition continuous prior into ordinal chunks\"\"\"\n",
     "        b0 = pm.Normal(f\"beta_{nm}\", mu=0, sigma=b_s)  # (1, )\n",
     "        c0 = pm.Dirichlet(f\"chi_{nm}\", a=np.ones(len(COORDS[dim])), dims=dim)  # (lvls, )\n",
     "        return pm.Deterministic(f\"nu_{nm}\", b0 * c0.cumsum(), dims=dim)  # (lvls, )\n",
@@ -5173,7 +5173,7 @@
     "Just for completeness, just compare to Figure 3 in the Bürkner paper and Rochford's\n",
     "blogpost.\n",
     "\n",
-    "Those plots summarize to a mean though, which seems unneccesary - let's improve it a little."
+    "Those plots summarize to a mean though, which seems unnecessary - let's improve it a little."
    ]
   },
   {
diff --git a/examples/generalized_linear_models/GLM-ordinal-features.myst.md b/examples/generalized_linear_models/GLM-ordinal-features.myst.md
index 92417b5a3..611555b31 100644
--- a/examples/generalized_linear_models/GLM-ordinal-features.myst.md
+++ b/examples/generalized_linear_models/GLM-ordinal-features.myst.md
@@ -16,7 +16,7 @@ kernelspec:
 # GLM-ordinal-features
 
 :::{post} Oct 27, 2024
-:tags: ordinal-features, ordinal-regression, glm, bayesian-workflow, r-datasets
+:tags: ordinal features, ordinal regression, generalized linear model, bayesian workflow, r-datasets
 :category: intermediate, reference
 :author: Jonathan Sedar
 :::
@@ -75,7 +75,7 @@ preference or summarizing a metric value, and is particularly common in insuranc
 + As a totally subjective opinion which can be different between observations (e.g. `["bad", "medium", "good", "better",
   "way better", "best", "actually the best"]`)  - these are difficult to work with and a symptom of poor data design
 + On a subjective but standardized scale (e.g. `["strongly disagree", "disagree", "agree", "strongly agree"]`) 
-  this is the approach of the familar [Likert scale](https://en.wikipedia.org/wiki/Likert_scale)
+  this is the approach of the familiar [Likert scale](https://en.wikipedia.org/wiki/Likert_scale)
 + As a summary binning of a real objective value on a metric scale (e.g. binning ages into age-groups 
   `["<30", "30 to 60", "60+"]`), or a subjective value that's been mapped to a metric scale (e.g. medical health
   self-scoring `["0-10%", ..., "90-100%"]`) - these are typically a misuse of the metric because the data has been
@@ -93,7 +93,7 @@ These properties can unfortunately encourage modellers to incorporate ordinal fe
 infinite degrees of freedom - so we lose ordering / rank information), or as a numeric coefficient (which ignores the 
 unequal spacing, non-linear response). Both are poor choices and have subtly negative effects on the model performance.
 
-A final nuance is that we might not see the occurence of all valid categorial ordinal levels in the training dataset. 
+A final nuance is that we might not see the occurrence of all valid categorial ordinal levels in the training dataset. 
 For example we might know a range is measured `["c0", "c1", "c2", "c3"]` but only see `["c0", "c1", "c3"]`. This is a 
 missing data problem which could further encourage the misuse of a numeric coefficient to average or "interpolate" a
 value. What we should do is incorporate our knowledge of the data domain into the model structure to autoimpute a
@@ -218,7 +218,7 @@ missing data problem which could further encourage the misuse of a numeric coeff
 value. What we should do is incorporate our knowledge of the data domain into the model structure to auto-impute a
 coefficient value. This means that our model can make predictions on new data where a `d450=4` value might be seen.
 
-** _Just for completness (but not needed for this notebook) that study is reported in 
+** _Just for completeness (but not needed for this notebook) that study is reported in 
 Gertheiss, J., Hogger, S., Oberhauser, C., & Tutz, G. (2011). Selection of ordinally
 784 scaled independent variables with applications to international classification of functioning
 785 core sets. Journal of the Royal Statistical Society: Series C (Applied Statistics), 60 (3),
@@ -234,7 +234,7 @@ more generally useful
 
 +++ {"id": "pwNFAyKzJagB"}
 
-Annoyingly but not suprisingly for an R project, despite being a small, simple table, the dataset is only available in 
+Annoyingly but not surprisingly for an R project, despite being a small, simple table, the dataset is only available in 
 an obscure R binary format, and tarred, so we'll download, unpack and store locally as a normal CSV file.
 This uses the rather helpful [`pyreadr`](https://github.com/ofajardo/pyreadr) package.
 
@@ -446,7 +446,7 @@ _ = f.tight_layout()
 
 **Observe:**
 
-+ `phcs` is a subjective scored measure of physical healt, see {cite:p}`burkner2018` for details
++ `phcs` is a subjective scored measure of physical health, see {cite:p}`burkner2018` for details
 + Seems well-behaved, unimodal, smooth
 
 +++ {"id": "p43qjcvJJagH"}
@@ -966,7 +966,7 @@ f = plot_posterior(ida, "posterior", rvs=RVS_SIMPLE_COMMON, mdlname="mdla", n=5,
 +++ {"id": "yrrzYjmhJagK"}
 
 Just for completeness, just compare to Figure 3 in the Bürkner paper and Rochford's
-blogpost. Those plots summarize to a mean though, which seems unneccesary - let's
+blogpost. Those plots summarize to a mean though, which seems unnecessary - let's
 improve it a little with full sample posteriors
 
 +++ {"id": "X4XB1eiwJagK"}
@@ -1114,7 +1114,7 @@ where:
 + Notably:
     + $\mathbb{x}_{i,d450}$ is treated as an ordinal feature and used to index $\nu_{d450}[x_{i,d450}]$
     + $\mathbb{x}_{i,d455}$ is treated as an ordinal feature and used to index $\nu_{d455}[x_{i,d455}]$
-+ NOTE: The above spec is not particuarly optimised / vectorised / DRY to aid explanation
++ NOTE: The above spec is not particularly optimised / vectorised / DRY to aid explanation
 
 +++ {"id": "F47aQhT2JagK"}
 
@@ -1145,7 +1145,7 @@ id: ZyP0P29AJagK
 outputId: 2f5e3717-7549-43d0-a334-7875b3871dcd
 ---
 with pm.Model(coords=COORDS) as mdlb:
-    # NOTE: Spec not particuarly optimised / vectorised / DRY to aid explanation
+    # NOTE: Spec not particularly optimised / vectorised / DRY to aid explanation
 
     # 0. create (Mutable)Data containers for obs (Y, X)
     y = pm.Data("y", dfx[ft_y].values, dims="oid")  # (i, )
@@ -1159,7 +1159,7 @@ with pm.Model(coords=COORDS) as mdlb:
 
     # 2. define nu
     def _get_nu(nm, dim):
-        """Partition continous prior into ordinal chunks"""
+        """Partition continuous prior into ordinal chunks"""
         b0 = pm.Normal(f"beta_{nm}", mu=0, sigma=b_s)  # (1, )
         c0 = pm.Dirichlet(f"chi_{nm}", a=np.ones(len(COORDS[dim])), dims=dim)  # (lvls, )
         return pm.Deterministic(f"nu_{nm}", b0 * c0.cumsum(), dims=dim)  # (lvls, )
@@ -1528,7 +1528,7 @@ Here we see the same patterns in more detail, in particular:
 Just for completeness, just compare to Figure 3 in the Bürkner paper and Rochford's
 blogpost.
 
-Those plots summarize to a mean though, which seems unneccesary - let's improve it a little.
+Those plots summarize to a mean though, which seems unnecessary - let's improve it a little.
 
 +++ {"id": "b09YNkSkJagM"}
 
diff --git a/examples/generalized_linear_models/GLM-ordinal-regression.ipynb b/examples/generalized_linear_models/GLM-ordinal-regression.ipynb
index f0b8be3ca..516960352 100644
--- a/examples/generalized_linear_models/GLM-ordinal-regression.ipynb
+++ b/examples/generalized_linear_models/GLM-ordinal-regression.ipynb
@@ -350,7 +350,7 @@
     "\n",
     "$$ P(Y = j) = \\frac{exp(\\alpha_{j} + \\beta'x)}{1 + exp(\\alpha_{j} + \\beta'x)} - \\frac{exp(\\alpha_{j-1} + \\beta'x)}{1 + exp(\\alpha_{j-1} + \\beta'x)} $$\n",
     "\n",
-    "One nice feature of ordinal regressions specified in this fashion is that the interpretation of the coefficients on the beta terms remain the same across each interval on the latent space. The interpretaiton of the model parameters is typical: a unit increase in $x_{k}$ corresponds to an increase in $Y_{latent}$ of $\\beta_{k}$ Similar interpretation holds for probit regression specification too. However we must be careful about comparing the interpretation of coefficients across different model specifications with different variables. The above coefficient interpretation makes sense as conditional interpretation based on holding fixed precisely the variables in the model. Adding or removing variables changes the conditionalisation which breaks the comparability of the models due the phenomena of non-collapsability. We'll show below how it's better to compare the models on their predictive implications using the posterior predictive distribution. \n",
+    "One nice feature of ordinal regressions specified in this fashion is that the interpretation of the coefficients on the beta terms remain the same across each interval on the latent space. The interpretations of the model parameters is typical: a unit increase in $x_{k}$ corresponds to an increase in $Y_{latent}$ of $\\beta_{k}$ Similar interpretation holds for probit regression specification too. However we must be careful about comparing the interpretation of coefficients across different model specifications with different variables. The above coefficient interpretation makes sense as conditional interpretation based on holding fixed precisely the variables in the model. Adding or removing variables changes the conditionalisation which breaks the comparability of the models due the phenomena of non-collapsability. We'll show below how it's better to compare the models on their predictive implications using the posterior predictive distribution. \n",
     "\n",
     "### Bayesian Particularities \n",
     "\n",
@@ -387,7 +387,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The above function, (brainchild of Dr Ben Vincent and Adrian Seyboldt), looks a little indimidating, but it's just a convenience function to specify a prior over the cutpoints in our $Y_{latent}$. The Dirichlet distribution is special in that draws from the distribution must sum to one. The above function ensures that each draw from the prior distribution is a cumulative share of the maximum category greater than the minimum of our ordinal categorisation. "
+    "The above function, (brainchild of Dr Ben Vincent and Adrian Seyboldt), looks a little intimidating, but it's just a convenience function to specify a prior over the cutpoints in our $Y_{latent}$. The Dirichlet distribution is special in that draws from the distribution must sum to one. The above function ensures that each draw from the prior distribution is a cumulative share of the maximum category greater than the minimum of our ordinal categorisation. "
    ]
   },
   {
@@ -1231,7 +1231,7 @@
    "source": [
     "### Extracting Individual Probabilities \n",
     "\n",
-    "We can now for each individual manager's rating, look at the probability associated with each of the available categories. Across the posterior distributions of our cuts which section of the latent continous measure the employee is most likely to fall into."
+    "We can now for each individual manager's rating, look at the probability associated with each of the available categories. Across the posterior distributions of our cuts which section of the latent continuous measure the employee is most likely to fall into."
    ]
   },
   {
@@ -2238,7 +2238,7 @@
    "source": [
     "## Compare Cutpoints: Normal versus Uniform Priors\n",
     "\n",
-    "Note how the model with unconstrianed cutpoints allows the occurence of a threshold estimated to be below zero. This does not make much conceptual sense, but can lead to a plausible enough posterior predictive distribution."
+    "Note how the model with unconstrianed cutpoints allows the occurrence of a threshold estimated to be below zero. This does not make much conceptual sense, but can lead to a plausible enough posterior predictive distribution."
    ]
   },
   {
diff --git a/examples/generalized_linear_models/GLM-ordinal-regression.myst.md b/examples/generalized_linear_models/GLM-ordinal-regression.myst.md
index 2f58be878..5ef7ff253 100644
--- a/examples/generalized_linear_models/GLM-ordinal-regression.myst.md
+++ b/examples/generalized_linear_models/GLM-ordinal-regression.myst.md
@@ -168,7 +168,7 @@ and that the probability for belonging within a particular category $j$ is deter
 
 $$ P(Y = j) = \frac{exp(\alpha_{j} + \beta'x)}{1 + exp(\alpha_{j} + \beta'x)} - \frac{exp(\alpha_{j-1} + \beta'x)}{1 + exp(\alpha_{j-1} + \beta'x)} $$
 
-One nice feature of ordinal regressions specified in this fashion is that the interpretation of the coefficients on the beta terms remain the same across each interval on the latent space. The interpretaiton of the model parameters is typical: a unit increase in $x_{k}$ corresponds to an increase in $Y_{latent}$ of $\beta_{k}$ Similar interpretation holds for probit regression specification too. However we must be careful about comparing the interpretation of coefficients across different model specifications with different variables. The above coefficient interpretation makes sense as conditional interpretation based on holding fixed precisely the variables in the model. Adding or removing variables changes the conditionalisation which breaks the comparability of the models due the phenomena of non-collapsability. We'll show below how it's better to compare the models on their predictive implications using the posterior predictive distribution. 
+One nice feature of ordinal regressions specified in this fashion is that the interpretation of the coefficients on the beta terms remain the same across each interval on the latent space. The interpretations of the model parameters is typical: a unit increase in $x_{k}$ corresponds to an increase in $Y_{latent}$ of $\beta_{k}$ Similar interpretation holds for probit regression specification too. However we must be careful about comparing the interpretation of coefficients across different model specifications with different variables. The above coefficient interpretation makes sense as conditional interpretation based on holding fixed precisely the variables in the model. Adding or removing variables changes the conditionalisation which breaks the comparability of the models due the phenomena of non-collapsability. We'll show below how it's better to compare the models on their predictive implications using the posterior predictive distribution. 
 
 ### Bayesian Particularities 
 
@@ -192,7 +192,7 @@ def constrainedUniform(N, min=0, max=1):
     )
 ```
 
-The above function, (brainchild of Dr Ben Vincent and Adrian Seyboldt), looks a little indimidating, but it's just a convenience function to specify a prior over the cutpoints in our $Y_{latent}$. The Dirichlet distribution is special in that draws from the distribution must sum to one. The above function ensures that each draw from the prior distribution is a cumulative share of the maximum category greater than the minimum of our ordinal categorisation. 
+The above function, (brainchild of Dr Ben Vincent and Adrian Seyboldt), looks a little intimidating, but it's just a convenience function to specify a prior over the cutpoints in our $Y_{latent}$. The Dirichlet distribution is special in that draws from the distribution must sum to one. The above function ensures that each draw from the prior distribution is a cumulative share of the maximum category greater than the minimum of our ordinal categorisation. 
 
 ```{code-cell} ipython3
 :tags: [hide-output]
@@ -248,7 +248,7 @@ pm.model_to_graphviz(model3)
 
 ### Extracting Individual Probabilities 
 
-We can now for each individual manager's rating, look at the probability associated with each of the available categories. Across the posterior distributions of our cuts which section of the latent continous measure the employee is most likely to fall into.
+We can now for each individual manager's rating, look at the probability associated with each of the available categories. Across the posterior distributions of our cuts which section of the latent continuous measure the employee is most likely to fall into.
 
 ```{code-cell} ipython3
 implied_probs = az.extract(idata3, var_names=["y_probs"])
@@ -334,7 +334,7 @@ az.summary(idata3, var_names=["cutpoints", "beta", "sigma"])
 
 ## Compare Cutpoints: Normal versus Uniform Priors
 
-Note how the model with unconstrianed cutpoints allows the occurence of a threshold estimated to be below zero. This does not make much conceptual sense, but can lead to a plausible enough posterior predictive distribution.
+Note how the model with unconstrianed cutpoints allows the occurrence of a threshold estimated to be below zero. This does not make much conceptual sense, but can lead to a plausible enough posterior predictive distribution.
 
 ```{code-cell} ipython3
 def plot_fit(idata):
diff --git a/examples/howto/LKJ.ipynb b/examples/howto/LKJ.ipynb
index f7725d4ce..c62f9906b 100644
--- a/examples/howto/LKJ.ipynb
+++ b/examples/howto/LKJ.ipynb
@@ -160,7 +160,7 @@
     "\n",
     "$$f(\\mathbf{x}\\ |\\ \\mu, \\Sigma^{-1}) = (2 \\pi)^{-\\frac{k}{2}} |\\Sigma|^{-\\frac{1}{2}} \\exp\\left(-\\frac{1}{2} (\\mathbf{x} - \\mu)^{\\top} \\Sigma^{-1} (\\mathbf{x} - \\mu)\\right).$$\n",
     "\n",
-    "The LKJ distribution provides a prior on the correlation matrix, $\\mathbf{C} = \\textrm{Corr}(x_i, x_j)$, which, combined with priors on the standard deviations of each component, [induces](http://www3.stat.sinica.edu.tw/statistica/oldpdf/A10n416.pdf) a prior on the covariance matrix, $\\Sigma$. Since inverting $\\Sigma$ is numerically unstable and inefficient, it is computationally advantageous to use the [Cholesky decompositon](https://en.wikipedia.org/wiki/Cholesky_decomposition) of $\\Sigma$, $\\Sigma = \\mathbf{L} \\mathbf{L}^{\\top}$, where $\\mathbf{L}$ is a lower-triangular matrix. This decompositon allows computation of the term $(\\mathbf{x} - \\mu)^{\\top} \\Sigma^{-1} (\\mathbf{x} - \\mu)$ using back-substitution, which is more numerically stable and efficient than direct matrix inversion.\n",
+    "The LKJ distribution provides a prior on the correlation matrix, $\\mathbf{C} = \\textrm{Corr}(x_i, x_j)$, which, combined with priors on the standard deviations of each component, [induces](http://www3.stat.sinica.edu.tw/statistica/oldpdf/A10n416.pdf) a prior on the covariance matrix, $\\Sigma$. Since inverting $\\Sigma$ is numerically unstable and inefficient, it is computationally advantageous to use the [Cholesky decomposition](https://en.wikipedia.org/wiki/Cholesky_decomposition) of $\\Sigma$, $\\Sigma = \\mathbf{L} \\mathbf{L}^{\\top}$, where $\\mathbf{L}$ is a lower-triangular matrix. This decomposition allows computation of the term $(\\mathbf{x} - \\mu)^{\\top} \\Sigma^{-1} (\\mathbf{x} - \\mu)$ using back-substitution, which is more numerically stable and efficient than direct matrix inversion.\n",
     "\n",
     "PyMC supports LKJ priors for the Cholesky decomposition of the covariance matrix via the {class}`pymc.LKJCholeskyCov` distribution. This distribution has parameters `n` and `sd_dist`, which are the dimension of the observations, $\\mathbf{x}$, and the PyMC distribution of the component standard deviations, respectively. It also has a hyperparamter `eta`, which controls the amount of correlation between components of $\\mathbf{x}$. The LKJ distribution has the density $f(\\mathbf{C}\\ |\\ \\eta) \\propto |\\mathbf{C}|^{\\eta - 1}$, so $\\eta = 1$ leads to a uniform distribution on correlation matrices, while the magnitude of correlations between components decreases as $\\eta \\to \\infty$.\n",
     "\n",
@@ -187,7 +187,7 @@
     "id": "6Cscu-CRr2Pe"
    },
    "source": [
-    "Since the Cholesky decompositon of $\\Sigma$ is lower triangular, `LKJCholeskyCov` only stores the diagonal and sub-diagonal entries, for efficiency:"
+    "Since the Cholesky decomposition of $\\Sigma$ is lower triangular, `LKJCholeskyCov` only stores the diagonal and sub-diagonal entries, for efficiency:"
    ]
   },
   {
diff --git a/examples/howto/LKJ.myst.md b/examples/howto/LKJ.myst.md
index 826e5091d..7eb95038b 100644
--- a/examples/howto/LKJ.myst.md
+++ b/examples/howto/LKJ.myst.md
@@ -100,7 +100,7 @@ The sampling distribution for the multivariate normal model is $\mathbf{x} \sim
 
 $$f(\mathbf{x}\ |\ \mu, \Sigma^{-1}) = (2 \pi)^{-\frac{k}{2}} |\Sigma|^{-\frac{1}{2}} \exp\left(-\frac{1}{2} (\mathbf{x} - \mu)^{\top} \Sigma^{-1} (\mathbf{x} - \mu)\right).$$
 
-The LKJ distribution provides a prior on the correlation matrix, $\mathbf{C} = \textrm{Corr}(x_i, x_j)$, which, combined with priors on the standard deviations of each component, [induces](http://www3.stat.sinica.edu.tw/statistica/oldpdf/A10n416.pdf) a prior on the covariance matrix, $\Sigma$. Since inverting $\Sigma$ is numerically unstable and inefficient, it is computationally advantageous to use the [Cholesky decompositon](https://en.wikipedia.org/wiki/Cholesky_decomposition) of $\Sigma$, $\Sigma = \mathbf{L} \mathbf{L}^{\top}$, where $\mathbf{L}$ is a lower-triangular matrix. This decompositon allows computation of the term $(\mathbf{x} - \mu)^{\top} \Sigma^{-1} (\mathbf{x} - \mu)$ using back-substitution, which is more numerically stable and efficient than direct matrix inversion.
+The LKJ distribution provides a prior on the correlation matrix, $\mathbf{C} = \textrm{Corr}(x_i, x_j)$, which, combined with priors on the standard deviations of each component, [induces](http://www3.stat.sinica.edu.tw/statistica/oldpdf/A10n416.pdf) a prior on the covariance matrix, $\Sigma$. Since inverting $\Sigma$ is numerically unstable and inefficient, it is computationally advantageous to use the [Cholesky decomposition](https://en.wikipedia.org/wiki/Cholesky_decomposition) of $\Sigma$, $\Sigma = \mathbf{L} \mathbf{L}^{\top}$, where $\mathbf{L}$ is a lower-triangular matrix. This decomposition allows computation of the term $(\mathbf{x} - \mu)^{\top} \Sigma^{-1} (\mathbf{x} - \mu)$ using back-substitution, which is more numerically stable and efficient than direct matrix inversion.
 
 PyMC supports LKJ priors for the Cholesky decomposition of the covariance matrix via the {class}`pymc.LKJCholeskyCov` distribution. This distribution has parameters `n` and `sd_dist`, which are the dimension of the observations, $\mathbf{x}$, and the PyMC distribution of the component standard deviations, respectively. It also has a hyperparamter `eta`, which controls the amount of correlation between components of $\mathbf{x}$. The LKJ distribution has the density $f(\mathbf{C}\ |\ \eta) \propto |\mathbf{C}|^{\eta - 1}$, so $\eta = 1$ leads to a uniform distribution on correlation matrices, while the magnitude of correlations between components decreases as $\eta \to \infty$.
 
@@ -117,7 +117,7 @@ with pm.Model() as m:
 
 +++ {"id": "6Cscu-CRr2Pe"}
 
-Since the Cholesky decompositon of $\Sigma$ is lower triangular, `LKJCholeskyCov` only stores the diagonal and sub-diagonal entries, for efficiency:
+Since the Cholesky decomposition of $\Sigma$ is lower triangular, `LKJCholeskyCov` only stores the diagonal and sub-diagonal entries, for efficiency:
 
 ```{code-cell} ipython3
 ---
diff --git a/examples/howto/Missing_Data_Imputation.ipynb b/examples/howto/Missing_Data_Imputation.ipynb
index 06f417dd1..e2db8dc4f 100644
--- a/examples/howto/Missing_Data_Imputation.ipynb
+++ b/examples/howto/Missing_Data_Imputation.ipynb
@@ -843,7 +843,7 @@
    "source": [
     "### Bootstrapping Sensitivity Analysis\n",
     "\n",
-    "We may also want to validate the estimated parameters against bootstrapped samples under different speficiations of missing-ness. "
+    "We may also want to validate the estimated parameters against bootstrapped samples under different specifications of missing-ness. "
    ]
   },
   {
@@ -7655,7 +7655,7 @@
    "source": [
     "### Process the Posterior Predictive Distribution\n",
     "\n",
-    "Above we estimated a number of likelihood terms in a single PyMC model context. These likelihoods constrained the hyper-parameters which determined the imputation values of the missing terms in the variables used as predictors in our focal regression equation for `empower`. But we could also perform a more manual sequential imputation, where we model each of the subordinate regression equations seperately and extract the imputed values for each variable in turn and then run a simple regression on the imputed values for the focal regression equation. \n",
+    "Above we estimated a number of likelihood terms in a single PyMC model context. These likelihoods constrained the hyper-parameters which determined the imputation values of the missing terms in the variables used as predictors in our focal regression equation for `empower`. But we could also perform a more manual sequential imputation, where we model each of the subordinate regression equations separately and extract the imputed values for each variable in turn and then run a simple regression on the imputed values for the focal regression equation. \n",
     "\n",
     "We show here how to extract the imputed values for each of the regression equations and augment the observed data."
    ]
diff --git a/examples/howto/Missing_Data_Imputation.myst.md b/examples/howto/Missing_Data_Imputation.myst.md
index f38482741..712e44fd4 100644
--- a/examples/howto/Missing_Data_Imputation.myst.md
+++ b/examples/howto/Missing_Data_Imputation.myst.md
@@ -247,7 +247,7 @@ pd.DataFrame(mle_sample.corr(), columns=data.columns, index=data.columns)
 
 ### Bootstrapping Sensitivity Analysis
 
-We may also want to validate the estimated parameters against bootstrapped samples under different speficiations of missing-ness. 
+We may also want to validate the estimated parameters against bootstrapped samples under different specifications of missing-ness. 
 
 ```{code-cell} ipython3
 data_200 = df_employee[["worksat", "empower", "lmx"]].dropna().sample(200)
@@ -534,7 +534,7 @@ az.plot_ppc(idata_normal)
 
 ### Process the Posterior Predictive Distribution
 
-Above we estimated a number of likelihood terms in a single PyMC model context. These likelihoods constrained the hyper-parameters which determined the imputation values of the missing terms in the variables used as predictors in our focal regression equation for `empower`. But we could also perform a more manual sequential imputation, where we model each of the subordinate regression equations seperately and extract the imputed values for each variable in turn and then run a simple regression on the imputed values for the focal regression equation. 
+Above we estimated a number of likelihood terms in a single PyMC model context. These likelihoods constrained the hyper-parameters which determined the imputation values of the missing terms in the variables used as predictors in our focal regression equation for `empower`. But we could also perform a more manual sequential imputation, where we model each of the subordinate regression equations separately and extract the imputed values for each variable in turn and then run a simple regression on the imputed values for the focal regression equation. 
 
 We show here how to extract the imputed values for each of the regression equations and augment the observed data.
 
diff --git a/examples/howto/arbitrary_stochastic.py b/examples/howto/arbitrary_stochastic.py
index 1840e1161..d76e6bf17 100644
--- a/examples/howto/arbitrary_stochastic.py
+++ b/examples/howto/arbitrary_stochastic.py
@@ -4,7 +4,7 @@
 import pymc3 as pm
 
 
-# custom log-liklihood
+# custom log-likelihood
 def logp(failure, lam, value):
     return tt.sum(failure * tt.log(lam) - lam * value)
 
diff --git a/examples/howto/blackbox_external_likelihood_numpy.ipynb b/examples/howto/blackbox_external_likelihood_numpy.ipynb
index e894ffcaf..ad6655e7c 100644
--- a/examples/howto/blackbox_external_likelihood_numpy.ipynb
+++ b/examples/howto/blackbox_external_likelihood_numpy.ipynb
@@ -803,7 +803,7 @@
     "        outputs[1][0] = grad_wrt_c\n",
     "\n",
     "\n",
-    "# Initalize the Ops\n",
+    "# Initialize the Ops\n",
     "loglikewithgrad_op = LogLikeWithGrad()\n",
     "loglikegrad_op = LogLikeGrad()"
    ]
diff --git a/examples/howto/blackbox_external_likelihood_numpy.myst.md b/examples/howto/blackbox_external_likelihood_numpy.myst.md
index 56c6bbb87..6d2d5cfc4 100644
--- a/examples/howto/blackbox_external_likelihood_numpy.myst.md
+++ b/examples/howto/blackbox_external_likelihood_numpy.myst.md
@@ -389,7 +389,7 @@ class LogLikeGrad(Op):
         outputs[1][0] = grad_wrt_c
 
 
-# Initalize the Ops
+# Initialize the Ops
 loglikewithgrad_op = LogLikeWithGrad()
 loglikegrad_op = LogLikeGrad()
 ```
diff --git a/examples/howto/updating_priors.ipynb b/examples/howto/updating_priors.ipynb
index 88f628594..932e26d0d 100644
--- a/examples/howto/updating_priors.ipynb
+++ b/examples/howto/updating_priors.ipynb
@@ -403,7 +403,7 @@
     ":class: warning\n",
     "Observe that, despite the fact that the iterations seems improving, some of them don't look so good, even sometimes it seems it regresses. In addition to reasons noted at the beginning of the notebook, there are a couple key steps in the process where randomness is involved. Thus, things should be expected to improve on average.\n",
     "\n",
-    "1. New observations are random. If in the initial iterations we get values closer to the bulk of the distribuion and then we get several values in a row from the positive tail, the iterations where we have accumulated a couple draws from the tail will probably be biased and \"look worse\" than previous ones.\n",
+    "1. New observations are random. If in the initial iterations we get values closer to the bulk of the distribution and then we get several values in a row from the positive tail, the iterations where we have accumulated a couple draws from the tail will probably be biased and \"look worse\" than previous ones.\n",
     "2. MCMC is random. Even when it converges, MCMC is a random process, so different calls to `pymc.sample` will return values centered around the exact posterior but not always the same; how large a variation we should expect can be checked with {func}`arviz.mcse`. KDEs also incorporate this often negligible yet present source of uncertainty in the posterior estimates, and so will the generated Interpolated distributions.\n",
     "\n",
     "+++\n",
diff --git a/examples/howto/updating_priors.myst.md b/examples/howto/updating_priors.myst.md
index 960eadac5..2b1c1e3ea 100644
--- a/examples/howto/updating_priors.myst.md
+++ b/examples/howto/updating_priors.myst.md
@@ -191,7 +191,7 @@ What is interesting to note is that the posterior distributions for our paramete
 :class: warning
 Observe that, despite the fact that the iterations seems improving, some of them don't look so good, even sometimes it seems it regresses. In addition to reasons noted at the beginning of the notebook, there are a couple key steps in the process where randomness is involved. Thus, things should be expected to improve on average.
 
-1. New observations are random. If in the initial iterations we get values closer to the bulk of the distribuion and then we get several values in a row from the positive tail, the iterations where we have accumulated a couple draws from the tail will probably be biased and "look worse" than previous ones.
+1. New observations are random. If in the initial iterations we get values closer to the bulk of the distribution and then we get several values in a row from the positive tail, the iterations where we have accumulated a couple draws from the tail will probably be biased and "look worse" than previous ones.
 2. MCMC is random. Even when it converges, MCMC is a random process, so different calls to `pymc.sample` will return values centered around the exact posterior but not always the same; how large a variation we should expect can be checked with {func}`arviz.mcse`. KDEs also incorporate this often negligible yet present source of uncertainty in the posterior estimates, and so will the generated Interpolated distributions.
 
 +++
diff --git a/examples/ode_models/ODE_Lotka_Volterra_multiple_ways.ipynb b/examples/ode_models/ODE_Lotka_Volterra_multiple_ways.ipynb
index 59599fcaa..eccdc16ea 100644
--- a/examples/ode_models/ODE_Lotka_Volterra_multiple_ways.ipynb
+++ b/examples/ode_models/ODE_Lotka_Volterra_multiple_ways.ipynb
@@ -2021,7 +2021,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The Sequential Monte Carlo (SMC) sampler can be used to sample a regular Bayesian model or to run model without a likelihood (Aproximate Bayesian Computation). Let's try first with a regular model,"
+    "The Sequential Monte Carlo (SMC) sampler can be used to sample a regular Bayesian model or to run model without a likelihood (Approximate Bayesian Computation). Let's try first with a regular model,"
    ]
   },
   {
@@ -2331,7 +2331,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "As outlined in the SMC tutorial on PyMC.io, the SMC sampler can be used for Aproximate Bayesian Computation, i.e. we can use a `pm.Simulator` instead of a explicit likelihood.  Here is a rewrite of the PyMC - odeint model for SMC-ABC.\n",
+    "As outlined in the SMC tutorial on PyMC.io, the SMC sampler can be used for Approximate Bayesian Computation, i.e. we can use a `pm.Simulator` instead of a explicit likelihood.  Here is a rewrite of the PyMC - odeint model for SMC-ABC.\n",
     "\n",
     "The simulator function needs to have the correct signature (e.g., accept an rng argument first).  "
    ]
@@ -4226,7 +4226,7 @@
    "metadata": {},
    "source": [
     "**Notes:**  \n",
-    "If we ran the samplers for long enough to get good inferences, we would expect them to converge on the same posterior probability distributions. This is not necessarily true for Aproximate Bayssian Computation, unless we first ensure that the approximation too the likelihood is good enough. For instance SMCe=1 is providing a wrong result, we have been warning that this was most likely the case when we use `plot_trace` as a diagnostic. For SMC e=10, we see that posterior mean agrees with the other samplers, but the posterior is wider. This is expected with ABC methods. A smaller value of epsilon, maybe 5, should provide a posterior closer to the true one."
+    "If we ran the samplers for long enough to get good inferences, we would expect them to converge on the same posterior probability distributions. This is not necessarily true for Approximate Bayssian Computation, unless we first ensure that the approximation too the likelihood is good enough. For instance SMCe=1 is providing a wrong result, we have been warning that this was most likely the case when we use `plot_trace` as a diagnostic. For SMC e=10, we see that posterior mean agrees with the other samplers, but the posterior is wider. This is expected with ABC methods. A smaller value of epsilon, maybe 5, should provide a posterior closer to the true one."
    ]
   },
   {
diff --git a/examples/ode_models/ODE_Lotka_Volterra_multiple_ways.myst.md b/examples/ode_models/ODE_Lotka_Volterra_multiple_ways.myst.md
index a6f153dff..9e66ff826 100644
--- a/examples/ode_models/ODE_Lotka_Volterra_multiple_ways.myst.md
+++ b/examples/ode_models/ODE_Lotka_Volterra_multiple_ways.myst.md
@@ -441,7 +441,7 @@ The old-school Metropolis sampler is less reliable and slower than the DEMetropl
 
 +++
 
-The Sequential Monte Carlo (SMC) sampler can be used to sample a regular Bayesian model or to run model without a likelihood (Aproximate Bayesian Computation). Let's try first with a regular model,
+The Sequential Monte Carlo (SMC) sampler can be used to sample a regular Bayesian model or to run model without a likelihood (Approximate Bayesian Computation). Let's try first with a regular model,
 
 +++
 
@@ -479,7 +479,7 @@ At this number of samples and tuning scheme, the SMC algorithm results in wider
 
 +++
 
-As outlined in the SMC tutorial on PyMC.io, the SMC sampler can be used for Aproximate Bayesian Computation, i.e. we can use a `pm.Simulator` instead of a explicit likelihood.  Here is a rewrite of the PyMC - odeint model for SMC-ABC.
+As outlined in the SMC tutorial on PyMC.io, the SMC sampler can be used for Approximate Bayesian Computation, i.e. we can use a `pm.Simulator` instead of a explicit likelihood.  Here is a rewrite of the PyMC - odeint model for SMC-ABC.
 
 The simulator function needs to have the correct signature (e.g., accept an rng argument first).  
 
@@ -914,7 +914,7 @@ for var_name in var_names:
 ```
 
 **Notes:**  
-If we ran the samplers for long enough to get good inferences, we would expect them to converge on the same posterior probability distributions. This is not necessarily true for Aproximate Bayssian Computation, unless we first ensure that the approximation too the likelihood is good enough. For instance SMCe=1 is providing a wrong result, we have been warning that this was most likely the case when we use `plot_trace` as a diagnostic. For SMC e=10, we see that posterior mean agrees with the other samplers, but the posterior is wider. This is expected with ABC methods. A smaller value of epsilon, maybe 5, should provide a posterior closer to the true one.
+If we ran the samplers for long enough to get good inferences, we would expect them to converge on the same posterior probability distributions. This is not necessarily true for Approximate Bayssian Computation, unless we first ensure that the approximation too the likelihood is good enough. For instance SMCe=1 is providing a wrong result, we have been warning that this was most likely the case when we use `plot_trace` as a diagnostic. For SMC e=10, we see that posterior mean agrees with the other samplers, but the posterior is wider. This is expected with ABC methods. A smaller value of epsilon, maybe 5, should provide a posterior closer to the true one.
 
 +++
 
diff --git a/examples/ode_models/ODE_with_manual_gradients.ipynb b/examples/ode_models/ODE_with_manual_gradients.ipynb
index d5062af4b..0b8a2b469 100644
--- a/examples/ode_models/ODE_with_manual_gradients.ipynb
+++ b/examples/ode_models/ODE_with_manual_gradients.ipynb
@@ -137,7 +137,7 @@
     "\n",
     "Using forward sensitivity analysis we can obtain both the state $X(t)$ and its derivative w.r.t the parameters, at each time point, as the solution to an initial value problem by augmenting the original ODE system with the sensitivity equations $Z_{kd}$. The augmented ODE system $\\big(X(t), Z(t)\\big)$ can then be solved together using a chosen numerical method. The augmented ODE system needs the initial values for the sensitivity equations. All of these should be set to zero except the ones where the sensitivity of a state w.r.t. its own initial value is sought, that is $  \\frac{\\partial X_k(t)}{\\partial X_k (0)} =1 $. Note that in order to solve this augmented system we have to embark in the tedious process of deriving $ \\frac{\\partial f_k}{\\partial X_i (t)}$, also known as the Jacobian of an ODE, and $\\frac{\\partial f_k}{\\partial \\theta_d}$ terms. Thankfully, many ODE solvers calculate these terms and solve the augmented system when asked for by the user. An example would be the [SUNDIAL CVODES solver suite](https://computation.llnl.gov/projects/sundials/cvodes). A Python wrapper for CVODES can be found [here](https://jmodelica.org/assimulo/). \n",
     "\n",
-    "However, for this tutorial I would go ahead and derive the terms mentioned above, manually, and solve the Lotka-Volterra ODEs alongwith the sensitivites in the following code block. The functions `jac` and `dfdp` below calculate $ \\frac{\\partial f_k}{\\partial X_i (t)}$ and $\\frac{\\partial f_k}{\\partial \\theta_d}$ respectively for the Lotka-Volterra model. For convenience I have transformed the sensitivity equation in a matrix form. Here I extended the solver code snippet above to include sensitivities when asked for."
+    "However, for this tutorial I would go ahead and derive the terms mentioned above, manually, and solve the Lotka-Volterra ODEs alongwith the sensitivities in the following code block. The functions `jac` and `dfdp` below calculate $ \\frac{\\partial f_k}{\\partial X_i (t)}$ and $\\frac{\\partial f_k}{\\partial \\theta_d}$ respectively for the Lotka-Volterra model. For convenience I have transformed the sensitivity equation in a matrix form. Here I extended the solver code snippet above to include sensitivities when asked for."
    ]
   },
   {
diff --git a/examples/ode_models/ODE_with_manual_gradients.myst.md b/examples/ode_models/ODE_with_manual_gradients.myst.md
index 5e1bf8d2d..b1c63f706 100644
--- a/examples/ode_models/ODE_with_manual_gradients.myst.md
+++ b/examples/ode_models/ODE_with_manual_gradients.myst.md
@@ -111,7 +111,7 @@ $$Z_{kd}(t)=\frac{d }{d t} \left\{\frac{\partial X_k (t)}{\partial \theta_d}\rig
 
 Using forward sensitivity analysis we can obtain both the state $X(t)$ and its derivative w.r.t the parameters, at each time point, as the solution to an initial value problem by augmenting the original ODE system with the sensitivity equations $Z_{kd}$. The augmented ODE system $\big(X(t), Z(t)\big)$ can then be solved together using a chosen numerical method. The augmented ODE system needs the initial values for the sensitivity equations. All of these should be set to zero except the ones where the sensitivity of a state w.r.t. its own initial value is sought, that is $  \frac{\partial X_k(t)}{\partial X_k (0)} =1 $. Note that in order to solve this augmented system we have to embark in the tedious process of deriving $ \frac{\partial f_k}{\partial X_i (t)}$, also known as the Jacobian of an ODE, and $\frac{\partial f_k}{\partial \theta_d}$ terms. Thankfully, many ODE solvers calculate these terms and solve the augmented system when asked for by the user. An example would be the [SUNDIAL CVODES solver suite](https://computation.llnl.gov/projects/sundials/cvodes). A Python wrapper for CVODES can be found [here](https://jmodelica.org/assimulo/). 
 
-However, for this tutorial I would go ahead and derive the terms mentioned above, manually, and solve the Lotka-Volterra ODEs alongwith the sensitivites in the following code block. The functions `jac` and `dfdp` below calculate $ \frac{\partial f_k}{\partial X_i (t)}$ and $\frac{\partial f_k}{\partial \theta_d}$ respectively for the Lotka-Volterra model. For convenience I have transformed the sensitivity equation in a matrix form. Here I extended the solver code snippet above to include sensitivities when asked for.
+However, for this tutorial I would go ahead and derive the terms mentioned above, manually, and solve the Lotka-Volterra ODEs alongwith the sensitivities in the following code block. The functions `jac` and `dfdp` below calculate $ \frac{\partial f_k}{\partial X_i (t)}$ and $\frac{\partial f_k}{\partial \theta_d}$ respectively for the Lotka-Volterra model. For convenience I have transformed the sensitivity equation in a matrix form. Here I extended the solver code snippet above to include sensitivities when asked for.
 
 ```{code-cell} ipython3
 n_states = 2
diff --git a/examples/samplers/DEMetropolisZ_EfficiencyComparison.ipynb b/examples/samplers/DEMetropolisZ_EfficiencyComparison.ipynb
index 850e514f6..de2f08aaa 100644
--- a/examples/samplers/DEMetropolisZ_EfficiencyComparison.ipynb
+++ b/examples/samplers/DEMetropolisZ_EfficiencyComparison.ipynb
@@ -1247,7 +1247,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The efficiency advantage for `NUTS` over `DEMetropolisZ` over `DEMetropolis` is more pronounced in higher dimensions.  $\\hat{R}$ is also large for `DEMetropolis` for this sample size and number of chains.  For `DEMetropolis`, a smaller number of chains ($2N$) with a larger number of samples performed better than more chains with fewer samples.  Counter-intuitively, the `NUTS` sampler yeilds $ESS$ values greater than the number of samples, which can occur as discussed [here](https://discourse.pymc.io/t/effective-sample-size-larger-than-number-of-samples-for-nuts/6275)."
+    "The efficiency advantage for `NUTS` over `DEMetropolisZ` over `DEMetropolis` is more pronounced in higher dimensions.  $\\hat{R}$ is also large for `DEMetropolis` for this sample size and number of chains.  For `DEMetropolis`, a smaller number of chains ($2N$) with a larger number of samples performed better than more chains with fewer samples.  Counter-intuitively, the `NUTS` sampler yields $ESS$ values greater than the number of samples, which can occur as discussed [here](https://discourse.pymc.io/t/effective-sample-size-larger-than-number-of-samples-for-nuts/6275)."
    ]
   },
   {
diff --git a/examples/samplers/DEMetropolisZ_EfficiencyComparison.myst.md b/examples/samplers/DEMetropolisZ_EfficiencyComparison.myst.md
index 8762f6249..269d5cbd8 100644
--- a/examples/samplers/DEMetropolisZ_EfficiencyComparison.myst.md
+++ b/examples/samplers/DEMetropolisZ_EfficiencyComparison.myst.md
@@ -528,7 +528,7 @@ results_df[cols[~cols.isin(["Trace", "Run"])]].round(2).style.set_caption(
 plot_comparison_bars(results_df)
 ```
 
-The efficiency advantage for `NUTS` over `DEMetropolisZ` over `DEMetropolis` is more pronounced in higher dimensions.  $\hat{R}$ is also large for `DEMetropolis` for this sample size and number of chains.  For `DEMetropolis`, a smaller number of chains ($2N$) with a larger number of samples performed better than more chains with fewer samples.  Counter-intuitively, the `NUTS` sampler yeilds $ESS$ values greater than the number of samples, which can occur as discussed [here](https://discourse.pymc.io/t/effective-sample-size-larger-than-number-of-samples-for-nuts/6275).
+The efficiency advantage for `NUTS` over `DEMetropolisZ` over `DEMetropolis` is more pronounced in higher dimensions.  $\hat{R}$ is also large for `DEMetropolis` for this sample size and number of chains.  For `DEMetropolis`, a smaller number of chains ($2N$) with a larger number of samples performed better than more chains with fewer samples.  Counter-intuitively, the `NUTS` sampler yields $ESS$ values greater than the number of samples, which can occur as discussed [here](https://discourse.pymc.io/t/effective-sample-size-larger-than-number-of-samples-for-nuts/6275).
 
 ```{code-cell} ipython3
 plot_forest_compare_analytical(results_df)
diff --git a/examples/samplers/DEMetropolisZ_tune_drop_fraction.ipynb b/examples/samplers/DEMetropolisZ_tune_drop_fraction.ipynb
index 5f054f627..fee845944 100644
--- a/examples/samplers/DEMetropolisZ_tune_drop_fraction.ipynb
+++ b/examples/samplers/DEMetropolisZ_tune_drop_fraction.ipynb
@@ -123,7 +123,7 @@
    "metadata": {},
    "source": [
     "## Problem Statement\n",
-    "In this notebook, a 10-dimensional multivariate normal target density will be sampled with `DEMetropolisZ` while varing four parameters to identify efficient sampling schemes.  The four parameters are the following:\n",
+    "In this notebook, a 10-dimensional multivariate normal target density will be sampled with `DEMetropolisZ` while varying four parameters to identify efficient sampling schemes.  The four parameters are the following:\n",
     "* `drop_tuning_fraction`, which determines the number of samples from the tuning phase that are recycled for the purpose of random vector $(x_{R1}-x_{R2})$ selection,  \n",
     "* `lamb` ($\\gamma$), which scales the size of the jumps relative to the random vector, \n",
     "* `scaling` ($b$), which scales the size of the jumps for the noise term $\\epsilon$, and \n",
diff --git a/examples/samplers/DEMetropolisZ_tune_drop_fraction.myst.md b/examples/samplers/DEMetropolisZ_tune_drop_fraction.myst.md
index 4eba98f78..40bd9e248 100644
--- a/examples/samplers/DEMetropolisZ_tune_drop_fraction.myst.md
+++ b/examples/samplers/DEMetropolisZ_tune_drop_fraction.myst.md
@@ -84,7 +84,7 @@ In PyMC, we can tune either `lamb` ($\gamma$), or `scaling` ($b$), and the other
 +++
 
 ## Problem Statement
-In this notebook, a 10-dimensional multivariate normal target density will be sampled with `DEMetropolisZ` while varing four parameters to identify efficient sampling schemes.  The four parameters are the following:
+In this notebook, a 10-dimensional multivariate normal target density will be sampled with `DEMetropolisZ` while varying four parameters to identify efficient sampling schemes.  The four parameters are the following:
 * `drop_tuning_fraction`, which determines the number of samples from the tuning phase that are recycled for the purpose of random vector $(x_{R1}-x_{R2})$ selection,  
 * `lamb` ($\gamma$), which scales the size of the jumps relative to the random vector, 
 * `scaling` ($b$), which scales the size of the jumps for the noise term $\epsilon$, and 
diff --git a/examples/samplers/fast_sampling_with_jax_and_numba.ipynb b/examples/samplers/fast_sampling_with_jax_and_numba.ipynb
index b299cdb63..45139e3b3 100644
--- a/examples/samplers/fast_sampling_with_jax_and_numba.ipynb
+++ b/examples/samplers/fast_sampling_with_jax_and_numba.ipynb
@@ -30,7 +30,7 @@
     "\n",
     "For the JAX backend there is the NumPyro and BlackJAX NUTS sampler available. To use these samplers, you have to install `numpyro` and `blackjax`. Both of them are available through conda/mamba: `mamba install -c conda-forge numpyro blackjax`.\n",
     "\n",
-    "For the Numba backend, there is the [Nutpie sampler](https://github.com/pymc-devs/nutpie) writte in Rust. To use this sampler you need `nutpie` installed: `mamba install -c conda-forge nutpie`. "
+    "For the Numba backend, there is the [Nutpie sampler](https://github.com/pymc-devs/nutpie) written in Rust. To use this sampler you need `nutpie` installed: `mamba install -c conda-forge nutpie`. "
    ]
   },
   {
diff --git a/examples/samplers/fast_sampling_with_jax_and_numba.myst.md b/examples/samplers/fast_sampling_with_jax_and_numba.myst.md
index efe039fe2..3c3c3ede7 100644
--- a/examples/samplers/fast_sampling_with_jax_and_numba.myst.md
+++ b/examples/samplers/fast_sampling_with_jax_and_numba.myst.md
@@ -33,7 +33,7 @@ However, by compiling to other backends, we can use samplers written in other la
 
 For the JAX backend there is the NumPyro and BlackJAX NUTS sampler available. To use these samplers, you have to install `numpyro` and `blackjax`. Both of them are available through conda/mamba: `mamba install -c conda-forge numpyro blackjax`.
 
-For the Numba backend, there is the [Nutpie sampler](https://github.com/pymc-devs/nutpie) writte in Rust. To use this sampler you need `nutpie` installed: `mamba install -c conda-forge nutpie`. 
+For the Numba backend, there is the [Nutpie sampler](https://github.com/pymc-devs/nutpie) written in Rust. To use this sampler you need `nutpie` installed: `mamba install -c conda-forge nutpie`. 
 
 ```{code-cell} ipython3
 import arviz as az
diff --git a/examples/spatial/malaria_prevalence.ipynb b/examples/spatial/malaria_prevalence.ipynb
index f62aa5f36..15579f477 100644
--- a/examples/spatial/malaria_prevalence.ipynb
+++ b/examples/spatial/malaria_prevalence.ipynb
@@ -318,7 +318,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We need to convert our dataframe into a geodataframe. In order to do this we need to know what coordinate reference system (CRS) either geographic coordinate system (GCS) or projected coordinate system (PCS) to use. GCS tells you where your data is on the earth, whereas PCS tells you how to draw your data on a two-dimensional plane. There are many different GCS/PCS because each GCS/PCS is a model of the earth's surface. However, the earth's surface is variable from one location to another. Therefore, different GCS/PCS versions will be more accurate depending on the geography your analysis is based in. Since our analysis is in the Gambia we will use PCS [EPSG 32628](https://epsg.io/32628) and GCS [EPSG 4326](https://epsg.io/4326) when plotting on a globe. Where EPSG stands for European Petroluem Survey Group, which is an organization that maintains geodetic parameters for coordinate systems."
+    "We need to convert our dataframe into a geodataframe. In order to do this we need to know what coordinate reference system (CRS) either geographic coordinate system (GCS) or projected coordinate system (PCS) to use. GCS tells you where your data is on the earth, whereas PCS tells you how to draw your data on a two-dimensional plane. There are many different GCS/PCS because each GCS/PCS is a model of the earth's surface. However, the earth's surface is variable from one location to another. Therefore, different GCS/PCS versions will be more accurate depending on the geography your analysis is based in. Since our analysis is in the Gambia we will use PCS [EPSG 32628](https://epsg.io/32628) and GCS [EPSG 4326](https://epsg.io/4326) when plotting on a globe. Where EPSG stands for European Petroleum Survey Group, which is an organization that maintains geodetic parameters for coordinate systems."
    ]
   },
   {
@@ -1327,7 +1327,7 @@
     "$$Y_{i} \\sim Binomial(n_{i}, P(x_{i}))$$\n",
     "$$logit(P(x_{i})) = \\beta_{0} + \\beta_{1} \\times Elevation + S(x_{i})$$\n",
     "\n",
-    "Where $n_{i}$ represents an individual tested for malaria, $P(x_{i})$ is the prevalence of malaria at location $x_{i}$, $\\beta_{0}$ is the intercept, $\\beta_{1}$ is the coefficient for the elevation covariate and $S(x_{i})$ is a zero mean field guassian process with a Matérn covariance function with $\\nu=\\frac{3}{2}$ that we will approximate using a Hilbert Space Gaussian Process (HSGP)\n",
+    "Where $n_{i}$ represents an individual tested for malaria, $P(x_{i})$ is the prevalence of malaria at location $x_{i}$, $\\beta_{0}$ is the intercept, $\\beta_{1}$ is the coefficient for the elevation covariate and $S(x_{i})$ is a zero mean field gaussian process with a Matérn covariance function with $\\nu=\\frac{3}{2}$ that we will approximate using a Hilbert Space Gaussian Process (HSGP)\n",
     "\n",
     "In order to approximate a Gaussian process using an HSGP we need to select the parameters `m` and `c`. To learn more about how to set these parameters please refer to this wonderful ([example](../gaussian_processes/HSGP-Basic.myst.md)) of how to set these parameters."
    ]
@@ -1926,7 +1926,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can plot our out-of-sample posterior predictions to visualize the estimated prevalence of malaria across the Gambia. In figure below you'll notice that there is a smooth transition of prevalences surrounding the areas where we observed data in a way where nearer areas have more similar prevalences and as you move away you approach zero (the mean of the guassian process)."
+    "We can plot our out-of-sample posterior predictions to visualize the estimated prevalence of malaria across the Gambia. In figure below you'll notice that there is a smooth transition of prevalences surrounding the areas where we observed data in a way where nearer areas have more similar prevalences and as you move away you approach zero (the mean of the gaussian process)."
    ]
   },
   {
@@ -2023,7 +2023,7 @@
     ")\n",
     "plt.xlabel(\"Latitude\")\n",
     "plt.ylabel(\"Longitude\")\n",
-    "plt.title(\"Probability of Malaria Prevelance greater than 20%\")\n",
+    "plt.title(\"Probability of Malaria Prevalence greater than 20%\")\n",
     "plt.colorbar(label=\"Posterior mean\");"
    ]
   },
@@ -2032,7 +2032,7 @@
    "metadata": {},
    "source": [
     "# Different Covariance Functions\n",
-    "Before we conclude let's talk breifly about why we decided to use the Matérn family of covariance functions instead of the Exponential Quadratic. The Matérn family of covariances is a generalization of the Exponential Quadratic. When the smoothing parameter of the Matérn $\\nu \\to \\infty$ then we have the Exponential Quadratic covariance function. As the smoothing parameter increases the function you are estimating becomes smoother. A few commonly used values for $\\nu$ are $\\frac{1}{2}$, $\\frac{3}{2}$, and $\\frac{5}{2}$. Typically, when estimating a measure that has a spatial dependence we don't want an overly smooth function because that will prevent our estimate to capture abrupt changes in the measurement we are estimating. Below we simulate some data to show how the Matérn is able to capture these abrupt changes, whereas the Exponential Quadratic is overly smooth. For simplicity's sake we will be working in one dimension but these concepts apply with two-dimensional data."
+    "Before we conclude let's talk briefly about why we decided to use the Matérn family of covariance functions instead of the Exponential Quadratic. The Matérn family of covariances is a generalization of the Exponential Quadratic. When the smoothing parameter of the Matérn $\\nu \\to \\infty$ then we have the Exponential Quadratic covariance function. As the smoothing parameter increases the function you are estimating becomes smoother. A few commonly used values for $\\nu$ are $\\frac{1}{2}$, $\\frac{3}{2}$, and $\\frac{5}{2}$. Typically, when estimating a measure that has a spatial dependence we don't want an overly smooth function because that will prevent our estimate to capture abrupt changes in the measurement we are estimating. Below we simulate some data to show how the Matérn is able to capture these abrupt changes, whereas the Exponential Quadratic is overly smooth. For simplicity's sake we will be working in one dimension but these concepts apply with two-dimensional data."
    ]
   },
   {
diff --git a/examples/spatial/malaria_prevalence.myst.md b/examples/spatial/malaria_prevalence.myst.md
index 42e94d1e3..f69441dfc 100644
--- a/examples/spatial/malaria_prevalence.myst.md
+++ b/examples/spatial/malaria_prevalence.myst.md
@@ -80,7 +80,7 @@ gambia_agg = (
 gambia_agg.head()
 ```
 
-We need to convert our dataframe into a geodataframe. In order to do this we need to know what coordinate reference system (CRS) either geographic coordinate system (GCS) or projected coordinate system (PCS) to use. GCS tells you where your data is on the earth, whereas PCS tells you how to draw your data on a two-dimensional plane. There are many different GCS/PCS because each GCS/PCS is a model of the earth's surface. However, the earth's surface is variable from one location to another. Therefore, different GCS/PCS versions will be more accurate depending on the geography your analysis is based in. Since our analysis is in the Gambia we will use PCS [EPSG 32628](https://epsg.io/32628) and GCS [EPSG 4326](https://epsg.io/4326) when plotting on a globe. Where EPSG stands for European Petroluem Survey Group, which is an organization that maintains geodetic parameters for coordinate systems.
+We need to convert our dataframe into a geodataframe. In order to do this we need to know what coordinate reference system (CRS) either geographic coordinate system (GCS) or projected coordinate system (PCS) to use. GCS tells you where your data is on the earth, whereas PCS tells you how to draw your data on a two-dimensional plane. There are many different GCS/PCS because each GCS/PCS is a model of the earth's surface. However, the earth's surface is variable from one location to another. Therefore, different GCS/PCS versions will be more accurate depending on the geography your analysis is based in. Since our analysis is in the Gambia we will use PCS [EPSG 32628](https://epsg.io/32628) and GCS [EPSG 4326](https://epsg.io/4326) when plotting on a globe. Where EPSG stands for European Petroleum Survey Group, which is an organization that maintains geodetic parameters for coordinate systems.
 
 ```{code-cell} ipython3
 # Create a GeoDataframe and set coordinate reference system to EPSG 4326
@@ -221,7 +221,7 @@ We specify the following model:
 $$Y_{i} \sim Binomial(n_{i}, P(x_{i}))$$
 $$logit(P(x_{i})) = \beta_{0} + \beta_{1} \times Elevation + S(x_{i})$$
 
-Where $n_{i}$ represents an individual tested for malaria, $P(x_{i})$ is the prevalence of malaria at location $x_{i}$, $\beta_{0}$ is the intercept, $\beta_{1}$ is the coefficient for the elevation covariate and $S(x_{i})$ is a zero mean field guassian process with a Matérn covariance function with $\nu=\frac{3}{2}$ that we will approximate using a Hilbert Space Gaussian Process (HSGP)
+Where $n_{i}$ represents an individual tested for malaria, $P(x_{i})$ is the prevalence of malaria at location $x_{i}$, $\beta_{0}$ is the intercept, $\beta_{1}$ is the coefficient for the elevation covariate and $S(x_{i})$ is a zero mean field gaussian process with a Matérn covariance function with $\nu=\frac{3}{2}$ that we will approximate using a Hilbert Space Gaussian Process (HSGP)
 
 In order to approximate a Gaussian process using an HSGP we need to select the parameters `m` and `c`. To learn more about how to set these parameters please refer to this wonderful ([example](../gaussian_processes/HSGP-Basic.myst.md)) of how to set these parameters.
 
@@ -342,7 +342,7 @@ with hsgp_model:
 posterior_predictive_prevalence = pp["posterior_predictive"]["p"]
 ```
 
-We can plot our out-of-sample posterior predictions to visualize the estimated prevalence of malaria across the Gambia. In figure below you'll notice that there is a smooth transition of prevalences surrounding the areas where we observed data in a way where nearer areas have more similar prevalences and as you move away you approach zero (the mean of the guassian process).
+We can plot our out-of-sample posterior predictions to visualize the estimated prevalence of malaria across the Gambia. In figure below you'll notice that there is a smooth transition of prevalences surrounding the areas where we observed data in a way where nearer areas have more similar prevalences and as you move away you approach zero (the mean of the gaussian process).
 
 ```{code-cell} ipython3
 fig = plt.figure(figsize=(16, 8))
@@ -388,12 +388,12 @@ plt.scatter(
 )
 plt.xlabel("Latitude")
 plt.ylabel("Longitude")
-plt.title("Probability of Malaria Prevelance greater than 20%")
+plt.title("Probability of Malaria Prevalence greater than 20%")
 plt.colorbar(label="Posterior mean");
 ```
 
 # Different Covariance Functions
-Before we conclude let's talk breifly about why we decided to use the Matérn family of covariance functions instead of the Exponential Quadratic. The Matérn family of covariances is a generalization of the Exponential Quadratic. When the smoothing parameter of the Matérn $\nu \to \infty$ then we have the Exponential Quadratic covariance function. As the smoothing parameter increases the function you are estimating becomes smoother. A few commonly used values for $\nu$ are $\frac{1}{2}$, $\frac{3}{2}$, and $\frac{5}{2}$. Typically, when estimating a measure that has a spatial dependence we don't want an overly smooth function because that will prevent our estimate to capture abrupt changes in the measurement we are estimating. Below we simulate some data to show how the Matérn is able to capture these abrupt changes, whereas the Exponential Quadratic is overly smooth. For simplicity's sake we will be working in one dimension but these concepts apply with two-dimensional data.
+Before we conclude let's talk briefly about why we decided to use the Matérn family of covariance functions instead of the Exponential Quadratic. The Matérn family of covariances is a generalization of the Exponential Quadratic. When the smoothing parameter of the Matérn $\nu \to \infty$ then we have the Exponential Quadratic covariance function. As the smoothing parameter increases the function you are estimating becomes smoother. A few commonly used values for $\nu$ are $\frac{1}{2}$, $\frac{3}{2}$, and $\frac{5}{2}$. Typically, when estimating a measure that has a spatial dependence we don't want an overly smooth function because that will prevent our estimate to capture abrupt changes in the measurement we are estimating. Below we simulate some data to show how the Matérn is able to capture these abrupt changes, whereas the Exponential Quadratic is overly smooth. For simplicity's sake we will be working in one dimension but these concepts apply with two-dimensional data.
 
 ```{code-cell} ipython3
 # simulate 1-dimensional data
diff --git a/examples/spatial/nyc_bym.ipynb b/examples/spatial/nyc_bym.ipynb
index 8bc7cd111..2e7a9b986 100644
--- a/examples/spatial/nyc_bym.ipynb
+++ b/examples/spatial/nyc_bym.ipynb
@@ -81,7 +81,7 @@
     "\n",
     "BYM also scales well with large datasets. A common problem with spatial models is that their computational cost grows rapidly as the size of the dataset increases. This is the case, for example, with PyMC's {ref}`CAR model <conditional_autoregressive_priors>`. With the BYM model, the growth in computational cost is nearly linear.\n",
     "\n",
-    "The BYM model works with *areal* data, such as neighboring states, counties, or census tracks. For problems involving spatial points or continuous measures of distance, consider using a {ref}`Gaussian Proccess <log-gaussian-cox-process>` instead."
+    "The BYM model works with *areal* data, such as neighboring states, counties, or census tracks. For problems involving spatial points or continuous measures of distance, consider using a {ref}`Gaussian Process <log-gaussian-cox-process>` instead."
    ]
   },
   {
@@ -145,7 +145,7 @@
     "\n",
     "In this way, ICAR encodes the core assumption of spatial statistics - *nearby areas should be more similar to each other than distant areas*. The most likely outcome is a graph where every node has the same value. In this case, the square distance between neighbors is always zero. The more a graph experiences abrupt changes between neighboring areas, the lower the log density.\n",
     "\n",
-    "ICAR has a few other special features: it is contrained so all the $\\phi$'s add up to zero. This also implies the mean of the $\\phi$'s is zero. It can be helpful to think of ICAR values as similar to z-scores. They represent relative deviations centered around 0. ICAR is also typically only used as a sub-component of a larger model. Other parts of the model typically adjust the scale (with a variance parameter) or the location (with an intercept parameter). An accessible discussion of the math behind ICAR and its relationship to CAR can be found [here](https://mc-stan.org/users/documentation/case-studies/icar_stan.html) or in the academic paper version {cite:p}`morris2021bym`."
+    "ICAR has a few other special features: it is constrained so all the $\\phi$'s add up to zero. This also implies the mean of the $\\phi$'s is zero. It can be helpful to think of ICAR values as similar to z-scores. They represent relative deviations centered around 0. ICAR is also typically only used as a sub-component of a larger model. Other parts of the model typically adjust the scale (with a variance parameter) or the location (with an intercept parameter). An accessible discussion of the math behind ICAR and its relationship to CAR can be found [here](https://mc-stan.org/users/documentation/case-studies/icar_stan.html) or in the academic paper version {cite:p}`morris2021bym`."
    ]
   },
   {
@@ -472,7 +472,7 @@
    "id": "11462df0-d136-4bd6-85ae-b2c8bdab4873",
    "metadata": {},
    "source": [
-    "The first `.csv` file just has the spatial structure bits. The rest of the data comes seperately - here we'll pull in the number of accidents `y` and the population size of the census track, `E`. We'll use the population size as an offset - we should expect that more populated areas will have more accidents for trivial reasons. What is more interesting is something like the excess risk associated with an area.\n",
+    "The first `.csv` file just has the spatial structure bits. The rest of the data comes separately - here we'll pull in the number of accidents `y` and the population size of the census track, `E`. We'll use the population size as an offset - we should expect that more populated areas will have more accidents for trivial reasons. What is more interesting is something like the excess risk associated with an area.\n",
     "\n",
     "Finally, we'll also explore one predictor variable, the social fragmentation index. The index is built out of measures of the number of vacant housing units, people living alone, renters and people who have moved within the previous year. These communities tend to be less integrated and have weaker social support systems. The social epidemiology community is interested in how ecological variables can trickle down into various facets of public health. So we'll see if social fragmentation can explain the pattern of traffic accidents. The measure is standardized to have a mean of zero and standard deviation of 1.\n"
    ]
@@ -623,7 +623,7 @@
     "\n",
     "All the parameters of the BYM were already introduced in {ref}`section 1 <bym-components>`. Now it's just a matter of assigning some priors. The priors on $\\theta$ are picky - we need to assign a mean of 0 and a standard deviation 1 so that we can interpret it as comparable with $\\phi$. Otherwise, the priors distributions afford the opportunity to incorporate domain expertise. In this problem, I'll pick some weakly informative priors.\n",
     "\n",
-    "Lastly, we'll use a Poisson outcome distribution. The number of traffic accidents is a count outcome and the maximium possible value is very large. To ensure our predictions remain positive, we'll exponentiate the linear model before passing it to the Poisson distribution."
+    "Lastly, we'll use a Poisson outcome distribution. The number of traffic accidents is a count outcome and the maximum possible value is very large. To ensure our predictions remain positive, we'll exponentiate the linear model before passing it to the Poisson distribution."
    ]
   },
   {
@@ -1440,7 +1440,7 @@
     "\n",
     "The scaling factor is the trick that ensures the variance of $\\phi$ roughly equals one. When the variance implied by the spatial structure is quite small, say, less than one, dividing $\\rho$ by the scaling factor will give some number greater than one. In other words, we expand the variance of $\\phi$ until it equals one. Now all the other parameters will behave properly. $\\rho$ represents a mixture between two similar things and $\\sigma$ represents the joint variance from random effects.\n",
     "\n",
-    "A final way to understand the purpose of the scaling factor is to imagine what would happen if we didn't include it. Suppose the graph implied very large variance, like the first preferential attachment graph above. In this case, the mixture parameter, $\\rho$, might pull in more of $\\phi$ because the data has a lot of variance and the model is searching for variance wherever it can find to explain it. But that makes the intepretation of the results challenging. Did $\\rho$ gravitate towards $\\phi$ because there is actually a strong spatial structure? Or because it had higher variance than $\\theta$? We cannot tell unless we rescale the $\\phi$."
+    "A final way to understand the purpose of the scaling factor is to imagine what would happen if we didn't include it. Suppose the graph implied very large variance, like the first preferential attachment graph above. In this case, the mixture parameter, $\\rho$, might pull in more of $\\phi$ because the data has a lot of variance and the model is searching for variance wherever it can find to explain it. But that makes the interpretation of the results challenging. Did $\\rho$ gravitate towards $\\phi$ because there is actually a strong spatial structure? Or because it had higher variance than $\\theta$? We cannot tell unless we rescale the $\\phi$."
    ]
   },
   {
diff --git a/examples/spatial/nyc_bym.myst.md b/examples/spatial/nyc_bym.myst.md
index 4555b576c..3eb71901e 100644
--- a/examples/spatial/nyc_bym.myst.md
+++ b/examples/spatial/nyc_bym.myst.md
@@ -59,7 +59,7 @@ This notebook explains why and how to deploy the Besag-York-Mollie (BYM) model i
 
 BYM also scales well with large datasets. A common problem with spatial models is that their computational cost grows rapidly as the size of the dataset increases. This is the case, for example, with PyMC's {ref}`CAR model <conditional_autoregressive_priors>`. With the BYM model, the growth in computational cost is nearly linear.
 
-The BYM model works with *areal* data, such as neighboring states, counties, or census tracks. For problems involving spatial points or continuous measures of distance, consider using a {ref}`Gaussian Proccess <log-gaussian-cox-process>` instead.
+The BYM model works with *areal* data, such as neighboring states, counties, or census tracks. For problems involving spatial points or continuous measures of distance, consider using a {ref}`Gaussian Process <log-gaussian-cox-process>` instead.
 
 +++
 
@@ -96,7 +96,7 @@ So, for example, imagine that the intensity of the color represents the value of
 
 In this way, ICAR encodes the core assumption of spatial statistics - *nearby areas should be more similar to each other than distant areas*. The most likely outcome is a graph where every node has the same value. In this case, the square distance between neighbors is always zero. The more a graph experiences abrupt changes between neighboring areas, the lower the log density.
 
-ICAR has a few other special features: it is contrained so all the $\phi$'s add up to zero. This also implies the mean of the $\phi$'s is zero. It can be helpful to think of ICAR values as similar to z-scores. They represent relative deviations centered around 0. ICAR is also typically only used as a sub-component of a larger model. Other parts of the model typically adjust the scale (with a variance parameter) or the location (with an intercept parameter). An accessible discussion of the math behind ICAR and its relationship to CAR can be found [here](https://mc-stan.org/users/documentation/case-studies/icar_stan.html) or in the academic paper version {cite:p}`morris2021bym`.
+ICAR has a few other special features: it is constrained so all the $\phi$'s add up to zero. This also implies the mean of the $\phi$'s is zero. It can be helpful to think of ICAR values as similar to z-scores. They represent relative deviations centered around 0. ICAR is also typically only used as a sub-component of a larger model. Other parts of the model typically adjust the scale (with a variance parameter) or the location (with an intercept parameter). An accessible discussion of the math behind ICAR and its relationship to CAR can be found [here](https://mc-stan.org/users/documentation/case-studies/icar_stan.html) or in the academic paper version {cite:p}`morris2021bym`.
 
 +++
 
@@ -228,7 +228,7 @@ scaling_factor = scaling_factor_sp(W_nyc)
 scaling_factor
 ```
 
-The first `.csv` file just has the spatial structure bits. The rest of the data comes seperately - here we'll pull in the number of accidents `y` and the population size of the census track, `E`. We'll use the population size as an offset - we should expect that more populated areas will have more accidents for trivial reasons. What is more interesting is something like the excess risk associated with an area.
+The first `.csv` file just has the spatial structure bits. The rest of the data comes separately - here we'll pull in the number of accidents `y` and the population size of the census track, `E`. We'll use the population size as an offset - we should expect that more populated areas will have more accidents for trivial reasons. What is more interesting is something like the excess risk associated with an area.
 
 Finally, we'll also explore one predictor variable, the social fragmentation index. The index is built out of measures of the number of vacant housing units, people living alone, renters and people who have moved within the previous year. These communities tend to be less integrated and have weaker social support systems. The social epidemiology community is interested in how ecological variables can trickle down into various facets of public health. So we'll see if social fragmentation can explain the pattern of traffic accidents. The measure is standardized to have a mean of zero and standard deviation of 1.
 
@@ -315,7 +315,7 @@ nx.draw_networkx(
 
 All the parameters of the BYM were already introduced in {ref}`section 1 <bym-components>`. Now it's just a matter of assigning some priors. The priors on $\theta$ are picky - we need to assign a mean of 0 and a standard deviation 1 so that we can interpret it as comparable with $\phi$. Otherwise, the priors distributions afford the opportunity to incorporate domain expertise. In this problem, I'll pick some weakly informative priors.
 
-Lastly, we'll use a Poisson outcome distribution. The number of traffic accidents is a count outcome and the maximium possible value is very large. To ensure our predictions remain positive, we'll exponentiate the linear model before passing it to the Poisson distribution.
+Lastly, we'll use a Poisson outcome distribution. The number of traffic accidents is a count outcome and the maximum possible value is very large. To ensure our predictions remain positive, we'll exponentiate the linear model before passing it to the Poisson distribution.
 
 ```{code-cell} ipython3
 with pm.Model(coords=coords) as BYM_model:
@@ -528,7 +528,7 @@ The goal of the BYM model is that we mix together two different types of random
 
 The scaling factor is the trick that ensures the variance of $\phi$ roughly equals one. When the variance implied by the spatial structure is quite small, say, less than one, dividing $\rho$ by the scaling factor will give some number greater than one. In other words, we expand the variance of $\phi$ until it equals one. Now all the other parameters will behave properly. $\rho$ represents a mixture between two similar things and $\sigma$ represents the joint variance from random effects.
 
-A final way to understand the purpose of the scaling factor is to imagine what would happen if we didn't include it. Suppose the graph implied very large variance, like the first preferential attachment graph above. In this case, the mixture parameter, $\rho$, might pull in more of $\phi$ because the data has a lot of variance and the model is searching for variance wherever it can find to explain it. But that makes the intepretation of the results challenging. Did $\rho$ gravitate towards $\phi$ because there is actually a strong spatial structure? Or because it had higher variance than $\theta$? We cannot tell unless we rescale the $\phi$.
+A final way to understand the purpose of the scaling factor is to imagine what would happen if we didn't include it. Suppose the graph implied very large variance, like the first preferential attachment graph above. In this case, the mixture parameter, $\rho$, might pull in more of $\phi$ because the data has a lot of variance and the model is searching for variance wherever it can find to explain it. But that makes the interpretation of the results challenging. Did $\rho$ gravitate towards $\phi$ because there is actually a strong spatial structure? Or because it had higher variance than $\theta$? We cannot tell unless we rescale the $\phi$.
 
 +++
 
diff --git a/examples/survival_analysis/bayes_param_survival_pymc3.myst.md b/examples/survival_analysis/bayes_param_survival_pymc3.myst.md
index e6915bda8..e7f425501 100644
--- a/examples/survival_analysis/bayes_param_survival_pymc3.myst.md
+++ b/examples/survival_analysis/bayes_param_survival_pymc3.myst.md
@@ -268,7 +268,7 @@ Below we plot posterior distributions of the parameters.
 az.plot_posterior(weibull_trace, lw=0, alpha=0.5);
 ```
 
-These are somewhat interesting (espescially the fact that the posterior of $\beta_1$ is fairly well-separated from zero), but the posterior predictive survival curves will be much more interpretable.
+These are somewhat interesting (especially the fact that the posterior of $\beta_1$ is fairly well-separated from zero), but the posterior predictive survival curves will be much more interpretable.
 
 The advantage of using [`theano.shared`](http://deeplearning.net/software/theano_versions/dev/library/compile/shared.html) variables is that we can now change their values to perform posterior predictive sampling.  For posterior prediction, we set $X$ to have two rows, one for a subject whose cancer had not metastized and one for a subject whose cancer had metastized.  Since we want to predict actual survival times, none of the posterior predictive rows are censored.
 
diff --git a/examples/survival_analysis/frailty_models.ipynb b/examples/survival_analysis/frailty_models.ipynb
index 1ce983691..c7d9117cb 100644
--- a/examples/survival_analysis/frailty_models.ipynb
+++ b/examples/survival_analysis/frailty_models.ipynb
@@ -8,7 +8,7 @@
     "# Frailty and Survival Regression Models\n",
     "\n",
     ":::{post} November, 2023\n",
-    ":tags: frailty models, survival analysis, competing risks, model comparison\n",
+    ":tags: frailty model, survival analysis, competing risks, model comparison\n",
     ":category: intermediate, reference\n",
     ":author: Nathaniel Forde\n",
     ":::"
@@ -66,7 +66,7 @@
     "\n",
     "### Survival Regression Models\n",
     "\n",
-    "The emphasis here is on the generality of the framework. We are describing the trajectory of state-transitions within time. Anywhere speed or efficiency matters, it is important to understand the inputs to time-to-event trajectories. This is the benefit of survival analysis - clearly articulated models which quantify the impact of demographic characteristics and treatment effects (in terms of speed) on the probability of state-transition. Movement between life and death, hired and fired, ill and cured, subscribed to churned. These state transitions are all tranparently and compellingly modelled using survival regression models. \n",
+    "The emphasis here is on the generality of the framework. We are describing the trajectory of state-transitions within time. Anywhere speed or efficiency matters, it is important to understand the inputs to time-to-event trajectories. This is the benefit of survival analysis - clearly articulated models which quantify the impact of demographic characteristics and treatment effects (in terms of speed) on the probability of state-transition. Movement between life and death, hired and fired, ill and cured, subscribed to churned. These state transitions are all transparently and compellingly modelled using survival regression models. \n",
     "\n",
     "We will see two varieties of regression modelling with respect to time-to-event data: (1) Cox's Proportional Hazard approach and (2) the Accelerated Failure time models. Both models enable the analyst to combine and assess the impacts of different covariates on the survival time outcomes, but each does so in a slightly different manner. \n",
     "\n",
@@ -370,7 +370,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Data Preperation for Survival Regression\n",
+    "## Data Preparation for Survival Regression\n",
     "\n",
     "The idea behind Cox Proportional Hazard regression models is, put crudely, to treat the temporal component of risk seriously. We imagine a latent baseline hazard of occurrence over the time-interval. Michael Betancourt [asks](https://betanalpha.github.io/assets/case_studies/survival_modeling.html) that we think of the hazard as \"the accumulation of some stimulating resource\" that precedes the occurrence of an event. In failure modelling it can be imagined as sporadic increasing wear and tear. In the context of HR dyanamics it could be imagined as increasing frustration is the work-environment. In philosophy it could viewed as an articulation of the sorites paradox; how do chances change over time, as sand is piled higher, for us to identify a collection of individual grains as a heap?. This term is often denoted:\n",
     "\n",
@@ -1545,7 +1545,7 @@
     "- If $exp(\\beta)$ < 1: An increase in X is associated with a decreased hazard (lower risk) of the event occurring.\n",
     "- If $exp(\\beta)$ = 1: X has no effect on the hazard rate.\n",
     "\n",
-    "So our case we can see that having an occupation in  the fields of Finance or Health would seem to induce a roughly 25% increase in the hazard risk of the event occuring over the baseline hazard. Interestingly we can see that the inclusion of the `intention` predictor seems to be important as a unit increase of the `intention` metric moves the dial similarly - and intention is a 0-10 scale. \n",
+    "So our case we can see that having an occupation in  the fields of Finance or Health would seem to induce a roughly 25% increase in the hazard risk of the event occurring over the baseline hazard. Interestingly we can see that the inclusion of the `intention` predictor seems to be important as a unit increase of the `intention` metric moves the dial similarly - and intention is a 0-10 scale. \n",
     "\n",
     "These are not time-varying - they enter __once__ into the weighted sum that modifies the baseline hazard. This is the proportional hazard assumption - that while the baseline hazard can change over time the difference in hazard induced by different levels in the covariates remains constant over time. The Cox model is popular because it allows us to estimate a changing hazard at each time-point and incorporates the impact of the demographic predictors multiplicatively across the period. The proportional hazards assumption does not always hold, and we'll see some adjustments below that can help deal with violations of the proportional hazards assumption. "
    ]
@@ -1899,7 +1899,7 @@
    "source": [
     "### The Sentiment Model\n",
     "\n",
-    "If we submit the same test to a model unable to account for intention most of the weight falls on the differences specified between the sentiment recorded by the survey participant. Here we also see a seperation in the survival curves, but the effect is much less pronounced. "
+    "If we submit the same test to a model unable to account for intention most of the weight falls on the differences specified between the sentiment recorded by the survey participant. Here we also see a separation in the survival curves, but the effect is much less pronounced. "
    ]
   },
   {
@@ -2504,11 +2504,11 @@
     "\n",
     "where we have the baseline survival function $S_{0} = P(exp(\\mu + \\sigma\\epsilon_{i}) \\geq t)$ modified by additional covariates. The details are largely important for the estimation strategies, but they show how the impact of risk can be decomposed here just as in the CoxPH model. The effects of the covariates are additive on the log-scale towards the acceleration factor induced by the individual's risk profile.\n",
     "\n",
-    "Below we'll estimate two AFT models: the weibull model and the Log-Logistic model. Ultimately we're just fitting a censored parametric distribution but we've allowed that that one of the parameters of each distribution is specified as a linear function of the explainatory variables. So the log likelihood term is just: \n",
+    "Below we'll estimate two AFT models: the weibull model and the Log-Logistic model. Ultimately we're just fitting a censored parametric distribution but we've allowed that that one of the parameters of each distribution is specified as a linear function of the explanatory variables. So the log likelihood term is just: \n",
     "\n",
     "$$ log(L) = \\sum_{i}^{n} \\Big[ c_{i}log(f(t)) + (1-c_{i})log(S(t))) \\Big]  $$ \n",
     "\n",
-    "where $f$ is the distribution pdf function , $S$ is the survival fucntion and $c$ is an indicator function for whether the observation is censored - meaning it takes a value in $\\{0, 1\\}$ depending on whether the individual is censored. Both $f$, $S$ are parameterised by some vector of parameters $\\mathbf{\\theta}$.  In the case of the Log-Logistic model we estimate it by transforming our time variable to a log-scale and fitting a logistic likelihood with parameters $\\mu, s$. The resulting parameter fits can be adapted to recover the log-logistic survival function as we'll show below. In the case of the Weibull model the parameters are denote $\\alpha, \\beta$ respectively."
+    "where $f$ is the distribution pdf function , $S$ is the survival function and $c$ is an indicator function for whether the observation is censored - meaning it takes a value in $\\{0, 1\\}$ depending on whether the individual is censored. Both $f$, $S$ are parameterised by some vector of parameters $\\mathbf{\\theta}$.  In the case of the Log-Logistic model we estimate it by transforming our time variable to a log-scale and fitting a logistic likelihood with parameters $\\mu, s$. The resulting parameter fits can be adapted to recover the log-logistic survival function as we'll show below. In the case of the Weibull model the parameters are denote $\\alpha, \\beta$ respectively."
    ]
   },
   {
@@ -3901,7 +3901,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Both models fit comparable estimates for these two individuals. We'll see now how the marginal survival function compares across our entire sample of indivduals. "
+    "Both models fit comparable estimates for these two individuals. We'll see now how the marginal survival function compares across our entire sample of individuals. "
    ]
   },
   {
@@ -6562,7 +6562,7 @@
     "\n",
     "There are roughly two perspectives to be balanced: (i) the \"actuarial\" need to understand expected losses over the lifecycle, and (ii) the \"diagnostic\" needs to understand the causative factors that extend or reduce the lifecycle. Both are ultimately complementary as we need to \"price in\" differential flavours of risk that impact the expected bottom line whenever we plan for the future. Survival regression analysis neatly combines both these perspectives enabling the analyst to understand and take preventative action to offset periods of increased risk.\n",
     "\n",
-    "We've seen above a number of distinct regression modelling strategies for time-to-event data, but there are more flavours to explore: joint longitidunal models with a survival component, survival models with time-varying covariates, cure-rate models. The Bayesian perspective on these survival models is useful because we often have detailed results from prior years or experiments where our priors add useful perspective on the problem - allowing us to numerically encode that information to help regularise model fits for complex survival modelling. In the case of frailty models like the ones above - we've seen how priors can be added to the frailty terms to describe the influence of unobserved covariates which influence individual trajectories. Similarly the stratified approach to modelling baseline hazards allows us to carefully express trajectories of individual risk.  This can be especially important in the human centric disciplines where we seek to understand repeat measurments of the same individual time and again - accounting for the degree to which we can explain individual effects. Which is to say that while the framework of survival analysis suits a wide range of domains and problems, it nevertheless allows us to model, predict and infer aspects of specific and individual risk. "
+    "We've seen above a number of distinct regression modelling strategies for time-to-event data, but there are more flavours to explore: joint longitidunal models with a survival component, survival models with time-varying covariates, cure-rate models. The Bayesian perspective on these survival models is useful because we often have detailed results from prior years or experiments where our priors add useful perspective on the problem - allowing us to numerically encode that information to help regularise model fits for complex survival modelling. In the case of frailty models like the ones above - we've seen how priors can be added to the frailty terms to describe the influence of unobserved covariates which influence individual trajectories. Similarly the stratified approach to modelling baseline hazards allows us to carefully express trajectories of individual risk.  This can be especially important in the human centric disciplines where we seek to understand repeat measurements of the same individual time and again - accounting for the degree to which we can explain individual effects. Which is to say that while the framework of survival analysis suits a wide range of domains and problems, it nevertheless allows us to model, predict and infer aspects of specific and individual risk. "
    ]
   },
   {
diff --git a/examples/survival_analysis/frailty_models.myst.md b/examples/survival_analysis/frailty_models.myst.md
index bc565d47b..7d0bbf011 100644
--- a/examples/survival_analysis/frailty_models.myst.md
+++ b/examples/survival_analysis/frailty_models.myst.md
@@ -17,7 +17,7 @@ myst:
 # Frailty and Survival Regression Models
 
 :::{post} November, 2023
-:tags: frailty models, survival analysis, competing risks, model comparison
+:tags: frailty model, survival analysis, competing risks, model comparison
 :category: intermediate, reference
 :author: Nathaniel Forde
 :::
@@ -57,7 +57,7 @@ We will demonstrate how the concepts of survival based regression analysis, trad
 
 ### Survival Regression Models
 
-The emphasis here is on the generality of the framework. We are describing the trajectory of state-transitions within time. Anywhere speed or efficiency matters, it is important to understand the inputs to time-to-event trajectories. This is the benefit of survival analysis - clearly articulated models which quantify the impact of demographic characteristics and treatment effects (in terms of speed) on the probability of state-transition. Movement between life and death, hired and fired, ill and cured, subscribed to churned. These state transitions are all tranparently and compellingly modelled using survival regression models. 
+The emphasis here is on the generality of the framework. We are describing the trajectory of state-transitions within time. Anywhere speed or efficiency matters, it is important to understand the inputs to time-to-event trajectories. This is the benefit of survival analysis - clearly articulated models which quantify the impact of demographic characteristics and treatment effects (in terms of speed) on the probability of state-transition. Movement between life and death, hired and fired, ill and cured, subscribed to churned. These state transitions are all transparently and compellingly modelled using survival regression models. 
 
 We will see two varieties of regression modelling with respect to time-to-event data: (1) Cox's Proportional Hazard approach and (2) the Accelerated Failure time models. Both models enable the analyst to combine and assess the impacts of different covariates on the survival time outcomes, but each does so in a slightly different manner. 
 
@@ -157,7 +157,7 @@ Here we've used the Kaplan Meier non-parametric estimate of the survival curve w
 
 +++
 
-## Data Preperation for Survival Regression
+## Data Preparation for Survival Regression
 
 The idea behind Cox Proportional Hazard regression models is, put crudely, to treat the temporal component of risk seriously. We imagine a latent baseline hazard of occurrence over the time-interval. Michael Betancourt [asks](https://betanalpha.github.io/assets/case_studies/survival_modeling.html) that we think of the hazard as "the accumulation of some stimulating resource" that precedes the occurrence of an event. In failure modelling it can be imagined as sporadic increasing wear and tear. In the context of HR dyanamics it could be imagined as increasing frustration is the work-environment. In philosophy it could viewed as an articulation of the sorites paradox; how do chances change over time, as sand is piled higher, for us to identify a collection of individual grains as a heap?. This term is often denoted:
 
@@ -318,7 +318,7 @@ Each individual model coefficient records an estimate of the impact on the log h
 - If $exp(\beta)$ < 1: An increase in X is associated with a decreased hazard (lower risk) of the event occurring.
 - If $exp(\beta)$ = 1: X has no effect on the hazard rate.
 
-So our case we can see that having an occupation in  the fields of Finance or Health would seem to induce a roughly 25% increase in the hazard risk of the event occuring over the baseline hazard. Interestingly we can see that the inclusion of the `intention` predictor seems to be important as a unit increase of the `intention` metric moves the dial similarly - and intention is a 0-10 scale. 
+So our case we can see that having an occupation in  the fields of Finance or Health would seem to induce a roughly 25% increase in the hazard risk of the event occurring over the baseline hazard. Interestingly we can see that the inclusion of the `intention` predictor seems to be important as a unit increase of the `intention` metric moves the dial similarly - and intention is a 0-10 scale. 
 
 These are not time-varying - they enter __once__ into the weighted sum that modifies the baseline hazard. This is the proportional hazard assumption - that while the baseline hazard can change over time the difference in hazard induced by different levels in the covariates remains constant over time. The Cox model is popular because it allows us to estimate a changing hazard at each time-point and incorporates the impact of the demographic predictors multiplicatively across the period. The proportional hazards assumption does not always hold, and we'll see some adjustments below that can help deal with violations of the proportional hazards assumption. 
 
@@ -476,7 +476,7 @@ Focus here on the plot on the right. The baseline cumulative hazard is represent
 
 ### The Sentiment Model
 
-If we submit the same test to a model unable to account for intention most of the weight falls on the differences specified between the sentiment recorded by the survey participant. Here we also see a seperation in the survival curves, but the effect is much less pronounced. 
+If we submit the same test to a model unable to account for intention most of the weight falls on the differences specified between the sentiment recorded by the survey participant. Here we also see a separation in the survival curves, but the effect is much less pronounced. 
 
 ```{code-cell} ipython3
 plot_individuals(test_df, base_idata, [0, 1, 2], intention=False)
@@ -598,11 +598,11 @@ $$ log (T_{i}) = \mu + \alpha_{i}x_{i} + \alpha_{2}x_{2} ... \alpha_{p}x_{p} + \
 
 where we have the baseline survival function $S_{0} = P(exp(\mu + \sigma\epsilon_{i}) \geq t)$ modified by additional covariates. The details are largely important for the estimation strategies, but they show how the impact of risk can be decomposed here just as in the CoxPH model. The effects of the covariates are additive on the log-scale towards the acceleration factor induced by the individual's risk profile.
 
-Below we'll estimate two AFT models: the weibull model and the Log-Logistic model. Ultimately we're just fitting a censored parametric distribution but we've allowed that that one of the parameters of each distribution is specified as a linear function of the explainatory variables. So the log likelihood term is just: 
+Below we'll estimate two AFT models: the weibull model and the Log-Logistic model. Ultimately we're just fitting a censored parametric distribution but we've allowed that that one of the parameters of each distribution is specified as a linear function of the explanatory variables. So the log likelihood term is just: 
 
 $$ log(L) = \sum_{i}^{n} \Big[ c_{i}log(f(t)) + (1-c_{i})log(S(t))) \Big]  $$ 
 
-where $f$ is the distribution pdf function , $S$ is the survival fucntion and $c$ is an indicator function for whether the observation is censored - meaning it takes a value in $\{0, 1\}$ depending on whether the individual is censored. Both $f$, $S$ are parameterised by some vector of parameters $\mathbf{\theta}$.  In the case of the Log-Logistic model we estimate it by transforming our time variable to a log-scale and fitting a logistic likelihood with parameters $\mu, s$. The resulting parameter fits can be adapted to recover the log-logistic survival function as we'll show below. In the case of the Weibull model the parameters are denote $\alpha, \beta$ respectively.
+where $f$ is the distribution pdf function , $S$ is the survival function and $c$ is an indicator function for whether the observation is censored - meaning it takes a value in $\{0, 1\}$ depending on whether the individual is censored. Both $f$, $S$ are parameterised by some vector of parameters $\mathbf{\theta}$.  In the case of the Log-Logistic model we estimate it by transforming our time variable to a log-scale and fitting a logistic likelihood with parameters $\mu, s$. The resulting parameter fits can be adapted to recover the log-logistic survival function as we'll show below. In the case of the Weibull model the parameters are denote $\alpha, \beta$ respectively.
 
 ```{code-cell} ipython3
 coords = {
@@ -792,7 +792,7 @@ loglogistic_predicted_surv = pd.DataFrame(
 loglogistic_predicted_surv
 ```
 
-Both models fit comparable estimates for these two individuals. We'll see now how the marginal survival function compares across our entire sample of indivduals. 
+Both models fit comparable estimates for these two individuals. We'll see now how the marginal survival function compares across our entire sample of individuals. 
 
 ```{code-cell} ipython3
 fig, ax = plt.subplots(figsize=(20, 7))
@@ -1265,7 +1265,7 @@ In this example we've seen how to model time-to-attrition in a employee lifecycl
 
 There are roughly two perspectives to be balanced: (i) the "actuarial" need to understand expected losses over the lifecycle, and (ii) the "diagnostic" needs to understand the causative factors that extend or reduce the lifecycle. Both are ultimately complementary as we need to "price in" differential flavours of risk that impact the expected bottom line whenever we plan for the future. Survival regression analysis neatly combines both these perspectives enabling the analyst to understand and take preventative action to offset periods of increased risk.
 
-We've seen above a number of distinct regression modelling strategies for time-to-event data, but there are more flavours to explore: joint longitidunal models with a survival component, survival models with time-varying covariates, cure-rate models. The Bayesian perspective on these survival models is useful because we often have detailed results from prior years or experiments where our priors add useful perspective on the problem - allowing us to numerically encode that information to help regularise model fits for complex survival modelling. In the case of frailty models like the ones above - we've seen how priors can be added to the frailty terms to describe the influence of unobserved covariates which influence individual trajectories. Similarly the stratified approach to modelling baseline hazards allows us to carefully express trajectories of individual risk.  This can be especially important in the human centric disciplines where we seek to understand repeat measurments of the same individual time and again - accounting for the degree to which we can explain individual effects. Which is to say that while the framework of survival analysis suits a wide range of domains and problems, it nevertheless allows us to model, predict and infer aspects of specific and individual risk. 
+We've seen above a number of distinct regression modelling strategies for time-to-event data, but there are more flavours to explore: joint longitidunal models with a survival component, survival models with time-varying covariates, cure-rate models. The Bayesian perspective on these survival models is useful because we often have detailed results from prior years or experiments where our priors add useful perspective on the problem - allowing us to numerically encode that information to help regularise model fits for complex survival modelling. In the case of frailty models like the ones above - we've seen how priors can be added to the frailty terms to describe the influence of unobserved covariates which influence individual trajectories. Similarly the stratified approach to modelling baseline hazards allows us to carefully express trajectories of individual risk.  This can be especially important in the human centric disciplines where we seek to understand repeat measurements of the same individual time and again - accounting for the degree to which we can explain individual effects. Which is to say that while the framework of survival analysis suits a wide range of domains and problems, it nevertheless allows us to model, predict and infer aspects of specific and individual risk. 
 
 +++
 
diff --git a/examples/time_series/AR.ipynb b/examples/time_series/AR.ipynb
index d0456c1bd..24932a19e 100644
--- a/examples/time_series/AR.ipynb
+++ b/examples/time_series/AR.ipynb
@@ -282,7 +282,7 @@
     " y_t = \\rho_0 + \\rho_1 y_{t-1} + \\rho_2 y_{t-2} + \\epsilon_t.\n",
     "$$\n",
     "\n",
-    "The `AR` distribution infers the order of the process thanks to the size the of `rho` argmument passed to `AR` (including the mean). \n",
+    "The `AR` distribution infers the order of the process thanks to the size the of `rho` argument passed to `AR` (including the mean). \n",
     "\n",
     "We will also use the standard deviation of the innovations (rather than the precision) to parameterize the distribution."
    ]
diff --git a/examples/time_series/AR.myst.md b/examples/time_series/AR.myst.md
index 17d8dd6b2..f40f8c474 100644
--- a/examples/time_series/AR.myst.md
+++ b/examples/time_series/AR.myst.md
@@ -120,7 +120,7 @@ $$
  y_t = \rho_0 + \rho_1 y_{t-1} + \rho_2 y_{t-2} + \epsilon_t.
 $$
 
-The `AR` distribution infers the order of the process thanks to the size the of `rho` argmument passed to `AR` (including the mean). 
+The `AR` distribution infers the order of the process thanks to the size the of `rho` argument passed to `AR` (including the mean). 
 
 We will also use the standard deviation of the innovations (rather than the precision) to parameterize the distribution.
 
diff --git a/examples/time_series/Time_Series_Generative_Graph.ipynb b/examples/time_series/Time_Series_Generative_Graph.ipynb
index 322704d6c..1ff48ea12 100644
--- a/examples/time_series/Time_Series_Generative_Graph.ipynb
+++ b/examples/time_series/Time_Series_Generative_Graph.ipynb
@@ -10,7 +10,7 @@
     "# Time Series Models Derived From a Generative Graph\n",
     "\n",
     ":::{post} January, 2025\n",
-    ":tags: time-series, \n",
+    ":tags: time series, \n",
     ":category: intermediate, reference\n",
     ":author: Jesse Grabowski, Juan Orduz and Ricardo Vieira\n",
     ":::"
@@ -1059,7 +1059,7 @@
    "id": "ff135390",
    "metadata": {},
    "source": [
-    "We can visualize the out-of-sample predictions and compare thee results wth the one from  `statsmodels`."
+    "We can visualize the out-of-sample predictions and compare thee results with the one from  `statsmodels`."
    ]
   },
   {
diff --git a/examples/time_series/Time_Series_Generative_Graph.myst.md b/examples/time_series/Time_Series_Generative_Graph.myst.md
index bb1ca4982..73233e9af 100644
--- a/examples/time_series/Time_Series_Generative_Graph.myst.md
+++ b/examples/time_series/Time_Series_Generative_Graph.myst.md
@@ -14,7 +14,7 @@ kernelspec:
 # Time Series Models Derived From a Generative Graph
 
 :::{post} January, 2025
-:tags: time-series, 
+:tags: time series, 
 :category: intermediate, reference
 :author: Jesse Grabowski, Juan Orduz and Ricardo Vieira
 :::
@@ -470,7 +470,7 @@ with pm.Model(coords=coords, check_bounds=False) as forecasting_model:
     )
 ```
 
-We can visualize the out-of-sample predictions and compare thee results wth the one from  `statsmodels`.
+We can visualize the out-of-sample predictions and compare thee results with the one from  `statsmodels`.
 
 ```{code-cell} ipython3
 forecast_post_pred_ar = post_pred_forecast.posterior_predictive["ar_steps"]
diff --git a/examples/time_series/longitudinal_models.ipynb b/examples/time_series/longitudinal_models.ipynb
index 503080e6f..789607b45 100644
--- a/examples/time_series/longitudinal_models.ipynb
+++ b/examples/time_series/longitudinal_models.ipynb
@@ -8,7 +8,7 @@
     "# Longitudinal Models of Change\n",
     "\n",
     ":::{post} April, 2023\n",
-    ":tags: hierarchical, longitudinal, time series\n",
+    ":tags: hierarchical model, longitudinal data, time series\n",
     ":category: advanced, reference\n",
     ":author: Nathaniel Forde\n",
     ":::"
@@ -311,7 +311,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "First we'll examine the consumption patterns of a subset of the chidren to see how their reported usage exhibits a range of different trajectories. All the trajectories can be plausibly modelled as a linear phenomena. "
+    "First we'll examine the consumption patterns of a subset of the children to see how their reported usage exhibits a range of different trajectories. All the trajectories can be plausibly modelled as a linear phenomena. "
    ]
   },
   {
@@ -934,7 +934,7 @@
    "source": [
     "## Unconditional Growth Model                     \n",
     "\n",
-    "Next we will more explictly model the individual contribution to the slope of a regression model where time is the key predictor. The structure of this model is worth pausing to consider. There are various instantiations of this kind of hierarchical model across different domains and disciplines. Economics, political science, psychometrics and ecology all have their own slightly varied vocabulary for naming the parts of the model: fixed effects, random effects, within-estimators, between estimators...etc, the list goes and the discourse is cursed. The terms are ambiguous and used divergingly. Wilett and Singer refer to the Level 1 and Level 2 sub-models, but the precise terminology is not important. \n",
+    "Next we will more explicitly model the individual contribution to the slope of a regression model where time is the key predictor. The structure of this model is worth pausing to consider. There are various instantiations of this kind of hierarchical model across different domains and disciplines. Economics, political science, psychometrics and ecology all have their own slightly varied vocabulary for naming the parts of the model: fixed effects, random effects, within-estimators, between estimators...etc, the list goes and the discourse is cursed. The terms are ambiguous and used divergingly. Wilett and Singer refer to the Level 1 and Level 2 sub-models, but the precise terminology is not important. \n",
     "\n",
     "The important thing about these models is the *hierarchy*. There is a global phenomena and a subject specific instantiation of the phenomena. The model allows us to compose the global model with the individual contributions from each subject. This helps the model account for unobserved heterogeneity at the subject level.Resulting in varying slopes and intercepts for each subject where allowed by the model specification. It can't solve all forms of bias but it does help account for this source of skew in the model predictions.\n",
     "\n",
@@ -5420,7 +5420,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The model is nicely specified and details the structure of hierarchical and subject level parameters. By default the Bambi model assigns priors and uses a non-centred parameterisation. The Bambi model definition uses the language of common and group level effects as opposed to the global and subject distinction we have beeen using in this example so far. Again, the important point to stress is just the hierarchy of levels, not the names."
+    "The model is nicely specified and details the structure of hierarchical and subject level parameters. By default the Bambi model assigns priors and uses a non-centred parameterisation. The Bambi model definition uses the language of common and group level effects as opposed to the global and subject distinction we have been using in this example so far. Again, the important point to stress is just the hierarchy of levels, not the names."
    ]
   },
   {
@@ -5966,7 +5966,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can see here how the bambi model specification recovers the same parameterisation we derived with PyMC. In practice and in production you should use bambi when you can if you're using a Bayesian hierarchical model. It is flexible for many use-cases and you should likely only need PyMC for highly customised models, where the flexibility of the model specification cannot be accomodated with the constraints of the formula syntax.  "
+    "We can see here how the bambi model specification recovers the same parameterisation we derived with PyMC. In practice and in production you should use bambi when you can if you're using a Bayesian hierarchical model. It is flexible for many use-cases and you should likely only need PyMC for highly customised models, where the flexibility of the model specification cannot be accommodated with the constraints of the formula syntax.  "
    ]
   },
   {
@@ -6617,7 +6617,7 @@
    "source": [
     "## Behaviour over time\n",
     "\n",
-    "We now model the evolution of the behaviours over time in a hierarchical fashion. We start with a simple hierarhical linear regression with a focal predictor of grade. "
+    "We now model the evolution of the behaviours over time in a hierarchical fashion. We start with a simple hierarchical linear regression with a focal predictor of grade. "
    ]
   },
   {
@@ -7827,7 +7827,7 @@
    "source": [
     "## Comparing Trajectories across Gender\n",
     "\n",
-    "We'll now allow the model greater flexibility and pull in the gender of the subject to analyse whether and to what degree the gender of the teenager influences their behaviorial changes. "
+    "We'll now allow the model greater flexibility and pull in the gender of the subject to analyse whether and to what degree the gender of the teenager influences their behavioral changes. "
    ]
   },
   {
@@ -8952,7 +8952,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "As perhaps expected our final gender based model is deemed to be best according the WAIC ranking. But somewhat suprisingly the Linear model with fixed trajectories is not far behind. "
+    "As perhaps expected our final gender based model is deemed to be best according the WAIC ranking. But somewhat surprisingly the Linear model with fixed trajectories is not far behind. "
    ]
   },
   {
diff --git a/examples/time_series/longitudinal_models.myst.md b/examples/time_series/longitudinal_models.myst.md
index dafbbfc81..f33acad7a 100644
--- a/examples/time_series/longitudinal_models.myst.md
+++ b/examples/time_series/longitudinal_models.myst.md
@@ -14,7 +14,7 @@ kernelspec:
 # Longitudinal Models of Change
 
 :::{post} April, 2023
-:tags: hierarchical, longitudinal, time series
+:tags: hierarchical model, longitudinal data, time series
 :category: advanced, reference
 :author: Nathaniel Forde
 :::
@@ -66,7 +66,7 @@ df["peer_hi_lo"] = np.where(df["peer"] > df["peer"].mean(), 1, 0)
 df
 ```
 
-First we'll examine the consumption patterns of a subset of the chidren to see how their reported usage exhibits a range of different trajectories. All the trajectories can be plausibly modelled as a linear phenomena. 
+First we'll examine the consumption patterns of a subset of the children to see how their reported usage exhibits a range of different trajectories. All the trajectories can be plausibly modelled as a linear phenomena. 
 
 ```{code-cell} ipython3
 fig, axs = plt.subplots(2, 4, figsize=(20, 8), sharey=True)
@@ -234,7 +234,7 @@ We see here the variation in the implied modification of the grand mean by each
 
 ## Unconditional Growth Model                     
 
-Next we will more explictly model the individual contribution to the slope of a regression model where time is the key predictor. The structure of this model is worth pausing to consider. There are various instantiations of this kind of hierarchical model across different domains and disciplines. Economics, political science, psychometrics and ecology all have their own slightly varied vocabulary for naming the parts of the model: fixed effects, random effects, within-estimators, between estimators...etc, the list goes and the discourse is cursed. The terms are ambiguous and used divergingly. Wilett and Singer refer to the Level 1 and Level 2 sub-models, but the precise terminology is not important. 
+Next we will more explicitly model the individual contribution to the slope of a regression model where time is the key predictor. The structure of this model is worth pausing to consider. There are various instantiations of this kind of hierarchical model across different domains and disciplines. Economics, political science, psychometrics and ecology all have their own slightly varied vocabulary for naming the parts of the model: fixed effects, random effects, within-estimators, between estimators...etc, the list goes and the discourse is cursed. The terms are ambiguous and used divergingly. Wilett and Singer refer to the Level 1 and Level 2 sub-models, but the precise terminology is not important. 
 
 The important thing about these models is the *hierarchy*. There is a global phenomena and a subject specific instantiation of the phenomena. The model allows us to compose the global model with the individual contributions from each subject. This helps the model account for unobserved heterogeneity at the subject level.Resulting in varying slopes and intercepts for each subject where allowed by the model specification. It can't solve all forms of bias but it does help account for this source of skew in the model predictions.
 
@@ -653,7 +653,7 @@ model.predict(idata_bambi, kind="pps")
 idata_bambi
 ```
 
-The model is nicely specified and details the structure of hierarchical and subject level parameters. By default the Bambi model assigns priors and uses a non-centred parameterisation. The Bambi model definition uses the language of common and group level effects as opposed to the global and subject distinction we have beeen using in this example so far. Again, the important point to stress is just the hierarchy of levels, not the names.
+The model is nicely specified and details the structure of hierarchical and subject level parameters. By default the Bambi model assigns priors and uses a non-centred parameterisation. The Bambi model definition uses the language of common and group level effects as opposed to the global and subject distinction we have been using in this example so far. Again, the important point to stress is just the hierarchy of levels, not the names.
 
 ```{code-cell} ipython3
 model
@@ -703,7 +703,7 @@ az.plot_forest(
 );
 ```
 
-We can see here how the bambi model specification recovers the same parameterisation we derived with PyMC. In practice and in production you should use bambi when you can if you're using a Bayesian hierarchical model. It is flexible for many use-cases and you should likely only need PyMC for highly customised models, where the flexibility of the model specification cannot be accomodated with the constraints of the formula syntax.  
+We can see here how the bambi model specification recovers the same parameterisation we derived with PyMC. In practice and in production you should use bambi when you can if you're using a Bayesian hierarchical model. It is flexible for many use-cases and you should likely only need PyMC for highly customised models, where the flexibility of the model specification cannot be accommodated with the constraints of the formula syntax.  
 
 +++
 
@@ -798,7 +798,7 @@ ax.set_title("Distribution of Individual Modifications to the Grand Mean");
 
 ## Behaviour over time
 
-We now model the evolution of the behaviours over time in a hierarchical fashion. We start with a simple hierarhical linear regression with a focal predictor of grade. 
+We now model the evolution of the behaviours over time in a hierarchical fashion. We start with a simple hierarchical linear regression with a focal predictor of grade. 
 
 ```{code-cell} ipython3
 id_indx, unique_ids = pd.factorize(df_external["ID"])
@@ -978,7 +978,7 @@ Granting the model more flexibility allows it to ascribe more nuanced growth tra
 
 ## Comparing Trajectories across Gender
 
-We'll now allow the model greater flexibility and pull in the gender of the subject to analyse whether and to what degree the gender of the teenager influences their behaviorial changes. 
+We'll now allow the model greater flexibility and pull in the gender of the subject to analyse whether and to what degree the gender of the teenager influences their behavioral changes. 
 
 ```{code-cell} ipython3
 :tags: [hide-output]
@@ -1123,7 +1123,7 @@ compare
 az.plot_compare(compare, figsize=(10, 4));
 ```
 
-As perhaps expected our final gender based model is deemed to be best according the WAIC ranking. But somewhat suprisingly the Linear model with fixed trajectories is not far behind. 
+As perhaps expected our final gender based model is deemed to be best according the WAIC ranking. But somewhat surprisingly the Linear model with fixed trajectories is not far behind. 
 
 +++
 
diff --git a/examples/variational_inference/pathfinder.ipynb b/examples/variational_inference/pathfinder.ipynb
index 44cb1cb03..b483db6bf 100644
--- a/examples/variational_inference/pathfinder.ipynb
+++ b/examples/variational_inference/pathfinder.ipynb
@@ -10,7 +10,7 @@
     "# Pathfinder Variational Inference\n",
     "\n",
     ":::{post} Feb 5, 2023 \n",
-    ":tags: variational inference, jax \n",
+    ":tags: variational inference, JAX\n",
     ":category: advanced, how-to\n",
     ":author: Thomas Wiecki\n",
     ":::"
diff --git a/examples/variational_inference/pathfinder.myst.md b/examples/variational_inference/pathfinder.myst.md
index bdad65849..0072e2782 100644
--- a/examples/variational_inference/pathfinder.myst.md
+++ b/examples/variational_inference/pathfinder.myst.md
@@ -15,7 +15,7 @@ kernelspec:
 # Pathfinder Variational Inference
 
 :::{post} Feb 5, 2023 
-:tags: variational inference, jax 
+:tags: variational inference, JAX
 :category: advanced, how-to
 :author: Thomas Wiecki
 :::