From 5f0fa796b1b76e4fe1046966606a29b64bd33ae2 Mon Sep 17 00:00:00 2001
From: Meraldo Antonio <meraldo@.antonio@gmail.com>
Date: Sun, 19 Jan 2025 23:28:06 +0800
Subject: [PATCH 1/2] Finished first two Bayesian notebooks and their artefacts

---
 ...uction to Bayesian Linear Regression.ipynb |  903 +++++++++++++++
 .../2. Gaussian Conjugate Prior.ipynb         | 1012 +++++++++++++++++
 examples/bayesian/test_data.csv               |    6 +
 examples/bayesian/train_data.csv              |   21 +
 examples/bayesian/utils.py                    |   76 ++
 5 files changed, 2018 insertions(+)
 create mode 100644 examples/bayesian/1. Introduction to Bayesian Linear Regression.ipynb
 create mode 100644 examples/bayesian/2. Gaussian Conjugate Prior.ipynb
 create mode 100644 examples/bayesian/test_data.csv
 create mode 100644 examples/bayesian/train_data.csv
 create mode 100644 examples/bayesian/utils.py

diff --git a/examples/bayesian/1. Introduction to Bayesian Linear Regression.ipynb b/examples/bayesian/1. Introduction to Bayesian Linear Regression.ipynb
new file mode 100644
index 000000000..5cab8c6aa
--- /dev/null
+++ b/examples/bayesian/1. Introduction to Bayesian Linear Regression.ipynb	
@@ -0,0 +1,903 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Objective"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This series of notebooks offers an in-depth exploration of the **Bayesian Linear Regression**. \n",
+    "It includes a comparison with the frequentist approach to linear regression and an introduction to the most common Bayesian inference techniques.\n",
+    "\n",
+    "The notebooks are organized as follows:\n",
+    "\n",
+    "1. **Foundations of Linear Regression**  \n",
+    "   The first notebook (this one) lays the groundwork for understanding linear regression. It introduces the mathematical framework and provides an overview of frequentist approaches to estimate the model weights. This serves as a stepping stone for transitioning into the Bayesian perspective.\n",
+    "\n",
+    "\n",
+    "2. **Conjugate Prior in Bayesian Inference**  \n",
+    "   The second notebook delves into the concept of conjugate priors. By using conjugate distributions, we can derive posterior distributions analytically, simplifying Bayesian inference. This notebook also highlights how prior knowledge can influence the model and improve predictions in the presence of limited data.\n",
+    "\n",
+    "\n",
+    "3. **MCMC and Variational Inference**  \n",
+    "   The third and fourth notebooks introduce advanced methods for Bayesian inference, including Markov Chain Monte Carlo (MCMC) and Variational Inference. These techniques enable approximate inference when analytical solutions are infeasible, making it possible to handle complex models and large datasets effectively.\n",
+    "\n",
+    "These notebooks are heavily inspired by the theories and notations presented in Christopher M. Bishop's classic textbook *[Pattern Recognition and Machine Learning](https://www.microsoft.com/en-us/research/people/cmbishop/prml-book/)*. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Imports and Helper Function"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import statsmodels.api as sm\n",
+    "from IPython.display import Math, display\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "from utils import style_data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%load_ext autoreload\n",
+    "%autoreload 2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "HAxBG7mR2Ar1"
+   },
+   "source": [
+    "# Theory of Linear Regression"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Definition"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Linear regression is a widely used model due to its simplicity and interpretability. \n",
+    "\n",
+    "\n",
+    "In its simplest form, it predicts a single target $t$ as the deterministic output of the function $y$, which in turn is a linear combination of input variables $\\mathbf{x} = (x_1, \\dots, x_D)^\\top$ and parameters $\\mathbf{w} = (w_0, w_1, \\dots, w_D)^\\top$:\n",
+    "\n",
+    "$$\n",
+    "t = y(\\mathbf{x}, \\mathbf{w}) = w_0 + \\sum_{j=1}^D w_j x_j \\tag{1}\n",
+    "$$\n",
+    "\n",
+    "\n",
+    "The parameter $w_0$, which is often called \"bias\" or \"intercept\", allows for any fixed offset in the data. It is often convenient to define an additional dummy feature $x_0 = 1$ so that we can simplify the above equation as:\n",
+    "\n",
+    "$$\n",
+    "t =  y(\\mathbf{x}, \\mathbf{w}) = \\mathbf{w}^\\top \\mathbf{x} \\tag{2}\n",
+    "$$\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Likelihood"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Likelihood of a single target data point $t$"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Likelihood measures how well a statistical model explains the observed data, given its parameters.** Put another way, it expresses how the observed data could have been generated by the model’s assumed data-generating process.\n",
+    "\n",
+    "Let’s revisit our earlier example. Initially, we assumed that the target $t$ is a deterministic output of $y(\\mathbf{x}, \\mathbf{w})$, the model's prediction based on the input features $\\mathbf{x}$ and the regression coefficients $\\mathbf{w}$. \n",
+    "\n",
+    "While this works in theory, it doesn’t account for the inherent uncertainty in real-world data.\n",
+    "To address this, we shall introduce uncertainty by assuming that $t$ is generated as the deterministic output $y(\\mathbf{x}, \\mathbf{w})$ combined with additive Gaussian noise $\\epsilon$:\n",
+    "\n",
+    "$$\n",
+    "\\begin{aligned}\n",
+    "t &= y(\\mathbf{x}, \\mathbf{w}) + \\epsilon \\\\\n",
+    "&= \\mathbf{w}^\\top \\mathbf{x} \\tag{3}\n",
+    "\\end{aligned}\n",
+    "$$\n",
+    "\n",
+    "where $\\epsilon \\sim \\mathcal{N}(0, \\beta^{-1})$ represents noise with zero mean and variance $\\beta^{-1}$ (i.e. in $\\beta$ is the *precision* of the Gaussian).\n",
+    "\n",
+    "\n",
+    "Alternatively, we can express  $t$ probabilistically by saying that it follows a Gaussian distribution, centered at $y(\\mathbf{x}, \\mathbf{w})$, with variance $\\beta^{-1}$:\n",
+    "\n",
+    "$$\n",
+    "\\begin{aligned}\n",
+    "p(t | \\mathbf{x}, \\mathbf{w}, \\beta) &= \\mathcal{N}(t | y(\\mathbf{x}, \\mathbf{w}), \\beta^{-1}) \\\\\n",
+    "&= \\mathcal{N}(t | \\mathbf{w}^\\top \\mathbf{x}, \\beta^{-1}) \\tag{4}\n",
+    "\\end{aligned}\n",
+    "$$\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Likelihood of a set of data points $\\mathbf{t}$"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "Now let's consider a data set of inputs $ \\mathbf{X} = \\{ \\mathbf{x}_1, \\ldots, \\mathbf{x}_N \\} $ with corresponding target values $ t_1, \\ldots, t_N $. \n",
+    "\n",
+    "First, we group the scalar target variables $\\{ t_n \\}$ into a column vector that we denote by $\\mathbf{t}$. \n",
+    "\n",
+    "Afterards, we assume that the data points in $\\mathbf{t}$ are drawn independently from the above single-sample likelihood distribution. With this assumption, we proceed to construct the following expression for the likelihood function of the *entire dataset* $\\mathbf{t}$:\n",
+    "\n",
+    "$$\n",
+    "\\begin{aligned}\n",
+    "p(\\mathbf{t} | \\mathbf{X}, \\mathbf{w}, \\beta) &= \\prod_{n=1}^N \\mathcal{N}(t_n | \\mathbf{w}^T \\mathbf{x}_n, \\beta^{-1}) \\\\\n",
+    "&\\propto \\exp \\left( -\\frac{\\beta}{2} \\sum_{n=1}^N (t_n - \\mathbf{w}^T \\mathbf{x}_n)^2 \\right)\n",
+    "\\\\\n",
+    "&\\propto \\exp \\left( -\\frac{\\beta}{2} \\|\\mathbf{t} - \\mathbf{X} \\mathbf{w}\\|^2 \\right)\n",
+    "\\end{aligned}  \\tag{5}\n",
+    "$$\n",
+    "\n",
+    "\n",
+    "Note: since the data matrix $\\mathbf{X} $ always appears in the set of conditioning variables, from this point onwards, we will drop the explicit \"$\\mathbf{X}$\" from expressions such as $ p(\\mathbf{t} | \\mathbf{X}, \\mathbf{w}, \\beta) $ to keep the notation uncluttered.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Log-likelihood"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "To better analyze the likelihood function, we often work with its *logarithm*. Taking the logarithm simplifies the product of probabilities into a sum, thus simplifying calculations and avoiding numerical underflow.\n",
+    "\n",
+    "The **log-likelihood** of our model is given by:\n",
+    "\n",
+    "$$\n",
+    "\\begin{aligned}\n",
+    "\\ln p(\\mathbf{t} | \\mathbf{w}, \\beta) \n",
+    "&= \\sum_{n=1}^N \\ln \\mathcal{N}(t_n | \\mathbf{w}^\\top \\mathbf{x}_n, \\beta^{-1}) \\\\\n",
+    "&= \\frac{N}{2} \\ln \\beta - \\frac{N}{2} \\ln(2\\pi) - \\beta E_D(\\mathbf{w}) \n",
+    "\\end{aligned}\\tag{6}\n",
+    "$$\n",
+    "\n",
+    "The last component of the equation $E_D(\\mathbf{w})$ is an important one - it is the **sum of squares error function** and is defined as:\n",
+    "\n",
+    "$$\n",
+    "E_D(\\mathbf{w}) = \\frac{1}{2} \\sum_{n=1}^N \\big(t_n - \\mathbf{w}^\\top \\mathbf{x}_n\\big)^2. \\tag{7}\n",
+    "$$\n",
+    "\n",
+    "The log-likelihood above measures how well the model parameters $\\mathbf{w}$ explain the observed data. \n",
+    "\n",
+    "To optimize a model, we should aim to **maximize the log-likelihood**. From **(6)**, we observe that the only component of the log-likelihood that depends on the parameters $\\mathbf{w}$ is the sum of squared errors $E_D(\\mathbf{w})$. Consequently, this **likelihood maximization** is mathematically equivalent to minimizing $E_D(\\mathbf{w})$—that is, reducing the discrepancy between the observed target values $t_n$ and the predicted values $\\mathbf{w}^\\top \\mathbf{x}_n$ as much as possible.\n",
+    "\n",
+    "This connection explains why **Maximum Likelihood Estimation (MLE)** is also referred to as **Ordinary Least Squares (OLS)** in the context of linear regression with Gaussian noise. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Frequentist Approach of Obtaining $w$"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "To maximize the log-likelihood in the **frequentist framework**, we compute its gradient with respect to $\\mathbf{w}$:\n",
+    "\n",
+    "$$\n",
+    "\\nabla \\ln p(\\mathbf{t} | \\mathbf{w}, \\beta) = \\sum_{n=1}^N \\big(t_n - \\mathbf{w}^\\top \\mathbf{x}_n \\big) \\mathbf{x}_n^\\top. \\tag{8}\n",
+    "$$\n",
+    "\n",
+    "Setting this gradient to zero gives the optimal weights $\\mathbf{w}$ that minimize the sum-of-squares error:\n",
+    "\n",
+    "$$\n",
+    "0 = \\sum_{n=1}^N t_n \\mathbf{x}_n^\\top - \\mathbf{w}^\\top \\sum_{n=1}^N \\mathbf{x}_n \\mathbf{x}_n^\\top. \\tag{9}\n",
+    "$$\n",
+    "\n",
+    "Solving for $\\mathbf{w}$, we obtain the famous **normal equation**:\n",
+    "\n",
+    "$$\n",
+    "\\mathbf{w}_{\\text{MLE}} = (\\mathbf{X}^\\top \\mathbf{X})^{-1} \\mathbf{X}^\\top \\mathbf{t}, \\tag{10}\n",
+    "$$\n",
+    "\n",
+    "The term $(\\mathbf{X}^\\top \\mathbf{X})^{-1} \\mathbf{X}^\\top$ is the **Moore-Penrose pseudoinverse** of $\\mathbf{X}$; it serves as a generalization of the standard matrix inverse for cases where $\\mathbf{X}$ may not be square or invertible. This pseudoinverse plays a key role in solving linear systems when $\\mathbf{X}$ is over- or under-determined. It can be thought of as the matrix that provides the best possible approximation to an inverse in such scenarios, enabling us to project the observed data into the parameter space effectively.\n",
+    "\n",
+    "This *frequentist* approach provides a closed-form solution for the maximum likelihood estimate of the regression weights, $\\mathbf{w}_{\\text{MLE}}$. While straightforward, it has key limitations:\n",
+    "- The estimate $\\mathbf{w}_{\\text{MLE}}$ is fixed and does not account for uncertainty in the parameters.\n",
+    "- It cannot incorporate prior knowledge, which becomes particularly problematic when data is scarce."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Synthetic Data Generation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "HiSOh5L6AFuh"
+   },
+   "source": [
+    "We will now demostrate the above theory by fitting a linear regeression on synthetic data. This synthetic data $\\mathbf{x}$ have just two features ($x_1$ and $x_2$) and 25 data points. The true relationship between the data $\\mathbf{x}$ and the target variable $y_{\\text{true}}$ is given by the equation:\n",
+    "\n",
+    "\n",
+    "$$\n",
+    "\\begin{equation}\n",
+    "y_{\\text{true}} = w_0^{\\text{true}} + x_1 \\cdot w_1^{\\text{true}} + x_2 \\cdot w_2^{\\text{true}} \\tag{11}\n",
+    "\\end{equation}\n",
+    "$$\n",
+    "\n",
+    "\n",
+    "where $w_0^{\\text{true}} = 1$, $w_1^{\\text{true}} = 2$, and $ w_2^{\\text{true}} = 3$.\n",
+    "\n",
+    "The observed target values ($y_{\\text{train}}$) are generated by adding Gaussian noise to the true target values:\n",
+    "\n",
+    "$$\n",
+    "\\begin{equation}\n",
+    "y = y_{\\text{true}} + \\mathcal{N}(0, \\sigma_{\\text{true}}) \\tag{12}\n",
+    "\\end{equation}\n",
+    "$$\n",
+    "\n",
+    "Here, $\\sigma_{\\text{true}} = 0.5$.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 206
+    },
+    "id": "xo4qpkVhisFX",
+    "outputId": "44218333-4b7d-42f7-9b02-b8f9a2caf30d"
+   },
+   "outputs": [],
+   "source": [
+    "# Creating 25 random data points containing 2 features\n",
+    "N = 25\n",
+    "np.random.seed(42)\n",
+    "x1 = np.random.uniform(0, 1, N)\n",
+    "x2 = np.random.uniform(0, 1, N)\n",
+    "X = pd.DataFrame({\"x1\": x1, \"x2\": x2})\n",
+    "\n",
+    "# Set the true relationship between the features and the target variable\n",
+    "TRUE_INTERCEPT = 1\n",
+    "TRUE_SLOPES = np.array([2, 3])\n",
+    "TRUE_SIGMA = 0.5\n",
+    "\n",
+    "# Calculating the target variables:\n",
+    "# y_true (deterministic) and y_train (includes Gaussian noise)\n",
+    "y_true = TRUE_INTERCEPT + np.dot(X, TRUE_SLOPES)\n",
+    "y_train = y_true + np.random.normal(0, TRUE_SIGMA, size=len(X))\n",
+    "\n",
+    "# Combine everything into a single DataFrame\n",
+    "data = pd.concat(\n",
+    "    [X, pd.Series(y_true, name=\"y_true\"), pd.Series(y_train, name=\"y_train\")],\n",
+    "    axis=1,\n",
+    ")\n",
+    "data = data.reset_index(drop=True)\n",
+    "\n",
+    "# train test split and saving\n",
+    "train_data, test_data = train_test_split(data, test_size=5)\n",
+    "train_data.to_csv(\"train_data.csv\")\n",
+    "test_data.to_csv(\"test_data.csv\")\n",
+    "\n",
+    "# Display the train_data DataFrame\n",
+    "style_data(test_data.head())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "zUygtX7Ad698"
+   },
+   "source": [
+    "The line chart below plots the relationship between $x_1$, $x_2$ and the targets - both the theoretical $y_{\\text{true}}$, represented by the red line, and the observed $y_{\\text{train}}$ that contains Gaussian noise, represented by the blue dots."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Fix feature1 and feature2 constants\n",
+    "x1_constant = train_data[\"x1\"].mean()\n",
+    "x2_constant = train_data[\"x2\"].mean()\n",
+    "\n",
+    "# Recalculate the true target `y_true` for a constant x1\n",
+    "y_true_fixed_x1 = (\n",
+    "    TRUE_INTERCEPT + TRUE_SLOPES[0] * x1_constant + TRUE_SLOPES[1] * train_data[\"x2\"]\n",
+    ")\n",
+    "\n",
+    "# Recalculate the true target `y_true` for a constant x2\n",
+    "y_true_fixed_x2 = (\n",
+    "    TRUE_INTERCEPT + TRUE_SLOPES[0] * train_data[\"x1\"] + TRUE_SLOPES[1] * x2_constant\n",
+    ")\n",
+    "\n",
+    "# Set up the plot\n",
+    "fig, axes = plt.subplots(1, 2, figsize=(15, 6))\n",
+    "\n",
+    "# Plot feature1 vs. y_train with x2 constant\n",
+    "axes[0].scatter(\n",
+    "    train_data[\"x1\"],\n",
+    "    train_data[\"y_train\"],\n",
+    "    label=\"Observed `y_train` (containing noise)\",\n",
+    "    alpha=0.6,\n",
+    ")\n",
+    "axes[0].plot(\n",
+    "    train_data[\"x1\"],\n",
+    "    y_true_fixed_x2,\n",
+    "    color=\"red\",\n",
+    "    label=\"Theoretical `y_true`\",\n",
+    "    linewidth=2,\n",
+    ")\n",
+    "axes[0].set_xlabel(\"x1\")\n",
+    "axes[0].set_ylabel(\"target\")\n",
+    "axes[0].set_title(f\"x1 vs target\\n(x2 fixed at {x2_constant:.2f})\")\n",
+    "axes[0].legend()\n",
+    "\n",
+    "# Plot feature2 vs. y_train with x1 constant\n",
+    "axes[1].scatter(\n",
+    "    train_data[\"x2\"],\n",
+    "    train_data[\"y_train\"],\n",
+    "    label=\"Observed `y_train` (containing noise)\",\n",
+    "    alpha=0.6,\n",
+    ")\n",
+    "axes[1].plot(\n",
+    "    train_data[\"x2\"],\n",
+    "    y_true_fixed_x1,\n",
+    "    color=\"blue\",\n",
+    "    label=\"Theoretical `y_true`\",\n",
+    "    linewidth=2,\n",
+    ")\n",
+    "axes[1].set_xlabel(\"x2\")\n",
+    "axes[1].set_title(f\"x2 vs target\\n(x1 fixed at {x1_constant:.2f})\")\n",
+    "axes[1].legend()\n",
+    "\n",
+    "# Improve spacing and show plot\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "uB-IoGsjd3vA"
+   },
+   "source": [
+    "We will also create synthetic **testing** data to evaluate the models' performance. The following code generates 10 new testing data points."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 206
+    },
+    "id": "X0ddXs1Ii0Yb",
+    "outputId": "5f04ae6d-1dd8-475d-ac91-cc77564c3a24"
+   },
+   "outputs": [],
+   "source": [
+    "X_test = test_data[[\"x1\", \"x2\"]]\n",
+    "style_data(X_test)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Coding OLS Linear Regression"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## From Scratch (Normal Equation)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Having generated the synthetic data, we are now ready to solve the problem.  \n",
+    "\n",
+    "To begin, we will solve the linear regression problem \"from scratch\" by applying the normal equation to compute $\\mathbf{w}_{\\text{MLE}}$, the maximum likelihood estimate of the weights."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X_train = train_data[[\"x1\", \"x2\"]]\n",
+    "y_train = train_data[\"y_train\"]\n",
+    "\n",
+    "# For simplicity, we'll use `X` and `y` to represent the final forms X_train and y_train\n",
+    "X = np.c_[np.ones(len(X_train)), X_train]  # Add a column of ones to the features\n",
+    "y = y_train.values.reshape(-1, 1)  # Reshape y to a column vector\n",
+    "\n",
+    "# Applying the normal equation\n",
+    "X_pseudo_inverse = np.linalg.inv(X.T @ X) @ X.T\n",
+    "weights = X_pseudo_inverse @ y\n",
+    "\n",
+    "# Extracting the intercept and slopes\n",
+    "intercept = weights[0, 0]\n",
+    "slopes = weights[1:].flatten()\n",
+    "\n",
+    "# Calculating residuals and their standard deviation\n",
+    "y_pred = X @ weights\n",
+    "residuals = y - y_pred\n",
+    "estimated_sigma = residuals.std()\n",
+    "\n",
+    "\n",
+    "# ================== Reporting ==================\n",
+    "\n",
+    "true_model_latex = rf\"\"\"\n",
+    "\\text{{True data generating model:}} \\\\\n",
+    "y_{{\\text{{true}}}} = {TRUE_SLOPES[0]:.2f} \\cdot x_1 +\n",
+    "{TRUE_SLOPES[1]:.2f} \\cdot x_2 + {TRUE_INTERCEPT:.2f} \\\\\n",
+    "\\text{{True standard deviation: }} \\sigma_{{\\text{{true}}}} = {TRUE_SIGMA:.2f}\\\\\n",
+    "\"\"\"\n",
+    "\n",
+    "\n",
+    "estimated_model_latex_normal = rf\"\"\"\n",
+    "\\text{{Estimated MLE model (using Normal equation):}} \\\\\n",
+    "\\hat{{y}} = {slopes[0]:.2f} \\cdot x_1 + {slopes[1]:.2f} \\cdot x_2 + {intercept:.2f} \\\\\n",
+    "\\text{{Standard deviation of residuals: }} \\hat{{\\sigma}} = {estimated_sigma:.2f}\n",
+    "\"\"\"\n",
+    "\n",
+    "display(Math(true_model_latex))\n",
+    "display(Math(estimated_model_latex_normal))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We see that the estimated $\\mathbf{w}_{\\text{MLE}}$ is not too far of from the true data generating model."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Using `statsmodels`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Manually coding the Normal Equation every time we solve a linear regression problem can quickly become tedious and error-prone.  \n",
+    "\n",
+    "Thankfully, the **`statsmodels`** library simplifies this process, allowing us to fit an Ordinary Least Squares (OLS) model in just two lines of code."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "P7Af9sHKKdx8"
+   },
+   "outputs": [],
+   "source": [
+    "X_train_with_ones = sm.add_constant(X_train)\n",
+    "ols_model = sm.OLS(y_train, X_train_with_ones).fit()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "T65gKYllh18N"
+   },
+   "source": [
+    "The code below extracts the parameter estimates from the `ols_model`.\n",
+    "\n",
+    "Notice that the estimated slopes, intercept, and standard deviation closely match those derived from the Normal equation. This is expected since **`statsmodels`** leverages a variation of the Normal equation behind the scenes to compute these estimates."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Predicted values for y_train\n",
+    "y_train_pred = ols_model.predict(X_train_with_ones)\n",
+    "residuals = y_train_pred - y_train\n",
+    "\n",
+    "# ================== Reporting ==================\n",
+    "\n",
+    "true_model_latex = rf\"\"\"\n",
+    "\\text{{True data generating model:}} \\\\\n",
+    "y_{{\\text{{true}}}} = {TRUE_SLOPES[0]:.2f} \\cdot x_1 +\n",
+    "{TRUE_SLOPES[1]:.2f} \\cdot x_2 + {TRUE_INTERCEPT:.2f} \\\\\n",
+    "\\text{{True standard deviation: }} \\sigma = {TRUE_SIGMA:.2f}\\\\\n",
+    "\"\"\"\n",
+    "\n",
+    "estimated_model_latex = rf\"\"\"\n",
+    "\\text{{Estimated MLE model:}} \\\\\n",
+    "\\hat{{y}} = {ols_model.params.iloc[1]:.2f} \\cdot x_1 +\n",
+    "{ols_model.params.iloc[2]:.2f} \\cdot x_2 + {ols_model.params.iloc[0]:.2f} \\\\\n",
+    "\\text{{Standard deviation of residuals: }} \\hat{{\\sigma}} = {residuals.std():.2f}\\\\\n",
+    "\"\"\"\n",
+    "\n",
+    "# Displaying the results using LaTeX\n",
+    "display(Math(true_model_latex))\n",
+    "display(Math(estimated_model_latex))\n",
+    "display(Math(estimated_model_latex_normal))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Test Prediction"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "evk3LGh6nEkU"
+   },
+   "source": [
+    "\n",
+    "Using the trained `ols_model`,  we can also create point predictions on the unseen `X_test` along with the corresponding confidence interval.\n",
+    "\n",
+    "The latter provides a range within which we expect the true parameter to lie with a certain level of confidence (e.g., 95%).\n",
+    "\n",
+    "In this code, we fix **`feature2`** at its mean value to isolate and visualize the influence of **`feature1`** on the target variable. \n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Fix feature2 at its mean\n",
+    "x2_constant = X_test[\"x2\"].mean()\n",
+    "\n",
+    "# Create a new test dataset with feature2 fixed at the constant value\n",
+    "X_test_fixed_x2 = X_test.copy()\n",
+    "X_test_fixed_x2[\"x2\"] = x2_constant\n",
+    "\n",
+    "# Predict y_test using the linear model after adding constant\n",
+    "predictions_fixed_x2 = ols_model.get_prediction(\n",
+    "    sm.add_constant(X_test_fixed_x2, has_constant=\"add\")\n",
+    ")\n",
+    "pred_summary_fixed_x2 = predictions_fixed_x2.summary_frame(alpha=0.05)\n",
+    "\n",
+    "# Extract predicted values and confidence intervals\n",
+    "y_test_pred_fixed_x2 = pred_summary_fixed_x2[\"mean\"]\n",
+    "conf_int_lower_fixed_x2 = pred_summary_fixed_x2[\"obs_ci_lower\"]\n",
+    "conf_int_upper_fixed_x2 = pred_summary_fixed_x2[\"obs_ci_upper\"]\n",
+    "\n",
+    "sorted_indices = np.argsort(X_test[\"x1\"])\n",
+    "X_test_sorted = X_test[\"x1\"].iloc[sorted_indices]\n",
+    "y_test_pred_sorted = y_test_pred_fixed_x2.iloc[sorted_indices]\n",
+    "conf_int_lower_sorted = conf_int_lower_fixed_x2.iloc[sorted_indices]\n",
+    "conf_int_upper_sorted = conf_int_upper_fixed_x2.iloc[sorted_indices]\n",
+    "\n",
+    "# Plot the predictions with the confidence intervals\n",
+    "plt.figure(figsize=(10, 6))\n",
+    "plt.scatter(X_test_sorted, y_test_pred_sorted, color=\"blue\", label=\"Predicted values\")\n",
+    "plt.fill_between(\n",
+    "    X_test_sorted,\n",
+    "    conf_int_lower_sorted,\n",
+    "    conf_int_upper_sorted,\n",
+    "    color=\"lightblue\",\n",
+    "    alpha=0.4,\n",
+    "    label=\"95% Confidence Interval\",\n",
+    ")\n",
+    "plt.xlabel(\"x1\")\n",
+    "plt.ylabel(\"Predicted Target\")\n",
+    "plt.title(f\"Predictions for X_test \\n(x2 fixed at {x2_constant:.2f})\")\n",
+    "plt.legend()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Confidence Interval"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The prediction above illustrates the frequentist concept of a **confidence interval (CI)**. A confidence interval provides a range of plausible values for an unknown parameter (such the regression coefficient) based on the observed data. For example, a 95% CI means that if we were to repeat the same experiment or sampling process many times, approximately 95% of the intervals constructed from those experiments would contain the true value of the parameter.\n",
+    "\n",
+    "It’s important to note that a CI does not indicate the probability that the true parameter lies within the interval for a single sample. This distinction often leads to confusion, as people sometimes interpret CIs in a probabilistic way that is closer to Bayesian reasoning.\n",
+    "\n",
+    "We will see later the Bayesian counterpart of CI, termed the **Bayesian credible intervals**, in contrast, do provide a probabilistic statement about the parameter itself. For instance, a 95% Bayesian credible interval directly states that there is a 95% probability that the true parameter value lies within the interval, conditioned on the data and the prior. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "HAxBG7mR2Ar1"
+   },
+   "source": [
+    "# Bayesian Linear Regression"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now that we have a solid foundation in linear regression and the frequentist approach using Maximum Likelihood Estimation (MLE), let us shift our focus to Bayesian linear regression. Bayesian linear regression estimates parameters using two sources of information: prior beliefs on those parameters and observed data.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "\n",
+    "We will see that compared to OLS regression, Bayesian Linear Regression offers several key advantages:\n",
+    "\n",
+    "1. **Incorporation of Prior Knowledge**: Bayesian regression allows you to incorporate prior beliefs about parameters, which can improve estimates, especially in cases where data is sparse.\n",
+    "\n",
+    "2. **Uncertainty Quantification**: Unlike the frequentist approach which provides only single point estimates for model parameters, Bayesian linear regression outputs entire probability distributions. This allows for a richer understanding of parameter uncertainty and facilitates probabilistic predictions.\n",
+    "\n",
+    "\n",
+    "In this section, we will explore the theoretical framework used in Bayesian linear regression.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Theory"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "NVVm_Idgo0Ac"
+   },
+   "source": [
+    "Bayesian linear regression directly applies Bayes' Theorem to estimate the $\\color{green}{\\textbf{posterior distributions}}$ of the model parameters. \n",
+    "\n",
+    "As a reminder, here is Bayes' Theorem:\n",
+    "\n",
+    "$$\n",
+    "\\begin{align}\n",
+    "\\color{green}{P(\\mathbf{w} \\mid \\mathbf{t})} &= \\frac{\\color{blue}{P(\\mathbf{t} \\mid  \\mathbf{w})} \\times \\color{orange}{P(\\mathbf{w})}}{\\color{purple}{P(\\mathbf{t})}} \\\\\n",
+    "\\color{green}{\\text{posterior}} &= \\frac{\\color{blue}{\\text{likelihood}} \\times \\color{orange}{\\text{prior}}}{\\color{purple}{\\text{marginal likelihood}}}\n",
+    "\\end{align}\n",
+    "\\tag{13}\n",
+    "$$ \n",
+    "\n",
+    "Where:\n",
+    "\n",
+    "- $\\color{orange}{P(\\mathbf{w})}$ is the $\\color{orange}{\\textbf{prior}}$ for parameters $\\mathbf{w}$, reflecting our beliefs about $\\mathbf{w}$ before observing the data $(\\mathbf{t}, \\mathbf{X})$. For instance, if we assume most predictors should have little influence, we can set $\\color{orange}{P(\\mathbf{w})}$ to be a Gaussian centered at zero with small standard deviation, which represents our high certainty that their values are close to zero. Technically, we can also have $\\color{orange}{P(\\beta)}$ - the prior for the precision parameter $\\beta$ (inverse variance of the noise). However, to simplify our discussion, we'll assume $\\beta$ is known, so this term is constant and can be ignored.  \n",
+    "\n",
+    "\n",
+    "- $\\color{green}{P(\\mathbf{w}\\mid\\mathbf{t}})$ represents the $\\color{green}{\\textbf{posterior}}$ for $\\mathbf{w}$ given the observed data $(\\mathbf{t}, \\mathbf{X})$. It combines the prior and the likelihood and quantifies our belief about $\\mathbf{w}$ *after* observing the data. For example, if we initially believe a slope $w_1$ should be small due to prior domain knowledge but the data $\\mathbf{X}$ strongly suggests otherwise, the posterior would balance these two sources of information.\n",
+    "\n",
+    "\n",
+    "\n",
+    "- $\\color{blue}{P(\\mathbf{t} \\mid \\mathbf{w})}$ is the $\\color{blue}{\\textbf{likelihood}}$ of the target data $\\mathbf{t}$ given the input data $\\mathbf{X}$, parameters $\\mathbf{w}$, and noise precision $\\beta$. It measures how well a particular set of parameters $\\mathbf{w}$ explains the observed target values $\\mathbf{t}$. As mentioned above, we'll assume that each data point $(\\mathbf{x}_n, t_n)$ is drawn independently from a Gaussian.\n",
+    "\n",
+    "\n",
+    "- $\\color{purple}{P(\\mathbf{t})}$ is the $\\color{purple}{\\textbf{marginal likelihood}}$, which ensures that the posterior  sums to one. It integrates out all possible parameter values $\\mathbf{w}$ from the joint probability. The marginal likelihood is particularly important for Bayesian model comparison because it quantifies how well a model as a whole explains the data independent of any specific parameter settings.\n",
+    "\n",
+    "A notational reminder: As the data matrix $\\mathbf{X}$ and the known precision parameter $\\beta$ always appear in the set of conditioning variables, we have dropped the explicit $\\mathbf{X}$ and $\\beta$ from expressions such as the likelihood $\\color{blue}{P(\\mathbf{t} \\mid \\mathbf{X}, \\mathbf{w}, \\beta)}$ to keep the notation uncluttered.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## The Challenge of Bayesian Inference"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As demonstrated above, Bayesian inference aims to determine the $\\color{green}{\\textbf{posterior}}$ $\\color{green}{P(\\mathbf{w} \\mid \\mathbf{t})}$ by applying Bayes' theorem. This posterior encapsulates our updated belief about the parameters $\\mathbf{w}$ after observing the data $\\mathbf{t}$ and $\\mathbf{X}$.\n",
+    "\n",
+    "However, in many real-world applications, computing the posterior directly is computationally challenging. The difficulty arises from the denominator in Bayes' formula—the $\\color{purple}{\\textbf{marginal likelihood}}$ $\\color{purple}{P(\\mathbf{t})}$—whose calculation requires integrating over the parameter space to marginalize $\\mathbf{w}$ from the $\\color{blue}{\\textbf{likelihood}}$ $\\color{blue}{P(\\mathbf{t} \\mid \\mathbf{w})}$:\n",
+    "\n",
+    "$$\n",
+    "\\color{purple}{P(\\mathbf{t})} = \\int \\color{blue}{P(\\mathbf{t} \\mid \\mathbf{w})} \\cdot \\color{orange}{P(\\mathbf{w})} \\, d\\mathbf{w}.\n",
+    "\\tag{14}\n",
+    "$$\n",
+    "\n",
+    "For high-dimensional parameter spaces, this integral is often analytically intractable due to its complexity. In such cases, we must rely on numerical strategies to approximate the posterior. \n",
+    "\n",
+    "\n",
+    "Below, we'll introduce some strategies for obtaining/approximating the posterior. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "## Strategies for Bayesian Inference"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To obtain or approximate the posterior distribution, we can employ several strategies. The most common approaches include **conjugate priors**, **Markov Chain Monte Carlo (MCMC)**, and **Variational Inference (VI)**, which are briefly introduced below.\n",
+    "\n",
+    "\n",
+    "\n",
+    "### Conjugate Priors\n",
+    "\n",
+    "A conjugate prior is a type of prior distribution that makes Bayesian inference much easier. When paired with a specific likelihood, this prior ensures the posterior distribution stays in the same family as the prior  (hence the name \"conjugate\"). This property of 'staying in the same family' means we can compute the posterior analytically without needing any numerical methods!\n",
+    "\n",
+    "For example, in Bayesian linear regression with a Gaussian likelihood, if we were to use a Gaussian prior, the resulting posterior is also Gaussian. This property of \"staying in the same family\" greatly simplifies Bayesian computations and allows us to obtain an analytical form of the posterior. \n",
+    "\n",
+    "The second notebook in this series delves into conjugacy in more details.\n",
+    "\n",
+    "### Markov Chain Monte Carlo (MCMC)\n",
+    "\n",
+    "Markov Chain Monte Carlo (MCMC) is a group of algorithms designed to sample from the posterior. It works by building a Markov chain where the posterior serves as the equilibrium distribution. Once enough samples are generated, they can be used to numerically approximate the posterior and derive insights about the parameters.\n",
+    "\n",
+    "We'll dive deeper into how MCMC works and its practical applications in the third notebook of this series.\n",
+    "\n",
+    "###  Variational Inference (VI)\n",
+    "\n",
+    "\n",
+    "Variational Inference (VI) offers a clever way to approximate the $\\color{green}{\\textbf{posterior }P(\\mathbf{w} \\mid \\mathbf{t})}$. Instead of using sampling, VI seeks $\\color{salmon}\\textbf{a simpler distribution  }{q(\\mathbf{w})}$ from a predefined family that closely approximates posterior. This approximation is done by minimizing the **Kullback-Leibler (KL) divergence** (i.e. the difference) between $\\color{salmon}{q(\\mathbf{w})}$ and the posterior:\n",
+    "\n",
+    "$$\n",
+    "\\color{salmon}{q^*(\\mathbf{w})} \\color{black}{= \\arg \\min_{\\color{black}{q}} \\text{KL}(\\color{salmon}{q(\\mathbf{w})} \\, \\color{black}{\\parallel} \\, \\color{green}{P(\\mathbf{w} \\mid \\mathbf{t})}}\\color{black}{)}.\n",
+    "$$\n",
+    "\n",
+    "In short, VI turns Bayesian inference into a neat optimization problem, making it faster and more scalable.\n",
+    "\n",
+    "We'll explore this approach in more depth in the fourth notebook of this series.\n",
+    "\n",
+    "### Comparison of Strategies\n",
+    "\n",
+    "| **Method**               | **Advantages**                                      | **Disadvantages**                                       |\n",
+    "|---------------------------|----------------------------------------------------|--------------------------------------------------------|\n",
+    "| **Conjugate Priors**      | Analytically tractable; computationally efficient  | Limited flexibility in prior/likelihood combinations   |\n",
+    "| **MCMC**                 | Flexible and can theoretically approximate any posterior  | Can be computationally expensive             |\n",
+    "| **Variational Inference** | Computationally efficient as it uses deterministic optimization | May underestimate uncertainty; <br> Result depends on the choice of approximant $q(\\mathbf{w})$ |\n",
+    "\n",
+    "Each method has trade-offs, and the choice of approach depends on the complexity of the model, the size of the data, and the computational resources available.\n",
+    "A more detailed discussion on these methods will be provided in subsequent notebooks.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "cNKM7EPXJXij"
+   },
+   "source": [
+    "# References"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "KzuX_hc8vsD7"
+   },
+   "source": [
+    "- [Bishop - Pattern Recognition and Machine Learning (2006)](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf) - A comprehensive reference on machine learning theory and Bayesian methods by Christopher M. Bishop."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "\n",
+    "# Credits\n",
+    "\n",
+    "Notebook creation: `meraldoantonio`\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "provenance": []
+  },
+  "hide_input": false,
+  "kernelspec": {
+   "display_name": "pymc_env",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.3"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {},
+   "toc_section_display": true,
+   "toc_window_display": true
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
diff --git a/examples/bayesian/2. Gaussian Conjugate Prior.ipynb b/examples/bayesian/2. Gaussian Conjugate Prior.ipynb
new file mode 100644
index 000000000..0f4da040a
--- /dev/null
+++ b/examples/bayesian/2. Gaussian Conjugate Prior.ipynb	
@@ -0,0 +1,1012 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "0",
+   "metadata": {},
+   "source": [
+    "# Objective"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1",
+   "metadata": {},
+   "source": [
+    "In this second notebook, we'll dive deeper into the concept of conjugate priors in Bayesian linear regression (BLR).\n",
+    "\n",
+    "We'll see how by using conjugate distributions, we can derive posterior distributions analytically and thus simplifying Bayesian inference. \n",
+    "\n",
+    "After providing the theory, we will see how this inference can be done using `skpro`'s `BayesianConjugateLinearRegressor` class."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2",
+   "metadata": {},
+   "source": [
+    "# Imports and Helper Functions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import statsmodels.api as sm\n",
+    "from IPython.display import Math, display\n",
+    "from utils import style_data\n",
+    "\n",
+    "from skpro.regression.bayesian import BayesianConjugateLinearRegressor"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%load_ext autoreload\n",
+    "%autoreload 2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5",
+   "metadata": {},
+   "source": [
+    "# Theory"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6",
+   "metadata": {},
+   "source": [
+    "## Bayesian Linear Regression - Quick Recap "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7",
+   "metadata": {},
+   "source": [
+    "As mentioned in the first notebook, **Bayesian linear regression (BLR)** is a probabilistic approach to linear regression where prior beliefs about the model parameters are combined with observed data to compute posterior distributions. Unlike traditional linear regression, which only provides point estimates for parameters and predictions, BLR offers a complete distribution for them, providing an intuitive way to reason about uncertainty in both parameters and predictions.\n",
+    "\n",
+    "As a reminder, BLR builds on Bayes' theorem:\n",
+    "\n",
+    "$$\n",
+    "\\begin{align*}\n",
+    "\\color{green}{P(\\mathbf{w} \\mid \\mathbf{t})} &= \\frac{\\color{blue}{P(\\mathbf{t} \\mid \\mathbf{w})} \\times \\color{orange}{P(\\mathbf{w})}}{\\color{purple}{P(\\mathbf{t})}} \\\\\n",
+    "\\color{green}{\\text{posterior}} &= \\frac{\\color{blue}{\\text{likelihood}} \\times \\color{orange}{\\text{prior}}}{\\color{purple}{\\text{marginal likelihood}}}\n",
+    "\\end{align*}\n",
+    "\\tag{1}\n",
+    "$$\n",
+    "\n",
+    "Where:\n",
+    "\n",
+    "- $\\color{orange}{P(\\mathbf{w})}$ is the $\\color{orange}{\\textbf{prior}}$ for parameters $\\mathbf{w}$, reflecting our beliefs about $\\mathbf{w}$ before observing the data. \n",
+    "\n",
+    "\n",
+    "- $\\color{green}{P(\\mathbf{w} \\mid \\mathbf{t})}$ represents the $\\color{green}{\\textbf{posterior}}$ for $\\mathbf{w}$ given the observed data $(\\mathbf{t}, \\mathbf{X})$. It combines the prior and the likelihood and quantifies our belief about $\\mathbf{w}$ *after* observing the data. \n",
+    "\n",
+    "\n",
+    "- $\\color{purple}{P(\\mathbf{t})}$ is the $\\color{purple}{\\textbf{marginal likelihood}}$, which ensures that the posterior  sums to one. \n",
+    "\n",
+    "\n",
+    "- $\\color{blue}{P(\\mathbf{t} \\mid \\mathbf{w})}$  is the $\\color{blue}{\\textbf{likelihood}}$ of the target data $\\mathbf{t}$ given the input data $\\mathbf{X}$, parameters $\\mathbf{w}$, and noise precision $\\beta$. It measures how well a particular set of parameters $\\mathbf{w}$ explains the observed target values $\\mathbf{t}$. Assuming each data point $(\\mathbf{x}_n, t_n)$ is drawn independently, the likelihood is:\n",
+    "\n",
+    "$$\n",
+    "\\begin{aligned}\n",
+    "\\color{blue}{P(\\mathbf{t} \\mid \\mathbf{w})} &= \\prod_{n=1}^N \\mathcal{N}(t_n \\mid \\mathbf{w}^T \\mathbf{x}_n, \\beta^{-1}) \\\\\n",
+    "&\\propto \\exp \\left( -\\frac{\\beta}{2} \\sum_{n=1}^N (t_n - \\mathbf{w}^T \\mathbf{x}_n)^2 \\right) \\\\\n",
+    "&\\propto \\exp \\left( -\\frac{\\beta}{2} \\|\\mathbf{t} - \\mathbf{X} \\mathbf{w}\\|^2 \\right) \\tag{2}\n",
+    "\\end{aligned}\n",
+    "$$\n",
+    "\n",
+    "A notational reminder: As the data matrix $\\mathbf{X}$ and the known precision parameter $\\beta$ always appear in the set of conditioning variables, we have dropped the explicit $\\mathbf{X}$ and $\\beta$ from expressions such as the likelihood $\\color{blue}{P(\\mathbf{t} \\mid \\mathbf{X}, \\mathbf{w}, \\beta)}$ to keep the notation uncluttered.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8",
+   "metadata": {},
+   "source": [
+    "## Conjugacy and Conjugate Prior"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9",
+   "metadata": {},
+   "source": [
+    "A **conjugate prior** is a prior distribution that, when combined with a specific likelihood, ensures that the posterior distribution belongs to the same family as the prior. We'll soon see that this property of \"staying in the same family\" greatly simplifies Bayesian computations!\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "10",
+   "metadata": {},
+   "source": [
+    "*So, what would be the conjugate prior in our case?*\n",
+    "\n",
+    "To answer this, let’s revisit the likelihood. As mentioned earlier, the likelihood $ \\color{blue}{P(\\mathbf{t} \\mid \\mathbf{w})} $ follows a Gaussian distribution.\n",
+    "\n",
+    "To ensure conjugacy and simplify computation, we choose a multivariate Gaussian distribution as the prior for $ \\mathbf{w} $. This choice leverages a key property of Gaussian distributions: *the product of two Gaussian distributions is also a Gaussian*.\n",
+    "\n",
+    "Let’s now define the prior to set the stage for the analytical derivation that follows. \n",
+    "\n",
+    "The multivariate Gaussian prior is given by:\n",
+    "\n",
+    "$$\n",
+    "\\begin{aligned}\n",
+    "\\color{orange}{P(\\mathbf{w})} &= \\mathcal{N}(\\mathbf{w} \\mid \\mathbf{m}_0, \\mathbf{S}_0) \\\\\n",
+    "&\\propto \\exp \\left( -\\frac{1}{2} (\\mathbf{w} - \\mathbf{m}_0)^\\top \\mathbf{S}_0^{-1} (\\mathbf{w} - \\mathbf{m}_0) \\right) \\tag{3}\n",
+    "\\end{aligned}\n",
+    "$$\n",
+    "\n",
+    "Here:\n",
+    "- $ \\mathbf{m}_0 $ is the **mean vector** of the multivariate Gaussian prior of the the regression coefficients $\\mathbf{w}$. <br>\n",
+    "$ \\mathbf{m}_0 $ has a shape of $(D, 1)$, where $D$ is the number of features.\n",
+    "- $ \\mathbf{S}_0 $ is the **covariance matrix** of the prior, with shape $ (D, D) $.\n",
+    "\n",
+    "Both $ \\mathbf{m}_0 $ and $ \\mathbf{S}_0 $ should be selected based on prior knowledge or assumptions about the regression coefficients.\n",
+    "\n",
+    "> **Note:** Do not confuse $\\mathbf{S}_0$, the prior covariance matrix of the parameter $\\mathbf{w}$, with $\\beta$, which represents the precision (inverse variance) of the noise term in regression. In this framework, for simplicity, we assume $\\beta$ is known. If it is unknown, an alternative conjugate framework, the Multivariate Normal-Wishart distribution, must be used.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "11",
+   "metadata": {},
+   "source": [
+    "## Conjugate Posterior"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "12",
+   "metadata": {},
+   "source": [
+    "\n",
+    "\n",
+    "According to the Bayes formula, the $\\color{green}{\\textbf{posterior}}$ is proportional to the product of the $\\color{blue}{\\textbf{likelihood}}$ and $\\color{orange}{\\textbf{prior}}$:\n",
+    "<br>\n",
+    "\n",
+    "$$\n",
+    "\\color{green}{P(\\mathbf{w} | \\mathbf{t})} \\propto \\color{blue}{\\exp \\left( -\\frac{\\beta}{2} \\|\\mathbf{t} - \\mathbf{X} \\mathbf{w}\\|^2 \\right)} \\color{orange}{\\exp \\left( -\\frac{1}{2} (\\mathbf{w} - \\mathbf{m}_0)^T \\mathbf{S}_0^{-1} (\\mathbf{w} - \\mathbf{m}_0) \\right)} \\tag{4}\n",
+    "$$\n",
+    "\n",
+    "After expanding the terms in the exponents and completing the square, we obtain a posterior that's also a multivariate Gaussian:\n",
+    "\n",
+    "$$\n",
+    "\\color{green}{P(\\mathbf{w} | \\mathbf{t})} = \\mathcal{N}(\\mathbf{w} | \\mathbf{m}_N, \\mathbf{S}_N) \\tag{5}\n",
+    "$$\n",
+    "\n",
+    "In this formula,\n",
+    "$\\mathbf{S}_N$ is the *covariance* of our posterior Gaussian distribution. <br> Its inverse (i.e. the  *precision* of the posterior) can be conveniently calculated from the prior and the data by simply using the formula below:\n",
+    "\n",
+    "  $$\n",
+    "  \\mathbf{S}_N^{-1} = \\mathbf{S}_0^{-1} + \\beta \\mathbf{X}^T \\mathbf{X} \\tag{6}\n",
+    "  $$\n",
+    "  \n",
+    "We see that the formula lends itself to the following intution:\n",
+    "\n",
+    "- The posterior precision $\\mathbf{S}_N^{-1}$ is the sum of the prior-derived precision and data-derived precision.\n",
+    "- The prior precision $\\mathbf{S}_0^{-1}$ reflects initial uncertainty in the weights $\\mathbf{w}$.\n",
+    "- On the other hand, the data-derived precision $\\beta \\mathbf{X}^T \\mathbf{X}$ reflects the improvement in precision coming from observing data $\\mathbf{X}$, adjusted by the noise precision $\\beta$.\n",
+    "\n",
+    "\n",
+    "Meanwhile, $\\mathbf{m}_N$ is the mean of our posterior. It is calculated using the following formula:\n",
+    "\n",
+    "  $$\n",
+    "  \\mathbf{m}_N = \\mathbf{S}_N \\left( \\mathbf{S}_0^{-1} \\mathbf{m}_0 + \\beta \\mathbf{X}^T \\mathbf{t} \\right) \\tag{7}\n",
+    "  $$\n",
+    "\n",
+    " We note that $\\mathbf{m}_N$ is essentially a **weighted average** of the prior mean vector, $\\mathbf{m}_0$, and the observed data (represented by $\\mathbf{X}^T \\mathbf{t}$). \n",
+    "\n",
+    "Another observation: if we set an infinitively broad (i.e. completely uninformative) prior with a zero lprecision $\\mathbf{S}_0$, we see that the Bayesian posterior estimate reduces to the frequentist $\\mathbf{w}_{\\text{MLE}}$ obtained through the normal equation:\n",
+    "\n",
+    "\n",
+    "\n",
+    "$$\n",
+    "\\mathbf{S}_N^{-1} = \\mathbf{0} + \\beta \\mathbf{X}^T \\mathbf{X}  \\tag{Assuming $\\mathbf{S}_0^{-1} = \\mathbf{0}$}\n",
+    "$$\n",
+    "\n",
+    "$$\n",
+    "\\mathbf{S}_N = \\left( \\beta \\mathbf{X}^T \\mathbf{X} \\right)^{-1} \\tag{Inversing $\\mathbf{S}_N^{-1}$}\n",
+    "$$\n",
+    "\n",
+    "$$\n",
+    "\\begin{align}\n",
+    "\\mathbf{m}_N &= \\left( \\beta \\mathbf{X}^T \\mathbf{X} \\right)^{-1} \\beta \\mathbf{X}^T \\mathbf{t} \\tag{Substituting $\\mathbf{S}_N^{-1}$ into $\\mathbf{m}_N$} \\\\\n",
+    "&= \\left( \\mathbf{X}^T \\mathbf{X} \\right)^{-1} \\mathbf{X}^T \\mathbf{t} \\tag{Normal Equation recovered!} \\\\\n",
+    "&= \\mathbf{w}_{\\text{MLE}} \\notag\n",
+    "\\end{align}\n",
+    "$$"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "13",
+   "metadata": {},
+   "source": [
+    "## Posterior Predictive"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "14",
+   "metadata": {},
+   "source": [
+    "Our ultimate goal is to get the posterior predictive distribution of a new target $t$ given a new input $\\mathbf{x}$:\n",
+    "\n",
+    "$$\n",
+    "p(t | \\mathbf{x}, \\mathbf{X}, \\mathbf{t}) = \\mathcal{N}(t | m(\\mathbf{x}), s^2(\\mathbf{x}))\n",
+    "$$\n",
+    "\n",
+    "As the notation suggests, this distribution depends on the training data ($\\mathbf{X}$ and $\\mathbf{t}$) used to fit the model.\n",
+    "\n",
+    "This posterior predictive distribution is a univariate Gaussian with mean $m(\\mathbf{x})$ and variance $s^2(\\mathbf{x})$, both of which depend on the given input $\\mathbf{x}$.\n",
+    "\n",
+    "\n",
+    "### Predictive Mean\n",
+    "\n",
+    "The predictive mean $m(\\mathbf{x})$ is given by:\n",
+    "$$\n",
+    "\\begin{aligned}\n",
+    "m(\\mathbf{x}) &= \\mathbf{x}^T \\beta S_N \\mathbf{X}^T \\mathbf{t} \\\\\n",
+    "&= \\mathbf{x}^T \\mathbf{m}_N \\tag{8}\n",
+    "\\end{aligned}\n",
+    "$$\n",
+    "\n",
+    "We note that this predictive mean is very simple: it is simply a projection of incoming data point $\\mathbf{x}$ onto the posterior mean $\\mathbf{m}_N$.\n",
+    "\n",
+    "\n",
+    "### Predictive Variance\n",
+    "\n",
+    "The predictive variance $s^2(\\mathbf{x})$ is given by:\n",
+    "$$\n",
+    "s^2(\\mathbf{x}) = \\beta^{-1} + \\mathbf{x}^T S_N \\mathbf{x} \\tag{9}\n",
+    "$$\n",
+    "\n",
+    "\n",
+    "Predictive variance quantifies model confidence, increasing in regions far from the training data or in uncertain directions. From the formula, we note that this uncertainty depends on the **position** and **direction** of the incoming data point $\\mathbf{x}$:\n",
+    "\n",
+    "- If it's close to fitted training data: $\\mathbf{x}^T S_N \\mathbf{x}$ is small near training data, where the model is confident.\n",
+    "- On the other hand, if it is far from fitted training data: $\\mathbf{x}^T S_N \\mathbf{x}$ grows as $\\mathbf{x}$ moves away, reflecting increased uncertainty.\n",
+    "- Lastly, larger $\\|\\mathbf{x}\\|$ increases $\\mathbf{x}^T S_N \\mathbf{x}$, leading to higher variance.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "15",
+   "metadata": {},
+   "source": [
+    "# Application"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "16",
+   "metadata": {},
+   "source": [
+    "The above framework is implemented by the `BayesianConjugateLinearRegressor` class from `skpro`. In this section, we'll take a look at its usage."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "17",
+   "metadata": {},
+   "source": [
+    "## Data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "18",
+   "metadata": {},
+   "source": [
+    "We will use the same synthetic dataset we generated in the first notebook.\n",
+    "\n",
+    "As a reminder, the true parameter values are:\n",
+    "- Intercept $w_0$ = 1\n",
+    "- First regression coefficient $w_1$ = 2\n",
+    "- Second regression coefficient $w_2$ = 3\n",
+    "- Noise variance $\\sigma$ = 0.5; in other words, noise *precision* $\\beta$ is 4. These values are assumed to be known."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "19",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train_data = pd.read_csv(\"train_data.csv\", index_col=0)\n",
+    "X_train = train_data[[\"x1\", \"x2\"]]\n",
+    "y_train = train_data[\"y_train\"]\n",
+    "\n",
+    "test_data = pd.read_csv(\"test_data.csv\", index_col=0)\n",
+    "X_test = test_data[[\"x1\", \"x2\"]]\n",
+    "\n",
+    "style_data(train_data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "20",
+   "metadata": {},
+   "source": [
+    "## Instantiation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "21",
+   "metadata": {},
+   "source": [
+    "To instantiate a `BayesianConjugateLinearRegressor` model, we need to define a multivariate Gaussian prior for the regression coefficients using the following parameters:\n",
+    "\n",
+    "1. **`coefs_prior_mu`**: The prior mean vector ($\\mathbf{m}_0$). <br>\n",
+    "Suppose that based on prior knowledge, we estimate that the regression coefficients are likely around 4 and 5. (Note: as mentioned earlier, the true values of $w_1$ and $w_2$ are actually $2$ and $3$, respectively, so this assumption is slightly off). For the intercept, we'll (correctly) assume it to be close to 1. Hence, we select the prior mean vector as follows; note that the first element of the vector is the intercept:\n",
+    "   $$\n",
+    "   \\mathbf{m}_0 = \n",
+    "   \\begin{bmatrix}\n",
+    "   1 \\\\\\\\ \n",
+    "   4 \\\\\\\\ \n",
+    "   5\n",
+    "   \\end{bmatrix}\n",
+    "   $$\n",
+    "\n",
+    "2. **`coefs_prior_cov`** <br>\n",
+    " The prior covariance matrix ($\\mathbf{S}_0$). To keep the model simple, we'll assume that the intercept and coefficients are independent and have equal variance. This leads us to select an identity matrix for **`coefs_prior_cov`**, expressed as:\n",
+    "\n",
+    "$$\n",
+    "\\mathbf{S}_0 = \n",
+    "\\begin{bmatrix}\n",
+    "1 & 0 & 0 \\\\\\\\ \n",
+    "0 & 1 & 0 \\\\\\\\\n",
+    "0 & 0 & 1\n",
+    "\\end{bmatrix}\n",
+    "$$\n",
+    "\n",
+    "Additionally, we need to specify **`noise_precision`**, the known precision ($\\beta$) of the Gaussian noise in the data. For this example, we'll assume we know the true noise precision value which is $4$. \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "22",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "COEFS_PRIOR_MU = np.array(\n",
+    "    [\n",
+    "        [1.0],  # Prior for intercept and coefficients;\n",
+    "        [4.0],  # the 1st value is the intercept\n",
+    "        [5.0],\n",
+    "    ]\n",
+    ")  # the 2nd and 3rd are the mean priors for w1 and w2\n",
+    "COEFS_PRIOR_COV = np.eye(3)  # Covariance matrix which is Identity\n",
+    "NOISE_PRECISION = 4\n",
+    "\n",
+    "model = BayesianConjugateLinearRegressor(\n",
+    "    coefs_prior_mu=COEFS_PRIOR_MU,\n",
+    "    coefs_prior_cov=COEFS_PRIOR_COV,\n",
+    "    noise_precision=NOISE_PRECISION,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "23",
+   "metadata": {},
+   "source": [
+    "Before fitting the model to our data, let's visualize the shape of our multivariate Gaussian prior for the coefficients $w_1$ and $w_2$."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "24",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "COL_NAMES = [\"Intercept\", \"w1\", \"w2\"]\n",
+    "display(Math(r\"\\text{Mean of the prior for regression coefficents } (\\mathbf{m}_0):\"))\n",
+    "display(style_data(pd.DataFrame(COEFS_PRIOR_MU, columns=[\"Value\"], index=COL_NAMES)))\n",
+    "\n",
+    "display(\n",
+    "    Math(r\"\\text{Covariance of the prior for regression coefficents } (\\mathbf{S}_0):\")\n",
+    ")\n",
+    "display(style_data(pd.DataFrame(COEFS_PRIOR_COV, columns=COL_NAMES, index=COL_NAMES)))\n",
+    "\n",
+    "TRUE_W1 = 2\n",
+    "TRUE_W2 = 3\n",
+    "\n",
+    "# Plot the Gaussian distribution: Contour and 3D\n",
+    "fig = plt.figure(figsize=(14, 6))\n",
+    "\n",
+    "# Generate a grid of points\n",
+    "x1, x2 = np.linspace(0, 8, 100), np.linspace(0, 8, 100)\n",
+    "X1, X2 = np.meshgrid(x1, x2)\n",
+    "pos = np.dstack((X1, X2))\n",
+    "\n",
+    "\n",
+    "def multivariate_gaussian(pos, mu, cov):\n",
+    "    # Helper function to compute the multivariate Gaussian PDF\n",
+    "    n = mu.shape[0]\n",
+    "    diff = pos - mu.flatten()\n",
+    "    inv_cov = np.linalg.inv(cov)\n",
+    "    exponent = -0.5 * np.einsum(\"...i,ij,...j->...\", diff, inv_cov, diff)\n",
+    "    return (1.0 / np.sqrt((2 * np.pi) ** n * np.linalg.det(cov))) * np.exp(exponent)\n",
+    "\n",
+    "\n",
+    "# Extract the w1 and w2\n",
+    "Z = multivariate_gaussian(pos, COEFS_PRIOR_MU[1:], COEFS_PRIOR_COV[1:, 1:])\n",
+    "\n",
+    "# Contour plot\n",
+    "ax1 = fig.add_subplot(121)\n",
+    "ax1.contour(X1, X2, Z, levels=10, cmap=\"viridis\")\n",
+    "ax1.set_title(\"Contour Plot of the Bivariate Gaussian Prior\")\n",
+    "ax1.set_xlabel(\"$w_1$\")\n",
+    "ax1.set_ylabel(\"$w_2$\")\n",
+    "ax1.plot(TRUE_W1, TRUE_W2, marker=\"*\", color=\"red\", markersize=10, label=\"True Value\")\n",
+    "ax1.annotate(\n",
+    "    \"True Value (2,3)\",\n",
+    "    (TRUE_W1, TRUE_W2),\n",
+    "    textcoords=\"offset points\",\n",
+    "    xytext=(5, -18),\n",
+    "    ha=\"center\",\n",
+    "    color=\"red\",\n",
+    ")\n",
+    "ax1.grid(True)\n",
+    "\n",
+    "# 3D plot\n",
+    "ax2 = fig.add_subplot(122, projection=\"3d\", box_aspect=(1, 1, 0.5))\n",
+    "ax2.plot_surface(X1, X2, Z, cmap=\"viridis\", edgecolor=\"none\", alpha=0.7)\n",
+    "ax2.scatter(TRUE_W1, TRUE_W2, 0, marker=\"*\", color=\"red\", s=100, label=\"True Value\")\n",
+    "ax2.set_title(\"3D Plot of the Bivariate Gaussian Prior\")\n",
+    "ax2.set_xlabel(\"$w_1$\")\n",
+    "ax2.set_ylabel(\"$w_2$\")\n",
+    "ax2.set_zticklabels([])\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "25",
+   "metadata": {},
+   "source": [
+    "## Fitting"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "26",
+   "metadata": {},
+   "source": [
+    "Like other estimators in the `skpro` family, the model is fitted easily with a single `.fit` call, which requires `X_train` and `y_train` as inputs.\n",
+    "\n",
+    "If an intercept is required, a column of ones should be added to the feature matrix. This can be easily achieved using the `add_constant` function from the `statsmodels` library.\n",
+    "\n",
+    "The `fit` method calculates the posterior using the conjugate formula elaborated above.\n",
+    "\n",
+    "After fitting, the posterior becomes available through its parameters: `_coefs_posterior_mu` (mean) and `_coefs_posterior_cov` (covariance)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "27",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X_train_with_ones = sm.add_constant(\n",
+    "    X_train, prepend=True\n",
+    ")  # prepending a constant of ones\n",
+    "style_data(X_train_with_ones.head())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "28",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model.fit(X_train_with_ones, y_train)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "29",
+   "metadata": {},
+   "source": [
+    "As shown below, the posterior retains the same shape as the prior."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "30",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model._coefs_posterior_mu.shape == model._coefs_prior_mu.shape"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "31",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Access posterior mean and covariance\n",
+    "posterior_mu = model._coefs_posterior_mu.ravel()  # Flatten to match dimensions\n",
+    "posterior_cov = model._coefs_posterior_cov\n",
+    "\n",
+    "display(Math(r\"\\text{Mean of the regression coefficients posterior } (\\mathbf{m}_N):\"))\n",
+    "display(\n",
+    "    style_data(\n",
+    "        pd.DataFrame(posterior_mu, columns=[\"Value\"], index=[\"Intercept\", \"w1\", \"w2\"])\n",
+    "    )\n",
+    ")\n",
+    "\n",
+    "display(\n",
+    "    Math(r\"\\text{Covariance of the regression coefficents posterior } (\\mathbf{S}_N):\")\n",
+    ")\n",
+    "display(\n",
+    "    style_data(\n",
+    "        pd.DataFrame(\n",
+    "            posterior_cov,\n",
+    "            columns=[\"Intercept\", \"w1\", \"w2\"],\n",
+    "            index=[\"Intercept\", \"w1\", \"w2\"],\n",
+    "        )\n",
+    "    )\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "32",
+   "metadata": {},
+   "source": [
+    "The visualization reveals that the posterior is narrower than the prior (indicating reduced uncertainty) and it is centered closer to the true value."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "33",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fig = plt.figure(figsize=(14, 6))\n",
+    "\n",
+    "# Compute the posterior density of w1 and w2\n",
+    "Z = multivariate_gaussian(pos, posterior_mu[1:], posterior_cov[1:, 1:])\n",
+    "\n",
+    "# Contour plot\n",
+    "ax1 = fig.add_subplot(121)\n",
+    "ax1.contour(X1, X2, Z, levels=10, cmap=\"viridis\")\n",
+    "ax1.set_title(\"Contour Plot of the Bivariate Gaussian Posterior\")\n",
+    "ax1.set_xlabel(\"$w_1$\")\n",
+    "ax1.set_ylabel(\"$w_2$\")\n",
+    "ax1.plot(TRUE_W1, TRUE_W2, marker=\"*\", color=\"red\", markersize=10, label=\"True Value\")\n",
+    "ax1.annotate(\n",
+    "    \"True Value (2, 3)\",\n",
+    "    xy=(TRUE_W1, TRUE_W2),\n",
+    "    xytext=(50, 50),\n",
+    "    textcoords=\"offset points\",\n",
+    "    ha=\"center\",\n",
+    "    color=\"red\",\n",
+    "    arrowprops=dict(arrowstyle=\"->\", color=\"red\", lw=1.5, alpha=0.5),\n",
+    ")\n",
+    "ax1.grid(True)\n",
+    "\n",
+    "# 3D plot\n",
+    "ax2 = fig.add_subplot(122, projection=\"3d\", box_aspect=(1, 1, 0.5))\n",
+    "ax2.plot_surface(X1, X2, Z, cmap=\"viridis\", edgecolor=\"none\", alpha=0.7)\n",
+    "ax2.scatter(TRUE_W1, TRUE_W2, 0, marker=\"*\", color=\"red\", s=100, label=\"True Value\")\n",
+    "ax2.set_title(\"3D Plot of the Bivariate Gaussian Posterior\")\n",
+    "ax2.set_xlabel(\"$w_1$\")\n",
+    "ax2.set_ylabel(\"$w_2$\")\n",
+    "ax2.set_zticklabels([])\n",
+    "\n",
+    "\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "34",
+   "metadata": {},
+   "source": [
+    "## Update"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "35",
+   "metadata": {},
+   "source": [
+    "One significant advantage of the Bayesian approach is its simplicity in handling updates, i.e., retraining the model when new data becomes available.\n",
+    "\n",
+    "First, we'll generate additional synthetic training data.\n",
+    "\n",
+    "As before, a column of ones must be prepended to the feature matrix to include the intercept."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "36",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Generate a new random data point for X_train_update\n",
+    "np.random.seed(43)\n",
+    "N = 40\n",
+    "x1_new = np.random.uniform(0, 1, 40)\n",
+    "x2_new = np.random.uniform(0, 1, 40)\n",
+    "\n",
+    "X_train_update = pd.DataFrame({\"x1\": x1_new, \"x2\": x2_new})\n",
+    "X_train_update_with_ones = sm.add_constant(\n",
+    "    X_train_update, prepend=True\n",
+    ")  # prepending a constant of ones\n",
+    "\n",
+    "# Set the true relationship between the features and the target variable\n",
+    "TRUE_INTERCEPT = 1\n",
+    "TRUE_SLOPES = np.array([2, 3])\n",
+    "TRUE_SIGMA = 0.5\n",
+    "\n",
+    "# Calculate y_true and y_train for the new data point\n",
+    "y_true_new = TRUE_INTERCEPT + np.dot(X_train_update, TRUE_SLOPES)\n",
+    "X_train_update[\"x0\"] = 1\n",
+    "y_train_update = pd.Series(y_true_new + np.random.normal(0, TRUE_SIGMA, size=1))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "37",
+   "metadata": {},
+   "source": [
+    "We'll then perform the update with a single call to the `update` method.\n",
+    "\n",
+    "The `update`method applies the same conjugacy framework, treating the posterior from the previous training as the new prior and updating it with the additional data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "38",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model.update(X_train_update_with_ones, y_train_update)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "39",
+   "metadata": {},
+   "source": [
+    "The visualization below shows that the posterior after the update becomes even narrower and moves even closer to the true value."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "40",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Access posterior mean and covariance\n",
+    "posterior_mu = model._coefs_posterior_mu.ravel()  # Flatten to match dimensions\n",
+    "posterior_cov = model._coefs_posterior_cov\n",
+    "\n",
+    "display(Math(r\"\\text{Mean of the regression coefficents posterior } (\\mathbf{m}_N):\"))\n",
+    "display(\n",
+    "    style_data(\n",
+    "        pd.DataFrame(posterior_mu, columns=[\"Value\"], index=[\"Intercept\", \"w1\", \"w2\"])\n",
+    "    )\n",
+    ")\n",
+    "\n",
+    "display(\n",
+    "    Math(r\"\\text{Covariance of the regression coefficents posterior } (\\mathbf{S}_N):\")\n",
+    ")\n",
+    "display(\n",
+    "    style_data(\n",
+    "        pd.DataFrame(\n",
+    "            posterior_cov,\n",
+    "            columns=[\"Intercept\", \"w1\", \"w2\"],\n",
+    "            index=[\"Intercept\", \"w1\", \"w2\"],\n",
+    "        )\n",
+    "    )\n",
+    ")\n",
+    "\n",
+    "\n",
+    "# Plot the posterior Gaussian distribution: Contour and 3D\n",
+    "fig = plt.figure(figsize=(14, 6))\n",
+    "\n",
+    "# Compute the posterior density of w1 and w2\n",
+    "Z = multivariate_gaussian(pos, posterior_mu[1:], posterior_cov[1:, 1:])\n",
+    "\n",
+    "# Contour plot\n",
+    "ax1 = fig.add_subplot(121)\n",
+    "ax1.contour(X1, X2, Z, levels=10, cmap=\"viridis\")\n",
+    "ax1.set_title(\"Contour Plot of the Bivariate Gaussian Posterior after Update\")\n",
+    "ax1.set_xlabel(\"$w_1$\")\n",
+    "ax1.set_ylabel(\"$w_2$\")\n",
+    "ax1.plot(TRUE_W1, TRUE_W2, marker=\"*\", color=\"red\", markersize=10, label=\"True Value\")\n",
+    "ax1.annotate(\n",
+    "    \"True Value (2,3)\",\n",
+    "    (TRUE_W1, TRUE_W2),\n",
+    "    textcoords=\"offset points\",\n",
+    "    xytext=(3, -28),\n",
+    "    ha=\"center\",\n",
+    "    color=\"red\",\n",
+    ")\n",
+    "ax1.grid(True)\n",
+    "\n",
+    "# 3D plot\n",
+    "ax2 = fig.add_subplot(122, projection=\"3d\", box_aspect=(1, 1, 0.5))\n",
+    "ax2.plot_surface(X1, X2, Z, cmap=\"viridis\", edgecolor=\"none\", alpha=0.7)\n",
+    "ax2.scatter(TRUE_W1, TRUE_W2, 0, marker=\"*\", color=\"red\", s=100, label=\"True Value\")\n",
+    "ax2.set_title(\"3D Plot of the Bivariate Gaussian Posterior after Update\")\n",
+    "ax2.set_xlabel(\"$w_1$\")\n",
+    "ax2.set_ylabel(\"$w_2$\")\n",
+    "ax2.set_zticklabels([])\n",
+    "\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "41",
+   "metadata": {},
+   "source": [
+    "## (Probabilistic) Prediction"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "42",
+   "metadata": {},
+   "source": [
+    "With the fitted model, we can now make predictions. Below is the test dataset, `X_test`, which will be used for generating predictions:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "43",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "style_data(X_test)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "44",
+   "metadata": {},
+   "source": [
+    "Probabilistic predictions are made using the `predict_proba` method, which requires `X_test` as inputs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "45",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X_test_with_ones = sm.add_constant(\n",
+    "    X_test, prepend=True\n",
+    ")  # prepending a constant of ones\n",
+    "y_test_pred_proba = model.predict_proba(X_test_with_ones)\n",
+    "y_test_pred_proba"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "46",
+   "metadata": {},
+   "source": [
+    "### Plotting posterior predictive PDF"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "47",
+   "metadata": {},
+   "source": [
+    "The prediction output, `y_test_pred_proba`, is an instance of `skpro`'s `Normal` distribution, containing the same number of data points as `X_test`. \n",
+    "\n",
+    "A key advantage of this Bayesian estimator is that each prediction is a Normal distribution, making it straightforward to work with. \n",
+    "\n",
+    "For example, we can easily plot the probability density function of the predictions using `plot` method."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "48",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "_ = y_test_pred_proba.plot(\"pdf\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "49",
+   "metadata": {},
+   "source": [
+    "### Predictive Credible Intervals"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "50",
+   "metadata": {},
+   "source": [
+    "One great thing about using probabilities in predictions is how easily we can calculate **predictive credible intervals**. These are the Bayesian version of confidence intervals in frequentist statistics.\n",
+    "\n",
+    "A credible interval gives you a range where the true prediction is likely to land, based on a specific level of certainty. For instance, a 95% predictive credible interval means there’s a 95% chance that the prediction will fall within that range.\n",
+    "\n",
+    "The key difference from confidence intervals is how they’re interpreted. Confidence intervals are about long-term averages—how often the interval would contain the true value if you repeated the experiment a bunch of times. Credible intervals, on the other hand, are much more straightforward and interpretable: they tell you the probability of the prediction being in that range for the case you’re looking at right now.\n",
+    "\n",
+    "The credible intervals are obtained using the model's `predict_interval` method, as shown below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "51",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "predictive_credible_interval = model.predict_interval(X_test_with_ones, coverage=0.8)\n",
+    "style_data(predictive_credible_interval)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "52",
+   "metadata": {},
+   "source": [
+    "### Point predictions"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "53",
+   "metadata": {},
+   "source": [
+    "To obtain point predictions, we can use the `predict` method instead. This will return the median of the posterior predictive distributions described above."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "54",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "y_test_pred = model.predict(X_test_with_ones)\n",
+    "style_data(y_test_pred)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "55",
+   "metadata": {},
+   "source": [
+    "## Advantages \n",
+    "\n",
+    "The main advantage of the conjugate prior frameworks such as the one we saw above is their **analytical simplicity**. When a conjugate prior is used, the posterior distribution can be found analytically. This eliminates the need for computationally intensive methods such as MCMC. This simplicity and the presence of closed form solution also extend to the derivation of posterior predictive distributions.\n",
+    "\n",
+    "Another key benefit is **computational efficiency**. The availability of a closed-form solution for the posterior distribution makes the process fast. \n",
+    "\n",
+    "Finally, conjugate priors enhance **interpretability**. Since the prior, likelihood, and posterior share the same functional form, it is easier to see how prior beliefs combine with observed data to produce the posterior distribution. This transparency allows for deeper insights into how the data and prior assumptions influence the final results.\n",
+    "\n",
+    "## Disadvantages\n",
+    "\n",
+    "Conjugate priors suffer from **limited flexibility**. They impose constraints on the form of the prior distribution, and the chosen prior may not accurately reflect true prior knowledge about the parameters. To reiterate: conjugate priors only work when both the likelihood and prior belong to the same distributio family. Thus, if the likelihood and prior do not align, the conjugate strategy cannot be applied.\n",
+    "\n",
+    "Additionally, conjugate priors are **inadequate for complex models**. Modern Bayesian models, such as neural networks or probabilistic graphical models, frequently require non-conjugate priors to capture intricate relationships between variables. In such cases, other inference techniques like MCMC or Variational Inference (VI) are necessary to approximate the posterior.\n",
+    "\n",
+    "## Conclusion\n",
+    "\n",
+    "While conjugate priors excel in providing analytical tractability and computational efficiency, they are inherently rigid. They are most effective for simple, structured models or situations where interpretability and fast computations are prioritized. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "56",
+   "metadata": {
+    "id": "cNKM7EPXJXij"
+   },
+   "source": [
+    "# References"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "57",
+   "metadata": {
+    "id": "KzuX_hc8vsD7"
+   },
+   "source": [
+    "- [Bishop - Pattern Recognition and Machine Learning (2006)](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf) - A comprehensive reference on machine learning theory and Bayesian methods by Christopher M. Bishop."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "58",
+   "metadata": {},
+   "source": [
+    "\n",
+    "\n",
+    "# Credits\n",
+    "\n",
+    "Notebook creation: `meraldoantonio`\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "59",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "hide_input": false,
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.20"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {},
+   "toc_section_display": true,
+   "toc_window_display": true
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/examples/bayesian/test_data.csv b/examples/bayesian/test_data.csv
new file mode 100644
index 000000000..fbb26ca98
--- /dev/null
+++ b/examples/bayesian/test_data.csv
@@ -0,0 +1,6 @@
+,x1,x2,y_true,y_train
+11,0.9699098521619943,0.3046137691733707,3.8536610118441006,3.6611198716359423
+16,0.3042422429595377,0.4951769101112702,3.094015216252886,2.674406454641567
+1,0.9507143064099162,0.19967378215835974,3.5004499592949117,3.5861340998898967
+5,0.15599452033620265,0.6075448519014384,3.1346235963767204,2.774701492179366
+3,0.5986584841970366,0.5924145688620425,3.9745606749802005,3.824008827185556
diff --git a/examples/bayesian/train_data.csv b/examples/bayesian/train_data.csv
new file mode 100644
index 000000000..ba40d6e29
--- /dev/null
+++ b/examples/bayesian/train_data.csv
@@ -0,0 +1,21 @@
+,x1,x2,y_true,y_train
+17,0.5247564316322378,0.034388521115218396,2.152678426610131,1.9980722386845238
+7,0.8661761457749352,0.06505159298527952,2.927507070505709,3.4560681836151668
+12,0.8324426408004217,0.09767211400638387,2.9579016236199953,2.619440623467016
+18,0.43194501864211576,0.9093204020787821,4.591851243520578,4.75748295922236
+15,0.18340450985343382,0.12203823484477883,1.732923724241204,2.1985637837993033
+8,0.6011150117432088,0.9488855372533332,5.048886635246418,5.220695780030648
+10,0.020584494295802447,0.8083973481164611,3.466361032940988,3.6284030176383855
+20,0.6118528947223795,0.662522284353982,4.211272642506705,3.97168552358406
+19,0.2912291401980419,0.2587799816000169,2.358798225196135,2.8465707887573144
+6,0.05808361216819946,0.17052412368729153,1.6277395953982734,1.3974202099183797
+9,0.7080725777960455,0.9656320330745594,5.313041254815769,4.431521177134401
+21,0.13949386065204183,0.31171107608941095,2.2141209495723166,2.121291461240408
+23,0.3663618432936917,0.5467102793432796,3.3728545246172223,2.774751212576887
+22,0.29214464853521815,0.5200680211778108,3.1444933606038687,2.5913258736008546
+13,0.21233911067827616,0.6842330265121569,3.477377300893023,3.783215445313457
+14,0.18182496720710062,0.4401524937396013,2.6841074156330054,3.199607176880981
+4,0.15601864044243652,0.046450412719997725,1.4513885190448663,0.7121275238611526
+24,0.45606998421703593,0.18485445552552704,2.466703335010653,2.872966246207752
+0,0.3745401188473625,0.7851759613930136,4.104608121873766,4.473841411871471
+2,0.7319939418114051,0.5142344384136116,4.0066911988636456,3.9488670576695255
diff --git a/examples/bayesian/utils.py b/examples/bayesian/utils.py
new file mode 100644
index 000000000..85b7b1ec3
--- /dev/null
+++ b/examples/bayesian/utils.py
@@ -0,0 +1,76 @@
+"""Utility function for Bayesian example notebooks."""
+
+import pandas as pd
+
+
+def style_data(data, vmax=None, subset=None, cmap="coolwarm", hide_index=False):
+    """
+    Apply styling to a Series or DataFrame.
+
+    Parameters
+    ----------
+    data : pd.DataFrame or pd.Series
+        The data to style.
+    vmax : float, optional
+        The maximum numeric value for the color spectrum.
+        Defaults to the max value in the data.
+    subset : list, optional
+        List of columns to which the formatting is to be applied (for DataFrames).
+    cmap : str, optional
+        The color map to apply for the gradient. Defaults to 'coolwarm'.
+    hide_index : bool, optional
+        If True, hide the index in the output. Defaults to False.
+
+    Returns
+    -------
+    pd.io.formats.style.Styler
+        The styled Series or DataFrame.
+    """
+    # Check if input is a Series or DataFrame
+    if isinstance(data, pd.Series):
+        # For Series, directly apply background gradient and format
+        if vmax is None:
+            vmax = data.max()
+        styled_data = (
+            data.to_frame()
+            .style.background_gradient(cmap=cmap, axis=0, vmin=-vmax, vmax=vmax)
+            .format("{:.3f}")
+        )
+
+        # Hide the index if requested
+        if hide_index:
+            styled_data = styled_data.hide(axis="index")
+
+    elif isinstance(data, pd.DataFrame):
+        # Determine the max value for the gradient if not provided
+        if vmax is None:
+            vmax = data.select_dtypes(include=["number"]).max().max()
+
+        # If no subset provided, apply to all float columns by default
+        if subset is None:
+            subset = pd.IndexSlice[:, data.select_dtypes(include=["float64"]).columns]
+
+        # Apply background gradient to numeric columns and format to 3 decimal points
+        styled_data = data.style.background_gradient(
+            cmap=cmap, axis=None, vmin=-vmax, vmax=vmax, subset=subset
+        ).format("{:.3f}", subset=subset)
+
+        # Color boolean columns (pink for False, lightblue for True)
+        bool_columns = data.select_dtypes(include=["bool"]).columns
+
+        def color_boolean(val):
+            color = "lightblue" if val else "pink"
+            return f"background-color: {color}"
+
+        # Apply the boolean-specific styling if any boolean columns exist
+        if not bool_columns.empty:
+            styled_data = styled_data.applymap(color_boolean, subset=bool_columns)
+
+        # Hide the index if requested
+        if hide_index:
+            styled_data = styled_data.hide(axis="index")
+
+    else:
+        raise TypeError("Input must be a pandas DataFrame or Series.")
+
+    return styled_data

From ce535b60fb97c4fa3b7bc441eb40c7a866479ced Mon Sep 17 00:00:00 2001
From: Meraldo Antonio <meraldo@.antonio@gmail.com>
Date: Sun, 19 Jan 2025 23:33:38 +0800
Subject: [PATCH 2/2] Added contribution

---
 .all-contributorsrc | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/.all-contributorsrc b/.all-contributorsrc
index 7d9c22ed8..010ce4529 100644
--- a/.all-contributorsrc
+++ b/.all-contributorsrc
@@ -184,7 +184,9 @@
       "contributions": [
         "bug",
         "code",
-        "doc"
+        "doc",
+        "example",
+        "tutorial"
       ]
     },
     {