Skip to content

Commit

Permalink
Fixed minibatching in NN notebook (#773)
Browse files Browse the repository at this point in the history
* Fixed minibatching in NN notebook

* Updated text on prediction

* Updated text on prediction
  • Loading branch information
fonnesbeck authored Feb 12, 2025
1 parent 18cb11d commit 321c57e
Show file tree
Hide file tree
Showing 4 changed files with 120 additions and 6,058 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,5 @@ build
jupyter_execute
_thumbnails
examples/gallery.rst

pixi.lock
289 changes: 74 additions & 215 deletions examples/variational_inference/bayesian_neural_network_advi.ipynb

Large diffs are not rendered by default.

94 changes: 44 additions & 50 deletions examples/variational_inference/bayesian_neural_network_advi.myst.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ jupytext:
format_name: myst
format_version: 0.13
kernelspec:
display_name: Python 3 (ipykernel)
display_name: default
language: python
name: python3
---
Expand Down Expand Up @@ -114,7 +114,7 @@ A neural network is quite simple. The basic unit is a [perceptron](https://en.wi
jupyter:
outputs_hidden: true
---
def construct_nn():
def construct_nn(batch_size=50):
n_hidden = 5
# Initialize random weights between each layer
Expand All @@ -130,12 +130,13 @@ def construct_nn():
}
with pm.Model(coords=coords) as neural_network:
# Define minibatch variables
minibatch_x, minibatch_y = pm.Minibatch(X_train, Y_train, batch_size=50)
# Define data variables using minibatches
ann_input = pm.Data("ann_input", minibatch_x, mutable=True, dims=("obs_id", "train_cols"))
ann_output = pm.Data("ann_output", minibatch_y, mutable=True, dims="obs_id")
X_data = pm.Data("X_data", X_train, dims=("obs_id", "train_cols"))
Y_data = pm.Data("Y_data", Y_train, dims="obs_id")
# Define minibatch variables
ann_input, ann_output = pm.Minibatch(X_data, Y_data, batch_size=batch_size)
# Weights from input to hidden layer
weights_in_1 = pm.Normal(
Expand All @@ -161,7 +162,6 @@ def construct_nn():
act_out,
observed=ann_output,
total_size=X_train.shape[0], # IMPORTANT for minibatches
dims="obs_id",
)
return neural_network
Expand All @@ -174,12 +174,16 @@ That's not so bad. The `Normal` priors help regularize the weights. Usually we w

+++

### Variational Inference: Scaling model complexity
## Variational Inference: Scaling model complexity

We could now just run a MCMC sampler like {class}`pymc.NUTS` which works pretty well in this case, but was already mentioned, this will become very slow as we scale our model up to deeper architectures with more layers.

Instead, we will use the {class}`pymc.ADVI` variational inference algorithm. This is much faster and will scale better. Note, that this is a mean-field approximation so we ignore correlations in the posterior.

### Mini-batch ADVI

While this simulated dataset is small enough to fit all at once, it would not scale to something big like ImageNet. In the model above, we have set up minibatches that will allow for scaling to larger data sets. Moreover, training on mini-batches of data (stochastic gradient descent) avoids local minima and can lead to faster convergence.

```{code-cell} ipython3
%%time
Expand All @@ -199,17 +203,38 @@ plt.xlabel("iteration");
trace = approx.sample(draws=5000)
```

Now that we trained our model, lets predict on the hold-out set using a posterior predictive check (PPC). We can use {func}`~pymc.sample_posterior_predictive` to generate new data (in this case class predictions) from the posterior (sampled from the variational estimation).
Now that we trained our model, lets predict on the hold-out set using a posterior predictive check (PPC). We can use {func}`pymc.sample_posterior_predictive` to generate new data (in this case class predictions) from the posterior (sampled from the variational estimation).

To predict on the entire test set (and not just the minibatches) we need to create a new model object that removes the minibatches. Notice that we are using our fitted `trace` to sample from the posterior predictive distribution, using the posterior estimates from the original model. There is no new inference here, we are just using the same model and the same posterior estimates to generate predictions. The {class}`Flat` distribution is just a placeholder to make the model work; the actual values are sampled from the posterior.

```{code-cell} ipython3
---
jupyter:
outputs_hidden: true
---
with neural_network:
pm.set_data(new_data={"ann_input": X_test})
ppc = pm.sample_posterior_predictive(trace)
trace.extend(ppc)
def sample_posterior_predictive(X_test, Y_test, trace, n_hidden=5):
coords = {
"hidden_layer_1": np.arange(n_hidden),
"hidden_layer_2": np.arange(n_hidden),
"train_cols": np.arange(X_test.shape[1]),
"obs_id": np.arange(X_test.shape[0]),
}
with pm.Model(coords=coords):
ann_input = X_test
ann_output = Y_test
weights_in_1 = pm.Flat("w_in_1", dims=("train_cols", "hidden_layer_1"))
weights_1_2 = pm.Flat("w_1_2", dims=("hidden_layer_1", "hidden_layer_2"))
weights_2_out = pm.Flat("w_2_out", dims="hidden_layer_2")
# Build neural-network using tanh activation function
act_1 = pm.math.tanh(pm.math.dot(ann_input, weights_in_1))
act_2 = pm.math.tanh(pm.math.dot(act_1, weights_1_2))
act_out = pm.math.sigmoid(pm.math.dot(act_2, weights_2_out))
# Binary classification -> Bernoulli likelihood
out = pm.Bernoulli("out", act_out, observed=ann_output)
return pm.sample_posterior_predictive(trace)
ppc = sample_posterior_predictive(X_test, Y_test, trace)
```

We can average the predictions for each observation to estimate the underlying probability of class 1.
Expand Down Expand Up @@ -250,18 +275,7 @@ dummy_out = np.ones(grid_2d.shape[0], dtype=np.int8)
```

```{code-cell} ipython3
---
jupyter:
outputs_hidden: true
---
coords_eval = {
"train_cols": np.arange(grid_2d.shape[1]),
"obs_id": np.arange(grid_2d.shape[0]),
}
with neural_network:
pm.set_data(new_data={"ann_input": grid_2d, "ann_output": dummy_out}, coords=coords_eval)
ppc = pm.sample_posterior_predictive(trace)
ppc = sample_posterior_predictive(grid_2d, dummy_out, trace)
```

```{code-cell} ipython3
Expand Down Expand Up @@ -304,27 +318,6 @@ We can see that very close to the decision boundary, our uncertainty as to which

+++

## Mini-batch ADVI

So far, we have trained our model on all data at once. Obviously this won't scale to something like ImageNet. Moreover, training on mini-batches of data (stochastic gradient descent) avoids local minima and can lead to faster convergence.

Fortunately, ADVI can be run on mini-batches as well. It just requires some setting up:

```{code-cell} ipython3
minibatch_x, minibatch_y = pm.Minibatch(X_train, Y_train, batch_size=50)
neural_network_minibatch = construct_nn(minibatch_x, minibatch_y)
with neural_network_minibatch:
approx = pm.fit(40000, method=pm.ADVI())
```

```{code-cell} ipython3
plt.plot(approx.hist)
plt.ylabel("ELBO")
plt.xlabel("iteration");
```

As you can see, mini-batch ADVI's running time is much lower. It also seems to converge faster.

For fun, we can also look at the trace. The point is that we also get uncertainty of our Neural Network weights.

```{code-cell} ipython3
Expand Down Expand Up @@ -352,6 +345,7 @@ You might argue that the above network isn't really deep, but note that we could
- This notebook was originally authored as a [blog post](https://twiecki.github.io/blog/2016/06/01/bayesian-deep-learning/) by Thomas Wiecki in 2016
- Updated by Chris Fonnesbeck for PyMC v4 in 2022
- Updated by Oriol Abril-Pla and Earl Bellinger in 2023
- Updated by Chris Fonnesbeck in 2024

## Watermark

Expand Down
Loading

0 comments on commit 321c57e

Please sign in to comment.