原文:https://tensorflow.google.cn/tutorials/keras/overfit_and_underfit
As always, the code in this example will use the tf.keras
API, which you can learn more about in the TensorFlow Keras guide.
In both of the previous examples—classifying text and predicting fuel efficiency — we saw that the accuracy of our model on the validation data would peak after training for a number of epochs, and would then stagnate or start decreasing.
In other words, our model would overfit to the training data. Learning how to deal with overfitting is important. Although it's often possible to achieve high accuracy on the training set, what we really want is to develop models that generalize well to a testing set (or data they haven't seen before).
The opposite of overfitting is underfitting. Underfitting occurs when there is still room for improvement on the train data. This can happen for a number of reasons: If the model is not powerful enough, is over-regularized, or has simply not been trained long enough. This means the network has not learned the relevant patterns in the training data.
If you train for too long though, the model will start to overfit and learn patterns from the training data that don't generalize to the test data. We need to strike a balance. Understanding how to train for an appropriate number of epochs as we'll explore below is a useful skill.
To prevent overfitting, the best solution is to use more complete training data. The dataset should cover the full range of inputs that the model is expected to handle. Additional data may only be useful if it covers new and interesting cases.
A model trained on more complete data will naturally generalize better. When that is no longer possible, the next best solution is to use techniques like regularization. These place constraints on the quantity and type of information your model can store. If a network can only afford to memorize a small number of patterns, the optimization process will force it to focus on the most prominent patterns, which have a better chance of generalizing well.
In this notebook, we'll explore several common regularization techniques, and use them to improve on a classification model.
Before getting started, import the necessary packages:
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import regularizers
print(tf.__version__)
2.3.1
!pip install -q git+https://github.com/tensorflow/docs
import tensorflow_docs as tfdocs
import tensorflow_docs.modeling
import tensorflow_docs.plots
from IPython import display
from matplotlib import pyplot as plt
import numpy as np
import pathlib
import shutil
import tempfile
logdir = pathlib.Path(tempfile.mkdtemp())/"tensorboard_logs"
shutil.rmtree(logdir, ignore_errors=True)
The goal of this tutorial is not to do particle physics, so don't dwell on the details of the dataset. It contains 11 000 000 examples, each with 28 features, and a binary class label.
gz = tf.keras.utils.get_file('HIGGS.csv.gz', 'http://mlphysics.ics.uci.edu/data/higgs/HIGGS.csv.gz')
Downloading data from http://mlphysics.ics.uci.edu/data/higgs/HIGGS.csv.gz
2816409600/2816407858 [==============================] - 230s 0us/step
FEATURES = 28
The tf.data.experimental.CsvDataset
class can be used to read csv records directly from a gzip file with no intermediate decompression step.
ds = tf.data.experimental.CsvDataset(gz,[float(),]*(FEATURES+1), compression_type="GZIP")
That csv reader class returns a list of scalars for each record. The following function repacks that list of scalars into a (feature_vector, label) pair.
def pack_row(*row):
label = row[0]
features = tf.stack(row[1:],1)
return features, label
TensorFlow is most efficient when operating on large batches of data.
So instead of repacking each row individually make a new Dataset
that takes batches of 10000-examples, applies the pack_row
function to each batch, and then splits the batches back up into individual records:
packed_ds = ds.batch(10000).map(pack_row).unbatch()
Have a look at some of the records from this new packed_ds
.
The features are not perfectly normalized, but this is sufficient for this tutorial.
for features,label in packed_ds.batch(1000).take(1):
print(features[0])
plt.hist(features.numpy().flatten(), bins = 101)
tf.Tensor(
[ 0.8692932 -0.6350818 0.22569026 0.32747006 -0.6899932 0.75420225
-0.24857314 -1.0920639 0\. 1.3749921 -0.6536742 0.9303491
1.1074361 1.1389043 -1.5781983 -1.0469854 0\. 0.65792954
-0.01045457 -0.04576717 3.1019614 1.35376 0.9795631 0.97807616
0.92000484 0.72165745 0.98875093 0.87667835], shape=(28,), dtype=float32)
To keep this tutorial relatively short use just the first 1000 samples for validation, and the next 10 000 for training:
N_VALIDATION = int(1e3)
N_TRAIN = int(1e4)
BUFFER_SIZE = int(1e4)
BATCH_SIZE = 500
STEPS_PER_EPOCH = N_TRAIN//BATCH_SIZE
The Dataset.skip
and Dataset.take
methods make this easy.
At the same time, use the Dataset.cache
method to ensure that the loader doesn't need to re-read the data from the file on each epoch:
validate_ds = packed_ds.take(N_VALIDATION).cache()
train_ds = packed_ds.skip(N_VALIDATION).take(N_TRAIN).cache()
train_ds
<CacheDataset shapes: ((28,), ()), types: (tf.float32, tf.float32)>
These datasets return individual examples. Use the .batch
method to create batches of an appropriate size for training. Before batching also remember to .shuffle
and .repeat
the training set.
validate_ds = validate_ds.batch(BATCH_SIZE)
train_ds = train_ds.shuffle(BUFFER_SIZE).repeat().batch(BATCH_SIZE)
The simplest way to prevent overfitting is to start with a small model: A model with a small number of learnable parameters (which is determined by the number of layers and the number of units per layer). In deep learning, the number of learnable parameters in a model is often referred to as the model's "capacity".
Intuitively, a model with more parameters will have more "memorization capacity" and therefore will be able to easily learn a perfect dictionary-like mapping between training samples and their targets, a mapping without any generalization power, but this would be useless when making predictions on previously unseen data.
Always keep this in mind: deep learning models tend to be good at fitting to the training data, but the real challenge is generalization, not fitting.
On the other hand, if the network has limited memorization resources, it will not be able to learn the mapping as easily. To minimize its loss, it will have to learn compressed representations that have more predictive power. At the same time, if you make your model too small, it will have difficulty fitting to the training data. There is a balance between "too much capacity" and "not enough capacity".
Unfortunately, there is no magical formula to determine the right size or architecture of your model (in terms of the number of layers, or the right size for each layer). You will have to experiment using a series of different architectures.
To find an appropriate model size, it's best to start with relatively few layers and parameters, then begin increasing the size of the layers or adding new layers until you see diminishing returns on the validation loss.
Start with a simple model using only layers.Dense
as a baseline, then create larger versions, and compare them.
Many models train better if you gradually reduce the learning rate during training. Use optimizers.schedules
to reduce the learning rate over time:
lr_schedule = tf.keras.optimizers.schedules.InverseTimeDecay(
0.001,
decay_steps=STEPS_PER_EPOCH*1000,
decay_rate=1,
staircase=False)
def get_optimizer():
return tf.keras.optimizers.Adam(lr_schedule)
The code above sets a schedules.InverseTimeDecay
to hyperbolically decrease the learning rate to 1/2 of the base rate at 1000 epochs, 1/3 at 2000 epochs and so on.
step = np.linspace(0,100000)
lr = lr_schedule(step)
plt.figure(figsize = (8,6))
plt.plot(step/STEPS_PER_EPOCH, lr)
plt.ylim([0,max(plt.ylim())])
plt.xlabel('Epoch')
_ = plt.ylabel('Learning Rate')
Each model in this tutorial will use the same training configuration. So set these up in a reusable way, starting with the list of callbacks.
The training for this tutorial runs for many short epochs. To reduce the logging noise use the tfdocs.EpochDots
which simply prints a .
for each epoch, and a full set of metrics every 100 epochs.
Next include callbacks.EarlyStopping
to avoid long and unnecessary training times. Note that this callback is set to monitor the val_binary_crossentropy
, not the val_loss
. This difference will be important later.
Use callbacks.TensorBoard
to generate TensorBoard logs for the training.
def get_callbacks(name):
return [
tfdocs.modeling.EpochDots(),
tf.keras.callbacks.EarlyStopping(monitor='val_binary_crossentropy', patience=200),
tf.keras.callbacks.TensorBoard(logdir/name),
]
Similarly each model will use the same Model.compile
and Model.fit
settings:
def compile_and_fit(model, name, optimizer=None, max_epochs=10000):
if optimizer is None:
optimizer = get_optimizer()
model.compile(optimizer=optimizer,
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=[
tf.keras.losses.BinaryCrossentropy(
from_logits=True, name='binary_crossentropy'),
'accuracy'])
model.summary()
history = model.fit(
train_ds,
steps_per_epoch = STEPS_PER_EPOCH,
epochs=max_epochs,
validation_data=validate_ds,
callbacks=get_callbacks(name),
verbose=0)
return history
Start by training a model:
tiny_model = tf.keras.Sequential([
layers.Dense(16, activation='elu', input_shape=(FEATURES,)),
layers.Dense(1)
])
size_histories = {}
size_histories['Tiny'] = compile_and_fit(tiny_model, 'sizes/Tiny')
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 16) 464
_________________________________________________________________
dense_1 (Dense) (None, 1) 17
=================================================================
Total params: 481
Trainable params: 481
Non-trainable params: 0
_________________________________________________________________
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/ops/summary_ops_v2.py:1277: stop (from tensorflow.python.eager.profiler) is deprecated and will be removed after 2020-07-01.
Instructions for updating:
use `tf.profiler.experimental.stop` instead.
WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0032s vs `on_train_batch_end` time: 0.0255s). Check your callbacks.
Epoch: 0, accuracy:0.5092, binary_crossentropy:0.7752, loss:0.7752, val_accuracy:0.5110, val_binary_crossentropy:0.7376, val_loss:0.7376,
....................................................................................................
Epoch: 100, accuracy:0.6028, binary_crossentropy:0.6251, loss:0.6251, val_accuracy:0.5680, val_binary_crossentropy:0.6271, val_loss:0.6271,
....................................................................................................
Epoch: 200, accuracy:0.6231, binary_crossentropy:0.6137, loss:0.6137, val_accuracy:0.5920, val_binary_crossentropy:0.6146, val_loss:0.6146,
....................................................................................................
Epoch: 300, accuracy:0.6356, binary_crossentropy:0.6038, loss:0.6038, val_accuracy:0.6190, val_binary_crossentropy:0.6051, val_loss:0.6051,
....................................................................................................
Epoch: 400, accuracy:0.6470, binary_crossentropy:0.5963, loss:0.5963, val_accuracy:0.6330, val_binary_crossentropy:0.5968, val_loss:0.5968,
....................................................................................................
Epoch: 500, accuracy:0.6619, binary_crossentropy:0.5909, loss:0.5909, val_accuracy:0.6280, val_binary_crossentropy:0.5939, val_loss:0.5939,
....................................................................................................
Epoch: 600, accuracy:0.6618, binary_crossentropy:0.5872, loss:0.5872, val_accuracy:0.6630, val_binary_crossentropy:0.5910, val_loss:0.5910,
....................................................................................................
Epoch: 700, accuracy:0.6655, binary_crossentropy:0.5847, loss:0.5847, val_accuracy:0.6290, val_binary_crossentropy:0.5940, val_loss:0.5940,
....................................................................................................
Epoch: 800, accuracy:0.6683, binary_crossentropy:0.5819, loss:0.5819, val_accuracy:0.6510, val_binary_crossentropy:0.5908, val_loss:0.5908,
....................................................................................................
Epoch: 900, accuracy:0.6722, binary_crossentropy:0.5797, loss:0.5797, val_accuracy:0.6620, val_binary_crossentropy:0.5907, val_loss:0.5907,
....................................................................................................
Epoch: 1000, accuracy:0.6761, binary_crossentropy:0.5779, loss:0.5779, val_accuracy:0.6470, val_binary_crossentropy:0.5910, val_loss:0.5910,
...............................
Now check how the model did:
plotter = tfdocs.plots.HistoryPlotter(metric = 'binary_crossentropy', smoothing_std=10)
plotter.plot(size_histories)
plt.ylim([0.5, 0.7])
(0.5, 0.7)
To see if you can beat the performance of the small model, progressively train some larger models.
Try two hidden layers with 16 units each:
small_model = tf.keras.Sequential([
# `input_shape` is only required here so that `.summary` works.
layers.Dense(16, activation='elu', input_shape=(FEATURES,)),
layers.Dense(16, activation='elu'),
layers.Dense(1)
])
size_histories['Small'] = compile_and_fit(small_model, 'sizes/Small')
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_2 (Dense) (None, 16) 464
_________________________________________________________________
dense_3 (Dense) (None, 16) 272
_________________________________________________________________
dense_4 (Dense) (None, 1) 17
=================================================================
Total params: 753
Trainable params: 753
Non-trainable params: 0
_________________________________________________________________
WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0037s vs `on_train_batch_end` time: 0.0530s). Check your callbacks.
Epoch: 0, accuracy:0.5029, binary_crossentropy:0.7257, loss:0.7257, val_accuracy:0.4720, val_binary_crossentropy:0.6927, val_loss:0.6927,
....................................................................................................
Epoch: 100, accuracy:0.6153, binary_crossentropy:0.6185, loss:0.6185, val_accuracy:0.6290, val_binary_crossentropy:0.6112, val_loss:0.6112,
....................................................................................................
Epoch: 200, accuracy:0.6551, binary_crossentropy:0.5940, loss:0.5940, val_accuracy:0.6540, val_binary_crossentropy:0.5941, val_loss:0.5941,
....................................................................................................
Epoch: 300, accuracy:0.6678, binary_crossentropy:0.5824, loss:0.5824, val_accuracy:0.6680, val_binary_crossentropy:0.5904, val_loss:0.5904,
....................................................................................................
Epoch: 400, accuracy:0.6731, binary_crossentropy:0.5754, loss:0.5754, val_accuracy:0.6630, val_binary_crossentropy:0.5872, val_loss:0.5872,
....................................................................................................
Epoch: 500, accuracy:0.6836, binary_crossentropy:0.5679, loss:0.5679, val_accuracy:0.6740, val_binary_crossentropy:0.5834, val_loss:0.5834,
....................................................................................................
Epoch: 600, accuracy:0.6839, binary_crossentropy:0.5617, loss:0.5617, val_accuracy:0.6760, val_binary_crossentropy:0.5849, val_loss:0.5849,
....................................................................................................
Now try 3 hidden layers with 64 units each:
medium_model = tf.keras.Sequential([
layers.Dense(64, activation='elu', input_shape=(FEATURES,)),
layers.Dense(64, activation='elu'),
layers.Dense(64, activation='elu'),
layers.Dense(1)
])
And train the model using the same data:
size_histories['Medium'] = compile_and_fit(medium_model, "sizes/Medium")
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_5 (Dense) (None, 64) 1856
_________________________________________________________________
dense_6 (Dense) (None, 64) 4160
_________________________________________________________________
dense_7 (Dense) (None, 64) 4160
_________________________________________________________________
dense_8 (Dense) (None, 1) 65
=================================================================
Total params: 10,241
Trainable params: 10,241
Non-trainable params: 0
_________________________________________________________________
WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0039s vs `on_train_batch_end` time: 0.0548s). Check your callbacks.
Epoch: 0, accuracy:0.5027, binary_crossentropy:0.6936, loss:0.6936, val_accuracy:0.5150, val_binary_crossentropy:0.6758, val_loss:0.6758,
....................................................................................................
Epoch: 100, accuracy:0.7075, binary_crossentropy:0.5382, loss:0.5382, val_accuracy:0.6670, val_binary_crossentropy:0.6027, val_loss:0.6027,
....................................................................................................
Epoch: 200, accuracy:0.7705, binary_crossentropy:0.4498, loss:0.4498, val_accuracy:0.6200, val_binary_crossentropy:0.6833, val_loss:0.6833,
...................................................................
As an exercise, you can create an even larger model, and see how quickly it begins overfitting. Next, let's add to this benchmark a network that has much more capacity, far more than the problem would warrant:
large_model = tf.keras.Sequential([
layers.Dense(512, activation='elu', input_shape=(FEATURES,)),
layers.Dense(512, activation='elu'),
layers.Dense(512, activation='elu'),
layers.Dense(512, activation='elu'),
layers.Dense(1)
])
And, again, train the model using the same data:
size_histories['large'] = compile_and_fit(large_model, "sizes/large")
Model: "sequential_3"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_9 (Dense) (None, 512) 14848
_________________________________________________________________
dense_10 (Dense) (None, 512) 262656
_________________________________________________________________
dense_11 (Dense) (None, 512) 262656
_________________________________________________________________
dense_12 (Dense) (None, 512) 262656
_________________________________________________________________
dense_13 (Dense) (None, 1) 513
=================================================================
Total params: 803,329
Trainable params: 803,329
Non-trainable params: 0
_________________________________________________________________
WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0041s vs `on_train_batch_end` time: 0.0613s). Check your callbacks.
Epoch: 0, accuracy:0.5072, binary_crossentropy:0.8249, loss:0.8249, val_accuracy:0.4810, val_binary_crossentropy:0.6884, val_loss:0.6884,
....................................................................................................
Epoch: 100, accuracy:1.0000, binary_crossentropy:0.0025, loss:0.0025, val_accuracy:0.6590, val_binary_crossentropy:1.8242, val_loss:1.8242,
....................................................................................................
Epoch: 200, accuracy:1.0000, binary_crossentropy:0.0001, loss:0.0001, val_accuracy:0.6590, val_binary_crossentropy:2.5014, val_loss:2.5014,
......................
The solid lines show the training loss, and the dashed lines show the validation loss (remember: a lower validation loss indicates a better model).
While building a larger model gives it more power, if this power is not constrained somehow it can easily overfit to the training set.
In this example, typically, only the "Tiny"
model manages to avoid overfitting altogether, and each of the larger models overfit the data more quickly. This becomes so severe for the "large"
model that you need to switch the plot to a log-scale to really see what's happening.
This is apparent if you plot and compare the validation metrics to the training metrics.
- It's normal for there to be a small difference.
- If both metrics are moving in the same direction, everything is fine.
- If the validation metric begins to stagnate while the training metric continues to improve, you are probably close to overfitting.
- If the validation metric is going in the wrong direction, the model is clearly overfitting.
plotter.plot(size_histories)
a = plt.xscale('log')
plt.xlim([5, max(plt.xlim())])
plt.ylim([0.5, 0.7])
plt.xlabel("Epochs [Log Scale]")
Text(0.5, 0, 'Epochs [Log Scale]')
Note: All the above training runs used the callbacks.EarlyStopping
to end the training once it was clear the model was not making progress.
These models all wrote TensorBoard logs during training.
Open an embedded TensorBoard viewer inside a notebook:
# Load the TensorBoard notebook extension
%load_ext tensorboard
# Open an embedded TensorBoard viewer
%tensorboard --logdir {logdir}/sizes
You can view the results of a previous run of this notebook on TensorBoard.dev.
TensorBoard.dev is a managed experience for hosting, tracking, and sharing ML experiments with everyone.
It's also included in an <iframe>
for convenience:
display.IFrame(
src="https://tensorboard.dev/experiment/vW7jmmF9TmKmy3rbheMQpw/#scalars&_smoothingWeight=0.97",
width="100%", height="800px")
<iframe src="/tutorials/keras/overfit_and_underfit_b2e3abde2baf0d401dd70acbfc9be7edb69d49549b568d7034c72e54ebb5f379.frame" class="framebox inherit-locale " allowfullscreen="" is-upgraded=""></iframe>
If you want to share TensorBoard results you can upload the logs to TensorBoard.dev by copying the following into a code-cell.
Note: This step requires a Google account.
tensorboard dev upload --logdir {logdir}/sizes
Caution: This command does not terminate. It's designed to continuously upload the results of long-running experiments. Once your data is uploaded you need to stop it using the "interrupt execution" option in your notebook tool.
Before getting into the content of this section copy the training logs from the "Tiny"
model above, to use as a baseline for comparison.
shutil.rmtree(logdir/'regularizers/Tiny', ignore_errors=True)
shutil.copytree(logdir/'sizes/Tiny', logdir/'regularizers/Tiny')
PosixPath('/tmp/tmp9n203dpq/tensorboard_logs/regularizers/Tiny')
regularizer_histories = {}
regularizer_histories['Tiny'] = size_histories['Tiny']
You may be familiar with Occam's Razor principle: given two explanations for something, the explanation most likely to be correct is the "simplest" one, the one that makes the least amount of assumptions. This also applies to the models learned by neural networks: given some training data and a network architecture, there are multiple sets of weights values (multiple models) that could explain the data, and simpler models are less likely to overfit than complex ones.
A "simple model" in this context is a model where the distribution of parameter values has less entropy (or a model with fewer parameters altogether, as we saw in the section above). Thus a common way to mitigate overfitting is to put constraints on the complexity of a network by forcing its weights only to take small values, which makes the distribution of weight values more "regular". This is called "weight regularization", and it is done by adding to the loss function of the network a cost associated with having large weights. This cost comes in two flavors:
-
L1 regularization, where the cost added is proportional to the absolute value of the weights coefficients (i.e. to what is called the "L1 norm" of the weights).
-
L2 regularization, where the cost added is proportional to the square of the value of the weights coefficients (i.e. to what is called the squared "L2 norm" of the weights). L2 regularization is also called weight decay in the context of neural networks. Don't let the different name confuse you: weight decay is mathematically the exact same as L2 regularization.
L1 regularization pushes weights towards exactly zero encouraging a sparse model. L2 regularization will penalize the weights parameters without making them sparse since the penalty goes to zero for small weights. one reason why L2 is more common.
In tf.keras
, weight regularization is added by passing weight regularizer instances to layers as keyword arguments. Let's add L2 weight regularization now.
l2_model = tf.keras.Sequential([
layers.Dense(512, activation='elu',
kernel_regularizer=regularizers.l2(0.001),
input_shape=(FEATURES,)),
layers.Dense(512, activation='elu',
kernel_regularizer=regularizers.l2(0.001)),
layers.Dense(512, activation='elu',
kernel_regularizer=regularizers.l2(0.001)),
layers.Dense(512, activation='elu',
kernel_regularizer=regularizers.l2(0.001)),
layers.Dense(1)
])
regularizer_histories['l2'] = compile_and_fit(l2_model, "regularizers/l2")
Model: "sequential_4"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_14 (Dense) (None, 512) 14848
_________________________________________________________________
dense_15 (Dense) (None, 512) 262656
_________________________________________________________________
dense_16 (Dense) (None, 512) 262656
_________________________________________________________________
dense_17 (Dense) (None, 512) 262656
_________________________________________________________________
dense_18 (Dense) (None, 1) 513
=================================================================
Total params: 803,329
Trainable params: 803,329
Non-trainable params: 0
_________________________________________________________________
WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0040s vs `on_train_batch_end` time: 0.0613s). Check your callbacks.
Epoch: 0, accuracy:0.5087, binary_crossentropy:0.8160, loss:2.3363, val_accuracy:0.4770, val_binary_crossentropy:0.6979, val_loss:2.1441,
....................................................................................................
Epoch: 100, accuracy:0.6607, binary_crossentropy:0.5920, loss:0.6163, val_accuracy:0.6530, val_binary_crossentropy:0.5831, val_loss:0.6076,
....................................................................................................
Epoch: 200, accuracy:0.6820, binary_crossentropy:0.5789, loss:0.6033, val_accuracy:0.6690, val_binary_crossentropy:0.5799, val_loss:0.6044,
....................................................................................................
Epoch: 300, accuracy:0.6865, binary_crossentropy:0.5696, loss:0.5947, val_accuracy:0.6360, val_binary_crossentropy:0.5839, val_loss:0.6088,
....................................................................................................
Epoch: 400, accuracy:0.6908, binary_crossentropy:0.5639, loss:0.5908, val_accuracy:0.6840, val_binary_crossentropy:0.5898, val_loss:0.6167,
..........................................
l2(0.001)
means that every coefficient in the weight matrix of the layer will add 0.001 * weight_coefficient_value**2
to the total loss of the network.
That is why we're monitoring the binary_crossentropy
directly. Because it doesn't have this regularization component mixed in.
So, that same "Large"
model with an L2
regularization penalty performs much better:
plotter.plot(regularizer_histories)
plt.ylim([0.5, 0.7])
(0.5, 0.7)
As you can see, the "L2"
regularized model is now much more competitive with the the "Tiny"
model. This "L2"
model is also much more resistant to overfitting than the "Large"
model it was based on despite having the same number of parameters.
There are two important things to note about this sort of regularization.
First: if you are writing your own training loop, then you need to be sure to ask the model for its regularization losses.
result = l2_model(features)
regularization_loss=tf.add_n(l2_model.losses)
Second: This implementation works by adding the weight penalties to the model's loss, and then applying a standard optimization procedure after that.
There is a second approach that instead only runs the optimizer on the raw loss, and then while applying the calculated step the optimizer also applies some weight decay. This "Decoupled Weight Decay" is seen in optimizers like optimizers.FTRL
and optimizers.AdamW
.
Dropout is one of the most effective and most commonly used regularization techniques for neural networks, developed by Hinton and his students at the University of Toronto.
The intuitive explanation for dropout is that because individual nodes in the network cannot rely on the output of the others, each node must output features that are useful on their own.
Dropout, applied to a layer, consists of randomly "dropping out" (i.e. set to zero) a number of output features of the layer during training. Let's say a given layer would normally have returned a vector [0.2, 0.5, 1.3, 0.8, 1.1] for a given input sample during training; after applying dropout, this vector will have a few zero entries distributed at random, e.g. [0, 0.5, 1.3, 0, 1.1].
The "dropout rate" is the fraction of the features that are being zeroed-out; it is usually set between 0.2 and 0.5. At test time, no units are dropped out, and instead the layer's output values are scaled down by a factor equal to the dropout rate, so as to balance for the fact that more units are active than at training time.
In tf.keras
you can introduce dropout in a network via the Dropout layer, which gets applied to the output of layer right before.
Let's add two Dropout layers in our network to see how well they do at reducing overfitting:
dropout_model = tf.keras.Sequential([
layers.Dense(512, activation='elu', input_shape=(FEATURES,)),
layers.Dropout(0.5),
layers.Dense(512, activation='elu'),
layers.Dropout(0.5),
layers.Dense(512, activation='elu'),
layers.Dropout(0.5),
layers.Dense(512, activation='elu'),
layers.Dropout(0.5),
layers.Dense(1)
])
regularizer_histories['dropout'] = compile_and_fit(dropout_model, "regularizers/dropout")
Model: "sequential_5"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_19 (Dense) (None, 512) 14848
_________________________________________________________________
dropout (Dropout) (None, 512) 0
_________________________________________________________________
dense_20 (Dense) (None, 512) 262656
_________________________________________________________________
dropout_1 (Dropout) (None, 512) 0
_________________________________________________________________
dense_21 (Dense) (None, 512) 262656
_________________________________________________________________
dropout_2 (Dropout) (None, 512) 0
_________________________________________________________________
dense_22 (Dense) (None, 512) 262656
_________________________________________________________________
dropout_3 (Dropout) (None, 512) 0
_________________________________________________________________
dense_23 (Dense) (None, 1) 513
=================================================================
Total params: 803,329
Trainable params: 803,329
Non-trainable params: 0
_________________________________________________________________
WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0040s vs `on_train_batch_end` time: 0.0632s). Check your callbacks.
Epoch: 0, accuracy:0.5073, binary_crossentropy:0.7984, loss:0.7984, val_accuracy:0.5200, val_binary_crossentropy:0.6761, val_loss:0.6761,
....................................................................................................
Epoch: 100, accuracy:0.6576, binary_crossentropy:0.5965, loss:0.5965, val_accuracy:0.6730, val_binary_crossentropy:0.5833, val_loss:0.5833,
....................................................................................................
Epoch: 200, accuracy:0.6861, binary_crossentropy:0.5554, loss:0.5554, val_accuracy:0.6790, val_binary_crossentropy:0.5830, val_loss:0.5830,
....................................................................................................
Epoch: 300, accuracy:0.7280, binary_crossentropy:0.5102, loss:0.5102, val_accuracy:0.6860, val_binary_crossentropy:0.6088, val_loss:0.6088,
................
plotter.plot(regularizer_histories)
plt.ylim([0.5, 0.7])
(0.5, 0.7)
It's clear from this plot that both of these regularization approaches improve the behavior of the "Large"
model. But this still doesn't beat even the "Tiny"
baseline.
Next try them both, together, and see if that does better.
combined_model = tf.keras.Sequential([
layers.Dense(512, kernel_regularizer=regularizers.l2(0.0001),
activation='elu', input_shape=(FEATURES,)),
layers.Dropout(0.5),
layers.Dense(512, kernel_regularizer=regularizers.l2(0.0001),
activation='elu'),
layers.Dropout(0.5),
layers.Dense(512, kernel_regularizer=regularizers.l2(0.0001),
activation='elu'),
layers.Dropout(0.5),
layers.Dense(512, kernel_regularizer=regularizers.l2(0.0001),
activation='elu'),
layers.Dropout(0.5),
layers.Dense(1)
])
regularizer_histories['combined'] = compile_and_fit(combined_model, "regularizers/combined")
Model: "sequential_6"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_24 (Dense) (None, 512) 14848
_________________________________________________________________
dropout_4 (Dropout) (None, 512) 0
_________________________________________________________________
dense_25 (Dense) (None, 512) 262656
_________________________________________________________________
dropout_5 (Dropout) (None, 512) 0
_________________________________________________________________
dense_26 (Dense) (None, 512) 262656
_________________________________________________________________
dropout_6 (Dropout) (None, 512) 0
_________________________________________________________________
dense_27 (Dense) (None, 512) 262656
_________________________________________________________________
dropout_7 (Dropout) (None, 512) 0
_________________________________________________________________
dense_28 (Dense) (None, 1) 513
=================================================================
Total params: 803,329
Trainable params: 803,329
Non-trainable params: 0
_________________________________________________________________
WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0046s vs `on_train_batch_end` time: 0.0686s). Check your callbacks.
Epoch: 0, accuracy:0.5034, binary_crossentropy:0.8003, loss:0.9588, val_accuracy:0.5040, val_binary_crossentropy:0.6752, val_loss:0.8330,
....................................................................................................
Epoch: 100, accuracy:0.6514, binary_crossentropy:0.6067, loss:0.6373, val_accuracy:0.6470, val_binary_crossentropy:0.5868, val_loss:0.6173,
....................................................................................................
Epoch: 200, accuracy:0.6664, binary_crossentropy:0.5900, loss:0.6158, val_accuracy:0.6510, val_binary_crossentropy:0.5795, val_loss:0.6053,
....................................................................................................
Epoch: 300, accuracy:0.6690, binary_crossentropy:0.5822, loss:0.6104, val_accuracy:0.6940, val_binary_crossentropy:0.5611, val_loss:0.5892,
....................................................................................................
Epoch: 400, accuracy:0.6773, binary_crossentropy:0.5764, loss:0.6063, val_accuracy:0.6820, val_binary_crossentropy:0.5539, val_loss:0.5839,
....................................................................................................
Epoch: 500, accuracy:0.6840, binary_crossentropy:0.5695, loss:0.6012, val_accuracy:0.6870, val_binary_crossentropy:0.5500, val_loss:0.5818,
....................................................................................................
Epoch: 600, accuracy:0.6821, binary_crossentropy:0.5692, loss:0.6023, val_accuracy:0.6850, val_binary_crossentropy:0.5456, val_loss:0.5787,
....................................................................................................
Epoch: 700, accuracy:0.6836, binary_crossentropy:0.5678, loss:0.6021, val_accuracy:0.6870, val_binary_crossentropy:0.5502, val_loss:0.5846,
....................................................................................................
Epoch: 800, accuracy:0.6908, binary_crossentropy:0.5585, loss:0.5940, val_accuracy:0.7000, val_binary_crossentropy:0.5424, val_loss:0.5780,
....................................................................................................
Epoch: 900, accuracy:0.6931, binary_crossentropy:0.5583, loss:0.5948, val_accuracy:0.6860, val_binary_crossentropy:0.5447, val_loss:0.5813,
....................................................................................................
Epoch: 1000, accuracy:0.6919, binary_crossentropy:0.5563, loss:0.5940, val_accuracy:0.7100, val_binary_crossentropy:0.5422, val_loss:0.5799,
....................................................................................................
Epoch: 1100, accuracy:0.6914, binary_crossentropy:0.5545, loss:0.5935, val_accuracy:0.6940, val_binary_crossentropy:0.5375, val_loss:0.5765,
....................................................................................................
Epoch: 1200, accuracy:0.7012, binary_crossentropy:0.5466, loss:0.5867, val_accuracy:0.6970, val_binary_crossentropy:0.5429, val_loss:0.5831,
....................................................................................................
Epoch: 1300, accuracy:0.6939, binary_crossentropy:0.5491, loss:0.5903, val_accuracy:0.6950, val_binary_crossentropy:0.5477, val_loss:0.5890,
..
plotter.plot(regularizer_histories)
plt.ylim([0.5, 0.7])
(0.5, 0.7)
This model with the "Combined"
regularization is obviously the best one so far.
These models also recorded TensorBoard logs.
To open an embedded tensorboard viewer inside a notebook, copy the following into a code-cell:
%tensorboard --logdir {logdir}/regularizers
You can view the results of a previous run of this notebook on TensorDoard.dev.
It's also included in an <iframe>
for convenience:
display.IFrame(
src="https://tensorboard.dev/experiment/fGInKDo8TXes1z7HQku9mw/#scalars&_smoothingWeight=0.97",
width = "100%",
height="800px")
<iframe src="/tutorials/keras/overfit_and_underfit_f819422029cc7c7599f992ca8b2e0ee4056caa3f25d943155639b7c69c4525de.frame" class="framebox inherit-locale " allowfullscreen="" is-upgraded=""></iframe>
This was uploaded with:
tensorboard dev upload --logdir {logdir}/regularizers
To recap: here are the most common ways to prevent overfitting in neural networks:
- Get more training data.
- Reduce the capacity of the network.
- Add weight regularization.
- Add dropout.
Two important approaches not covered in this guide are:
- data-augmentation
- batch normalization
Remember that each method can help on its own, but often combining them can be even more effective.
# MIT License
#
# Copyright (c) 2017 François Chollet
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.