diff --git a/episodes/1-introduction.Rmd b/episodes/1-introduction.Rmd index a6740dc2..f8489275 100644 --- a/episodes/1-introduction.Rmd +++ b/episodes/1-introduction.Rmd @@ -38,9 +38,12 @@ Deep Learning (DL) is just one of many techniques collectively known as machine The image below shows some differences between artificial intelligence, Machine Learning and Deep Learning. -![](../fig/01_AI_ML_DL_differences.png){alt='An infographics showing the relation of AI, ML, NN and DL. NN are methods in DL which is a subset of ML algorithms that falls within the umbrella of AI'} - -The image above is by Tukijaaliwa, CC BY-SA 4.0, via Wikimedia Commons, [original source]( https://en.wikipedia.org/wiki/File:AI-ML-DL.svg) +![ +Image credit: Tukijaaliwa, CC BY-SA 4.0, via Wikimedia Commons, +[original source]( https://en.wikipedia.org/wiki/File:AI-ML-DL.svg) +](fig/01_AI_ML_DL_differences.png){ +alt='An infographic showing the relation of AI, ML, NN and DL. NN are methods in DL which is a subset of ML algorithms that falls within the umbrella of AI' +} #### Neural Networks @@ -59,14 +62,18 @@ A neural network consists of connected computational units called **neurons**. E - one example equation to calculate the output for a neuron is: $output = ReLU(\sum_{i} (x_i*w_i) + bias)$ -![](../fig/01_neuron.png){alt='A diagram of a single artificial neuron combining inputs and weights using an activation function.' width='600px'} +![](fig/01_neuron.png){alt='A diagram of a single artificial neuron combining inputs and weights using an activation function.' width='600px'} ##### Combining multiple neurons into a network Multiple neurons can be joined together by connecting the output of one to the input of another. These connections are associated with weights that determine the 'strength' of the connection, the weights are adjusted during training. In this way, the combination of neurons and connections describe a computational graph, an example can be seen in the image below. In most neural networks neurons are aggregated into layers. Signals travel from the input layer to the output layer, possibly through one or more intermediate layers called hidden layers. The image below shows an example of a neural network with three layers, each circle is a neuron, each line is an edge and the arrows indicate the direction data moves in. -![](../fig/01_neural_net.png){alt='A diagram of a three layer neural network with an input layer, one hidden layer, and an output layer.'} -The image above is by Glosser.ca, CC BY-SA 3.0 , via Wikimedia Commons, [original source](https://commons.wikimedia.org/wiki/File:Colored_neural_network.svg) +![ +Image credit: Glosser.ca, CC BY-SA 3.0 , via Wikimedia Commons, +[original source](https://commons.wikimedia.org/wiki/File:Colored_neural_network.svg) +](fig/01_neural_net.png){ +alt='A diagram of a three layer neural network with an input layer, one hidden layer, and an output layer.' +} ::: challenge ## Neural network calculations @@ -88,7 +95,7 @@ _Note: You can use whatever you like: brain only, pen&paper, Python, Excel..._ Have a look at the following network: -![](../fig/01_xor_exercise.png){alt='A diagram of a neural network with 2 inputs, 2 hidden layer neurons, and 1 output.' width='400px'} +![](fig/01_xor_exercise.png){alt='A diagram of a neural network with 2 inputs, 2 hidden layer neurons, and 1 output.' width='400px'} a. Calculate the output of the network for the following combinations of inputs: @@ -131,14 +138,13 @@ b. This solves the XOR logical problem, the output is 1 if only one of the two i ## Activation functions Look at the following activation functions: -![](../fig/01_sigmoid.svg){alt='Plot of the sigmoid function' width='200px'} -A. Sigmoid activation function +![A. Sigmoid activation function](fig/01_sigmoid.svg){alt='Plot of the sigmoid function' width='200px'} + + +![B. ReLU activation function](fig/01_relu.svg){alt='Plot of the ReLU function' width='200px'} -![](../fig/01_relu.svg){alt='Plot of the ReLU function' width='200px'} -B. ReLU activation function -![](../fig/01_identity_function.svg){alt='Plot of the Identity function' width='200px'} -C. Identity (or linear) activation function +![C. Identity (or linear) activation function](fig/01_identity_function.svg){alt='Plot of the Identity function' width='200px'} Combine the following statements to the correct activation function: @@ -176,7 +182,7 @@ The image below shows a diagram of all the layers (there are too many neurons to The input (left most) layer of the network is an image and the final (right most) layer of the network outputs a zero or one to determine if the input data belongs to the class of data we are interested in. This image is from the paper ["An Efficient Pedestrian Detection Method Based on YOLOv2" by Zhongmin Liu, Zhicai Chen, Zhanming Li, and Wenjin Hu published in Mathematical Problems in Engineering, Volume 2018](https://doi.org/10.1155/2018/3518959) -![](../fig/01_deep_network.png){alt='An example of a deep neural network'} +![](fig/01_deep_network.png){alt='An example of a deep neural network'} ### How do neural networks learn? What happens in a neural network during the training process? @@ -211,7 +217,7 @@ A more complicated and less used loss function for regression is the [Huber loss Below you see the Huber loss (green, delta = 1) and Squared error loss (blue) as a function of `y_true - y_pred`. -![](../fig/01_huber_loss.png){alt='Huber loss (green, delta = 1) and squared error loss (blue) +![](fig/01_huber_loss.png){alt='Huber loss (green, delta = 1) and squared error loss (blue) as a function of y_true - y_pred' width='400px'} Which loss function is more sensitive to outliers? @@ -352,7 +358,7 @@ The optimizer is responsible for taking the output of the loss function and then We can now go ahead and start training our neural network. We will probably keep doing this for a given number of iterations through our training dataset (referred to as _epochs_) or until the loss function gives a value under a certain threshold. The graph below show the loss against the number of _epochs_, generally the loss will go down with each _epoch_, but occasionally it will see a small rise. -![](../fig/training-0_to_1500.svg){alt='A graph showing an exponentially decreasing loss over the first 1500 epochs of training an example network.'} +![](fig/training-0_to_1500.svg){alt='A graph showing an exponentially decreasing loss over the first 1500 epochs of training an example network.'} ### 7. Perform a Prediction/Classification diff --git a/episodes/2-keras.Rmd b/episodes/2-keras.Rmd index 4d761d5d..4f064665 100644 --- a/episodes/2-keras.Rmd +++ b/episodes/2-keras.Rmd @@ -76,11 +76,11 @@ The goal is to predict a penguins' species using the attributes available in thi The `palmerpenguins` data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica. The physical attributes measured are flipper length, beak length, beak width, body mass, and sex. -![][palmer-penguins] -*Artwork by @allison_horst* +![*Artwork by @allison_horst*][palmer-penguins] + + +![*Artwork by @allison_horst*][penguin-beaks] -![][penguin-beaks] -*Artwork by @allison_horst* These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the [Palmer Station Long Term Ecological Research Program](https://pal.lternet.edu/), part of the [US Long Term Ecological Research Network](https://lternet.edu/). The data were imported directly from the [Environmental Data Initiative](https://environmentaldatainitiative.org/) (EDI) Data Portal, and are available for use by CC0 license ("No Rights Reserved") in accordance with the [Palmer Station Data Policy](https://pal.lternet.edu/data/policies). @@ -752,22 +752,22 @@ Length: 69, dtype: object ``` -[palmer-penguins]: ../fig/palmer_penguins.png "Palmer Penguins" +[palmer-penguins]: fig/palmer_penguins.png "Palmer Penguins" {alt='Illustration of the three species of penguins found in the Palmer Archipelago, Antarctica: Chinstrap, Gentoo and Adele'} -[penguin-beaks]: ../fig/culmen_depth.png "Culmen Depth" +[penguin-beaks]: fig/culmen_depth.png "Culmen Depth" {alt='Illustration of the beak dimensions called culmen length and culmen depth in the dataset'} -[pairplot]: ../fig/pairplot.png "Pair Plot" +[pairplot]: fig/pairplot.png "Pair Plot" {alt='Pair plot showing the separability of the three species of penguin for combinations of dataset attributes'} -[sex_pairplot]: ../fig/02_sex_pairplot.png "Pair plot grouped by sex" +[sex_pairplot]: fig/02_sex_pairplot.png "Pair plot grouped by sex" {alt='Pair plot showing the separability of the two sexes of penguin for combinations of dataset attributes'} -[training_curve]: ../fig/02_training_curve.png "Training Curve" +[training_curve]: fig/02_training_curve.png "Training Curve" {alt='Training loss curve of the neural network training which depicts exponential decrease in loss before a plateau from ~10 epochs'} -[confusion_matrix]: ../fig/confusion_matrix.png "Confusion Matrix" +[confusion_matrix]: fig/confusion_matrix.png "Confusion Matrix" {alt='Confusion matrix of the test set with high accuracy for Adelie and Gentoo classification and no correctly predicted Chinstrap'} diff --git a/episodes/3-monitor-the-model.Rmd b/episodes/3-monitor-the-model.Rmd index 75b2d3b5..be1376db 100644 --- a/episodes/3-monitor-the-model.Rmd +++ b/episodes/3-monitor-the-model.Rmd @@ -46,7 +46,7 @@ Here we want to work with the *weather prediction dataset* (the light version) w It contains daily weather observations from 11 different European cities or places through the years 2000 to 2010. For all locations the data contains the variables ‘mean temperature’, ‘max temperature’, and ‘min temperature’. In addition, for multiple locations, the following variables are provided: 'cloud_cover', 'wind_speed', 'wind_gust', 'humidity', 'pressure', 'global_radiation', 'precipitation', 'sunshine', but not all of them are provided for every location. A more extensive description of the dataset including the different physical units is given in accompanying metadata file. The full dataset comprises of 10 years (3654 days) of collected weather data across Europe. -![European locations in the weather prediction dataset](../fig/03_weather_prediction_dataset_map.png){alt='18 European locations in the weather prediction dataset'} +![European locations in the weather prediction dataset](fig/03_weather_prediction_dataset_map.png){alt='18 European locations in the weather prediction dataset'} A very common task with weather data is to make a prediction about the weather sometime in the future, say the next day. In this episode, we will try to predict tomorrow's sunshine hours, a challenging-to-predict feature, using a neural network with the available weather data for one location: BASEL. @@ -249,7 +249,7 @@ Then, we update the weight by taking a small step in the direction of the negati This will slightly decrease the loss. This process is repeated until the loss function reaches a minimum. The size of the step that is taken in each iteration is called the 'learning rate'. -![](../fig/03_gradient_descent.png){alt='Plot of the loss as a function of the weights. Through gradient descent the global loss minimum is found'} +![](fig/03_gradient_descent.png){alt='Plot of the loss as a function of the weights. Through gradient descent the global loss minimum is found'} ### Batch gradient descent You could use the entire training dataset to perform one learning step in gradient descent, @@ -388,7 +388,7 @@ def plot_history(history, metrics): plot_history(history, 'root_mean_squared_error') ``` -![](../fig/03_training_history_1_rmse.png){alt='Plot of the RMSE over epochs for the trained model that shows a decreasing error metric'} +![](fig/03_training_history_1_rmse.png){alt='Plot of the RMSE over epochs for the trained model that shows a decreasing error metric'} This looks very promising! Our metric ("RMSE") is dropping nicely and while it maybe keeps fluctuating a bit it does end up at fairly low *RMSE* values. But the *RMSE* is just the root *mean* squared error, so we might want to look a bit more in detail how well our just trained model does in predicting the sunshine hours. @@ -421,12 +421,12 @@ def plot_predictions(y_pred, y_true, title): plot_predictions(y_train_predicted, y_train, title='Predictions on the training set') ``` -![](../fig/03_regression_predictions_trainset.png){alt='Scatter plot between predictions and true sunshine hours in Basel on the train set showing a concise spread'} +![](fig/03_regression_predictions_trainset.png){alt='Scatter plot between predictions and true sunshine hours in Basel on the train set showing a concise spread'} ```python plot_predictions(y_test_predicted, y_test, title='Predictions on the test set') ``` -![](../fig/03_regression_predictions_testset.png){alt='Scatter plot between predictions and true sunshine hours in Basel on the test set showing a wide spread'} +![](fig/03_regression_predictions_testset.png){alt='Scatter plot between predictions and true sunshine hours in Basel on the test set showing a wide spread'} ::: challenge ## Exercise: Reflecting on our results @@ -489,7 +489,7 @@ y_baseline_prediction = X_test['BASEL_sunshine'] plot_predictions(y_baseline_prediction, y_test, title='Baseline predictions on the test set') ``` -![](../fig/03_regression_test_5_naive_baseline.png){alt="Scatter plot of predicted vs true sunshine hours in Basel for the test set where today's sunshine hours is considered as the true sunshine hours for tomorrow"} +![](fig/03_regression_test_5_naive_baseline.png){alt="Scatter plot of predicted vs true sunshine hours in Basel for the test set where today's sunshine hours is considered as the true sunshine hours for tomorrow"} It is difficult to interpret from this plot whether our model is doing better than the baseline. We can also have a look at the RMSE: @@ -557,7 +557,7 @@ With this we can plot both the performance on the training data and on the valid plot_history(history, ['root_mean_squared_error', 'val_root_mean_squared_error']) ``` -![](../fig/03_training_history_2_rmse.png){alt='Plot of RMSE vs epochs for the training set and the validation set which depicts a divergence between the two around 10 epochs.'} +![](fig/03_training_history_2_rmse.png){alt='Plot of RMSE vs epochs for the training set and the validation set which depicts a divergence between the two around 10 epochs.'} ::: challenge ## Exercise: plot the training progress. @@ -646,7 +646,7 @@ history = model.fit(X_train, y_train, plot_history(history, ['root_mean_squared_error', 'val_root_mean_squared_error']) ``` -![](../fig/03_training_history_3_rmse_smaller_model.png){alt='Plot of RMSE vs epochs for the training set and the validation set with similar performance across the two sets.'} +![](fig/03_training_history_3_rmse_smaller_model.png){alt='Plot of RMSE vs epochs for the training set and the validation set with similar performance across the two sets.'} 1. With this smaller model we have reduced overfitting a bit, since the training and validation loss are now closer to each other, and the validation loss does now reach a plateau and does not further increase. We have not completely avoided overfitting though. @@ -695,7 +695,7 @@ As before, we can plot the losses during training: plot_history(history, ['root_mean_squared_error', 'val_root_mean_squared_error']) ``` -![](../fig/03_training_history_3_rmse_early_stopping.png){alt='Plot of RMSE vs epochs for the training set and the validation set displaying similar performance across the two sets.'} +![](fig/03_training_history_3_rmse_early_stopping.png){alt='Plot of RMSE vs epochs for the training set and the validation set displaying similar performance across the two sets.'} This still seems to reveal the onset of overfitting, but the training stops before the discrepancy between training and validation loss can grow further. Despite avoiding severe cases of overfitting, early stopping has the additional advantage that the number of training epochs will be regulated automatically. @@ -772,7 +772,7 @@ history = model.fit(X_train, y_train, plot_history(history, ['root_mean_squared_error', 'val_root_mean_squared_error']) ``` -![](../fig/03_training_history_5_rmse_batchnorm.png){alt='Output of plotting sample'} +![](fig/03_training_history_5_rmse_batchnorm.png){alt='Output of plotting sample'} ::: callout ## Batchnorm parameters @@ -794,7 +794,7 @@ y_test_predicted = model.predict(X_test) plot_predictions(y_test_predicted, y_test, title='Predictions on the test set') ``` -![](../fig/03_regression_test_5_dropout_batchnorm.png){alt='Scatter plot between predictions and true sunshine hours for Basel on the test set'} +![](fig/03_regression_test_5_dropout_batchnorm.png){alt='Scatter plot between predictions and true sunshine hours for Basel on the test set'} Well, the above is certainly not perfect. But how good or bad is this? Maybe not good enough to plan your picnic for tomorrow. But let's better compare it to the naive baseline we created in the beginning. What would you say, did we improve on that? @@ -876,7 +876,7 @@ Create a scatter plot to compare with true observations: y_test_predicted = model.predict(X_test) plot_predictions(y_test_predicted, y_test, title='Predictions on the test set') ``` -![](../fig/03_scatter_plot_basel_model.png){alt='Scatterplot of predictions and true number of sunshine hours'} +![](fig/03_scatter_plot_basel_model.png){alt='Scatterplot of predictions and true number of sunshine hours'} Compute the RMSE on the test set: @@ -939,7 +939,7 @@ You can launch the tensorboard interface from a Jupyter notebook, showing all tr %tensorboard --logdir logs/fit ``` Which will show an interface that looks something like this: -![](../fig/03_tensorboard.png){alt='Screenshot of tensorboard'} +![](fig/03_tensorboard.png){alt='Screenshot of tensorboard'} ::: ## 10. Save model diff --git a/episodes/4-advanced-layer-types.Rmd b/episodes/4-advanced-layer-types.Rmd index ee3be84f..ac2972d0 100644 --- a/episodes/4-advanced-layer-types.Rmd +++ b/episodes/4-advanced-layer-types.Rmd @@ -56,7 +56,7 @@ For more information about this dataset and how it was collected you can check o [Learning Multiple Layers of Features from Tiny Images by Alex Krizhevsky, 2009](https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf). -![Sample images from the CIFAR-10 data-set. Each image is labelled with a category, for example: 'frog' or 'horse'](../fig/04_cifar10.png){alt="A 5 by 5 grid of 25 sample images from the CIFAR-10 data-set. Each image is labelled with a category, for example: 'frog' or 'horse'."} +![Sample images from the CIFAR-10 data-set. Each image is labelled with a category, for example: 'frog' or 'horse'](fig/04_cifar10.png){alt="A 5 by 5 grid of 25 sample images from the CIFAR-10 data-set. Each image is labelled with a category, for example: 'frog' or 'horse'."} We take a small sample of the data as training set for demonstration purposes. ```python @@ -209,9 +209,9 @@ Note that for RGB images, the kernel should also have a depth of 3. In the following image, we see the effect of such a kernel on the values of a single-channel image. The red cell in the output matrix is the result of multiplying and summing the values of the red square in the input, and the kernel. Applying this kernel to a real image shows that it indeed detects horizontal edges. -![](../fig/04_conv_matrix.png){alt='Example of a convolution matrix calculation' style='width:90%'} +![](fig/04_conv_matrix.png){alt='Example of a convolution matrix calculation' style='width:90%'} -![](../fig/04_conv_image.png){alt='Convolution example on an image of a cat to extract features' style='width:100%'} +![](fig/04_conv_image.png){alt='Convolution example on an image of a cat to extract features' style='width:100%'} In our **convolutional layer** our hidden units are a number of convolutional matrices (or kernels), where the values of the matrices are the weights that we learn in the training process. The output of a convolutional layer is an 'image' for each of the kernels, that gives the output of the kernel applied to each pixel. @@ -427,13 +427,13 @@ def plot_history(history, metrics): plt.ylabel("metric") plot_history(history, ['accuracy', 'val_accuracy']) ``` -![](../fig/04_training_history_1.png){alt='Plot of training accuracy and validation accuracy vs epochs for the trained model'} +![](fig/04_training_history_1.png){alt='Plot of training accuracy and validation accuracy vs epochs for the trained model'} ```python plot_history(history, ['loss', 'val_loss']) ``` -![](../fig/04_training_history_loss_1.png){alt='Plot of training loss and validation loss vs epochs for the trained model'} +![](fig/04_training_history_loss_1.png){alt='Plot of training loss and validation loss vs epochs for the trained model'} It seems that the model is overfitting somewhat, because the validation accuracy and loss stagnates. @@ -497,7 +497,7 @@ history = dense_model.fit(train_images, train_labels, epochs=30, validation_data=(test_images, test_labels)) plot_history(['accuracy', 'val_accuracy']) ``` -![](../fig/04_dense_model_training_history.png){alt="Plot of training accuracy and validation accuracy vs epochs for a model with only dense layers"} +![](fig/04_dense_model_training_history.png){alt="Plot of training accuracy and validation accuracy vs epochs for a model with only dense layers"} As you can see the validation accuracy only reaches about 35%, whereas the CNN reached about 55% accuracy. @@ -583,12 +583,12 @@ history = model.fit(train_images, train_labels, epochs=20, validation_data=(val_images, val_labels)) plot_history(history, ['accuracy', 'val_accuracy']) ``` -![](../fig/04_training_history_2.png){alt="Plot of training accuracy and validation accuracy vs epochs for the trained model"} +![](fig/04_training_history_2.png){alt="Plot of training accuracy and validation accuracy vs epochs for the trained model"} ```python plot_history(history, ['loss', 'val_loss']) ``` -![](../fig/04_training_history_loss_2.png){alt: "Plot of training loss and validation loss vs epochs for the trained model"} +![](fig/04_training_history_loss_2.png){alt: "Plot of training loss and validation loss vs epochs for the trained model"} :::: ::: @@ -636,7 +636,7 @@ One of the most versatile regularization technique is **dropout** ([Srivastava e Dropout means that during each training cycle (one forward pass of the data through the model) a random fraction of neurons in a dense layer are turned off. This is described with the dropout rate between 0 and 1 which determines the fraction of nodes to silence at a time. -![](../fig/neural_network_sketch_dropout.png){alt='A sketch of a neural network with and without dropout'} +![](fig/neural_network_sketch_dropout.png){alt='A sketch of a neural network with and without dropout'} The intuition behind dropout is that it enforces redundancies in the network by constantly removing different elements of a network. The model can no longer rely on individual nodes and instead must create multiple "paths". In addition, the model has to make predictions with much fewer nodes and weights (connections between the nodes). As a result, it becomes much harder for a network to memorize particular features. At first this might appear a quite drastic approach which affects the network architecture strongly. @@ -718,13 +718,13 @@ val_loss, val_acc = model_dropout.evaluate(val_images, val_labels, verbose=2) 313/313 - 2s - loss: 1.4683 - accuracy: 0.5307 ``` -![](../fig/04_training_history_3.png){alt="Plot of training accuracy and validation accuracy vs epochs for the trained model"} +![](fig/04_training_history_3.png){alt="Plot of training accuracy and validation accuracy vs epochs for the trained model"} ```python plot_history(history, ['loss', 'val_loss']) ``` -![](../fig/04_training_history_loss_3.png){alt="Plot of training loss and validation loss vs epochs for the trained model"} +![](fig/04_training_history_loss_3.png){alt="Plot of training loss and validation loss vs epochs for the trained model"} Now we see that the gap between the training accuracy and validation accuracy is much smaller, and that the final accuracy on the validation set is higher than without dropout. @@ -777,7 +777,8 @@ loss_df = pd.DataFrame({'dropout_rate': dropout_rates, 'val_loss': val_losses}) sns.lineplot(data=loss_df, x='dropout_rate', y='val_loss') ``` -![](../fig/04_vary_dropout_rate.png){alt="Plot of vall loss vs dropout rate used in the model. The val loss varies between 1.26 and 1.40 and is lowest with a dropout_rate around 0.45."} +![](/fig/04_vary_dropout_rate.png){alt="Plot of vall loss vs dropout rate used in the model. The val loss varies between 1.26 and 1.40 and is lowest with a dropout_rate around 0.45."} + ### 2. Term associated to this procedure This is called hyperparameter tuning. diff --git a/instructors/bonus-material.md b/instructors/bonus-material.md index 80b1740e..e28faf57 100644 --- a/instructors/bonus-material.md +++ b/instructors/bonus-material.md @@ -6,9 +6,9 @@ title: Bonus material To apply Deep Learning to a problem there are several steps we need to go through: -![A visualisation of the Machine Learning Pipeline](../fig/graphviz/pipeline.png) +![A visualisation of the Machine Learning Pipeline](../episodes/fig/graphviz/pipeline.png) -Feel free to use this figure as [png](../fig/graphviz/pipeline.png). The figure is contained in `fig/graphviz/` of this repository. Use the `Makefile` there in order to reproduce it in different output formats. +Feel free to use this figure as [png](../episodes/fig/graphviz/pipeline.png). The figure is contained in `fig/graphviz/` of this repository. Use the `Makefile` there in order to reproduce it in different output formats. ## Optional part - prediction uncertainty using Monte-Carlo Dropout Depending on the data and the question asked, model predictions can be highly accuracte. Or, as in the present case, show a high degree of error. @@ -22,7 +22,7 @@ The name of the technique refers to a very common regularization technique: **Dr One of the most versatile regularization technique is **dropout**. Dropout essentially means that during each training cycle a random fraction of the dense layer nodes are turned off. This is described with the dropout rate between 0 and 1 which determines the fraction of nodes to silence at a time. -![Dropout sketch](../fig/neural_network_sketch_dropout.png) +![Dropout sketch](../episodes/fig/neural_network_sketch_dropout.png) The intuition behind dropout is that it enforces redundancies in the network by constantly removing different elements of a network. The model can no longer rely on individual nodes and instead must create multiple "paths". In addition, the model has to make predictions with much fewer nodes and weights (connections between the nodes). As a result, it becomes much harder for a network to memorize particular features. At first this might appear a quiet drastic approach which affects the network architecture strongly. In practice, however, dropout is computationally a very elegant solution which does not affet training speed. And it frequently works very well. @@ -95,7 +95,7 @@ plt.xlabel("epochs") plt.ylabel("RMSE") ``` -![Output of plotting sample](../fig/03_training_history_4_rmse_dropout.png) +![Output of plotting sample](../episodes/fig/03_training_history_4_rmse_dropout.png) In this setting overfitting seems to be pervented succesfully. The overall results though have not improved (at least not by much). Above we have used dropout to randomly turn off network nodes during training. @@ -173,7 +173,7 @@ plt.hist(y_test_predicted_ensemble[0,:], rwidth=0.9) plt.xlabel("predicted sunshine hours") ``` -![Output of plotting sample](../fig/03_monte_carlo_dropout_distribution_example.png) +![Output of plotting sample](../episodes/fig/03_monte_carlo_dropout_distribution_example.png) Instead of full distributions for every datapoint we might also just want to extract the mean and standard deviation. ``` @@ -189,4 +189,4 @@ plt.scatter(y_test_predicted_mean, y_test, s=40*y_test_predicted_std, plt.xlabel("predicted") plt.ylabel("true values") ``` -![Output of plotting sample](../fig/03_scatter_plot_model_uncertainty.png) +![Output of plotting sample](../episodes/fig/03_scatter_plot_model_uncertainty.png)