Merge branch 'main' into image-paths-and-captions

carpentries-incubator · Nov 8, 2023 · 56eff9f · 56eff9f
2 parents 1752301 + f1e061a
commit 56eff9f
Show file tree

Hide file tree

Showing 13 changed files with 302 additions and 170 deletions.
diff --git a/README.md b/README.md
@@ -6,7 +6,7 @@
 This lesson gives an introduction to deep learning.
 
 ## Lesson Design
-The design of this lesson can be found in the [lesson design](_extras/design.md)
+The design of this lesson can be found in the [lesson design](https://carpentries-incubator.github.io/deep-learning-intro/design.html)
 
 ## Target Audience
 The main audience of this carpentry lesson is PhD students that have little to no experience with
@@ -30,7 +30,7 @@ Please see the current list of
 [issues](https://github.com/carpentries-incubator/deep-learning_intro/issues)
 for ideas for contributing to this repository.
 
-Please also familiarize yourself with the [lesson design](_extras/design.md)
+Please also familiarize yourself with the [lesson design](https://carpentries-incubator.github.io/deep-learning-intro/design.html)
 
 For making your contribution, we use the GitHub flow, which is nicely explained in the
 chapter [Contributing to a Project](http://git-scm.com/book/en/v2/GitHub-Contributing-to-a-Project)

diff --git a/episodes/1-introduction.Rmd b/episodes/1-introduction.Rmd
@@ -110,7 +110,11 @@ b. What logical problem does this network solve?
 
 :::: solution
 ## Solution
+
 #### 1: calculate the output for one neuron
+
+You can calculate the output as follows:
+
 * Weighted sum of input: `0 * (-1) + 0.5 * (-0.5) + 1 * 0.5 = 0.25`
 * Add the bias: `0.25 + 1 = 1.25`
 * Apply activation function: `max(1.25, 0) = 1.25`

diff --git a/episodes/2-keras.Rmd b/episodes/2-keras.Rmd
@@ -45,6 +45,14 @@ As a reminder below are the steps of the deep learning workflow:
 
 In this episode we will focus on a minimal example for each of these steps, later episodes will build on this knowledge to go into greater depth for some or all of these steps.
 
+::: instructor
+This episode really aims to go through the whole process once, as quickly as possible. 
+In episode 3 we will expand on all the concepts that are lightly inroduced in episode 2. Some concepts like monitoring the training progress, optimization and learning rate are explained in detail in episode 3.
+It is good to stress this a few times, because learners will usually have a lot of questions like:
+'Why don't we normalize our features' or 'Why do we choose Adam optimizer?'. 
+It can be a good idea to park some of these questions for discussion in episode 3 and 4.
+:::
+
 ::: callout
 ## GPU usage
 For this lesson having a GPU (graphics card) available is not needed.
@@ -202,7 +210,7 @@ penguins_filtered = penguins_filtered.dropna()
 Finally, we select only the features
 ```python
 # Extract columns corresponding to features
-penguins_features = penguins_filtered.drop(columns=['species'])
+features = penguins_filtered.drop(columns=['species'])
 ```
 
 ### Prepare target data for training
@@ -236,7 +244,7 @@ How many output neurons will our network have now that we one-hot encoded the ta
 
 :::: solution
 ## Solution
-3, one for each output variable class
+C: 3, one for each output variable class
 
 ::::
 :::
@@ -254,7 +262,7 @@ For this episode we will keep it at just a training and test set however.
 To split the cleaned dataset into a training and test set we will use a very convenient
 function from sklearn called `train_test_split`.
 This function takes a number of parameters which are extensively explained [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) :
-- The first two parameters are the dataset (in our case penguins_features) and the corresponding targets (i.e. defined as target).
+- The first two parameters are the dataset (in our case features) and the corresponding targets (i.e. defined as target).
 - Next is the named parameter `test_size` this is the fraction of the dataset that is
 used for testing, in this case `0.2` means 20% of the data will be used for testing.
 - `random_state` controls the shuffling of the dataset, setting this value will reproduce
@@ -265,7 +273,7 @@ the same results (assuming you give the same integer) every time it is called.
 ```python
 from sklearn.model_selection import train_test_split
 
-X_train, X_test, y_train, y_test = train_test_split(penguins_features, target,test_size=0.2, random_state=0, shuffle=True, stratify=target)
+X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=0, shuffle=True, stratify=target)
 ```
 
 ## 4. Build an architecture from scratch or choose a pretrained model
@@ -389,11 +397,8 @@ where each layer has **exactly one input tensor and one output tensor**.
 
 :::: solution
 ## Solution
+Have a look at the output of `model.summary()`:
 ```python
-inputs = keras.Input(shape=X_train.shape[1])
-hidden_layer = keras.layers.Dense(10, activation="relu")(inputs)
-output_layer = keras.layers.Dense(3, activation="softmax")(hidden_layer)
-model = keras.Model(inputs=inputs, outputs=output_layer)
 model.summary()
 ```
 
@@ -414,10 +419,14 @@ Non-trainable params: 0
 _________________________________________________________________
 ```
 The model has 83 trainable parameters.
+
 If you increase the number of neurons in the hidden layer the number of
 trainable parameters in both the hidden and output layer increases or
-decreases accordingly of neurons.
-The name in quotes within the string `Model: "model_1"` may be different in your view; this detail is not important.
+decreases in accordance with the number of neurons added.
+Each extra neuron has 4 weights connected to the input layer, 1 bias term, and 3 weights connected to the output layer.
+So in total 8 extra parameters.
+
+*The name in quotes within the string `Model: "model_1"` may be different in your view; this detail is not important.*
 
 #### (optional) Keras Sequential vs Functional API
 3. This implements the same model using the Sequential API:
@@ -524,11 +533,30 @@ Looking at the training curve we have just made.
    * Does the graph look very jittery?
 2. Do you think the resulting trained network will work well on the test set?
 
+When the training process does not go well:
+
+3. (optional) Something went wrong here during training. What could be the problem, and how do you see that in the training curve?
+Also compare the range on the y-axis with the previous training curve.
+![](../fig/02_bad_training_history_1.png){alt='Very jittery training curve with the loss value jumping back and forth between 2 and 4. The range of the y-axis is from 2 to 4, whereas in the previous training curve it was from 0 to 2. The loss seems to decrease a litle bit, but not as much as compared to the previous plot where it dropped to almost 0. The minimum loss in the end is somewhere around 2.'}
+
 :::: solution
 ## Solution
-1. The loss curve should drop quite quickly in a smooth line with little jitter
+1. The training loss decreases quickly. It drops in a smooth line with little jitter.
+This is ideal for a training curve.
 2. The results of the training give very little information on its performance on a test set.
   You should be careful not to use it as an indication of a well trained network.
+3. (optional) The loss does not go down at all, or only very slightly. This means that the model is not learning anything.
+It could be that something went wrong in the data preparation (for example the labels are not attached to the right features).
+In addition, the graph is very jittery. This means that for every update step,
+the weights in the network are updated in such a way that the loss sometimes increases a lot and sometimes decreases a lot.
+This could indicate that the weights are updated too much at every learning step and you need a smaller learning rate
+(we will go into more details on this in the next episode).
+Or there is a high variation in the data, leading the optimizer to change the weights in different directions at every learning step.
+This could be addressed by presenting more data at every learning step (or in other words increasing the batch size).
+In this case the graph was created by training on nonsense data, so this a training curve for a problem where nothing can be learned really.
+
+We will take a closer look at training curves in the next episode. Some of the concepts touched upon here will also be further explained there.
+
 ::::
 :::
 
@@ -736,7 +764,7 @@ Length: 69, dtype: object
 [sex_pairplot]: fig/02_sex_pairplot.png "Pair plot grouped by sex"
 {alt='Pair plot showing the separability of the two sexes of penguin for combinations of dataset attributes'}
 
-[training_curve]: fig/training_curve.png "Training Curve"
+[training_curve]: fig/02_training_curve.png "Training Curve"
 {alt='Training loss curve of the neural network training which depicts exponential decrease in loss before a plateau from ~10 epochs'}
 
 [confusion_matrix]: fig/confusion_matrix.png "Confusion Matrix"

diff --git a/episodes/3-monitor-the-model.Rmd b/episodes/3-monitor-the-model.Rmd
@@ -23,6 +23,19 @@ exercises: 80
 - "Implement basic strategies to prevent overfitting"
 :::
 
+::: instructor
+## Copy-pasting code
+In this episode we first introduce a simple approach to the problem,
+then we iterate on that a few times to, step-by-step,
+working towards a more complex solution.
+Unfortunately this involves using the same code repeatedly over and over again,
+only slightly adapting it.
+
+To avoid too much typing, it can help to copy-paste code from higher up in the notebook.
+Be sure to make it clear where you are copying from
+and what you are actually changing in the copied code.
+It can for example help to add a comment to the lines that you added.
+:::
 
 In this episode we will explore how to monitor the training progress, evaluate our the model predictions and finetune the model to avoid over-fitting. For that we will use a more complicated weather data-set.
 
@@ -281,10 +294,10 @@ Answer the following questions:
    We want to move towards the global minimum, so in the opposite direction of the gradient.
 
 3. Correct answer: B & D
-   - A. The number of samples in an epoch also increases (incorrect, an epoch is always defined as passing through the training data for one cycle)
-   - B. The number of batches in an epoch goes down (correct, the number of batches is the samples in an epoch divided by the batch size)
-   - C. The training progress is more jumpy, because more samples are consulted in each update step (one batch). (incorrect, more samples are consulted in each update step, but this makes the progress less jumpy since you get a more accurate estimate of the loss in the entire dataset)
-   - D. The memory load (memory as in computer hardware) of the training process is increased (correct, the data is begin loaded one batch at a time, so more samples means more memory usage)
+   - A. The number of samples in an epoch also increases (**incorrect**, an epoch is always defined as passing through the training data for one cycle)
+   - B. The number of batches in an epoch goes down (**correct**, the number of batches is the samples in an epoch divided by the batch size)
+   - C. The training progress is more jumpy, because more samples are consulted in each update step (one batch). (**incorrect**, more samples are consulted in each update step, but this makes the progress less jumpy since you get a more accurate estimate of the loss in the entire dataset)
+   - D. The memory load (memory as in computer hardware) of the training process is increased (**correct**, the data is begin loaded one batch at a time, so more samples means more memory usage)
 
 ::::
 :::
@@ -428,7 +441,8 @@ plot_predictions(y_test_predicted, y_test, title='Predictions on the test set')
 ## Solution
 While the performance on the train set seems reasonable, the performance on the test set is much worse.
 This is a common problem called **overfitting**, which we will discuss in more detail later.
-Optional exercise:
+
+#### Optional exercise:
 The metric that we are using: RMSE would be a good one. You could also consider Mean Squared Error, that punishes large errors more (because large errors create even larger squared errors).
 It is important that if the model improves in performance on the basis of this metric then that should also lead you a step closer to reaching your goal: to predict tomorrow's sunshine hours. 
 If you feel that improving the metric does not lead you closer to your goal, then it would be better to choose a different metric
@@ -702,10 +716,6 @@ An alternative, more common approach, is to add **BatchNormalization** layers ([
 Similar to dropout, batch normalization is available as a network layer in Keras and can be added to the network in a similar way.
 It does not require any additional parameter setting.
 
-```python
-from tensorflow.keras.layers import BatchNormalization
-```
-
 The `BatchNormalization` can be inserted as yet another layer into the architecture.
 
 ```python
@@ -714,7 +724,7 @@ def create_nn():
     inputs = keras.layers.Input(shape=(X_data.shape[1],), name='input')
 
     # Dense layers
-    layers_dense = keras.layers.BatchNormalization()(inputs)
+    layers_dense = keras.layers.BatchNormalization()(inputs) # This is new!
     layers_dense = keras.layers.Dense(100, 'relu')(layers_dense)
     layers_dense = keras.layers.Dense(50, 'relu')(layers_dense)