In Part I, deep learning model convolutional features were investigated by constructing an input that resulted in the highest total activation in that feature, starting with random noise. But the training dataset for the models used (ImageNet) does not contain any images that resemble pure noise, making the initial input
To recap, an input
On the page referenced above, we optimized the output activation of interest
Now we will investigate a similar update, but using octaves without jitter (denoted
A three-octave approach was found to result in clear images.
# original image size: 299x299
single_input = directed_octave(single_input, target_output, 100, [0.5, 0.4], [2.4, 0.8], 0, index, crop=False)
single_input = torchvision.transforms.Resize([380, 380])(single_input)
single_input = directed_octave(single_input, target_output, 100, [0.4, 0.3], [1.5, 0.4], 330, index, crop=False)
single_input = torchvision.transforms.Resize([460, 460])(single_input)
single_input = directed_octave(single_input, target_output, 100, [0.3, 0.2], [1.1, 0.3], 375, index, crop=False)
We will start by optimizing one or two features at a time. To do this, we want to increase the activation of each neuron in that feature using gradient descent. One way to accomplish this is to assign the loss function
As was done for feature visualization, computation of the gradient can be implemented with
def layer_gradient(model, input_tensor, desired_output, index):
...
input_tensor.requires_grad = True
output = model(input_tensor).to(device)
focus = output[0][index][:][:] # optimize the feature of interest
target = torch.ones(focus.shape).to(device)*200
loss = torch.sum(target - focus)
loss.backward()
gradient = input_tensor.grad
return gradient
For this procedure applied to 8 features from layer Mixed 6e in InceptionV3 where the original input
It may be wondered if there is any meaningful difference between this procedure and feature visualization in which
single_input = (torch.rand(1, 3, 299, 299))/10 + 0.5 # scaled distribution initialization
and it is apparent that the input does influence the optimized image noticably for a number of features in Layer Mixed 6e.
One can also modify the input
What happens if the input image were optimized without enforcing smoothness, or in other words omitting the Gaussian convolutional step during gradient descent? We can implement gradient descent using octaves but without Gaussian convolution,
as follows:
def octave(single_input, target_output, iterations, learning_rates, sigmas, index):
...
start_lr, end_lr = learning_rates
start_sigma, end_sigma = sigmas
for i in range(iterations):
single_input = single_input.detach() # remove the gradient for the input (if present)
input_grad = layer_gradient(newmodel, input, target_output, index) # compute input gradient
single_input -= (start_lr*(iterations-i)/iterations + end_lr*i/iterations)* input_grad # gradient descent step
return single_input
The other change we will make is to optimize the activations of an entire layer
which is denoted as the sum of the tensor [:, :, :]
of layer
def layer_gradient(model, input_tensor, desired_output, index):
...
input_tensor.requires_grad = True
output = model(input_tensor).to(device)
focus = output[0][:][:][:] # optimize all features of an input
target = torch.ones(focus.shape).to(device)*200
loss = torch.sum(target - focus)
loss.backward()
gradient = input_tensor.grad
return gradient
Using InceptionV3 as our model and applying gradient descent in 3 octaves to an input of flowers, when optimizing layer Mixed 6b we have
Other models yield similar results, for example optimizing layer 3 of ResNet gives
One interesting attribute of convolutional layers is thay they may be applied to inputs of arbitrary dimension.
As for feature maps, we can apply the gradient descent procedure on inputs of abitrary resolution because each layer is convolutional, meaning that only the weights and biases of kernals are specified such that any size of image may be used as input
Note, however, that there is a clear limit to this procedure: although the input image resolution may be increased without bound, the convolutional kernals of the model in question (here InceptionV3) during training were learned only for the resolution that training inputs existed at. This is why an increase in resolution yields smaller details introduced relative to the whole image, as the changes during gradient descent are made by a model that learned to expect images of a lower resolution.
It may be wondered how deep dream makes any kind of recognizable image at all. For input image generation, we saw that gradient descent on an initial input of random noise did not yield any recognizable images, and this was for only one neuron or feature at a time rather than for many features as we have here. It was only after the addition of a smoothness Bayesian prior that gradient descent was able to begin to produce recognizable images, but smoothness is typically not added during deep dream.
Futhermore, when one considers how a convolutional layer works for image classification, it is not immediately clear how optimizing the activation of many layers together would give anything other than an incoherent jumble. This is because during feed-forward operation each feature map in a convolutional layer is expected to have a different activation corresponding to which features are found in the image one wants to classify. Activating all the feature maps is equivalent to saying that one wants an image that has all possible features, and for deeper layers with many (>= 256) features one may reasonably expect that this image will not look like much of anything at all, especially when no smoothness constraint is enforced.
After some consideration, the situation may not seem as bad as at first glance. The deep dream process begins using a natural image rather than noise, and therefore although we don't enforce statistical characteristics of natural images during gradient descent we have enforced them at the beginning of gradient descent.
To see why optimizing the activation of many features at once does not necessarily lead to a nonsensical image, consider the following example in which first one, then five, and then all 768 features of InceptionV3's layer Mixed 6c are optimized. As we saw in part 1, the following maps were found to maximally activate each of the first five features of layer Mixed6c:
These inputs were generated from noise, but nevertheless if one particular feature were to be activated by a certain pattern, one would expect this to be the case if the starting point were not noise but instead an image of some kind. Indeed we see that is the case: observe how feature 4 gives a very similar yarn-like appearance when the starting point is a picture of a dalmation (top right below) as when it is noise (above).
On the bottom row, multiple features are optimized simultaneously. It is necessary to scale back the gradient to avoid producing very high-frequency or saturated final inputs, and we can do this by simply weighting the gradient of the entire layer by a fraction corresponding to the inverse of the number of features in that layer: ie if there are around 1000 features in a given layer, we can divide the gradient of the layer by 1000. This is because the gradient is a linear operator and is therefore additive, meaning that the gradient of an entire layer
Therefore the gradient descent update performed may be scaled by the constant
How then can we hope to make a coherent image if we are adding small gradients from nearly 1000 features? The important thing to remember is that a single image cannot possibly optimize all features at once. Consider the following simple example where we want to optimize the output of two features, where the gradient of the first feature
whereas another feature's gradient
now as gradients are additive, the total gradient
which when applied to the original input
What this means is that the deep dream procedure gives what can be thought of as a kind of 'average' across all features in the layers of interest. As each layer must pass the entirety of the necessary information from the input to the output for accurate classification, one can expect for the dream procedure to produce shapes that are most common in the training dataset used, provided the features recognizing these images may be optimized from the input. This is why deep dream performed on sufficiently deep layers generally introduces animal objects, whereas a dataset trained on landscapes generates images of landscapes and buildings during deep dream (reference).
We can observe this inability to optimize all features during the dream process. Observe how the average feature activation does not change for many features even after 3 octaves of deep dream:
{% include youtube.html id='l__HrW5spn0' %}
Omitting smoothness has a notable justification: the starting input is a natural image that contains the smoothness and other statistical features of other natural images, so as long as we do not modify this image too much then we would expect to end with something that resembles a natural image. But if we try to optimize the activation of some output category without smoothness, say to make an image of a 'Junco, Snowbird', we find that only noise has been added to the original image.
This suggests that simply starting with a natural image is not the reason why smoothness can be omitted from gradient descent during deep dream. Instead, observe how the average feature activation increased smoothly in the last video, and compare that to the discontinuous increases seen in the first video on this page (in which gradient descent without smoothness is used to optimize an output category). As the loss function in both cases is a measure of the values displayed at each time step of those videos, it is apparent that the output of a layer is continous with respect to the input, but the output of the final layer is not.
Discontinuities in the output of the final layer with respect to the input are to be expected for most classification models of sufficient power to approximate a one-to-one function (see this link for the reason why this is) but it appears that internal layers are not discontinuous with respect to the input. Exactly why this is the case is currently unclear.
In the previous section it was observed that not all features of the layer of interest were capable of increasing their mean activation during deep dream. Do certain features tend to exhibit an increase in activation, leading to a similar change in the input image for all starting images or does the image generated (and therefore the features that are activated) depend on the original input? In other words, if the dream procedure tends to generate an image that is representative of the more common examples in the dataset used to train a network, what influence does the original input image have? In the original deep dream article, it was observed that the dream procedure tends to amplify objects that the model's layer of interest 'sees' in original images. It remains to be seen what influence the input has on a dream, or whether the same objects tend to be amplified no matter what input is given.
We will use GoogleNet as the model for this section. For reference, this model's architecture is as follows:
On this page it was observed that a gradient descent on the input was capable of producing images that were confidently predicted to be of one output category. To test whether or not deep dream enhances characteristics of those categories or chooses new ones, these generated representative images will be used as initial inputs
Optimizing layer 4c leads to an increase in the resolution on most representative ImageNet categories, for example on this image for 'Stoplight'
For some layers it is clear that certain characteristics are introduced irrespective of the original image
For another non-animal
although other layers yield similar additions as for 'Stoplight'
and indeed we can see that the same patterns exist for most other input images, as for example
It is apparent that the final image after deep dream is of higher resolution than the initial, which is partly but not completely (see the next section) because the input generation procedure uses Gaussian blurring, but deep dream does not. This increased resolution comes at the price of some amount of precision, however, as the resulting image may not match the target category quite as well. Observe how optimizing layer 4c for an input representing 'Lion'
and optimizing layer 4c for an input representing 'Guinea Pig'
both lead to a substantial increase in resolution without high-frequency noise but do introduce certain characteristics of animals that are not the target class.
It is interesting that optimizing the activations of a hidden layer of our feed-forward network is capable of increasing the resolution of an input image. Performing gradient descent using only the output as a target nearly inevitably leads to the presence of high-frequency, incoherent structures unless some form of smoothness or local correlation is applied to the input or else the gradient.
Upon closer investigation of the representations that layer outputs are capable of forming of the input, there appears to be an explanation for why maximizing these early and middle-layer outputs would lead to an increase in resolution. For ResNet in particular, it was found that training leads to early layer kernals adopting similar weight configurations as those used to sharpen images such that the these layers learn to 'see' the input in more resolution than at the start of training. If these layers are activated most strongly by sharp edges, it is natural that maximizing their activations would lead to a higher-resolution image. See here for research on layer representations.
As we observed in the last section, the dream procedure tends to be somewhat arbitrary with the modifications it makes to a given input: for some it introduces many new objects (usually animal faces) and for others it does not. Can the dream procedure be directed in order to introduce features that we want?
Perhaps the simplest way to direct a dream towards one object is to perform layer optimization as well as a target optimization. This can be done using gradient descent on both the layer of interest, whose gradient is
A bit of experimentation is enough to convince on that this alone does not yield any kind of coherent image, as the same problems experienced for optimizing the gradient of the target output on this page. Namely, optimization of an output without smoothness leads to the presence of an adversarial negative with near-certainty, meaning that we cannot direct our dream by simply adding the gradient of the output class.
Instead we can enforce a small amount of smoothness using Gaussian convolution
where \mathcal{N} is applied only every 5 or 10 steps. Now the dream image usually contains recognizable images of the target class along with additional features that the dream might introduce. For the target class 'Bubble' and optimizing the activation of Layer 4c, observe how bubbles are present along with a house and some animals
Besides introducing some desired element into the dream, it is also apparent that this procedure results in higher resolution than occurs for image transfiguration where only the target class is optimized. If we apply the same smoothing procedure but with a newly scaled gradient
we find that the bubbles resulting are of somewhat lower resolution, mirroring what was observed in the last section of this page.
The increase in resolution for directed dreams relative to image transfigurations (using only the gradient of the target class) is most apparent when the target is an animal. For the target class 'Lion', observe how the directed dream with Layer 4c which is as follows
introduces lions with much higher resolution than transfiguration of the same target.
Directed dreams can introduce practically any of the 1000 ImageNet target categories, or a combination of these categories. Observe the result of the dream on Layer 4c with a target class of 'Torch':
or 'Junco, Snowbird' (note the birds on the left and right hand side of the image)