Ladder nets #75

AdrianLsk · 2016-06-22T07:52:51Z

My ladder net implementation. Comments are more than welcome.

f0k · 2016-06-22T10:03:21Z

Cool! Had a quick look only, some minor comments:

you may want to add your name somewhere
in the notebook, you could link to the papers in the beginning, as part of a short introduction (when reading the notebook it wasn't obvious that there's a "References" section in the end)
!wget -N ... avoids redownloading if the file already exists (in your notebook it downloaded a second copy)
the batch normalization behaviour seems funny, why do you set alpha=1? This only influences the exponential moving average used for testing (when deterministic=True), for training (deterministic=False) it is limited to the current mini-batch statistics anyway. This doesn't explain why mean and inv_std stick to 0 and 1, though, it seems they're not updated at all.

AdrianLsk · 2016-06-22T10:44:34Z

Thanks! The idea behind the double batchnormalization comes from the noise injection in the dirty part of the encoder, i.e. first normalize the batch by its mean and std, then add noise and shift and scale the output by trainable parameters afterwards. My idea was not to use deterministic=True, because it would turn off the noise injection (as in dropout) and I still need to calculate the running stats for the clean encoder. So for the second (learnable) batchnormalization I hardcoded the mean (0) and inv_std (1) using constant variables and set alpha=1 to not update them with the previous stats values. It worked, so I didnt explore anything else. I might have misunderstood the behavior intended by alpha settings though, so it might not be even required since I am using the constant variables. I will check what it does and update the notebook later.

f0k · 2016-06-22T11:44:25Z

You can pass batch_norm_use_averages and batch_norm_update_averages to get_output() to override the default behaviour from deterministic=True and deterministic=False. This allows you to control batch normalization independently of other indeterministic layers.

Wait, reading again, do you mean to use the second batch normalization merely for scaling and shifting? It will still do batch normalization during training even if mean and inv_std are set to constant values (those are used in the deterministic=False pass only). You should use a ScaleLayer and BiasLayer if you only want a learnable scale and shift.

AdrianLsk · 2016-06-23T09:14:48Z

So it turns out that something is going wrong after using scaling and bias layer instead of the second batchnormalization (which made everything smooth by normalizing the batch again). The reconstruction costs of the latent and classfication layers become huge, so I have to figure out why is that. My guess is that I am not using correctly the batch mean and inv_std from the first batchnormalization even after keeping only the mini-batch stats by setting alpha=1 instead of using running stats. I need to use the dirty encoder minibatch stats in the dirty decoder for normalizing the denoised output of the combinator layers, so there must big differences among those values which in turn give rise to huge reconstruction costs. Do you have any idea what can be the reason for the theano function output causing a floating point exception?

f0k · 2016-06-23T11:57:57Z

My guess is that I am not using correctly the batch mean and inv_std from the first batchnormalization even after keeping only the mini-batch stats by setting alpha=1 instead of using running stats.

Let me point out that I was just guessing on what you were trying to achieve, so please take my advice with a grain of salt. To get things straight, I can see that you've got three networks: The dirty encoder, the decoder, and the clean encoder. All of them share their weight matrices. A given minibatch x is passed through the dirty encoder + decoder, and through the clean encoder.

Now where does batch normalization come into play? Which batch normalization parts do you want to share between networks, and between which ones exactly? Where does the noise injection take place?

Do you have any idea what can be the reason for the theano function output causing a floating point exception?

I think ideally they're meant to be caught by Theano... did your process get terminated with SIGFPE? Some possible causes are listed in this slightly dubious ("there is no way to represent complex numbers in computers") source: https://www.quora.com/What-might-be-the-possible-causes-for-floating-point-exception-error-in-C++

AdrianLsk · 2016-06-23T15:38:32Z

Let me point out that I was just guessing on what you were trying to achieve, so please take my advice with a grain of salt. To get things straight, I can see that you've got three networks: The dirty encoder, the decoder, and the clean encoder. All of them share their weight matrices. A given minibatch x is passed through the dirty encoder + decoder, and through the clean encoder.

I was trying to say that your point was right, as there should not be a second mini-batch normalization. The dirty and clean encoders share the weights and batchnormalization parameters beta and gamma, while the dirty decoder shares the batchnormalization means and standard deviations of the clean encoder

Now where does batch normalization come into play? Which batch normalization parts do you want to share between networks, and between which ones exactly? Where does the noise injection take place?

The normalization part (using the minibatch mean and std) of the batchnormalization follows right after the affine transformation (i.e. dense layer), then the noise is injected and afterwards, the scaling (using learnable beta and gamma) follows. The beta and gamma parameters are shared between dirty and clean encoders, while the means and std's of the clean encoder are shared by the dirty decoder.

Now, sharing weights and beta/gamma parameters between encoders is straightforward, but the question is how to share those means and std's in a direction: clean encoder -> dirty decoder or if it's possible at all without using an additional layer (like custom normalization layer).

I will try the additional layer approach to see if anything changes.

f0k · 2016-06-24T09:06:21Z

Now, sharing weights and beta/gamma parameters between encoders is straightforward,

We did our best!

but the question is how to share those means and std's in a direction: clean encoder -> dirty decoder or if it's possible at all without using an additional layer (like custom normalization layer).

There's no direct way to access the mean and std used by the encoder. As I said, the ones you see as parameters are only used for inference (i.e., when deterministic=True). Setting alpha=1 will make them depend on the last batch only, but then they will indeed depend on the last batch seen, and not on the current batch.
Luckily, you don't have to access the mean and std used by the encoder, you can just compute them yourself. Theano will see that the expressions are equivalent and reuse them. You would need a custom UndoBatchNormLayer or something like that which gets two inputs: The input to the corresponding BatchNormLayer so you can recompute mean and std, and the input you want to transform.
Something like:

class UndoBatchNormLayer(lasagne.layers.MergeLayer):
    def __init__(self, incoming, bn_layer, **kwargs):
        super(UndoBatchNormLayer, self).__init__([incoming, bn_layer.input_layer]), **kwargs)
        self.axes = bn_layer.axes
        self.epsilon = bn_layer.epsilon
    def get_output_shape_for(self, input_shapes):
        return input_shapes[0]
    def get_output_for(self, inputs, **kwargs):
        input, bn_input = inputs
        mean = bn_input.mean(self.axes)
        var = bn_input.var(self.axes)
        std = T.sqrt(var + self.epsilon)
        return input * std + mean

I'm assuming that you want the decoder to undo the transformations of the encoder here.

AdrianLsk · 2016-06-24T09:18:52Z

Luckily, you don't have to access the mean and std used by the encoder, you can just compute them yourself. Theano will see that the expressions are equivalent and reuse them. You would need a custom UndoBatchNormLayer or something like that which gets two inputs: The input to the corresponding BatchNormLayer so you can recompute mean and std, and the input you want to transform.

Yes, that's exactly what I did yesterday and it worked! I will finish some extra adjustments and push it later today. Thanks for the feedback! It helped a lot.

… batchnorm

f0k · 2016-06-27T12:16:56Z

examples/ladder_nets/ladder_nets.py

+
+    to_stats_l = clean_net[enc_bname]
+    to_norm_l = dirty_net[comb_name]
+    dirty_net[bname] = SharedNormLayer(to_stats_l, to_norm_l)


Now this removes the mean and divides by the standard deviation that was also used in the encoding step -- does this make sense? Shouldn't the decoder be doing the reverse? You also have standard BatchNormLayers in the decoder, maybe you don't need the SharedNormLayers at all? (Disclaimer: I haven't looked at your code or the paper in detail, I'm just wondering. I'm happy to learn why it is implemented the way it is.)

Yes, it does. The output of the denoising layer should be comparable with the output of the corresponding encoder layer in order to calculate the reconstruction cost. If you skim through the algorithm on page 5 in http://arxiv.org/pdf/1507.02672v2.pdf, you will find that the decoder first calculates the affine transformation with batchnormalization (i.e. my dense layer and bachnorm layer withou learning beta and gamma) then feeds the output to the denoising function and subsequently normalizes it with the stats from the clean encoder. I need those standard batchnorm layers to learn the beta and gamma parameters in the dirty encoder and shared them afterwards with clean encoder.

benanne · 2016-06-29T18:11:22Z

Just wanted to say this looks great, and thanks for contributing! :)

f0k · 2016-08-31T16:36:01Z

Sorry for the delay, github doesn't notify about changes, only about comments. Is this ready to merge from your side, @AdrianLsk?

AdrianLsk · 2016-08-31T18:57:42Z

Hi @f0k, not yet. Although this version is working, I still need to push my latest changes. I refactored the code and fixed some pooling-layer inconsistencies with the original ladder nets code. I will do it this weekend and let you know when it's ready.

AdrianLsk · 2016-09-04T18:05:49Z

@f0k I think, it's ready for merge.

AdrianLsk added 4 commits June 15, 2016 10:40

initial commit for ladder nets

decd7a5

trained model

5e94121

working example

9b9ddcc

all

3430199

changed references and added name

f872380

added semisupervised functionality, pseudolabels and fixed the double…

098b334

… batchnorm

f0k reviewed Jun 27, 2016
View reviewed changes

AdrianLsk added 3 commits July 6, 2016 12:34

added option for convolution and pooling layers

c83ae27

fixed hardoded reshape in theano.function

a9760aa

same as before but in ipnb

0d15e2a

refactored code, corrected pooling

bb3a613

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ladder nets #75

Ladder nets #75

AdrianLsk commented Jun 22, 2016

f0k commented Jun 22, 2016

AdrianLsk commented Jun 22, 2016 •

edited

Loading

f0k commented Jun 22, 2016 •

edited

Loading

AdrianLsk commented Jun 23, 2016 •

edited

Loading

f0k commented Jun 23, 2016

AdrianLsk commented Jun 23, 2016 •

edited

Loading

f0k commented Jun 24, 2016

AdrianLsk commented Jun 24, 2016

f0k Jun 27, 2016

AdrianLsk Jun 27, 2016 •

edited

Loading

benanne commented Jun 29, 2016

f0k commented Aug 31, 2016

AdrianLsk commented Aug 31, 2016

AdrianLsk commented Sep 4, 2016

Ladder nets #75

Are you sure you want to change the base?

Ladder nets #75

Conversation

AdrianLsk commented Jun 22, 2016

f0k commented Jun 22, 2016

AdrianLsk commented Jun 22, 2016 • edited Loading

f0k commented Jun 22, 2016 • edited Loading

AdrianLsk commented Jun 23, 2016 • edited Loading

f0k commented Jun 23, 2016

AdrianLsk commented Jun 23, 2016 • edited Loading

f0k commented Jun 24, 2016

AdrianLsk commented Jun 24, 2016

f0k Jun 27, 2016

Choose a reason for hiding this comment

AdrianLsk Jun 27, 2016 • edited Loading

Choose a reason for hiding this comment

benanne commented Jun 29, 2016

f0k commented Aug 31, 2016

AdrianLsk commented Aug 31, 2016

AdrianLsk commented Sep 4, 2016

AdrianLsk commented Jun 22, 2016 •

edited

Loading

f0k commented Jun 22, 2016 •

edited

Loading

AdrianLsk commented Jun 23, 2016 •

edited

Loading

AdrianLsk commented Jun 23, 2016 •

edited

Loading

AdrianLsk Jun 27, 2016 •

edited

Loading