Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ladder nets #75

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Ladder nets #75

wants to merge 10 commits into from

Conversation

AdrianLsk
Copy link

My ladder net implementation. Comments are more than welcome.

@f0k
Copy link
Member

f0k commented Jun 22, 2016

Cool! Had a quick look only, some minor comments:

  • you may want to add your name somewhere
  • in the notebook, you could link to the papers in the beginning, as part of a short introduction (when reading the notebook it wasn't obvious that there's a "References" section in the end)
  • !wget -N ... avoids redownloading if the file already exists (in your notebook it downloaded a second copy)
  • the batch normalization behaviour seems funny, why do you set alpha=1? This only influences the exponential moving average used for testing (when deterministic=True), for training (deterministic=False) it is limited to the current mini-batch statistics anyway. This doesn't explain why mean and inv_std stick to 0 and 1, though, it seems they're not updated at all.

@AdrianLsk
Copy link
Author

AdrianLsk commented Jun 22, 2016

Thanks! The idea behind the double batchnormalization comes from the noise injection in the dirty part of the encoder, i.e. first normalize the batch by its mean and std, then add noise and shift and scale the output by trainable parameters afterwards. My idea was not to use deterministic=True, because it would turn off the noise injection (as in dropout) and I still need to calculate the running stats for the clean encoder. So for the second (learnable) batchnormalization I hardcoded the mean (0) and inv_std (1) using constant variables and set alpha=1 to not update them with the previous stats values. It worked, so I didnt explore anything else. I might have misunderstood the behavior intended by alpha settings though, so it might not be even required since I am using the constant variables. I will check what it does and update the notebook later.

@f0k
Copy link
Member

f0k commented Jun 22, 2016

You can pass batch_norm_use_averages and batch_norm_update_averages to get_output() to override the default behaviour from deterministic=True and deterministic=False. This allows you to control batch normalization independently of other indeterministic layers.

Wait, reading again, do you mean to use the second batch normalization merely for scaling and shifting? It will still do batch normalization during training even if mean and inv_std are set to constant values (those are used in the deterministic=False pass only). You should use a ScaleLayer and BiasLayer if you only want a learnable scale and shift.

@AdrianLsk
Copy link
Author

AdrianLsk commented Jun 23, 2016

So it turns out that something is going wrong after using scaling and bias layer instead of the second batchnormalization (which made everything smooth by normalizing the batch again). The reconstruction costs of the latent and classfication layers become huge, so I have to figure out why is that. My guess is that I am not using correctly the batch mean and inv_std from the first batchnormalization even after keeping only the mini-batch stats by setting alpha=1 instead of using running stats. I need to use the dirty encoder minibatch stats in the dirty decoder for normalizing the denoised output of the combinator layers, so there must big differences among those values which in turn give rise to huge reconstruction costs. Do you have any idea what can be the reason for the theano function output causing a floating point exception?

@f0k
Copy link
Member

f0k commented Jun 23, 2016

My guess is that I am not using correctly the batch mean and inv_std from the first batchnormalization even after keeping only the mini-batch stats by setting alpha=1 instead of using running stats.

Let me point out that I was just guessing on what you were trying to achieve, so please take my advice with a grain of salt. To get things straight, I can see that you've got three networks: The dirty encoder, the decoder, and the clean encoder. All of them share their weight matrices. A given minibatch x is passed through the dirty encoder + decoder, and through the clean encoder.

Now where does batch normalization come into play? Which batch normalization parts do you want to share between networks, and between which ones exactly? Where does the noise injection take place?

Do you have any idea what can be the reason for the theano function output causing a floating point exception?

I think ideally they're meant to be caught by Theano... did your process get terminated with SIGFPE? Some possible causes are listed in this slightly dubious ("there is no way to represent complex numbers in computers") source: https://www.quora.com/What-might-be-the-possible-causes-for-floating-point-exception-error-in-C++

@AdrianLsk
Copy link
Author

AdrianLsk commented Jun 23, 2016

Let me point out that I was just guessing on what you were trying to achieve, so please take my advice with a grain of salt. To get things straight, I can see that you've got three networks: The dirty encoder, the decoder, and the clean encoder. All of them share their weight matrices. A given minibatch x is passed through the dirty encoder + decoder, and through the clean encoder.

I was trying to say that your point was right, as there should not be a second mini-batch normalization. The dirty and clean encoders share the weights and batchnormalization parameters beta and gamma, while the dirty decoder shares the batchnormalization means and standard deviations of the clean encoder

Now where does batch normalization come into play? Which batch normalization parts do you want to share between networks, and between which ones exactly? Where does the noise injection take place?

The normalization part (using the minibatch mean and std) of the batchnormalization follows right after the affine transformation (i.e. dense layer), then the noise is injected and afterwards, the scaling (using learnable beta and gamma) follows. The beta and gamma parameters are shared between dirty and clean encoders, while the means and std's of the clean encoder are shared by the dirty decoder.

Now, sharing weights and beta/gamma parameters between encoders is straightforward, but the question is how to share those means and std's in a direction: clean encoder -> dirty decoder or if it's possible at all without using an additional layer (like custom normalization layer).

I will try the additional layer approach to see if anything changes.

@f0k
Copy link
Member

f0k commented Jun 24, 2016

Now, sharing weights and beta/gamma parameters between encoders is straightforward,

We did our best!

but the question is how to share those means and std's in a direction: clean encoder -> dirty decoder or if it's possible at all without using an additional layer (like custom normalization layer).

There's no direct way to access the mean and std used by the encoder. As I said, the ones you see as parameters are only used for inference (i.e., when deterministic=True). Setting alpha=1 will make them depend on the last batch only, but then they will indeed depend on the last batch seen, and not on the current batch.
Luckily, you don't have to access the mean and std used by the encoder, you can just compute them yourself. Theano will see that the expressions are equivalent and reuse them. You would need a custom UndoBatchNormLayer or something like that which gets two inputs: The input to the corresponding BatchNormLayer so you can recompute mean and std, and the input you want to transform.
Something like:

class UndoBatchNormLayer(lasagne.layers.MergeLayer):
    def __init__(self, incoming, bn_layer, **kwargs):
        super(UndoBatchNormLayer, self).__init__([incoming, bn_layer.input_layer]), **kwargs)
        self.axes = bn_layer.axes
        self.epsilon = bn_layer.epsilon
    def get_output_shape_for(self, input_shapes):
        return input_shapes[0]
    def get_output_for(self, inputs, **kwargs):
        input, bn_input = inputs
        mean = bn_input.mean(self.axes)
        var = bn_input.var(self.axes)
        std = T.sqrt(var + self.epsilon)
        return input * std + mean

I'm assuming that you want the decoder to undo the transformations of the encoder here.

@AdrianLsk
Copy link
Author

Luckily, you don't have to access the mean and std used by the encoder, you can just compute them yourself. Theano will see that the expressions are equivalent and reuse them. You would need a custom UndoBatchNormLayer or something like that which gets two inputs: The input to the corresponding BatchNormLayer so you can recompute mean and std, and the input you want to transform.

Yes, that's exactly what I did yesterday and it worked! I will finish some extra adjustments and push it later today. Thanks for the feedback! It helped a lot.


to_stats_l = clean_net[enc_bname]
to_norm_l = dirty_net[comb_name]
dirty_net[bname] = SharedNormLayer(to_stats_l, to_norm_l)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this removes the mean and divides by the standard deviation that was also used in the encoding step -- does this make sense? Shouldn't the decoder be doing the reverse? You also have standard BatchNormLayers in the decoder, maybe you don't need the SharedNormLayers at all? (Disclaimer: I haven't looked at your code or the paper in detail, I'm just wondering. I'm happy to learn why it is implemented the way it is.)

Copy link
Author

@AdrianLsk AdrianLsk Jun 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does. The output of the denoising layer should be comparable with the output of the corresponding encoder layer in order to calculate the reconstruction cost. If you skim through the algorithm on page 5 in http://arxiv.org/pdf/1507.02672v2.pdf, you will find that the decoder first calculates the affine transformation with batchnormalization (i.e. my dense layer and bachnorm layer withou learning beta and gamma) then feeds the output to the denoising function and subsequently normalizes it with the stats from the clean encoder. I need those standard batchnorm layers to learn the beta and gamma parameters in the dirty encoder and shared them afterwards with clean encoder.

@benanne
Copy link
Member

benanne commented Jun 29, 2016

Just wanted to say this looks great, and thanks for contributing! :)

@f0k
Copy link
Member

f0k commented Aug 31, 2016

Sorry for the delay, github doesn't notify about changes, only about comments. Is this ready to merge from your side, @AdrianLsk?

@AdrianLsk
Copy link
Author

Hi @f0k, not yet. Although this version is working, I still need to push my latest changes. I refactored the code and fixed some pooling-layer inconsistencies with the original ladder nets code. I will do it this weekend and let you know when it's ready.

@AdrianLsk
Copy link
Author

@f0k I think, it's ready for merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants