Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

L2 Regularization for all layers #131

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

L2 Regularization for all layers #131

wants to merge 4 commits into from

Conversation

aayux
Copy link

@aayux aayux commented Jan 4, 2018

It is better practice to apply weight decay methods in each layer. This commit proposes to change the current implementation of L2 Regularization from just the output layer to the hidden (convolution) layers and also omits regularization of the bias at the output as an unnecessary overhead.

Seeking review and testing.

aayux added 2 commits January 4, 2018 16:42
It is better practice to apply weight decay methods in each layer. This commit proposes to change the current implementation of L2 Regularization from just the output layer to the hidden (convolution) layers and also omits regularization of the bias at the output as an unnecessary overhead.

Seeking review and testing.
text_cnn.py Outdated
self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores")
self.predictions = tf.argmax(self.scores, 1, name="predictions")

# Calculate L2 Regularization
l2_loss = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables() if "b" not in v.name])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also add embedding weights into l2_loss, is it intended?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't! Thanks for pointing this out.

Following change should fix it I think:

l2_loss = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables()[1:] if "W" in v.name])

@dennybritz
Copy link
Owner

Thanks. This looks good in principle, have you tested it?

@aayux
Copy link
Author

aayux commented Mar 10, 2018

@dennybritz: I have only tested it on a small subset of the data. The program itself runs smoothly. However, I have not been able to do any meaningful performance/accuracy evaluations against your implementation because of a resource constraint. I can make a few citations to justify that this method works better in theory.

This StackOverflow answer verifies that the implementation is indeed correct and something of a standard, as similar usage can be found elsewhere.

Discussion on (not) regularizing bias can be found here and in this chapter of the Deep Learning book (p. 226).

@chiragnagpal
Copy link

Weight Decay with Adam is not a good idea. You'll end up adding another hyperparameter with little performance gain.

@aayux
Copy link
Author

aayux commented Mar 15, 2018

@chiragnagpal yes, weight decay is largely ineffective when used with Adam. I'm only suggesting a modification on what is currently implemented.

We could possibly rethink the entire implementation as proposed in Fixing Weight Decay Regularization in Adam (Loshchilov, Hutter; 2018).

@chiragnagpal
Copy link

That paper is too new with hardly any citations.

  1. It would be unwise to change from a standard implementation to a more esoteric one, especially given that we do not have any numbers on the performance improvement.

  2. Tensorflow itself may add weight decay to the optimizers, like pytorch and dynet do. Why bother with doing it explicitly just now, given that it might break again in the future.

@aayux
Copy link
Author

aayux commented Mar 15, 2018

Sure. It's only something to think about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants