-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
L2 Regularization for all layers #131
base: master
Are you sure you want to change the base?
Conversation
It is better practice to apply weight decay methods in each layer. This commit proposes to change the current implementation of L2 Regularization from just the output layer to the hidden (convolution) layers and also omits regularization of the bias at the output as an unnecessary overhead. Seeking review and testing.
text_cnn.py
Outdated
self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores") | ||
self.predictions = tf.argmax(self.scores, 1, name="predictions") | ||
|
||
# Calculate L2 Regularization | ||
l2_loss = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables() if "b" not in v.name]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will also add embedding weights into l2_loss, is it intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't! Thanks for pointing this out.
Following change should fix it I think:
l2_loss = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables()[1:] if "W" in v.name])
Thanks. This looks good in principle, have you tested it? |
@dennybritz: I have only tested it on a small subset of the data. The program itself runs smoothly. However, I have not been able to do any meaningful performance/accuracy evaluations against your implementation because of a resource constraint. I can make a few citations to justify that this method works better in theory. This StackOverflow answer verifies that the implementation is indeed correct and something of a standard, as similar usage can be found elsewhere. Discussion on (not) regularizing bias can be found here and in this chapter of the Deep Learning book (p. 226). |
Weight Decay with Adam is not a good idea. You'll end up adding another hyperparameter with little performance gain. |
@chiragnagpal yes, weight decay is largely ineffective when used with Adam. I'm only suggesting a modification on what is currently implemented. We could possibly rethink the entire implementation as proposed in Fixing Weight Decay Regularization in Adam (Loshchilov, Hutter; 2018). |
That paper is too new with hardly any citations.
|
Sure. It's only something to think about. |
It is better practice to apply weight decay methods in each layer. This commit proposes to change the current implementation of L2 Regularization from just the output layer to the hidden (convolution) layers and also omits regularization of the bias at the output as an unnecessary overhead.
Seeking review and testing.