Replies: 1 comment 1 reply
-
These are exposed via |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
In addition to tuning the learning rate for the optimizer, it can sometimes be helpful to adjust other parameters, such as the weight decay, to improve generalization, reduce overfitting, or to allow for more aggressive LR to speed up training. Both torch and bitsandbytes AdamW optimizers support custom beta1, beta2, and weight_decay parameters.
The short description for weight decay is that during training, the larger the magnitude of a weight the more it is penalize.
e.g. LoRA paper (https://arxiv.org/abs/2106.09685) weight decay is varied from 0.01 (the optimizer default) and 0.1.
While the default parameters should perform adequately for most situations, there may be datasets which benefit from tuning these other optimizer parameters.
The trade-off for exposing more configuration parameters is additional complexity to the end user. One possible option would be to have them processed at their default values unless specified otherwise. The inclusion of these parameters would be simple to implement, but at the same time I admit that there may not be many users interested in having access to them so wanted to open this topic up as a discussion.
Beta Was this translation helpful? Give feedback.
All reactions