-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hugging face trainer callback integration #33
Hugging face trainer callback integration #33
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work @parthraut! I did a quick review.
zeus/optimizer/__init__.py
Outdated
@@ -15,3 +15,4 @@ | |||
"""A collection of optimizers for various knobs.""" | |||
|
|||
from zeus.optimizer.power_limit import GlobalPowerLimitOptimizer | |||
from zeus.optimizer.power_limit import HFGPLO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please run bash scripts/lint.sh
to automatically format code and fix any lint errors. That's why CI is failing.
zeus/optimizer/power_limit.py
Outdated
def __init__(self, *args, **kwargs) -> None: | ||
self.plo = cls(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constructor signature binding missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also self.plo
is not a good name, since we want the code here to be generic over optimizers. Maybe use self.optimizer
.
zeus/optimizer/power_limit.py
Outdated
model: PreTrainedModel, | ||
**kwargs, | ||
) -> None: | ||
self.plo.on_evaluate() # what to set metric to? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another TODO was to check the TrainerState
of something else holds the current epoch's validation metric.
zeus/optimizer/power_limit.py
Outdated
return Wrapper | ||
|
||
|
||
HFGPLO = make_hf(GlobalPowerLimitOptimizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's fix naming policies to "HF" + cls.__name__
. The name you set in line 612 should be the same as the name you assign to in this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work! Apart from specific comments, we need to document this change. Please take a look at the overall documentation at https://ml.energy and the doc source under docs/
, and mention this in appropriate places. We want first time users who have no idea about how Zeus is organized to know that this exists.
test.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should not put random files in the root of the repository. We are using pytest
for testing and those are kept under the tests
directory. Also, tests should run without GPU to enable local machine development. I should have clarified these early on. I'll note these testing policies in our CONTRIBUTING.md.
Requested changes:
- Separate out constructor signature tests and incorporate them into
tests/optimizer/test_power_limit_optimizer.py
. - Refactor the remaining part of
test.py
intoexamples/huggingface
. One example training script that supports both single and multi GPU training, and a good README (probably copy and modifyexamples/imagenet/README.md
). - Please order imports according to PEP8: https://peps.python.org/pep-0008/#imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should there also be a test using HFGlobalPowerLimitOptimizer for single/multi GPU using HuggingFace transformers, or just the signature equivalence test? I'll add the example training script, but should there be a test along with it, testing the functionality of HFGlobalPowerLimitOptimizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's a nice way to test whether HF Trainer + global power limit optimizer actually calls all the required optimizer callbacks at the right moment it would be very nice. But if that's kinda tricky, I think we can just have the example training script and run it on our actual GPUs to check ourselves.
test.py
Outdated
logging_steps=10, | ||
) | ||
|
||
trainer = Trainer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does Trainer determine the number of GPUs to use? I'm asking because this and line 79 are identical, whereas this is supposed to use one GPU whereas line 79 is supposed to use four. This should be taken care of in the example training script you will put under examples/huggingface
.
zeus/optimizer/__init__.py
Outdated
@@ -15,3 +15,4 @@ | |||
"""A collection of optimizers for various knobs.""" | |||
|
|||
from zeus.optimizer.power_limit import GlobalPowerLimitOptimizer | |||
from zeus.optimizer.power_limit import HFGlobalPowerLimitOptimizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge import with above.
zeus/optimizer/power_limit.py
Outdated
pl_step: int = 25, | ||
profile_path: str | Path | None = None, | ||
) -> None: | ||
r"""[Wrapped for Hugging Face Trainer Callback] Initialize the optimizer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can have the bracket just in the class docstring. Please remove from other methods.
Made all the requested changes. I refactored the example, borrowed it from an example script from huggingface. Also updated the docs, but I was not sure whether to also add it to docs/index.md. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work! I think the example, its README, and docs are nice and appropriate.
Most of the comments are nitpicks, which I am only doing because this is your first PR and I would like us to align coding standards. Please address them, and make those your standards for future PRs as well.
examples/huggingface/README.md
Outdated
- [`HFGlobalPowerLimitOptimizer`](https://ml.energy/zeus/reference/optimizer/power_limit/#zeus.optimizer.power_limit.HFGlobalPowerLimitOptimizer): Online-profiles each power limit with `ZeusMonitor` and finds the cost-optimal power limit. Calls GlobalPowerLimitOptimizer under the hood. | ||
|
||
## Integration with HuggingFace 🤗 Trainer | ||
For easy use with [HuggingFace 🤗 Transformers](https://huggingface.co/docs/transformers/en/index), [`HFGlobalPowerLimitOptimizer`](zeus.optimizer.power_limit.HFGlobalPowerLimitOptimizer) is a drop-in compatible [HuggingFace 🤗 Trainer Callback](https://huggingface.co/docs/transformers/en/main_classes/callback). When initializing a [HuggingFace 🤗 Trainer](https://huggingface.co/docs/transformers/main_classes/trainer), initialize and pass in [`HFGlobalPowerLimitOptimizer`](zeus.optimizer.power_limit.HFGlobalPowerLimitOptimizer) as shown below: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these Zeus links will work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, should I remove them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant this
-[`HFGlobalPowerLimitOptimizer`](zeus.optimizer.power_limit.HFGlobalPowerLimitOptimizer)
+[`HFGlobalPowerLimitOptimizer`](https://ml.energy/zeus/reference/optimizer/power_limit/#zeus.optimizer.power_limit.HFGlobalPowerLimitOptimizerzeus.optimizer.power_limit.HFGlobalPowerLimitOptimizer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The [zeus.optimizer.power_limit.HFGlobalPowerLimitOptimizer]
trick only works inside docstrings, which is enabled by mkdocstrings
automatically parsing them and replacing them with links into the documentation site.
Co-authored-by: Jae-Won Chung <[email protected]>
Co-authored-by: Jae-Won Chung <[email protected]>
Co-authored-by: Jae-Won Chung <[email protected]>
Co-authored-by: Jae-Won Chung <[email protected]>
Co-authored-by: Jae-Won Chung <[email protected]>
Updated the code with all suggestions. For the README.md in examples/huggingface, I was unsure what to do for the links to HFGlobalPowerLimitOptimizer, so I just removed them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks again for your great work!
Pull request for issue #24 : Draft for HuggingFace Global Power Limit Optimizer. Imports correctly locally, but needs to be further tested on machine with GPUs.