Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hugging face trainer callback integration #33

Merged
merged 19 commits into from
Feb 16, 2024

Conversation

parthraut
Copy link
Collaborator

Pull request for issue #24 : Draft for HuggingFace Global Power Limit Optimizer. Imports correctly locally, but needs to be further tested on machine with GPUs.

@parthraut parthraut requested a review from jaywonchung February 9, 2024 15:13
@jaywonchung jaywonchung linked an issue Feb 9, 2024 that may be closed by this pull request
@jaywonchung jaywonchung changed the title 24 hugging face trainer callback integration Hugging face trainer callback integration Feb 9, 2024
Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work @parthraut! I did a quick review.

@@ -15,3 +15,4 @@
"""A collection of optimizers for various knobs."""

from zeus.optimizer.power_limit import GlobalPowerLimitOptimizer
from zeus.optimizer.power_limit import HFGPLO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please run bash scripts/lint.sh to automatically format code and fix any lint errors. That's why CI is failing.

Comment on lines 539 to 540
def __init__(self, *args, **kwargs) -> None:
self.plo = cls(*args, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constructor signature binding missing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also self.plo is not a good name, since we want the code here to be generic over optimizers. Maybe use self.optimizer.

model: PreTrainedModel,
**kwargs,
) -> None:
self.plo.on_evaluate() # what to set metric to?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another TODO was to check the TrainerState of something else holds the current epoch's validation metric.

return Wrapper


HFGPLO = make_hf(GlobalPowerLimitOptimizer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix naming policies to "HF" + cls.__name__. The name you set in line 612 should be the same as the name you assign to in this line.

zeus/run/dataloader.py Outdated Show resolved Hide resolved
@parthraut
Copy link
Collaborator Author

I simplified the class to only the methods necessary for HFGlobalPowerLimitOptimizer.

Testing of HFGlobalPowerLimitOptimizer is in test.py (I can remove it before merging). It tests 4 things:

  1. The constructor signatures of GPLO and HFGPLO are exactly the same
  2. HFGPLO inherits from TrainerCallback
  3. HFGPLO can be used as a HuggingFace TrainerCallback. Single GPU Test
  4. Ensure that HFGPLO can be used as a HuggingFace TrainerCallback. Multi GPU Test

The output of these tests seem to show that the GlobalPowerLimitOptimizer underneath is being called correctly. A snapshot is provided below:
image

Please let me know if there is anything else I should test or add to the code. Thanks!

Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work! Apart from specific comments, we need to document this change. Please take a look at the overall documentation at https://ml.energy and the doc source under docs/, and mention this in appropriate places. We want first time users who have no idea about how Zeus is organized to know that this exists.

test.py Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not put random files in the root of the repository. We are using pytest for testing and those are kept under the tests directory. Also, tests should run without GPU to enable local machine development. I should have clarified these early on. I'll note these testing policies in our CONTRIBUTING.md.

Requested changes:

  1. Separate out constructor signature tests and incorporate them into tests/optimizer/test_power_limit_optimizer.py.
  2. Refactor the remaining part of test.py into examples/huggingface. One example training script that supports both single and multi GPU training, and a good README (probably copy and modify examples/imagenet/README.md).
  3. Please order imports according to PEP8: https://peps.python.org/pep-0008/#imports

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there also be a test using HFGlobalPowerLimitOptimizer for single/multi GPU using HuggingFace transformers, or just the signature equivalence test? I'll add the example training script, but should there be a test along with it, testing the functionality of HFGlobalPowerLimitOptimizer?

Copy link
Member

@jaywonchung jaywonchung Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's a nice way to test whether HF Trainer + global power limit optimizer actually calls all the required optimizer callbacks at the right moment it would be very nice. But if that's kinda tricky, I think we can just have the example training script and run it on our actual GPUs to check ourselves.

test.py Outdated
logging_steps=10,
)

trainer = Trainer(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does Trainer determine the number of GPUs to use? I'm asking because this and line 79 are identical, whereas this is supposed to use one GPU whereas line 79 is supposed to use four. This should be taken care of in the example training script you will put under examples/huggingface.

@@ -15,3 +15,4 @@
"""A collection of optimizers for various knobs."""

from zeus.optimizer.power_limit import GlobalPowerLimitOptimizer
from zeus.optimizer.power_limit import HFGlobalPowerLimitOptimizer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge import with above.

pl_step: int = 25,
profile_path: str | Path | None = None,
) -> None:
r"""[Wrapped for Hugging Face Trainer Callback] Initialize the optimizer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can have the bracket just in the class docstring. Please remove from other methods.

zeus/optimizer/power_limit.py Show resolved Hide resolved
@parthraut
Copy link
Collaborator Author

Made all the requested changes. I refactored the example, borrowed it from an example script from huggingface. Also updated the docs, but I was not sure whether to also add it to docs/index.md.

Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work! I think the example, its README, and docs are nice and appropriate.

Most of the comments are nitpicks, which I am only doing because this is your first PR and I would like us to align coding standards. Please address them, and make those your standards for future PRs as well.

examples/huggingface/README.md Outdated Show resolved Hide resolved
examples/huggingface/README.md Outdated Show resolved Hide resolved
- [`HFGlobalPowerLimitOptimizer`](https://ml.energy/zeus/reference/optimizer/power_limit/#zeus.optimizer.power_limit.HFGlobalPowerLimitOptimizer): Online-profiles each power limit with `ZeusMonitor` and finds the cost-optimal power limit. Calls GlobalPowerLimitOptimizer under the hood.

## Integration with HuggingFace 🤗 Trainer
For easy use with [HuggingFace 🤗 Transformers](https://huggingface.co/docs/transformers/en/index), [`HFGlobalPowerLimitOptimizer`](zeus.optimizer.power_limit.HFGlobalPowerLimitOptimizer) is a drop-in compatible [HuggingFace 🤗 Trainer Callback](https://huggingface.co/docs/transformers/en/main_classes/callback). When initializing a [HuggingFace 🤗 Trainer](https://huggingface.co/docs/transformers/main_classes/trainer), initialize and pass in [`HFGlobalPowerLimitOptimizer`](zeus.optimizer.power_limit.HFGlobalPowerLimitOptimizer) as shown below:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these Zeus links will work

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, should I remove them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant this

-[`HFGlobalPowerLimitOptimizer`](zeus.optimizer.power_limit.HFGlobalPowerLimitOptimizer)
+[`HFGlobalPowerLimitOptimizer`](https://ml.energy/zeus/reference/optimizer/power_limit/#zeus.optimizer.power_limit.HFGlobalPowerLimitOptimizerzeus.optimizer.power_limit.HFGlobalPowerLimitOptimizer)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The [zeus.optimizer.power_limit.HFGlobalPowerLimitOptimizer] trick only works inside docstrings, which is enabled by mkdocstrings automatically parsing them and replacing them with links into the documentation site.

examples/huggingface/run_clm.py Outdated Show resolved Hide resolved
examples/huggingface/run_clm.py Outdated Show resolved Hide resolved
tests/optimizer/test_power_limit_optimizer.py Outdated Show resolved Hide resolved
tests/optimizer/test_power_limit_optimizer.py Outdated Show resolved Hide resolved
examples/huggingface/README.md Outdated Show resolved Hide resolved
examples/huggingface/requirements.txt Outdated Show resolved Hide resolved
@parthraut
Copy link
Collaborator Author

Updated the code with all suggestions. For the README.md in examples/huggingface, I was unsure what to do for the links to HFGlobalPowerLimitOptimizer, so I just removed them.

Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks again for your great work!

@jaywonchung jaywonchung merged commit 5c41ab4 into master Feb 16, 2024
2 checks passed
@jaywonchung jaywonchung deleted the 24-hugging-face-trainer-callback-integration branch February 16, 2024 02:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hugging Face Trainer callback integration
2 participants