Hugging face trainer callback integration #33

parthraut · 2024-02-09T15:13:42Z

Pull request for issue #24 : Draft for HuggingFace Global Power Limit Optimizer. Imports correctly locally, but needs to be further tested on machine with GPUs.

jaywonchung

Thanks for your work @parthraut! I did a quick review.

jaywonchung · 2024-02-09T15:21:23Z

zeus/optimizer/__init__.py

@@ -15,3 +15,4 @@
 """A collection of optimizers for various knobs."""

 from zeus.optimizer.power_limit import GlobalPowerLimitOptimizer
+from zeus.optimizer.power_limit import HFGPLO


Please run bash scripts/lint.sh to automatically format code and fix any lint errors. That's why CI is failing.

jaywonchung · 2024-02-09T15:22:01Z

zeus/optimizer/power_limit.py

+        def __init__(self, *args, **kwargs) -> None:
+            self.plo = cls(*args, **kwargs)


Constructor signature binding missing

Also self.plo is not a good name, since we want the code here to be generic over optimizers. Maybe use self.optimizer.

jaywonchung · 2024-02-09T15:23:21Z

zeus/optimizer/power_limit.py

+            model: PreTrainedModel,
+            **kwargs,
+        ) -> None:
+            self.plo.on_evaluate()  # what to set metric to?


Another TODO was to check the TrainerState of something else holds the current epoch's validation metric.

jaywonchung · 2024-02-09T15:24:55Z

zeus/optimizer/power_limit.py

+    return Wrapper
+
+
+HFGPLO = make_hf(GlobalPowerLimitOptimizer)


Let's fix naming policies to "HF" + cls.__name__. The name you set in line 612 should be the same as the name you assign to in this line.

zeus/run/dataloader.py

… added to)

parthraut · 2024-02-10T23:14:24Z

I simplified the class to only the methods necessary for HFGlobalPowerLimitOptimizer.

Testing of HFGlobalPowerLimitOptimizer is in test.py (I can remove it before merging). It tests 4 things:

The constructor signatures of GPLO and HFGPLO are exactly the same
HFGPLO inherits from TrainerCallback
HFGPLO can be used as a HuggingFace TrainerCallback. Single GPU Test
Ensure that HFGPLO can be used as a HuggingFace TrainerCallback. Multi GPU Test

The output of these tests seem to show that the GlobalPowerLimitOptimizer underneath is being called correctly. A snapshot is provided below:

Please let me know if there is anything else I should test or add to the code. Thanks!

jaywonchung

Thanks for your work! Apart from specific comments, we need to document this change. Please take a look at the overall documentation at https://ml.energy and the doc source under docs/, and mention this in appropriate places. We want first time users who have no idea about how Zeus is organized to know that this exists.

jaywonchung · 2024-02-11T17:01:31Z

test.py

You should not put random files in the root of the repository. We are using pytest for testing and those are kept under the tests directory. Also, tests should run without GPU to enable local machine development. I should have clarified these early on. I'll note these testing policies in our CONTRIBUTING.md.

Requested changes:

Separate out constructor signature tests and incorporate them into tests/optimizer/test_power_limit_optimizer.py.

Refactor the remaining part of test.py into examples/huggingface. One example training script that supports both single and multi GPU training, and a good README (probably copy and modify examples/imagenet/README.md).

Please order imports according to PEP8: https://peps.python.org/pep-0008/#imports

should there also be a test using HFGlobalPowerLimitOptimizer for single/multi GPU using HuggingFace transformers, or just the signature equivalence test? I'll add the example training script, but should there be a test along with it, testing the functionality of HFGlobalPowerLimitOptimizer?

If there's a nice way to test whether HF Trainer + global power limit optimizer actually calls all the required optimizer callbacks at the right moment it would be very nice. But if that's kinda tricky, I think we can just have the example training script and run it on our actual GPUs to check ourselves.

jaywonchung · 2024-02-11T17:04:19Z

test.py

+            logging_steps=10,
+        )
+
+        trainer = Trainer(


How does Trainer determine the number of GPUs to use? I'm asking because this and line 79 are identical, whereas this is supposed to use one GPU whereas line 79 is supposed to use four. This should be taken care of in the example training script you will put under examples/huggingface.

jaywonchung · 2024-02-11T17:04:43Z

zeus/optimizer/__init__.py

@@ -15,3 +15,4 @@
 """A collection of optimizers for various knobs."""

 from zeus.optimizer.power_limit import GlobalPowerLimitOptimizer
+from zeus.optimizer.power_limit import HFGlobalPowerLimitOptimizer


Merge import with above.

jaywonchung · 2024-02-11T17:06:58Z

zeus/optimizer/power_limit.py

+        pl_step: int = 25,
+        profile_path: str | Path | None = None,
+    ) -> None:
+        r"""[Wrapped for Hugging Face Trainer Callback] Initialize the optimizer.


I think you can have the bracket just in the class docstring. Please remove from other methods.

zeus/optimizer/power_limit.py

Co-authored-by: Jae-Won Chung <[email protected]>

parthraut · 2024-02-15T17:16:17Z

Made all the requested changes. I refactored the example, borrowed it from an example script from huggingface. Also updated the docs, but I was not sure whether to also add it to docs/index.md.

jaywonchung

Thanks for your work! I think the example, its README, and docs are nice and appropriate.

Most of the comments are nitpicks, which I am only doing because this is your first PR and I would like us to align coding standards. Please address them, and make those your standards for future PRs as well.

examples/huggingface/README.md

jaywonchung · 2024-02-15T21:51:55Z

examples/huggingface/README.md

+- [`HFGlobalPowerLimitOptimizer`](https://ml.energy/zeus/reference/optimizer/power_limit/#zeus.optimizer.power_limit.HFGlobalPowerLimitOptimizer): Online-profiles each power limit with `ZeusMonitor` and finds the cost-optimal power limit. Calls GlobalPowerLimitOptimizer under the hood.
+
+## Integration with HuggingFace 🤗 Trainer
+For easy use with [HuggingFace 🤗 Transformers](https://huggingface.co/docs/transformers/en/index), [`HFGlobalPowerLimitOptimizer`](zeus.optimizer.power_limit.HFGlobalPowerLimitOptimizer) is a drop-in compatible [HuggingFace 🤗 Trainer Callback](https://huggingface.co/docs/transformers/en/main_classes/callback). When initializing a [HuggingFace 🤗 Trainer](https://huggingface.co/docs/transformers/main_classes/trainer), initialize and pass in [`HFGlobalPowerLimitOptimizer`](zeus.optimizer.power_limit.HFGlobalPowerLimitOptimizer) as shown below:


I don't think these Zeus links will work

okay, should I remove them?

I meant this

-[`HFGlobalPowerLimitOptimizer`](zeus.optimizer.power_limit.HFGlobalPowerLimitOptimizer) +[`HFGlobalPowerLimitOptimizer`](https://ml.energy/zeus/reference/optimizer/power_limit/#zeus.optimizer.power_limit.HFGlobalPowerLimitOptimizerzeus.optimizer.power_limit.HFGlobalPowerLimitOptimizer)

The [zeus.optimizer.power_limit.HFGlobalPowerLimitOptimizer] trick only works inside docstrings, which is enabled by mkdocstrings automatically parsing them and replacing them with links into the documentation site.

examples/huggingface/run_clm.py

tests/optimizer/test_power_limit_optimizer.py

examples/huggingface/README.md

examples/huggingface/requirements.txt

Co-authored-by: Jae-Won Chung <[email protected]>

parthraut · 2024-02-16T02:28:40Z

Updated the code with all suggestions. For the README.md in examples/huggingface, I was unsure what to do for the links to HFGlobalPowerLimitOptimizer, so I just removed them.

jaywonchung

LGTM! Thanks again for your great work!

parthraut added 4 commits February 2, 2024 10:43

testing

19a2c6f

draft of HuggingFace Compatible GPLO, added to power_limit.py

133eefe

finished draft for HFGPLO, fixed import errors on usage

bf71b16

finished draft of HFGPLO, imports correctly but needs to be fully tested

a75087c

parthraut requested a review from jaywonchung February 9, 2024 15:13

jaywonchung linked an issue Feb 9, 2024 that may be closed by this pull request

Hugging Face Trainer callback integration #24

Closed

jaywonchung changed the title ~~24 hugging face trainer callback integration~~ Hugging face trainer callback integration Feb 9, 2024

jaywonchung requested changes Feb 9, 2024

View reviewed changes

parthraut added 4 commits February 10, 2024 15:59

Rewrote HFGPLO to static class, wrote a simple test case (needs to be…

fff3ffa

… added to)

finished class, ready for PR

43f898a

fixing formatting in dataloader

f01110e

fixed accidental dependency on transformers

dc9c28f

jaywonchung requested changes Feb 11, 2024

View reviewed changes

parthraut and others added 4 commits February 13, 2024 11:45

Update zeus/optimizer/power_limit.py

46f2a88

Co-authored-by: Jae-Won Chung <[email protected]>

finished huggingface example, fixed imports, added ctor test

5bdeda1

fixed some styling issues

2214494

fixed minor formatting, ready for review

aa6b409

jaywonchung requested changes Feb 15, 2024

View reviewed changes

parthraut and others added 6 commits February 15, 2024 21:01

Update examples/huggingface/README.md

f37c0a1

Co-authored-by: Jae-Won Chung <[email protected]>

Update examples/huggingface/README.md

e5e329d

Co-authored-by: Jae-Won Chung <[email protected]>

Update examples/huggingface/run_clm.py

02391a4

Co-authored-by: Jae-Won Chung <[email protected]>

Update examples/huggingface/README.md

4579431

Co-authored-by: Jae-Won Chung <[email protected]>

Update examples/huggingface/requirements.txt

f2a0f91

Co-authored-by: Jae-Won Chung <[email protected]>

fixed some style, removed broken links in example/huggingface/README.md

01db75e

fixed links in readme

e7e86d8

jaywonchung approved these changes Feb 16, 2024

View reviewed changes

jaywonchung merged commit 5c41ab4 into master Feb 16, 2024
2 checks passed

jaywonchung deleted the 24-hugging-face-trainer-callback-integration branch February 16, 2024 02:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hugging face trainer callback integration #33

Hugging face trainer callback integration #33

parthraut commented Feb 9, 2024

jaywonchung left a comment

jaywonchung Feb 9, 2024

jaywonchung Feb 9, 2024

jaywonchung Feb 9, 2024

jaywonchung Feb 9, 2024

jaywonchung Feb 9, 2024

parthraut commented Feb 10, 2024

jaywonchung left a comment

jaywonchung Feb 11, 2024

parthraut Feb 13, 2024

jaywonchung Feb 13, 2024 •

edited

Loading

jaywonchung Feb 11, 2024

jaywonchung Feb 11, 2024

jaywonchung Feb 11, 2024

parthraut commented Feb 15, 2024

jaywonchung left a comment

jaywonchung Feb 15, 2024

parthraut Feb 16, 2024

jaywonchung Feb 16, 2024

jaywonchung Feb 16, 2024

parthraut commented Feb 16, 2024

jaywonchung left a comment

		def __init__(self, args, *kwargs) -> None:
		self.plo = cls(args, *kwargs)

Hugging face trainer callback integration #33

Hugging face trainer callback integration #33

Conversation

parthraut commented Feb 9, 2024

jaywonchung left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parthraut commented Feb 10, 2024

jaywonchung left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaywonchung Feb 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parthraut commented Feb 15, 2024

jaywonchung left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parthraut commented Feb 16, 2024

jaywonchung left a comment

Choose a reason for hiding this comment

jaywonchung Feb 13, 2024 •

edited

Loading