tests: revert change of torch_require_multi_gpu to be device agnostic #35721

dvrogozh · 2025-01-16T02:14:12Z

The 11c27dd modified torch_require_multi_gpu() to be device agnostic instead of being CUDA specific. This broke some tests which are rightfully CUDA specific, such as:

tests/trainer/test_trainer_distributed.py::TestTrainerDistributed

In the current Transformers tests architecture require_torch_multi_accelerator() should be used to mark multi-GPU tests agnostic to device.

This change addresses the issue introduced by 11c27dd and reverts modification of torch_require_multi_gpu().

Fixes: 11c27dd ("Enable BNB multi-backend support (#31098)")

CC: @jiqing-feng @ydshieh

dvrogozh · 2025-01-16T02:23:33Z

@ydshieh : we've discussed the issue around recent modification of torch_require_multi_gpu() in #35269 (comment). Today I found the case which got broken after this modification: the rightfully CUDA specific test now is being triggered on non-CUDA multi GPU scenarios. Here is this test:

transformers/tests/trainer/test_trainer_distributed.py

Lines 148 to 151 in 99e0ab6

    
           class TestTrainerDistributed(TestCasePlus): 
        
               @require_torch_multi_gpu 
        
               def test_trainer(self): 
        
                   distributed_args = f"""--nproc_per_node={torch.cuda.device_count()}

I suggest to revert the modification to torch_require_multi_gpu() introduced in #31098. The downside of this step is that those device agnostic tests which have been marked @torch_require_multi_gpu after the merge of #31098 will start to be skipped. Unfortunately I do not know the list of such tests to properly mark them with @require_torch_multi_accelerator.

@jiqing-feng : I believe you made the change in #31098 for some specific tests. Can you share this list with me?

jiqing-feng · 2025-01-16T02:47:02Z

Hi @dvrogozh , thanks for your fix. For your question, I only use some small models like gpt2, opt-125m and llama-68m in the tests.

cc @Titus-von-Koeller

dvrogozh · 2025-01-16T03:05:13Z

Hi @dvrogozh , thanks for your fix. For your question, I only use some small models like gpt2, opt-125m and llama-68m in the tests.

@jiqing-feng : did you run any of the tests from Transformers tests/ suite in multi XPU scenario (using ipex)?

jiqing-feng · 2025-01-16T03:07:26Z

Hi @dvrogozh , thanks for your fix. For your question, I only use some small models like gpt2, opt-125m and llama-68m in the tests.

@jiqing-feng : did you run any of the tests from Transformers tests/ suite in multi XPU scenario (using ipex)?

Yes, but I only run the BNB test which you refered in this PR: #31098 . For other XPU tests, please ask @faaany .

dvrogozh · 2025-01-16T03:47:22Z

Thank you @jiqing-feng. It seems I can assume that tests in the following files are designed for multi-GPUs agnostic to GPU type and need a change s/torch_require_multi_gpu/require_torch_multi_accelerator/:

tests/quantization/bnb/test_4bit.py
tests/quantization/bnb/test_mixed_int8.py

I have modified them accordingly.

ydshieh

Yes, that is what I think better, thanks a lot.

Could you remove the get_available_devices added in that previous PR? That is not used anymore

dvrogozh · 2025-01-16T15:25:30Z

Could you remove the get_available_devices added in that previous PR? That is not used anymore

It's still in use here in the code to check availability of bnb multi device backend:

transformers/src/transformers/integrations/bitsandbytes.py

Line 489 in 3292e96

available_devices = get_available_devices()

ydshieh · 2025-01-16T15:42:51Z

ahhh, sorry, i mean get_device_count , copy-paste error 😢

dvrogozh · 2025-01-16T16:06:01Z

i mean get_device_count

Indeed, not used. Thank you for spotting. Removed.

ydshieh · 2025-01-16T16:26:49Z

@Titus-von-Koeller

Would be nice if you can take another look

dvrogozh · 2025-02-11T22:32:06Z

@Titus-von-Koeller, @ydshieh : will you have time to take another look and merge if no concerns?

ydshieh · 2025-02-12T12:43:32Z

I ping @Titus-von-Koeller internally and let's see. ~~I will merge at the end of this Friday in any case.~~

Titus-von-Koeller · 2025-02-12T12:52:51Z

@dvrogozh Sorry for the delay. Due to holidays, crunch on high impact stuff + sick leave this slipped the radar. We introduced this change to enable the multi-backend refactor for bitsandbytes. So we only need the decorator the way it was changed for any tests that require_bnb and foremost those in the quantization test folder.

I need to spin up the code on the respective VMs and see if I can cobble together a solution that fits both of our needs. Somehow I'm a bit surprised that this popped up so late, this change happened last summer.

Of course we'll help remediate asap. cc @matthewdouglas (for visibility, I'll take the lead on this one)

Titus-von-Koeller · 2025-02-18T21:07:26Z

Hey @dvrogozh, I'm in the process of testing your branch on the various backends (CUDA, AMD, Intel CPU + XPU). It's a bit of a pain and time-drain, but I'm on it this week. Right now I'm still untangling various failures. I'll keep you posted.

cc @ydshieh

HuggingFaceDocBuilderDev · 2025-02-18T21:28:36Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

dvrogozh · 2025-02-18T22:25:03Z

@Titus-von-Koeller no worries, take your time.

ydshieh · 2025-02-19T08:48:18Z

@Titus-von-Koeller

Thank you for taking another look on this PR.

If the failures you encountered with this PR are already failing on main, we could fix them in a separate PR. Only those tests passing on main but failing on this PR should be fixed within this PR itself.

dvrogozh · 2025-02-19T20:17:19Z

There were no ci test failures on initial submission and I don't think that change added by @Titus-von-Koeller introduced failures as well. This PR is done on an older code base and there is some merge conflict. I will rebase and let's see how ci goes.

The 11c27dd modified `torch_require_multi_gpu()` to be device agnostic instead of being CUDA specific. This broke some tests which are rightfully CUDA specific, such as: * `tests/trainer/test_trainer_distributed.py::TestTrainerDistributed` In the current Transformers tests architecture `require_torch_multi_accelerator()` should be used to mark multi-GPU tests agnostic to device. This change addresses the issue introduced by 11c27dd and reverts modification of `torch_require_multi_gpu()`. Fixes: 11c27dd ("Enable BNB multi-backend support (huggingface#31098)") Signed-off-by: Dmitry Rogozhkin <[email protected]>

dvrogozh · 2025-02-19T20:27:24Z

@Titus-von-Koeller : fyi, I rebased on top of latest main and resolved a conflict (trivial) with other changes.

Titus-von-Koeller

Ok, so the failures were due to issues in our testing environments for the different backends. All this multi-backend stuff is causing quite an extreme amount of extra work. We'll have to improve our overall setup to make it more straight forward to test things and give instant feedback through CI. We're working on that.

Thanks @dvrogozh for your patience with us and the initiative in providing a fully functioning fix for this ❤️ 🤗 Excellent work and very appreciated!

We can merge as is, green light also from the bitsandbytes team!

dvrogozh requested review from Rocketknight1 and ArthurZucker as code owners January 16, 2025 02:14

dvrogozh force-pushed the ci branch from d73cace to 3f36fbf Compare January 16, 2025 03:41

ydshieh approved these changes Jan 16, 2025

View reviewed changes

dvrogozh force-pushed the ci branch from 3f36fbf to bef1b1a Compare January 16, 2025 16:04

ydshieh requested a review from Titus-von-Koeller January 16, 2025 16:26

dvrogozh and others added 2 commits February 19, 2025 20:20

fix bug: modification of frozen set

d29f517

dvrogozh force-pushed the ci branch from dacb475 to d29f517 Compare February 19, 2025 20:24

Titus-von-Koeller approved these changes Feb 24, 2025

View reviewed changes

Merge branch 'main' into ci

3abd9c9

ydshieh merged commit b4b9da6 into huggingface:main Feb 25, 2025
19 of 21 checks passed

tests: revert change of torch_require_multi_gpu to be device agnostic #35721

tests: revert change of torch_require_multi_gpu to be device agnostic #35721

Uh oh!

Conversation

dvrogozh commented Jan 16, 2025

Uh oh!

dvrogozh commented Jan 16, 2025

Uh oh!

jiqing-feng commented Jan 16, 2025

Uh oh!

dvrogozh commented Jan 16, 2025

Uh oh!

jiqing-feng commented Jan 16, 2025

Uh oh!

dvrogozh commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ydshieh left a comment

Choose a reason for hiding this comment

Uh oh!

dvrogozh commented Jan 16, 2025

Uh oh!

ydshieh commented Jan 16, 2025

Uh oh!

dvrogozh commented Jan 16, 2025

Uh oh!

ydshieh commented Jan 16, 2025

Uh oh!

dvrogozh commented Feb 11, 2025

Uh oh!

ydshieh commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Titus-von-Koeller commented Feb 12, 2025

Uh oh!

Titus-von-Koeller commented Feb 18, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Feb 18, 2025

Uh oh!

dvrogozh commented Feb 18, 2025

Uh oh!

ydshieh commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dvrogozh commented Feb 19, 2025

Uh oh!

dvrogozh commented Feb 19, 2025

Uh oh!

Titus-von-Koeller left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dvrogozh commented Jan 16, 2025 •

edited

Loading

ydshieh commented Feb 12, 2025 •

edited

Loading

ydshieh commented Feb 19, 2025 •

edited

Loading