Skip to content

tests: revert change of torch_require_multi_gpu to be device agnostic #35721

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 25, 2025

Conversation

dvrogozh
Copy link
Contributor

The 11c27dd modified torch_require_multi_gpu() to be device agnostic instead of being CUDA specific. This broke some tests which are rightfully CUDA specific, such as:

  • tests/trainer/test_trainer_distributed.py::TestTrainerDistributed

In the current Transformers tests architecture require_torch_multi_accelerator() should be used to mark multi-GPU tests agnostic to device.

This change addresses the issue introduced by 11c27dd and reverts modification of torch_require_multi_gpu().

Fixes: 11c27dd ("Enable BNB multi-backend support (#31098)")

CC: @jiqing-feng @ydshieh

@dvrogozh
Copy link
Contributor Author

@ydshieh : we've discussed the issue around recent modification of torch_require_multi_gpu() in #35269 (comment). Today I found the case which got broken after this modification: the rightfully CUDA specific test now is being triggered on non-CUDA multi GPU scenarios. Here is this test:

class TestTrainerDistributed(TestCasePlus):
@require_torch_multi_gpu
def test_trainer(self):
distributed_args = f"""--nproc_per_node={torch.cuda.device_count()}

I suggest to revert the modification to torch_require_multi_gpu() introduced in #31098. The downside of this step is that those device agnostic tests which have been marked @torch_require_multi_gpu after the merge of #31098 will start to be skipped. Unfortunately I do not know the list of such tests to properly mark them with @require_torch_multi_accelerator.

@jiqing-feng : I believe you made the change in #31098 for some specific tests. Can you share this list with me?

@jiqing-feng
Copy link
Contributor

Hi @dvrogozh , thanks for your fix. For your question, I only use some small models like gpt2, opt-125m and llama-68m in the tests.

cc @Titus-von-Koeller

@dvrogozh
Copy link
Contributor Author

Hi @dvrogozh , thanks for your fix. For your question, I only use some small models like gpt2, opt-125m and llama-68m in the tests.

@jiqing-feng : did you run any of the tests from Transformers tests/ suite in multi XPU scenario (using ipex)?

@jiqing-feng
Copy link
Contributor

Hi @dvrogozh , thanks for your fix. For your question, I only use some small models like gpt2, opt-125m and llama-68m in the tests.

@jiqing-feng : did you run any of the tests from Transformers tests/ suite in multi XPU scenario (using ipex)?

Yes, but I only run the BNB test which you refered in this PR: #31098 . For other XPU tests, please ask @faaany .

@dvrogozh
Copy link
Contributor Author

dvrogozh commented Jan 16, 2025

Thank you @jiqing-feng. It seems I can assume that tests in the following files are designed for multi-GPUs agnostic to GPU type and need a change s/torch_require_multi_gpu/require_torch_multi_accelerator/:

  • tests/quantization/bnb/test_4bit.py
  • tests/quantization/bnb/test_mixed_int8.py

I have modified them accordingly.

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is what I think better, thanks a lot.

Could you remove the get_available_devices added in that previous PR? That is not used anymore

@dvrogozh
Copy link
Contributor Author

Could you remove the get_available_devices added in that previous PR? That is not used anymore

It's still in use here in the code to check availability of bnb multi device backend:

available_devices = get_available_devices()

@ydshieh
Copy link
Collaborator

ydshieh commented Jan 16, 2025

ahhh, sorry, i mean get_device_count , copy-paste error 😢

@dvrogozh
Copy link
Contributor Author

i mean get_device_count

Indeed, not used. Thank you for spotting. Removed.

@ydshieh
Copy link
Collaborator

ydshieh commented Jan 16, 2025

@Titus-von-Koeller

Would be nice if you can take another look

@dvrogozh
Copy link
Contributor Author

@Titus-von-Koeller, @ydshieh : will you have time to take another look and merge if no concerns?

@ydshieh
Copy link
Collaborator

ydshieh commented Feb 12, 2025

I ping @Titus-von-Koeller internally and let's see. I will merge at the end of this Friday in any case.

@Titus-von-Koeller
Copy link
Contributor

@dvrogozh Sorry for the delay. Due to holidays, crunch on high impact stuff + sick leave this slipped the radar. We introduced this change to enable the multi-backend refactor for bitsandbytes. So we only need the decorator the way it was changed for any tests that require_bnb and foremost those in the quantization test folder.

I need to spin up the code on the respective VMs and see if I can cobble together a solution that fits both of our needs. Somehow I'm a bit surprised that this popped up so late, this change happened last summer.

Of course we'll help remediate asap. cc @matthewdouglas (for visibility, I'll take the lead on this one)

@Titus-von-Koeller
Copy link
Contributor

Hey @dvrogozh, I'm in the process of testing your branch on the various backends (CUDA, AMD, Intel CPU + XPU). It's a bit of a pain and time-drain, but I'm on it this week. Right now I'm still untangling various failures. I'll keep you posted.

cc @ydshieh

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@dvrogozh
Copy link
Contributor Author

@Titus-von-Koeller no worries, take your time.

@ydshieh
Copy link
Collaborator

ydshieh commented Feb 19, 2025

@Titus-von-Koeller

Thank you for taking another look on this PR.

If the failures you encountered with this PR are already failing on main, we could fix them in a separate PR. Only those tests passing on main but failing on this PR should be fixed within this PR itself.

@dvrogozh
Copy link
Contributor Author

There were no ci test failures on initial submission and I don't think that change added by @Titus-von-Koeller introduced failures as well. This PR is done on an older code base and there is some merge conflict. I will rebase and let's see how ci goes.

dvrogozh and others added 2 commits February 19, 2025 20:20
The 11c27dd modified `torch_require_multi_gpu()` to be device agnostic
instead of being CUDA specific. This broke some tests which are rightfully
CUDA specific, such as:

* `tests/trainer/test_trainer_distributed.py::TestTrainerDistributed`

In the current Transformers tests architecture `require_torch_multi_accelerator()`
should be used to mark multi-GPU tests agnostic to device.

This change addresses the issue introduced by 11c27dd and reverts
modification of `torch_require_multi_gpu()`.

Fixes: 11c27dd ("Enable BNB multi-backend support (huggingface#31098)")
Signed-off-by: Dmitry Rogozhkin <[email protected]>
@dvrogozh
Copy link
Contributor Author

@Titus-von-Koeller : fyi, I rebased on top of latest main and resolved a conflict (trivial) with other changes.

Copy link
Contributor

@Titus-von-Koeller Titus-von-Koeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so the failures were due to issues in our testing environments for the different backends. All this multi-backend stuff is causing quite an extreme amount of extra work. We'll have to improve our overall setup to make it more straight forward to test things and give instant feedback through CI. We're working on that.

Thanks @dvrogozh for your patience with us and the initiative in providing a fully functioning fix for this ❤️ 🤗 Excellent work and very appreciated!

We can merge as is, green light also from the bitsandbytes team!

@ydshieh ydshieh merged commit b4b9da6 into huggingface:main Feb 25, 2025
19 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants