-
Notifications
You must be signed in to change notification settings - Fork 29.5k
tests: revert change of torch_require_multi_gpu to be device agnostic #35721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@ydshieh : we've discussed the issue around recent modification of transformers/tests/trainer/test_trainer_distributed.py Lines 148 to 151 in 99e0ab6
I suggest to revert the modification to @jiqing-feng : I believe you made the change in #31098 for some specific tests. Can you share this list with me? |
@jiqing-feng : did you run any of the tests from Transformers |
Yes, but I only run the BNB test which you refered in this PR: #31098 . For other XPU tests, please ask @faaany . |
Thank you @jiqing-feng. It seems I can assume that tests in the following files are designed for multi-GPUs agnostic to GPU type and need a change
I have modified them accordingly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is what I think better, thanks a lot.
Could you remove the get_available_devices
added in that previous PR? That is not used anymore
It's still in use here in the code to check availability of bnb multi device backend:
|
ahhh, sorry, i mean |
Indeed, not used. Thank you for spotting. Removed. |
Would be nice if you can take another look |
@Titus-von-Koeller, @ydshieh : will you have time to take another look and merge if no concerns? |
I ping @Titus-von-Koeller internally and let's see. |
@dvrogozh Sorry for the delay. Due to holidays, crunch on high impact stuff + sick leave this slipped the radar. We introduced this change to enable the multi-backend refactor for bitsandbytes. So we only need the decorator the way it was changed for any tests that require_bnb and foremost those in the quantization test folder. I need to spin up the code on the respective VMs and see if I can cobble together a solution that fits both of our needs. Somehow I'm a bit surprised that this popped up so late, this change happened last summer. Of course we'll help remediate asap. cc @matthewdouglas (for visibility, I'll take the lead on this one) |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@Titus-von-Koeller no worries, take your time. |
Thank you for taking another look on this PR. If the failures you encountered with this PR are already failing on |
There were no ci test failures on initial submission and I don't think that change added by @Titus-von-Koeller introduced failures as well. This PR is done on an older code base and there is some merge conflict. I will rebase and let's see how ci goes. |
The 11c27dd modified `torch_require_multi_gpu()` to be device agnostic instead of being CUDA specific. This broke some tests which are rightfully CUDA specific, such as: * `tests/trainer/test_trainer_distributed.py::TestTrainerDistributed` In the current Transformers tests architecture `require_torch_multi_accelerator()` should be used to mark multi-GPU tests agnostic to device. This change addresses the issue introduced by 11c27dd and reverts modification of `torch_require_multi_gpu()`. Fixes: 11c27dd ("Enable BNB multi-backend support (huggingface#31098)") Signed-off-by: Dmitry Rogozhkin <[email protected]>
@Titus-von-Koeller : fyi, I rebased on top of latest main and resolved a conflict (trivial) with other changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so the failures were due to issues in our testing environments for the different backends. All this multi-backend stuff is causing quite an extreme amount of extra work. We'll have to improve our overall setup to make it more straight forward to test things and give instant feedback through CI. We're working on that.
Thanks @dvrogozh for your patience with us and the initiative in providing a fully functioning fix for this ❤️ 🤗 Excellent work and very appreciated!
We can merge as is, green light also from the bitsandbytes team!
The 11c27dd modified
torch_require_multi_gpu()
to be device agnostic instead of being CUDA specific. This broke some tests which are rightfully CUDA specific, such as:tests/trainer/test_trainer_distributed.py::TestTrainerDistributed
In the current Transformers tests architecture
require_torch_multi_accelerator()
should be used to mark multi-GPU tests agnostic to device.This change addresses the issue introduced by 11c27dd and reverts modification of
torch_require_multi_gpu()
.Fixes: 11c27dd ("Enable BNB multi-backend support (#31098)")
CC: @jiqing-feng @ydshieh