Fix for multi gpu setup training with a single GPU. #974

Sehyo · 2024-08-30T19:36:07Z

check_nvidia() originally spawns a new process for nvidia-smi, thus bypassing that GPU count might be limited by an OS environmental variable as this won't be reflected in the new process.

Added check for if GPU is limited by OS environ, if multiple, raises error like original behaviour.

If only one GPU enabled, only returns output for that GPU.

check_nvidia() originally spawns a new process for nvidia-smi, thus bypassing that GPU count might be limited by an OS environmental variable as this won't be reflected in the new process. Added check for if GPU is limited by OS environ, if multiple, raises error like original behaviour. If only one GPU enabled, only returns output for that GPU.

Add fixed code to the trainer patcher.

Fixed variable misname in trainer patcher.

Sehyo

Ready to merge, tested in my local multi gpu setup, and now possible to single train if limited by OS ENVIRON.

RahulVadisetty91

There are several instances where potential failures (like in convert_to_fast_tokenizer, try_fix_tokenizer, and assert_same_tokenization) are not properly logged or handled. Instead of silently returning the slow tokenizer, there should be logs or warnings for failed conversions or mismatches.

if not check_vocab or not check_special:
logger.warning("Vocab or special tokens do not match between slow and fast tokenizer.")
return slow_tokenizer

Sehyo · 2024-10-13T16:18:49Z

@RahulVadisetty91 Hello, your review does not seem relevant to my PR. Can you elaborate on the relevancy?

Datta0 · 2024-10-25T18:21:48Z

unsloth/tokenizer_utils.py

+    index_for_cuda = -1
+    if "CUDA_VISIBLE_DEVICES" in os.environ:
+        index_for_cuda = os.environ["CUDA_VISIBLE_DEVICES"]


Suggested change

index_for_cuda = -1

if "CUDA_VISIBLE_DEVICES" in os.environ:

index_for_cuda = os.environ["CUDA_VISIBLE_DEVICES"]

index_for_cuda = os.environ.get("CUDA_VISIBLE_DEVICES", -1)

What if CUDA_VISIBLE_DEVICES="0,1,2"?

The next few lines would take care of that I suppose.

Right, sorry

hife-ai · 2024-10-31T14:13:56Z

Hi @Sehyo!
Thank you for your PR. I see that this is not the only place where this happens.
For example in llama.py and throughout this file.

Could you please fix those places as well?

udaygirish · 2024-11-09T11:31:43Z

@Sehyo @hife-ai isn't it better till the dev is ongoing someone updates this in ReadMe about using os.environ specifically this is a problem for the notebooks also I think we should use a device argument which can be a list or int or leverage that everywhere rather than checking like this. And if someone specifies more devices then the code can exit in the start itself saying multi GPU is not supported. What you say ?

giuliabaldini · 2024-11-13T15:02:01Z

@hife-ai @Datta0 FYI, I am working on another PR for this. From my understanding the only changes that should be needed are the ones to check_nvidia(), because that is also the function that is called on other parts of the code.

Datta0 · 2024-11-13T15:05:17Z

@giuliabaldini in that case can you please link that here?

giuliabaldini · 2024-11-13T15:32:04Z

@Datta0 I am still testing that it works, I will open the PR as soon as I am sure and link it here!

Sehyo · 2024-11-13T15:33:36Z

@hife-ai @Datta0 FYI, I am working on another PR for this. From my understanding the only changes that should be needed are the ones to check_nvidia(), because that is also the function that is called on other parts of the code.

I am not sure why you would think that is the case. The trainer patcher clearly contains code that will break multi gpu functionality without my PR. This is the appropriate solution.

giuliabaldini · 2024-11-13T15:37:45Z

Hey @Sehyo, do you intend to finish the PR and implement the other changes requested by @hife-ai and @Datta0 ?

Sehyo · 2024-11-13T15:40:17Z

Sure I could do it tomorrow if necessary

giuliabaldini · 2024-11-13T15:43:37Z

Let me know if you can't and I will do it!

Sehyo added 3 commits August 31, 2024 03:36

Add fixed code to the trainer patcher.

de8216c

Add fixed code to the trainer patcher.

Update tokenizer_utils.py

72cd790

Fixed variable misname in trainer patcher.

Sehyo commented Aug 30, 2024

View reviewed changes

This was referenced Aug 30, 2024

Fix check_nvidia to support running multiple single GPU training / inference at the same time #856

Open

Single GPU training in Multi-GPU system doesn't work. #975

Open

Unable to load unsloth models in just single GPU in a multi GPU system #983

Open

Sehyo closed this Sep 11, 2024

Sehyo reopened this Sep 11, 2024

RahulVadisetty91 reviewed Sep 17, 2024

View reviewed changes

Datta0 reviewed Oct 25, 2024

View reviewed changes

giuliabaldini mentioned this pull request Nov 15, 2024

Fix too sensitive "Unsloth currently does not support multi GPU setups" when training with a single GPU in a multi-GPU environment. #1295

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for multi gpu setup training with a single GPU. #974

Fix for multi gpu setup training with a single GPU. #974

Sehyo commented Aug 30, 2024

Sehyo left a comment

RahulVadisetty91 left a comment

Sehyo commented Oct 13, 2024

Datta0 Oct 25, 2024

hife-ai Oct 31, 2024

Datta0 Oct 31, 2024

hife-ai Oct 31, 2024

hife-ai commented Oct 31, 2024 •

edited

Loading

udaygirish commented Nov 9, 2024

giuliabaldini commented Nov 13, 2024

Datta0 commented Nov 13, 2024

giuliabaldini commented Nov 13, 2024

Sehyo commented Nov 13, 2024

giuliabaldini commented Nov 13, 2024

Sehyo commented Nov 13, 2024

giuliabaldini commented Nov 13, 2024

Fix for multi gpu setup training with a single GPU. #974

Are you sure you want to change the base?

Fix for multi gpu setup training with a single GPU. #974

Conversation

Sehyo commented Aug 30, 2024

Sehyo left a comment

Choose a reason for hiding this comment

RahulVadisetty91 left a comment

Choose a reason for hiding this comment

Sehyo commented Oct 13, 2024

Datta0 Oct 25, 2024

Choose a reason for hiding this comment

hife-ai Oct 31, 2024

Choose a reason for hiding this comment

Datta0 Oct 31, 2024

Choose a reason for hiding this comment

hife-ai Oct 31, 2024

Choose a reason for hiding this comment

hife-ai commented Oct 31, 2024 • edited Loading

udaygirish commented Nov 9, 2024

giuliabaldini commented Nov 13, 2024

Datta0 commented Nov 13, 2024

giuliabaldini commented Nov 13, 2024

Sehyo commented Nov 13, 2024

giuliabaldini commented Nov 13, 2024

Sehyo commented Nov 13, 2024

giuliabaldini commented Nov 13, 2024

hife-ai commented Oct 31, 2024 •

edited

Loading