Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci][misc] fix more device count #6055

Closed
wants to merge 2 commits into from

Conversation

youkaichao
Copy link
Member

No description provided.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jul 2, 2024

Since you got some experience already, could you provide a list of which third-party imports may cause CUDA initialization so we can better avoid importing them too early? It would help a lot when adding new distributed tests.

@youkaichao
Copy link
Member Author

I'm writing a debug guide for this.

@DarkLight1337
Copy link
Member

Are you going to fix other occurrences of torch.cuda.device_count in this PR? I see that it is called in some other tests as well.

@DarkLight1337 DarkLight1337 self-assigned this Jul 2, 2024
@simon-mo simon-mo mentioned this pull request Jul 2, 2024
@youkaichao
Copy link
Member Author

@DarkLight1337 this pr fixes implicit call to torch.cuda.device_count() . Those explicit call can be removed later in a followup PR.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jul 2, 2024

I found that using lazy import in vllm.transformer_utils.image_processor seems to remove the need for lazy importing the multimodal modules. See the changes to that file in #5276.

@youkaichao
Copy link
Member Author

You can see my stack trace at #6056 (comment) . Yes, image_processor is the part to blame. Lazy import either one would work.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jul 2, 2024

I found that using lazy import in vllm.transformer_utils.image_processor seems to remove the need for lazy importing the multimodal modules. See the changes to that file in #5276.

Let's do it this way since vllm.transformer_utils.image_processor is the source of the problem and is imported in fewer places.

@youkaichao
Copy link
Member Author

cool, feel free to submit another pr to suppress this one.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jul 2, 2024

#5276 is getting pretty close to merge so I'm inclined to just use that PR. Is that ok with you?

@youkaichao
Copy link
Member Author

close as #5276 will include the fix

@youkaichao youkaichao closed this Jul 2, 2024
@youkaichao youkaichao deleted the fix_more_init branch July 2, 2024 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants