-
Notifications
You must be signed in to change notification settings - Fork 563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix: use device in all Torch models #5026
base: develop
Are you sure you want to change the base?
Conversation
WalkthroughThe changes involve modifications to the device management in the Changes
Possibly related PRs
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
Still works fine and I can see difference between cpu and cuda. Note for future, this change is not pulled upstream by @harpreetsahota204 can you run this code when you test: To make sure the model is also multi-gpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
import fiftyone.brain as fob
from fiftyone import ViewField as F
dataset = foz.load_zoo_dataset('quickstart')
session = fo.launch_app(dataset)
model = foz.load_zoo_model("clip-vit-base32-torch", device="cuda")
embeddings = dataset.compute_embeddings(model)
worked as expected
… already loaded not just string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
fiftyone/utils/super_gradients.py (1)
98-100
: Consider adding a docstring note about device flexibility.Since this change enables flexible device selection, it would be helpful to document this capability in the class or method docstring. This would help users understand that they can use any available GPU.
Add a note like this to the class docstring:
"""FiftyOne wrapper around YOLO-NAS from https://github.com/Deci-AI/super-gradients. + +The model automatically uses the appropriate device (CPU/GPU) based on availability +and can work with any CUDA device, not just the default one. Args: config: a :class:`TorchYoloNasModelConfig` """
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (3)
fiftyone/utils/clip/zoo.py
(1 hunks)fiftyone/utils/open_clip.py
(3 hunks)fiftyone/utils/super_gradients.py
(1 hunks)
🔥 Files not summarized due to errors (1)
- fiftyone/utils/clip/zoo.py: Error: Server error: no LLM provider could handle the message
🔇 Additional comments (4)
fiftyone/utils/super_gradients.py (1)
99-99
: LGTM! Device management improvement.
The change from model.cuda()
to model.to(self.device)
improves GPU device flexibility, allowing the model to work with any available GPU instead of being restricted to the default one.
Let's verify that the model parameters are correctly moved to the specified device:
✅ Verification successful
Device management change is properly implemented
The verification confirms that the change from model.cuda()
to model.to(self.device)
is correct and consistent with the codebase's device management pattern. Key findings:
- The base
TorchImageModel
class properly initializesself.device
in its constructor:
device = self.config.device
if device is None:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
-
All other model implementations (CLIP, SAM, YOLO-NAS) consistently use
self.device
for device management. -
The change aligns with the existing pattern used across the codebase for moving tensors and models to the target device.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if there are any remaining hard-coded device assignments
# that might conflict with the new flexible device management
# Test 1: Look for any remaining .cuda() calls that might bypass device management
rg -g '*.py' '\.cuda\('
# Test 2: Check for proper device initialization in the parent class
ast-grep --pattern 'class TorchImageModel:
$$$
def __init__($$$):
$$$
self.device = $$$
$$$
'
Length of output: 137
Script:
#!/bin/bash
# Let's try a different approach to verify device management
# Check for device-related code in TorchImageModel and its subclasses
rg -g '*.py' 'class.*TorchImageModel' -A 10
# Look for device initialization patterns
rg -g '*.py' 'self\.device' -B 2 -A 2
# Check for any remaining cuda-specific code
rg -g '*.py' 'cuda|gpu' -B 2 -A 2
Length of output: 26257
fiftyone/utils/open_clip.py (3)
109-109
: LGTM: Device handling improvement in text feature extraction.
The change from .cuda()
to .to(self.device)
is correct and aligns with the PR's objective of supporting multiple GPUs.
121-121
: LGTM: Device handling improvement in prompt embedding.
The change from .cuda()
to .to(self.device)
is correct and consistent with the previous change.
109-109
: Verify device consistency across model components.
The device handling changes look good, but we should verify that all model components (model, tensors, etc.) are consistently using the same device.
Also applies to: 121-121, 146-146
✅ Verification successful
Based on the shell script results, I can now generate the final response as the device handling patterns are clear across the codebase.
Device handling is consistent across model components
The device handling in the OpenCLIP implementation follows the established patterns used throughout the codebase:
- The model and tensors are consistently moved to the same device using
to(self.device)
- Device initialization follows the standard pattern of defaulting to CUDA if available, otherwise CPU
- All tensor operations (text encoding, image encoding) maintain device consistency
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify device handling consistency across the codebase
# Look for any remaining .cuda() calls or device-related patterns
# Test 1: Check for any remaining .cuda() calls
echo "Checking for remaining .cuda() calls..."
rg '\.cuda\('
# Test 2: Check for device-related patterns in TorchOpenClipModel and related classes
echo "Checking device handling patterns..."
rg -A 3 'to\(.*device|\.device|device\s*='
Length of output: 14047
@@ -143,7 +143,7 @@ def _predict_all(self, imgs): | |||
frame_size = (width, height) | |||
|
|||
if self._using_gpu: | |||
imgs = imgs.cuda() | |||
imgs = imgs.to(self.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding error handling for device transfers.
While the device handling change is correct, the subsequent autocast
context is hardcoded to "cuda". This might cause issues when running on CPU.
Apply this diff to make it device-aware:
if self._using_gpu:
imgs = imgs.to(self.device)
- with torch.no_grad(), torch.amp.autocast("cuda"):
+ with torch.no_grad(), torch.amp.autocast(device_type=self.device.type if self._using_gpu else "cpu"):
image_features = self._model.encode_image(imgs)
text_features = self._get_text_features()
Also applies to: 147-152
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacobsela coderabbit raises an interesting point here. Does torch.amp.autocast("cuda")
need to be updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is valid and will cause problems if not handled. It's in my todo for this week to more thoroughly review the code before moving further with this PR because this message makes me think that there are probably more places I haven't noticed that make hardware assumptions.
…pport-for-multi-gpu-setups
What changes are proposed in this pull request?
Make CLIP zoo model work on all GPUs in a system.
How is this patch tested? If it is not, please explain why.
I ran embeddings on the model on GPUs other than 'coda:0'
Release Notes
notes for FiftyOne users.
What areas of FiftyOne does this PR affect?
fiftyone
Python library changesSummary by CodeRabbit