You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I experience frequent freezes on my workstation during training sessions with OmniIsaacGymEnvs, specifically when training the Crazyflie task with a modified reward function in headless mode using multi-GPU. The workstation freezes necessitate a full reboot. Notably, there is noticeable input lag (mouse and keyboard) prior to these freezes, and the GPUs emit continuous impulse sounds, indicating high activity. nvidia-bug-report.log.gz
Run the Crazyflie task with modified reward function in headless mode and multi-GPU setup.
Observe continuous GPU activity and eventual system freeze, requiring a reboot.
Expected Behavior
The system should handle training without significant performance degradation or freezing, as observed on another workstation with lower specifications (Intel Xeon W-2150B, 128GB DDR4 RAM, single RTX A6000 GPU) where only minor lagging occurs without system freezes.
Actual Behavior
The system freezes during training, and I'm unable to interact with any system functions, including the inability to recover the display even after reconnecting the HDMI cable. I attempted to reduce system load by closing applications like Edge Browser, VPN, and VS Code, but the issue persists.
Additional Information
Attempting to update Isaac Sim to version 4.0.0 and use the latest OIGE repo resulted in errors related to CUDA module data unloading, indicating potential compatibility or stability issues with the newer versions:
ed 700., FILE /builds/omniverse/physics/physx/source/cudamanager/src/CudaContextManager.cpp, LINE 817
2024-06-23 05:58:20 [71,245ms] [Error] [omni.physx.plugin] PhysX error: Failed to unload CUDA module data, returned 700., FILE /builds/omniverse/physics/physx/source/cudamanager/src/CudaContextManager.cpp, LINE 817
2024-06-23 05:58:20 [71,245ms] [Error] [omni.physx.plugin] PhysX error: Failed to unload CUDA module data, returned 700., FILE /builds/omniverse/physics/physx/source/cudamanager/src/CudaContextManager.cpp, LINE 817
2024-06-23 05:58:20 [71,245ms] [Error] [omni.physx.plugin] PhysX error: Failed to unload CUDA module data, returned 700., FILE /builds/omniverse/physics/physx/source/cudamanager/src/CudaContextManager.cpp, LINE 817
2024-06-23 05:58:20 [71,245ms] [Error] [omni.physx.plugin] PhysX error: Failed to unload CUDA module data, returned 700., FILE /builds/omniverse/physics/physx/source/cudamanager/src/CudaContextManager.cpp, LINE 817
2024-06-23 05:58:20 [71,314ms] [Warning] [carb] Recursive unloadAllPlugins() detected!
There was an error running python
(simple_eureka) simonwsy@simonwsy-Z790-UD:~/.local/share/ov/pkg/isaac-sim-4.0.0/OmniIsaacGy
Possible Solutions
It seems unlikely that the issue is hardware incompatibility with Ubuntu 22.04 since simultaneous stress tests on CPU and GPU (cpu-burner and gpu-burner) did not replicate the freezing or lagging.
Possible issues with process handling in multi-GPU setups, as evidenced by occasional errors from torch.distributed about port allocation, suggesting that processes might not be terminating correctly.
Reinstallation of Isaac Sim could be a potential fix, though it does not resolve the issue of increased error frequency over time.
I am looking for guidance on whether this issue is known and if there are recommended settings or configurations that could mitigate these problems.
The text was updated successfully, but these errors were encountered:
Issue Description
I experience frequent freezes on my workstation during training sessions with OmniIsaacGymEnvs, specifically when training the Crazyflie task with a modified reward function in headless mode using multi-GPU. The workstation freezes necessitate a full reboot. Notably, there is noticeable input lag (mouse and keyboard) prior to these freezes, and the GPUs emit continuous impulse sounds, indicating high activity.
nvidia-bug-report.log.gz
Environment
Steps to Reproduce
Expected Behavior
The system should handle training without significant performance degradation or freezing, as observed on another workstation with lower specifications (Intel Xeon W-2150B, 128GB DDR4 RAM, single RTX A6000 GPU) where only minor lagging occurs without system freezes.
Actual Behavior
The system freezes during training, and I'm unable to interact with any system functions, including the inability to recover the display even after reconnecting the HDMI cable. I attempted to reduce system load by closing applications like Edge Browser, VPN, and VS Code, but the issue persists.
Additional Information
Attempting to update Isaac Sim to version 4.0.0 and use the latest OIGE repo resulted in errors related to CUDA module data unloading, indicating potential compatibility or stability issues with the newer versions:
Possible Solutions
I am looking for guidance on whether this issue is known and if there are recommended settings or configurations that could mitigate these problems.
The text was updated successfully, but these errors were encountered: