Getting "RuntimeError: Cannot find cuda device suitable for rendering cuda:0" when training an RL agent #142

Pigbrainlyh · 2023-11-05T06:03:12Z

System:

OS version: Ubuntu 22.04 LTS
Python version (if applicable): Python 3.8.10

Describe the bug
I have used sapien to render camera pictures as observations in a gym environment. However, the following error occured during training.

Traceback (most recent call last):
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/liuyuhao/workspace/tacktile_vision_fusion/scripts/../utils/gym_env_utils.py", line 44, in _safe_run_worker
    observation = env.reset()
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/gym/wrappers/order_enforcing.py", line 16, in reset
    return self.env.reset(**kwargs)
  File "/home/liuyuhao/workspace/tacktile_vision_fusion/scripts/../envs/continuous_insertion.py", line 1292, in reset
    obs = super().reset(specify_offset)
  File "/home/liuyuhao/workspace/tacktile_vision_fusion/scripts/../envs/continuous_insertion.py", line 603, in reset
    reset_ok, sim_offset = self.__initialize__(sim_offset)
  File "/home/liuyuhao/workspace/tacktile_vision_fusion/scripts/../envs/continuous_insertion.py", line 382, in __initialize__
    renderer = sapien.VulkanRenderer(device="cuda:0", offscreen_only=True)
RuntimeError: Cannot find cuda device suitable for rendering cuda:0

The agent had already interacted with the gym environment before the above error occured, and the memory for the corresponding gpu is sufficient. The agent model is on cuda:2 device. I want to know why the error could occur and how to fix it.

Expected behavior
Successfully obtain a renderer and set it to the sapien engine.

Additional context
Also in the beginning I run the code without device="cuda:0", and another error occured

File "/home/liuyuhao/workspace/tacktile_vision_fusion/scripts/../encoded_env/custom_td3.py", line 341, in learn
    return super().learn(
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/off_policy_algorithm.py", line 356, in learn
    rollout = self.collect_rollouts(
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/off_policy_algorithm.py", line 597, in collect_rollouts
    if callback.on_step() is False:
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/callbacks.py", line 100, in on_step
    return self._on_step()
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/callbacks.py", line 204, in _on_step
    continue_training = callback.on_step() and continue_training
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/callbacks.py", line 100, in on_step
    return self._on_step()
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/callbacks.py", line 447, in _on_step
    episode_rewards, episode_lengths = evaluate_policy(
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/evaluation.py", line 86, in evaluate_policy
    actions, states = model.predict(observations, state=states, episode_start=episode_starts, deterministic=deterministic)
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/base_class.py", line 632, in predict
    return self.policy.predict(observation, state, episode_start, deterministic)
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/policies.py", line 333, in predict
    observation, vectorized_env = self.obs_to_tensor(observation)
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/policies.py", line 252, in obs_to_tensor
    observation = obs_as_tensor(observation, self.device)
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/utils.py", line 466, in obs_as_tensor
    return {key: th.as_tensor(_obs).to(device) for (key, _obs) in obs.items()}
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/utils.py", line 466, in <dictcomp>
    return {key: th.as_tensor(_obs).to(device) for (key, _obs) in obs.items()}
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I guess it was because the renderer and the agent used the same gpu, and this is the reason I chose to specify the device of the renderer.

The text was updated successfully, but these errors were encountered:

fbxiang · 2023-11-11T21:43:12Z

Did you set cuda:2 in your torch code or did you set CUDA_VISIBLE_DEVICES=2? CUDA_VISIBLE_DEVICES=2 will make the GPU with index 2 to become cuda:0.

Pigbrainlyh · 2023-11-13T00:32:33Z

Did you set cuda:2 in your torch code or did you set CUDA_VISIBLE_DEVICES=2? CUDA_VISIBLE_DEVICES=2 will make the GPU with index 2 to become cuda:0.

I set cuda:2 in my torch code.

fbxiang · 2023-11-22T05:38:06Z

Given the current information, it is very hard to understand what is causing the error.
We are currently developing SAPIEN 3, and it should come with more error detection and workarounds in the renderer. It will have a few API changes and I am not sure if it will greatly affect your code.
Our latest development version will always be posted to the release page (Nightly Release). You can try it out and see if the error still exists.

minghaoguo20 · 2024-09-20T19:33:43Z

Hi, I meet the same error. Is there any possible solution / related issue to this issue? Thanks.

guangnianyuji · 2024-09-25T06:14:19Z

I meet the same error. Any solution will be highly appreciated....🧎🧎🧎

StoneT2000 · 2024-09-25T08:07:56Z

what version of sapien are you using? And are you testing this via ManiSkill?

eyrs42 · 2024-09-28T18:17:09Z

Hi, I met with this error too. I am using sapien version of 2.2.2. Tried with NVIDIA A40 and A5000.

I was testing it via ManiSkill, but the error occurs even if I run:

>>> import sapien.core as sapien
>>> sapien.SapienRenderer(offscreen_only=True, device="cuda:0")

Any help would be highly appreciated.

StoneT2000 · 2024-09-28T19:23:35Z

@srye2 can you install sapien==3.0.0b1 and run

sapien info --all

If it doesn't work please follow https://maniskill.readthedocs.io/en/latest/user_guide/getting_started/installation.html#troubleshooting,
make sure the relevant packages and files are installed

eyrs42 · 2024-09-28T21:15:36Z

Upgrading worked, thank you so much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting "RuntimeError: Cannot find cuda device suitable for rendering cuda:0" when training an RL agent #142

Getting "RuntimeError: Cannot find cuda device suitable for rendering cuda:0" when training an RL agent #142

Pigbrainlyh commented Nov 5, 2023

fbxiang commented Nov 11, 2023

Pigbrainlyh commented Nov 13, 2023

fbxiang commented Nov 22, 2023 •

edited

Loading

minghaoguo20 commented Sep 20, 2024

guangnianyuji commented Sep 25, 2024

StoneT2000 commented Sep 25, 2024

eyrs42 commented Sep 28, 2024

StoneT2000 commented Sep 28, 2024

eyrs42 commented Sep 28, 2024

Getting "RuntimeError: Cannot find cuda device suitable for rendering cuda:0" when training an RL agent #142

Getting "RuntimeError: Cannot find cuda device suitable for rendering cuda:0" when training an RL agent #142

Comments

Pigbrainlyh commented Nov 5, 2023

fbxiang commented Nov 11, 2023

Pigbrainlyh commented Nov 13, 2023

fbxiang commented Nov 22, 2023 • edited Loading

minghaoguo20 commented Sep 20, 2024

guangnianyuji commented Sep 25, 2024

StoneT2000 commented Sep 25, 2024

eyrs42 commented Sep 28, 2024

StoneT2000 commented Sep 28, 2024

eyrs42 commented Sep 28, 2024

fbxiang commented Nov 22, 2023 •

edited

Loading