Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting "RuntimeError: Cannot find cuda device suitable for rendering cuda:0" when training an RL agent #142

Open
Pigbrainlyh opened this issue Nov 5, 2023 · 9 comments

Comments

@Pigbrainlyh
Copy link

System:

  • OS version: Ubuntu 22.04 LTS
  • Python version (if applicable): Python 3.8.10

Describe the bug
I have used sapien to render camera pictures as observations in a gym environment. However, the following error occured during training.

Traceback (most recent call last):
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/liuyuhao/workspace/tacktile_vision_fusion/scripts/../utils/gym_env_utils.py", line 44, in _safe_run_worker
    observation = env.reset()
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/gym/wrappers/order_enforcing.py", line 16, in reset
    return self.env.reset(**kwargs)
  File "/home/liuyuhao/workspace/tacktile_vision_fusion/scripts/../envs/continuous_insertion.py", line 1292, in reset
    obs = super().reset(specify_offset)
  File "/home/liuyuhao/workspace/tacktile_vision_fusion/scripts/../envs/continuous_insertion.py", line 603, in reset
    reset_ok, sim_offset = self.__initialize__(sim_offset)
  File "/home/liuyuhao/workspace/tacktile_vision_fusion/scripts/../envs/continuous_insertion.py", line 382, in __initialize__
    renderer = sapien.VulkanRenderer(device="cuda:0", offscreen_only=True)
RuntimeError: Cannot find cuda device suitable for rendering cuda:0

The agent had already interacted with the gym environment before the above error occured, and the memory for the corresponding gpu is sufficient. The agent model is on cuda:2 device. I want to know why the error could occur and how to fix it.

Expected behavior
Successfully obtain a renderer and set it to the sapien engine.

Additional context
Also in the beginning I run the code without device="cuda:0", and another error occured

File "/home/liuyuhao/workspace/tacktile_vision_fusion/scripts/../encoded_env/custom_td3.py", line 341, in learn
    return super().learn(
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/off_policy_algorithm.py", line 356, in learn
    rollout = self.collect_rollouts(
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/off_policy_algorithm.py", line 597, in collect_rollouts
    if callback.on_step() is False:
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/callbacks.py", line 100, in on_step
    return self._on_step()
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/callbacks.py", line 204, in _on_step
    continue_training = callback.on_step() and continue_training
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/callbacks.py", line 100, in on_step
    return self._on_step()
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/callbacks.py", line 447, in _on_step
    episode_rewards, episode_lengths = evaluate_policy(
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/evaluation.py", line 86, in evaluate_policy
    actions, states = model.predict(observations, state=states, episode_start=episode_starts, deterministic=deterministic)
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/base_class.py", line 632, in predict
    return self.policy.predict(observation, state, episode_start, deterministic)
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/policies.py", line 333, in predict
    observation, vectorized_env = self.obs_to_tensor(observation)
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/policies.py", line 252, in obs_to_tensor
    observation = obs_as_tensor(observation, self.device)
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/utils.py", line 466, in obs_as_tensor
    return {key: th.as_tensor(_obs).to(device) for (key, _obs) in obs.items()}
  File "/home/liuyuhao/anaconda3/envs/tactile_sim/lib/python3.8/site-packages/stable_baselines3/common/utils.py", line 466, in <dictcomp>
    return {key: th.as_tensor(_obs).to(device) for (key, _obs) in obs.items()}
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I guess it was because the renderer and the agent used the same gpu, and this is the reason I chose to specify the device of the renderer.

@fbxiang
Copy link
Collaborator

fbxiang commented Nov 11, 2023

Did you set cuda:2 in your torch code or did you set CUDA_VISIBLE_DEVICES=2? CUDA_VISIBLE_DEVICES=2 will make the GPU with index 2 to become cuda:0.

@Pigbrainlyh
Copy link
Author

Did you set cuda:2 in your torch code or did you set CUDA_VISIBLE_DEVICES=2? CUDA_VISIBLE_DEVICES=2 will make the GPU with index 2 to become cuda:0.

I set cuda:2 in my torch code.

@fbxiang
Copy link
Collaborator

fbxiang commented Nov 22, 2023

Given the current information, it is very hard to understand what is causing the error.
We are currently developing SAPIEN 3, and it should come with more error detection and workarounds in the renderer. It will have a few API changes and I am not sure if it will greatly affect your code.
Our latest development version will always be posted to the release page (Nightly Release). You can try it out and see if the error still exists.

@minghaoguo20
Copy link

Hi, I meet the same error. Is there any possible solution / related issue to this issue? Thanks.

@guangnianyuji
Copy link

I meet the same error. Any solution will be highly appreciated....🧎🧎🧎

@StoneT2000
Copy link
Member

what version of sapien are you using? And are you testing this via ManiSkill?

@eyrs42
Copy link

eyrs42 commented Sep 28, 2024

Hi, I met with this error too. I am using sapien version of 2.2.2. Tried with NVIDIA A40 and A5000.

I was testing it via ManiSkill, but the error occurs even if I run:

>>> import sapien.core as sapien
>>> sapien.SapienRenderer(offscreen_only=True, device="cuda:0")

Any help would be highly appreciated.

@StoneT2000
Copy link
Member

@srye2 can you install sapien==3.0.0b1 and run

sapien info --all

If it doesn't work please follow https://maniskill.readthedocs.io/en/latest/user_guide/getting_started/installation.html#troubleshooting,
make sure the relevant packages and files are installed

@eyrs42
Copy link

eyrs42 commented Sep 28, 2024

Upgrading worked, thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants