-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] MultiaSyncDataCollector crashes when passing Replay Buffer to constructor #2614
Comments
That error is probably not the one responsible of the crash but a remnant of the original one. Is there anything else in the error stack you can share? |
I only see this log + "Killed". |
Does it run on a single process (SyncDataCollector)? |
Yes everything works fine with SyncDataCollector (when passing the RB buffer to the data collector or not both work as expected) |
Can your policy be serialized? Have you changed the default starting method of multiprocessing (I think "fork" will fail) |
@vmoens I tried setting the spawn method to fork or spawn and both crash. Traceback (most recent call last):
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/torchrl/_utils.py", line 669, in run
return mp.Process.run(self, *args, **kwargs)
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/torchrl/collectors/collectors.py", line 2960, in _main_async_collector
next_data = next(dc_iter)
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/torchrl/collectors/collectors.py", line 247, in __iter__
yield from self.iterator()
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/torchrl/collectors/collectors.py", line 1035, in iterator
tensordict_out = self.rollout()
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/torchrl/_utils.py", line 481, in unpack_rref_and_invoke_function
return func(self, *args, **kwargs)
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/torchrl/collectors/collectors.py", line 1177, in rollout
self.replay_buffer.add(self._shuttle)
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/torchrl/data/replay_buffers/replay_buffers.py", line 1199, in add
index = super()._add(data)
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/torchrl/data/replay_buffers/replay_buffers.py", line 598, in _add
index = self._writer.add(data)
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/torchrl/data/replay_buffers/writers.py", line 276, in add
self._storage.set(index, data)
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/torchrl/_utils.py", line 394, in _lazy_call_fn
result = self._delazify(self.func_name)(*args, **kwargs)
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/torchrl/data/replay_buffers/storages.py", line 724, in set
self._storage[cursor] = data
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/tensordict/_td.py", line 900, in __setitem__
subtd.set(value_key, item, inplace=True, non_blocking=False)
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/tensordict/base.py", line 4706, in set
return self._set_tuple(
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/tensordict/_td.py", line 3483, in _set_tuple
return self._set_str(
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/tensordict/_td.py", line 3409, in _set_str
inplace = self._convert_inplace(inplace, key)
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/tensordict/_td.py", line 3394, in _convert_inplace
raise RuntimeError(_LOCK_ERROR)
RuntimeError: Cannot modify locked TensorDict. For in-place modification, consider using the `set_()` method and make sure the key is present.
Env Data Collection: 0it [00:10, ?it/s]
Error executing job with overrides: ['env=dmc_cartpole_balance', 'algo=sac_pixels']
Traceback (most recent call last):
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/mila/b/myuser/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 71, in <module>
cli.main()
File "/home/mila/b/myuser/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 501, in main
run()
File "/home/mila/b/myuser/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 351, in run_file
runpy.run_path(target, run_name="__main__")
File "/home/mila/b/myuser/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 310, in run_path
return _run_module_code(code, init_globals, run_name, pkg_name=pkg_name, script_name=fname)
File "/home/mila/b/myuser/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 127, in _run_module_code
_run_code(code, mod_globals, init_globals, mod_name, mod_spec, pkg_name, script_name)
File "/home/mila/b/myuser/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 118, in _run_code
exec(code, run_globals)
File "scripts/train_rl.py", line 131, in <module>
main()
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "scripts/train_rl.py", line 123, in main
trainer.train()
File "/home/mila/b/myuser/SegDAC/segdac_dev/src/segdac_dev/trainers/rl_trainer.py", line 41, in train
for _ in tqdm(self.train_data_collector, "Env Data Collection"):
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
for obj in iterable:
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/torchrl/collectors/collectors.py", line 247, in __iter__
yield from self.iterator()
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/torchrl/collectors/collectors.py", line 2584, in iterator
idx, j, out = self._get_from_queue(timeout=10.0)
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/site-packages/torchrl/collectors/collectors.py", line 2540, in _get_from_queue
new_data, j = self.queue_out.get(timeout=timeout)
File "/home/mila/b/myuser/micromamba/envs/dmc_env/lib/python3.10/multiprocessing/queues.py", line 114, in get
raise Empty
_queue.Empty
[W1129 09:45:22.315415632 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors] There seems to be 2 error messages, one is : RuntimeError: Cannot modify locked TensorDict. For in-place modification, consider using the `set_()` method and make sure the key is present. and the other is : multiprocessing/queues.py", line 114, in get
raise Empty
_queue.Empty |
Interesting! |
Sure, here is how I create it : import torch
from omegaconf import DictConfig
from torchrl.data import ReplayBuffer
from torchrl.data import TensorDictReplayBuffer
from torchrl.data import LazyMemmapStorage
from torchrl.data import LazyTensorStorage
from torchrl.envs.transforms import Compose
from torchrl.envs.transforms import ToTensorImage
from torchrl.envs.transforms import CatFrames
from torchrl.envs.transforms import Transform
from torchrl.envs.transforms import ExcludeTransform
class UnscaleActionTransform(Transform):
def __init__(self, env_action_scaler):
super().__init__(in_keys=["action"], out_keys=["action"])
self.env_action_scaler = env_action_scaler
def _apply_transform(self, scaled_action):
return self.env_action_scaler.unscale(scaled_action)
def get_replay_buffer_data_saving_transforms() -> list:
"""
These are transforms executed when saving data to the replay buffer.
We want to exclude pixels_transformed because it is in float32 (expensive to store), we can store the uint8 RGB image instead.
"""
return [
ExcludeTransform(
"pixels_transformed", ("next", "pixels_transformed"), inverse=True
)
]
def get_replay_buffer_sample_transforms(cfg: DictConfig, env_action_scaler) -> list:
"""
These are transforms executed when sampling data from the replay buffer.
"""
transforms = []
if "pixels" in cfg["algo"]["training_keys"]:
transforms.append(
ToTensorImage(
from_int=True,
in_keys=["pixels", ("next", "pixels")],
out_keys=["pixels_transformed", ("next", "pixels_transformed")],
shape_tolerant=True,
)
)
frame_stack = cfg["algo"].get("frame_stack")
if frame_stack is not None:
transforms.append(
CatFrames(
N=int(frame_stack),
dim=-3,
in_keys=["pixels_transformed", ("next", "pixels_transformed")],
out_keys=["pixels_transformed", ("next", "pixels_transformed")],
)
)
transforms.append(UnscaleActionTransform(env_action_scaler))
return transforms
def create_replay_buffer(cfg: DictConfig, env_action_scaler) -> ReplayBuffer:
storage_device = torch.device(cfg["storage_device"])
capacity = cfg["algo"]["replay_buffer"]["capacity"]
transforms = []
transforms.extend(get_replay_buffer_data_saving_transforms())
transforms.extend(get_replay_buffer_sample_transforms(cfg, env_action_scaler))
transform = Compose(*transforms)
storage_kwargs = {}
storage_kwargs["max_size"] = capacity
storage_kwargs["device"] = storage_device
if cfg["env"]["num_workers"] > 1:
storage_kwargs["ndim"] = 2
if "cpu" in storage_device.type:
# LazyMemmapStorage is only supported on CPU
replay_buffer = TensorDictReplayBuffer(
storage=LazyMemmapStorage(**storage_kwargs),
transform=transform,
)
else:
replay_buffer = TensorDictReplayBuffer(
storage=LazyTensorStorage(**storage_kwargs),
transform=transform,
)
return replay_buffer |
Describe the bug
Creating an instance of
MultiaSyncDataCollector
crashes when we passreplay_buffer=my_replay_buffer
to its constructor.The following logs is observed:
python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 8 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '
To Reproduce
MultiaSyncDataCollector
Expected behavior
No crash.
System info
Describe the characteristic of your environment:
Checklist
The text was updated successfully, but these errors were encountered: