Vectorized Trajectory Management #2660

nicholscrawford · 2024-12-16T22:52:25Z

nicholscrawford
Dec 16, 2024

Hi -- I'm unsure how to neatly manage vectorized trajectories with a replay buffer.

I was thinking something like this:

    next_obs, reward, terminated, truncated, info = env.step(action)

    buffer.extend(TensorDict(
        batch_size=cfg.training.num_envs, 
        episode=episode_idx, 
        obs=obs, action=action, reward=reward, next_obs=next_obs, truncated=truncated, terminated=terminated,
        steps = info['elapsed_steps'],
        ))
    obs = next_obs
    
        
    # Reset environments that are done
    dones = torch.logical_or(terminated, truncated)
    if dones.any():
        reset_obs, info = env.reset(options={'env_idx': torch.nonzero(dones).squeeze()})
        obs = torch.where(dones.unsqueeze(-1), reset_obs, obs) 
        print(f"Resetting {dones.sum()} environments")
        
        # Update episode_idx for done episodes to be a new unique id
        update_indices = torch.nonzero(dones).squeeze()
        next_values = torch.arange(episode_idx.max() + 1, episode_idx.max() + 1 + len(update_indices), device=device)
        episode_idx[update_indices] = next_values
        
        print(episode_idx)

But it seems like the trajectories need to be contiguous. Since my environments might not end at the same timestep, is there a better way to manage that in a vectorized/fast way?

nicholscrawford · 2024-12-17T16:20:55Z

nicholscrawford
Dec 17, 2024
Author

I've got a band aid solution:

# Sort buffer so episodes are contiguous. Lets us sample continuous trajectories.
_, sorted_indices = torch.sort(buffer.storage._storage[:len(buffer)]["episode"]) 
buffer.storage._storage[:len(buffer)] = buffer.storage._storage[:len(buffer)][sorted_indices]

sample = buffer.sample()

Although if there's something faster that'd be awesome.

0 replies

vmoens · 2024-12-18T11:28:03Z

vmoens
Dec 18, 2024
Collaborator

Hey!
We've got a bunch of doc about storing trajectories here
https://pytorch.org/rl/stable/reference/data.html#storing-trajectories

Essentially what you want to do is store every trajectory coming from each env one after the other (ie, something like this: the trajectories coming in are

traj_batch0 = tensor([
    [0, 0, 0, 2, 2, 2, 2, 5],
    [1, 1, 1, 1, 3, 3, 4, 4],
])
traj_batch1 = tensor([
    [5, 5, 5, 7, 7, 7, 7, 9],
    [4, 6, 6, 6, 8, 8, 8, 8],
])

and we want to concat them along dim=-1 to get

traj = tensor([
    [0, 0, 0, 2, 2, 2, 2, 5, 5, 5, 5, 7, 7, 7, 7, 9],
    [1, 1, 1, 1, 3, 3, 4, 4, 4, 6, 6, 6, 8, 8, 8, 8],
])

)

The core thing is to tell your storage that you have 2 dims

...     storage=LazyTensorStorage(100, ndim=2),

When you call extend, just pass a tensordict of shape [B, T] where B is the number of parallel envs you're running and T the number of timesteps.

Then you can use a SliceSampler and it'll know how the trajectories are stored and sample things accordingly.

LMK if anything's unclear!

0 replies

nicholscrawford · 2024-12-18T18:06:16Z

nicholscrawford
Dec 18, 2024
Author

oh nice okay! I missed that part of the doc. I think this part confused me a bit:

When a multidimensional storage is passed to a buffer, the buffer will automatically consider the last dimension as the “time” dimension, as it is conventional in TorchRL. This can be overridden through the dim_extend keyword argument in [ReplayBuffer](https://pytorch.org/rl/stable/reference/generated/torchrl.data.ReplayBuffer.html#torchrl.data.ReplayBuffer). This is the recommended way to save trajectories that are obtained through [ParallelEnv](https://pytorch.org/rl/stable/reference/generated/torchrl.envs.ParallelEnv.html#torchrl.envs.ParallelEnv) or its serial counterpart, as we will see below.

I'm not sure full understand what that means still, and dim_extend isn't mentioned elsewhere. Although what you suggested above works!

    action = torch.tensor(env.action_space.sample(), device=device)
    next_obs, reward, terminated, truncated, info = env.step(action)

    tdlist.append(TensorDict(
        batch_size=cfg.training.num_envs, 
        episode=episode_idx, 
        obs=obs, action=action, reward=reward, next_obs=next_obs, truncated=truncated, terminated=terminated,
        steps = info['elapsed_steps'],
        ))
    obs = next_obs

I store the tds in a list, where episode looks like [1, 2, 3, 4, 5, 6, 7, 8, 9], and then when there's any done signals I increment and add the list to the buffer.

dones = torch.logical_or(terminated, truncated)
    if dones.any():
        reset_obs, info = env.reset(options={'env_idx': torch.nonzero(dones).squeeze()})
        obs = torch.where(dones.unsqueeze(-1), reset_obs, obs) 
        print(f"Resetting {dones.sum()} environments")
        
        trajectories = torch.stack(tdlist, dim=0).transpose(0, 1)
        buffer.extend(trajectories)
        tdlist.clear()
        
        # Update episode_idx for done episodes to be a new unique id
        update_indices = torch.nonzero(dones).squeeze()
        next_values = torch.arange(episode_idx.max() + 1, episode_idx.max() + 1 + len(update_indices), device=device)
        episode_idx[update_indices] = next_values

That works! Should I just extend instead of accumulating for speed?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorized Trajectory Management #2660

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Vectorized Trajectory Management #2660

nicholscrawford Dec 16, 2024

Replies: 3 comments

nicholscrawford Dec 17, 2024 Author

vmoens Dec 18, 2024 Collaborator

nicholscrawford Dec 18, 2024 Author

nicholscrawford
Dec 16, 2024

nicholscrawford
Dec 17, 2024
Author

vmoens
Dec 18, 2024
Collaborator

nicholscrawford
Dec 18, 2024
Author