-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Examples Suggestion #861
Comments
Thanks for this! I really appreciate the feedback. |
Also: I was thinking of doing a getting started like this: ## Getting started with torchrl
Each "chapter" would be a separate notebook that can be read through in 10-20 mins. Must be executed on CPU. Questions
|
cc @NKay5 |
Perhaps "getting started" should be even simpler for pea-brains like myself. Here's the one from coax, which I really like import os
import gym
import jax
import coax
import haiku as hk
import jax.numpy as jnp
from optax import adam
# the name of this script
name = 'dqn'
# env with preprocessing
env = gym.make('PongNoFrameskip-v4', render_mode='rgb_array') # AtariPreprocessing will do frame skipping
env = gym.wrappers.AtariPreprocessing(env)
env = coax.wrappers.FrameStacking(env, num_frames=3)
env = gym.wrappers.TimeLimit(env, max_episode_steps=108000 // 3)
env = coax.wrappers.TrainMonitor(env, name=name, tensorboard_dir=f"./data/tensorboard/{name}")
def func(S, is_training):
""" type-2 q-function: s -> q(s,.) """
seq = hk.Sequential((
coax.utils.diff_transform,
hk.Conv2D(16, kernel_shape=8, stride=4), jax.nn.relu,
hk.Conv2D(32, kernel_shape=4, stride=2), jax.nn.relu,
hk.Flatten(),
hk.Linear(256), jax.nn.relu,
hk.Linear(env.action_space.n, w_init=jnp.zeros),
))
X = jnp.stack(S, axis=-1) / 255. # stack frames
return seq(X)
# function approximator
q = coax.Q(func, env)
pi = coax.EpsilonGreedy(q, epsilon=1.)
# target network
q_targ = q.copy()
# updater
qlearning = coax.td_learning.QLearning(q, q_targ=q_targ, optimizer=adam(3e-4))
# reward tracer and replay buffer
tracer = coax.reward_tracing.NStep(n=1, gamma=0.99)
buffer = coax.experience_replay.SimpleReplayBuffer(capacity=1000000)
# DQN exploration schedule (stepwise linear annealing)
epsilon = coax.utils.StepwiseLinearFunction((0, 1), (1000000, 0.1), (2000000, 0.01))
while env.T < 3000000:
s, info = env.reset()
pi.epsilon = epsilon(env.T)
for t in range(env.spec.max_episode_steps):
a = pi(s)
s_next, r, done, truncated, info = env.step(a)
# trace rewards and add transition to replay buffer
tracer.add(s, a, r, done or truncated)
while tracer:
buffer.add(tracer.pop())
# learn
if len(buffer) > 50000: # buffer warm-up
metrics = qlearning.update(buffer.sample(batch_size=32))
env.record_metrics(metrics)
if env.T % 10000 == 0:
q_targ.soft_update(q, tau=1)
if done or truncated:
break
s = s_next
# generate an animated GIF to see what's going on
if env.period(name='generate_gif', T_period=10000) and env.T > 50000:
T = env.T - env.T % 10000 # round to 10000s
coax.utils.generate_gif(
env=env, policy=pi, resize_to=(320, 420),
filepath=f"./data/gifs/{name}/T{T:08d}.gif") This "getting started" will get them hooked because it is concise, it is clear what each line is doing, and it leaves room for the user to modify the script as they see fit. I find learning by example works well for me, not sure about others. I think your proposed "getting started" would probably read best as separate chapters after the user has looked at a few examples. They come after the examples, but before you get to reading the autogenerated API docs.
Separate chapters/sections on envs/specs, transforms, rollouts/collectors, replay buffers, etc. These pieces are pretty modular and algorithm-agnostic, so I think it is ok to separate these. Then perhaps a chapter on "advanced sampling" where you put everything together and demonstrate all the bells and whistles in a single script.
These sound good as separate chapters.
Yes, this could be folded in with the env stuff
Yes! Separate chapter.
Yes! Separate chapter.
If things are modular, this shouldn't really be necessary after the user has read through the "getting started".
I think if you make these separate from "getting started" examples, you can go a bit more complex. You can write a paragraph or two about what exactly a collector is. You can have 3-4 separate code blocks that are different types of collectors, or for different situations (e.g. multiagent).
My proposed "getting started" from above has a goal: Write the simplest-possible agent being trained on a single specific environment using a for-loop, giving the users the high-level structure of what their code should look like. I defer to the
I think once the user reads the "getting started" and gleans some structure/naming conventions (e.g. collector == rollout worker set), they are better equipped to select and read specific chapters. I imagine a typical user might leave ~80% of the "getting started" example as is, and read whichever chapter they are interested in messing with (e.g. collectors, models, distributions, etc).
For "getting started", maybe have one algorithm use pixels? Otherwise I think state should be sufficient. Let them grasp the main idea, then throw all the extra tools at them. I don't think you want to overload them with everything TorchRL can do yet.
Yeah you need to set some prerequisites. Perhaps assume users have read Sutton & Barto and know what Q learning and policy gradient are, as well knowledge of |
Yes for me it is a +1 to @smorad's comment. I think what we currently miss is a short notebook (1 page or so) which a user that has never seen torchrl before could glimpse at as a first approach and understand the most important parts of the sampling-training pipeline in a few minutes. Something close to the example that Steven proposed where you have 2 lines of intialising an env, parallel env and a step_counter wrapper. A few lines to make the actor and critic model parts and then the loop with the loss (that can be switched out for other losses (eg. PPO for DDPG)). I would also prefer bare classes instead of heper functions as those make you wonder what logic are they actually wrapping. |
We have smth like that with dqn Nd ddpg https://pytorch.org/rl/tutorials/coding_ddpg.html https://pytorch.org/rl/tutorials/coding_dqn.html Are you saying they're too elaborate? |
These 2 are the first two notebooks that everyone from our lab looked at when we first got intresested in the library. There are in depth sections for each component but maybe as the very first approach something short that can be read in 2 mins like the coax one might be better. Those two are still helpful to have fot further understanding |
Got it! Let's identify a simple problem that can be nailed in a few lines of code then!
|
just boiling down something like those two to the bone might be an option. Where just the simplest env is chosen with a simple transform and all hyperparameter and training choices are already made. Yeah for example a bare bone cartpole ppo not using pixels |
Cool |
In this bare bone version my personal opinion would be to use them but maybe not as abstract as using the |
Good! Will do! |
@matteobettini @viktor-ktorvi @smorad @shagunsodhani @NKay5 I have written a PPO tutorial Wdyt? Is it simple enough? |
Looks very good to me! This will help a lot I wonder if alongiside this we could accompany a TL;DR version of the same tutorial, e.g.: # Env
env = TransformedEnv(
GymEnv(
"InvertedDoublePendulum-v4",
device=device,
frame_skip=frame_skip
),
Compose(
# normalize observations
ObservationNorm(in_keys=["observation"]),
DoubleToFloat(in_keys=["observation"], ),
StepCounter(),
)
)
# Policy
policy_module = TensorDictModule(
nn.Sequential(
nn.LazyLinear(num_cells, device=device),
nn.Tanh(),
nn.LazyLinear(2 * env.action_spec.shape[-1], device=device),
NormalParamExtractor(),
),
in_keys=["observation"],
out_keys=["loc", "scale"]
)
# Value function
value_module = ValueOperator(
module= nn.Sequential(
nn.LazyLinear(num_cells, device=device),
nn.Tanh(),
nn.LazyLinear(1, device=device),
),
in_keys=["observation"],
)
# Collector
collector = SyncDataCollector(
env,
policy_module,
frames_per_batch=frames_per_batch,
total_frames=total_frames,
split_trajs=False,
)
# Replay buffer
replay_buffer = ReplayBuffer(
storage=LazyTensorStorage(frames_per_batch),
sampler=SamplerWithoutReplacement(),
)
# Loss
advantage_module = GAE(
gamma=gamma, lmbda=lmbda, value_network=value_module,
average_gae=True
)
loss_module = ClipPPOLoss(
actor=policy_module,
critic=value_module,
advantage_key="advantage",
clip_epsilon=clip_epsilon,
entropy_bonus=bool(entropy_eps),
entropy_coef=entropy_eps,
)
optim = torch.optim.Adam(loss_module.parameters(), lr)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optim,
total_frames // frames_per_batch,
0.0
)
# Training
logs = defaultdict(list)
for i, data in enumerate(collector):
for k in range(num_epochs):
advantage_module(data)
data_view = data.reshape(-1)
replay_buffer.extend(data_view)
for j in range(frames_per_batch // batch_size):
subdata, *_ = replay_buffer.sample(batch_size)
loss_vals = loss_module(subdata)
loss_value = loss_vals["loss_objective"] + loss_vals[
"loss_critic"] + loss_vals["loss_entropy"]
# Optim
loss_value.backward()
torch.nn.utils.clip_grad_norm_(
loss_module.parameters(),
max_grad_norm
)
optim.step()
optim.zero_grad()
collector.update_policy_weights_()
scheduler.step()
collector.shutdown()
del collector
This could allow users to see a one page recap of the contents that they can capy and paste in one action. P.S. I think you might have left a comment that belongs to another tutorial: in the epoch iteration you talk about the mask and padding but we are not splitting trajs |
Huge improvement in understandability! I've written down my thoughts from last night while going through the tutorial and I think this will give you an insight into how someone whos 1st contact with the library was two days ago is thinking, which is the tutorials target audience, I'd say.
Well done.
I think this is because I'm unfamiliar with the terminology that's probably being used in Gym. I'm still not a 100% sure about the relationship between
I'm now pretty sure that
Should have gotten this from the algorithm itself but it did make me confused for a couple of minutes there.
Same issues again. Reading through the tutorial again I think I see where the issue is. I'm having trouble differentiating between batch in the context of Data Collector and batch in the context of sampling from a replay buffer; the second one is clear to me and I don't quite get what the first concept represents. Maybe it's my lack of terminology knowledge.
I see that it's not too important but it tickles my interest to know more about it.
This part is pretty good as well, there's just the confusion from before. Those were my thoughts from last night. I see there were some changes made afterwards but the gist of it is probably the same. I hope this gives the perspective of a new user. Overall great improvement! I really appreciate it! |
this is very useful thanks! I will clarify it all |
I edited the tuto based on your feedback. It is very (perhaps too much?) verbose right now. @matteobettini regarding your question, the doc originates from a |
The in and out keys are clear as day. It's very verbose but I'd rather have that information than not. I imagine that most people will focus on the code at first and then consult the text when they need to. I really like the tutorial now, although I am biased having read it multiple times over. Thanks for taking the feedback! |
Great work! Couple of minor suggestions below (only read till PPO section, will get back to rest later):
Tutorial is using the inverted pendulum task
Indentation is messed up
Maybe add hyperlinks for each
Maybe say "sophisticated version of REINFORCE"
"Next, we will design"
Add link to tensordict
"execute the policy on GPU" |
Thanks for the feedback everyone! |
Do you mean - should the code be split up or all together? I think the split with text inbetween is good. The view on github button does the trick if you wanna copy paste the entire thing. |
By the way, I've been trying to run the ppo example and I get an error:
print("observation_spec:", env.observation_spec) It has something to do with the StepCounter() transform. |
According to my personal taste, yes. While still keeping as much generality as you can. There is always this trade-off between simplicity and generality right? I personally lean more towards simplicity. I would make the top evel files in the example folder be simple and without helper functions. Then you can create a directory tree where you have examples for each extention (e.g. main ppo could be like the one i poposed, than you would ahve other files for multi-task ppo, multi-agent ppo, and so on) |
I can't reproduce this, does it happen with the nightly? |
@matteobettini |
Can't even run the nightly because of the C++ extensions problem so I can't check. |
Yeah. I like looking at the code in a single continuous format, then going through the explanations for each block to clarify what I don't understand.
That would be great! |
I love the modular spirit of TorchRL, very much appreciated for RL research! And seeing the list of features growing quickly is also very encouraging for the future of this library. However, coming back to this issue after 1 year and despite all efforts already deployed to improve it, I feel that documentation & examples are still the main weak spots of TorchRL (probably slowing down its adoption):
If this issue becomes a priority again, I would recommend taking some inspiration from the very readable documentation & examples (both agent stubs and specific implementations... even for advanced distributed algorithms like Ape-X) from coax, another modular RL library - unfortunately not too much maintained anymore. In any case, thanks again for all the great work that went into TorchRL already; this comment is just about making it more accessible and attractive to new-comers 😃 |
Yes. coming back to this after a while there is still some points that I feel can be improved. One that we talked a lot about that I think is worth considering is changing the name of the
It would be useful then to have an |
Agreed! What format should it take in your opinion? A section on the main page with a basic example? A page-by-page tutorial where one learns how to code a basic algo? |
Wrt. the Getting Started, I would personally love a minimal example (like the one from your TorchRL paper in Figure 1). That would be a great start! Of course, reworking/adding examples in general would be more difficult & time-consuming... |
Cool i'll work on that and ping you with it! |
I've started working on this here: #1886 Preview: The rendering is terrible and there are countless things to fix, but is this format ok? I was thinking of having one on envs, another on writing a policy, another on losses and optim, one on collecting and storing data and finally one on logging (wandb and such). |
Thank you for taking care of this so quickly. I initially imagined something shorter than 5 new notebooks as a Getting Started - but having read them carefully, I actually liked how you explained every core concept step-by-step and felt like every bit of information here was valuable. Great work, I believe this will be a very nice addition to the documentation! Just a few little things I noticed on the various notebooks: [typo] We now need to pass this action tp the environment. [formatting] To simplify the incorporation of Actor, # ProbabilisticActor, # ActorValueOperator or # ActorCriticOperator. [typo] In TorchRL, we try to tread optimization as it is custom to do in PyTorch [formatting] The most basic data collector is the :clas: [reformulation] We’ll be making a regular, deterministic version to be used within the loss module and during evaluation, and one augmented by an exploration module for inference. Thanks again! |
I think I fixed them all except for the last one! I also specified that you can start with the last and quickly see what a training loop looks like before backtracking to other tutos, hope that makes sense and is closer to what you had in mind! |
Motivation
This is somewhat related to #849. I realize TorchRL is quite new and I expect the documentation and examples will improve over time, but I believe the current examples are a bit difficult to follow for newcomers like myself. It is very temping to briefly scan the TorchRL examples and think "this is too complex, I'll go use CleanRL/StableBaselines". The current examples seem to fall into two buckets:
For type 1 examples, there seems to be a lot of extra setup/config that clutters the example (example). IMO while things like observation normalization and reward scaling are useful, I don't think they belong in the simple examples. Perhaps as separate entries in some "advanced examples" category. Many of the type 1 examples seem like they were intended as configurable algorithm implementations, making them abstract and harder to follow. I think it might be better to call these "algorithm implementations" instead of "examples".
On the other extreme, we have type 2 examples (example) which are bare metal. I find these more useful than type 1, but they do not demonstrate all the cool things in TorchRL. For example, we explicitly compute the loss instead of drawing from your large set of loss function implementations!
Solution
TL;DR: Env normalization and configuration code should be removed from basic examples. Rename torchrl/examples to
algorithm_implementations
. A third class of examples that utilize TorchRL methods while still giving the user loop control.I think that some of the researchers using TorchRL would benefit from a third category of examples, somewhere in between the first and second categories. I think these examples should make use of all the useful functions of TorchRL while still leaving the train loop up to the user. I personally would also prefer if these called constructors instead of using helper functions (e.g.
TensorDictModule
vsmake_dqn_model
). That way, we can inherit to implement our own custom model/loss/buffer/etc. E.g.Checklist
cc @matteobettini as I think he has some thoughts too.
The text was updated successfully, but these errors were encountered: