-
Notifications
You must be signed in to change notification settings - Fork 0
/
project_log.txt
345 lines (314 loc) · 20.7 KB
/
project_log.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
[1] Mastering Diverse Domains through World Models - 2023
D. Hafner, J. Pasukonis, J. Ba, T. Lillicrap
https://arxiv.org/pdf/2301.04104v1.pdf
[2] Mastering Atari with Discrete World Models - 2021
D. Hafner, T. Lillicrap, M. Norouzi, J. Ba
https://arxiv.org/pdf/2010.02193.pdf
2023/01/29:
- Implemented Symlog function and inverse symlog function. Nice to do the math ourselves:
# To get to symlog inverse, we solve the symlog equation for x:
# y = sign(x) * log(|x| + 1)
# <=> y / sign(x) = log(|x| + 1)
# <=> y = log( x + 1) V x >= 0
# -y = log(-x + 1) V x < 0
# <=> exp(y) = x + 1 V x >= 0
# exp(-y) = -x + 1 V x < 0
# <=> exp(y) - 1 = x V x >= 0
# exp(-y) - 1 = -x V x < 0
# <=> exp(y) - 1 = x V x >= 0 (if x >= 0, then y must also be >= 0)
# -exp(-y) - 1 = x V x < 0 (if x < 0, then y must also be < 0)
# <=> sign(y) * (exp(|y|) - 1) = x
- Started working on Atari CNN for encoder:
Difficulty: Understanding appendix B (table of hyperparams): How do I build
size=XS CNN? says: multiplier=24, strides=(2,2), stop (flatten) after image
has been reduced to dimensions 4x4, but what kernel should we use? If we use 4x4
(input=64x64x3, strides=2,2 filters=24,48,96) then the final image size would be 6x6x96
Answer: Use kernel=3 stride=2 padding="same" (as indicated at a different location in the paper ;) )
2023/01/30:
- Completed Atari encoder CNN.
Found answer to above question in appendix C: Summary of differences:
kernel=3 (not 4 like in Schmidhuber's world models) and padding="same" (not "valid")
- Implemted z-generating sub-network.
Figured out the difficulty of sampling from a categorical distribution and
keeping the result differentiable (Dreamer V2 paper: Algorithm 1).
2023/01/31:
- Completed Atari decoder CNNTranspose.
- Difficulties: How to reverse a CNN stack. Use same padding, same kernel, same strides.
- Start from reshaped (4x4x[final num CNN filters]) output of a dense layer and feed
that into the first conv2dtranspose.
- The final output are the means of a 64x64x3 diag-Gaussian with std=1.0, from which
we sample the (symlog'd) image. Then 1) inverse_symlog, 2) clamp between 0.0 and 255.0, 3) cast to uint8.
- Completed Sequence (GRU) model.
- Completed Dynamics Predictor, using z-generator and MLP.
- Implemented two-hot encoding.
Question: What's the range used of the 255 buckets
Answer: -20 to +20 (since the range is for the symlog'd values, -20 to 20
covers almost any possible reward range)
Difficulty: What's the most performant way for the implementation?
answer: Using floor, ceil, tf.scatter_nd()
- Implemented Reward-predictor and continue-predictor.
2023/02/01
- Complete Atari World Model code (roughly, w/o actually running it).
- Complete World Model loss functions (L_pred, L_dyn, L_rep) (roughly w/o
actually running it).
- Lift existing EnvRunner from other project ("Attention is all you need") and adjust
such that:
- It can run on vectorized Atari envs using gymnasium.vector.make()
- It produces samples (obs, actions, rewards, trunc., term., valid-masks) of
shape (B=num-sub-envs, T=max-seq-len, ...)
Note that T does NOT go beyond episode boundaries. If an episode ends within
a current T-rollout, the remaining max-seq-len - len timesteps are filled with
zero values. The returned mask indicates, which exact B,T positions are valid
and which aren't and thus can be used to properly mask the total loss
computation.
- "Next-obs" are NOT fully (B, T)-provided to save space. They are only ever
needed in case of a truncated rollout (max-seq-len) or the episode is truncated
in the middle of the rollout (before T is reached). Thus next_obs are returned
as (B, dim), NOT as (B, T, dim)
- Returned samples could be inserted as is (after splitting along the B-axis)
into a Dreamer-style replay buffer with max-seq-len being the "batch len"
described in the paper.
2023/02/02
- Finished fixing bugs in EnvRunner; wrote a small test script for collecting
vectorized Pacman samples.
- Wrote quick non-RLlib training script for a AtariWorldModel using the above simple
EnvRunner; this does not include a buffer yet, but has an Atari world model,
optimizer, calls loss functions and updates the model.
- Started debugging World Model training code by stepping through the above example
script and weeding out errors in the code. For example:
- output of decoder is NOT a Normal, but a MultivariateNormalDiag (each pixel
is an independent variable).
- weights of reward-bucket dense layer(s) must be zero-initialized according to the
paper.
- Got the entire sampling + loss + updating loop to run for the world model.
- TODO: Continue debugging and successfully train.
Open questions:
- When symlog'ing Atari image observations, do we brutally just pass in the 0-255 uint8
values into the symlog function? Or do we - even though we have symlog - still have
to normalize observations?
- For the binned reward distributions, which tf distribution should we take?
- Should sampling always return exact bin values (or also values in between,
like in a truly continuous distr.)?
- How to compute log-probs from a given (possibly not-exactly-on-the-bin-value
real reward)?
Answer: Use FiniteDiscrete with `atol`==bucket_delta / 2.0.
2023/02/03
- Answered question on how to compute the KL-divergence for two different
z-distributions. Each z-distribution is a 32xCategorical (over 32 classes each)
distribution. We could use the tfp Independent distribution for that with
args:
`distribution=Categorical(probs=[B, num_categoricals, num_classes])`
`reinterpreted_batch_ndims=1` <- dim 1 above are the independent categoricals
KL is computed as a sum(!) (not mean) of the individual distributions' KLs.
Note: We do the same in RLlib for our multi-action distribution
- Wrote a 30-lines-only simple FIFO replay buffer.
- Got example to run and learn a little (commit 1cf47fecd32a3c2ca1cf7aa757465f3856b26710).
- Have to fix the training ratio (1024) and the model size (S instead of currently XS).
- Implemented `model_dimension` parameter for all components, such that the world-model
can be constructed with a simple scaling parameter: "XS", "S", ... (see [1] appendix B)
- Implemented unimix categoricals for world model representation layer and dynamics
predictor.
- Open questions:
According to the paper: "DreamerV2 used a replay buffer that only replays time
steps from completed episodes. To shorten the feedback loop, DreamerV3 uniformly
samples from all inserted subsequences of size batch length regardless of
episode boundaries." -> Need to probably fix our env-runner to return (non-masked)
sequences, which may cross episode boundaries. This would solve the problem of some
batches having smaller losses due to masking some of the timesteps, but introduce the
new problem of having to GAE (for critic/actor loss) within the same sequence.
- According to eq. 5 in [1], the prediction loss is solely using neg-log(p) style loss terms.
However, it also says in Appendix C: "Symlog predictions: We symlog encode inputs to
the world model and use symlog predictions with **squared error** for reconstructing
inputs ...". This seems contradictory.
- How do we store h(0) in the buffer? We need this to start the z-computations (and subsequent
h-computations).
a) Store the initial_h along with the trajectories
(FIXED: we currently store the last(!) h instead and use that as the initial one for the
same sequence, which is completely wrong).
b) Update the initial_h of sequence B inside the buffer, iff we just made a training
pass through sequence A, whose end is the beginning of sequence B.
c) Don't store anything and always start with zero-vector.
2023/02/04-05
- Run different random-actions experiments (w/o Actor/Critic) with MsPacman to
stabilize the world model training run and fix various bugs.
- Try warmup for replay buffer of 10k random action timesteps.
- Make sure inverse_symlog is properly applied to predictions prior to TB'ing them
- Careful with casting from float32 (model output) to uint8 (RGB image) due to the overflow
problem (e.g. predicted 258.0 translates to uint8=3, predicted -1.0 translates to uint8=255)
- Added layernorm after initial dense layer before the decoder stack.
2023/02/06
- Started coding Actor and Critic (same as Reward predictor).
- Adding a LayerNorm after the GRU layer seems to be super important.
However, changing the default GRU activations from tanh/sigmoid to silu/silu (as hinted
in the paper for all other layers), seems to deteriorate performance.
-> We keep the layer-norm, but stick to tanh/sigmoid activation, just for the GRU layer
(all other layers (except the distribution parameter producing ones) do use SiLU).
- TODO: Need to fix reward/return loss in using two-hot instead of plain -log(p).
2023/02/07
- Fix two-hot labels/losses for reward- and value predictors.
2023/02/08
- Finish dream_trajectory method to produce dreamed trajectories given some pre-obs to burn in.
- Noticed that current world model seems to have trouble predicting dynamics properly
- Possible bug in h-generating GRU net OR in "z-dreaming" dynamics net.
- Solutions:
- Need to downsize experiment to CartPole for faster turnaround.
-
2023/02/09
- Downsized experiment to CartPole for faster experiment turnaround times.
- WIP: Fix buffer to store continuous timesteps, even across episode boundaries as
described in the paper. The `continue` flag in the value function loss should
take care of these "crossings" properly. This will enable us to scrap the
zero-masking that we currently still have in the EnvRunner and loss functions.
- TODO: Fill new buffer with single CartPole episode, then debug world-model
dynamics learning. Should have no problem memorizing the episode and "re-dreaming" it.
Add a quick euclidian distance TB summary after each training update to compare
actually sampled observations vs dreamed ones (do same for rewards). Note that
"dreamed" is NOT the same as "predicted" b/c dreamed values are done using the
dynamics network's z-states, whereas predictions are done using the encoder
produced z-states.
2023/02/10
- Found some new information in the papers [1] and [2]:
- Use sticky actions (action_repeat_prob=0.25), frameskip=4, no framestacking (b/c GRU),
grayscale(!), no Atari lifes-informtion, max-time-limit==108000, max-pooling pixel values
over last two (skipped) frames.
- [2] talks about an additional MLP right behind the encoder (instead of only a single
z-producing layer). Added this additional MLP to the world-model. So the new cascade is:
image -> encoder -> [some "image embedding"] -> [embedding] cat [h] -> [new MLP] -> [z-producing layer]
- Finished Buffer for storing continuous episodes.
- Enhanced EnvRunner to produce continous-episode samples, where no masks are needed anymore.
- Introduces a new problem for truncated episodes (see "open issues" below).
Truncated timesteps should ideally be sampled from the
buffer always at the end of a returned batch_length_T chunk (as described in [2]).
- Important bug fix: continue predictor did sigmoid twice. Before Bernoulli,
then passed the sigmoid probs as **`logits`** (not `probs`) into Bernoulli causing yet another sigmoid.
- Added evaluation step after every n training updates that would dream those
trajectories initialized by the first N timesteps of the samples from the buffer,
then comparing the dreamed results with the actual samples (using the same sampled actions
in the dream, not the actor net).
- Ran simplified experiment with CartPole, using a single 64ts sample in the buffer and
repeatedly sampling that sample and learning (memorizing) it. Seems to learn how to dream the single trajectory well.
- Ran regular CartPole experiment with normal sampling enabled. Seems to work similarly well.
2023/02/11-2023/02/14
- Refactoring example code and utilities (e.g. TB summary utility functions).
- Make sure all tf.summary calls are outside of tf.functions, otherwise, `step`
is frozen to 0.
- Deduplicate complex computed vs sampled obs/images code so it can be used
by different evaluation procedures (world model's train_forward produces
posterior computed obs that can be compared to sampled obs; critic/actor
require prior (dreamed) computed obs that can be compared to sampled obs).
- Implemented critic and actor networks.
- Wrote critic loss:
Difficulties:
- Figuring out how to do the EMA-weights network:
- Create a copy of ("fast") critic, call it "EMA-net".
- Update EMA-net after each critic update, such that EMA-net is a running EMA
of the fast critic.
- Use fast critic for value target computations (not the EMA-net!)
- Add additional regularizer term to critic loss that makes sure that
value predictions produced by the fast critic are close to those made by the
EMA-net.
- Missing from the paper: How is this extra regularizer term done (square error)?
Has some weight coefficient?
- Wrote actor loss:
Difficulties:
- Figure out whether the "sg(R) / S" in the equation should actually be
a "sg(R / S)" <- I think this makes more sense as it resembles reinforce AND
would NOT require backprop through dynamics.
- So for discrete actions, use simple reinforce loss with the R/S term as weights
(instead of reinforce's R - V).
- For continuous actions, use "straight through" gradients through the dynamics
of the dreamed trajectory, including re-parameterization of the action sampling step
such that it can be backprop'd. Then only update the policy's weights, NOT the world-model
ones. DreamerV2 ([2]) describes this in more detail.
2023/02/15:
- Switch over to using actor-produced actions, instead of random ones everywhere.
- Ran CartPole experiments. Learns a decent policy after about 30-40 minutes on my
Win GPU.
- This seems plausible as the world model needs to be properly trained first to really
improve the policy.
- Added "unimix" for actor network (for discrete actions Categorical).
- Added a simple evaluator to the example scripts that runs n episodes after each
main iteration.
2023/02/16:
- slides for presentation
-
2023/02/17:
- Debug tools for CartPole: Display image in TB that shows dreamed CartPole
observations (rendered) + predicted rewards/actions/continues/vf-estimates and vf
targets. Noticed that our critic loss has a flaw.
- Big critic bug fix: Use twohot * logp(.|s) instead of plain probs. Now critic
loss scales much better and cartpole actually learns fast.
2023/02/18:
- Running Pong experiment again on cluster.
- Bug fix: value targets used to be computed within symlog space (using symlog
rewards and (symlog) critic outputs. However, this is mathematically incorrect,
b/c log(a+b) [<- target] != log(a) [<- reward] + log(b) [<- value estimate at t+1];
Now we compute them in real space and then symlog the entire targets again when
computing the critic loss.
- Because of the above fix, CartPole learns up to 500 episode return now, albeit
still a little slow. (?)
2023/02/20:
- Wrote new replay buffer and env-runner to account for the fact that
we would like to sample B * T style batches that include episode boundaries.
TODO
2023/03/25
- Test running Pong again (after noticing it's not learning on latest changes)
with f5de6bba3d34b8401e71882979c66d88ee189311 commit (after CartPole fix for the
cont. actions release).
- If it works, we need to compare changes between f5de6bba3d34b8401e71882979c66d88ee189311
and f5de6bba3d34b8401e71882979c66d88ee189311.
OPEN PROBLEMS:
- [SOLVED]
Solution:
We should look into shifting the env runner sequence by one (see slack discussion)
reset -> r=0.0 c=False obs=init_obs a=first action
also: V(t) = r(t) + gamma * V(t+1) (instead of V(t) = r(t+1) + gamma * V(t+1)
reinterpret state-value as value for reaching the state (plus future) instead of
value for already being in the state.
- [SOLVED]
Solution: Use mode(), not sample().
For dreaming: Should we sample from the Bernoulli distribution (for `continue` flags) or use greedy?
We do use the deterministic mean for the dreamed rewards. Should probably do the same for `continues`.
- [SOLVED]
Solution: Treat last timestep i as a fully qualified on within the sampled trajectory; r(i) is the final reward, obs(i) is the final obs, continue(i)=False.
Now that we are using the continuous episode buffer, if an episode is truncated AND the truncated
ts is in the middle of some batch_length_T sized chunk, we would NOT have the next_obs information
that we need to compute a proper value at the truncation. Instead the obs(t+1) would alredy
be the first observation of the next episode. To fix this, our buffer could simply
auto-shift the chunk leftwards until the truncated timestep is exactly at the very end of the returned
trajectory. Then the `next_obs` field can be used for vf estimation. This is similarly described
already in [2] albeit due to a different reason: "To observe enough episode end during training, we sample
the start index of each training sequence uniformly within the episode and then clip it to not exceed the
episode length minus the training sequence length."
- [SOLVED]
Solution: Do not layer norm GRU output.
We are currently NOT LayerNorm'ing the GRU output (the next h-state). Adding this used to provide
a big L_total reduction (in Atari), however, wouldn't this mess with the GRU learning how
to manipulate its internal state vector properly?
Instead: Only layer-norm the input to the next NN (after the GRU), e.g. reward predictor, actor, vf, etc..
BUT leave the actual h-states as-is (as the next GRU h-inputs).
[2] states (w/o specifying, where exactly the LayerNorm would go:
before the GRU? -> makes no sense as it would normalize one-hot z- and a-vectors
after the GRU? on top of the h-states, OR as suggested above, h-states ontouched, but everything that uses h-states for further predictions gets normalized h-states
):
"Figure J.1: Comparison of DreamerV2 to a version without layer norm in the GRU ....
We find that the benefit of layer norm depends on the task at hand, increasing
and decreasing performance on a roughly equal number of tasks"
- [SOLVED] Figure out reason of weird actor loss in the paper.
Solution: See our implementation, which now matches Danijar's.
- [DONE] Implement unimix categoricals for actor network (for discrete actions?)
- Done already for the z-generator.
- Now also done for the actor network (2023/02/15).
- [DONE] We are NOT using an extra MLP after the encoder stack (just a single dense RepresentationLayer to
translate from flatten(4x4xC) -> (32x32)z OR from [flatten(4x4xC)+h_state] -> (32x32)z).
In [2] it says: "Neural networks: The representation model is implemented as a Convolutional Neural Network
(CNN; LeCun et al., 1989) followed by a Multi-Layer Perceptron (MLP) that receives the image
embedding and the deterministic recurrent state"
- [DONE: information found in [2]] For now, I have been training deterministic Atari only:
no sticky actions (action_repeat_probability=0.0), and frameskip=4
- [DONE] We are not using `two_hot` yet at all, also not for the reward predictions, how come?
- As soon as I switch on two-hot for reward predictions, the total loss seems to increase
and not go down that much over time anymore (compared to when we use simple neg log p loss for
rewards). Must figure out why this is!
Solution: We are using it now everywhere and there is no big difference between two-hot
losses and their simpler counterparts: -logp([single reward]).