[RL-baseline] Model v5, experiment #4 #47

ziritrion · 2021-04-05T08:07:25Z

For this experiment, a brand new move set with extreme granularity was chosen:
[0.0, 0.0, 0.0], # no action
[0.0, 0.9, 0.0], # throttle high
[0.0, 0.6, 0.0], # throttle medium-high
[0.0, 0.4, 0.0], # throttle medium-low
[0.0, 0.2, 0.0], # throttle low
[0.0, 0.0, 0.9], # brake high
[0.0, 0.0, 0.6], # brake medium-high
[0.0, 0.0, 0.4], # brake medium-low
[0.0, 0.0, 0.2], # brake low
[-0.9, 0.9, 0.0], # left high, throttle high
[-0.9, 0.6, 0.0], # left high, throttle medium-high
[-0.9, 0.4, 0.0], # left high, throttle medium-low
[-0.9, 0.2, 0.0], # left high, throttle low
[-0.9, 0.0, 0.9], # left high, brake high
[-0.9, 0.0, 0.6], # left high, brake medium-high
[-0.9, 0.0, 0.4], # left high, brake medium-low
[-0.9, 0.0, 0.2], # left high, brake low
[-0.9, 0.0, 0.0], # left high, no throttle
[-0.6, 0.9, 0.0], # left medium-high, throttle high
[-0.6, 0.6, 0.0], # left medium-high, throttle medium-high
[-0.6, 0.4, 0.0], # left medium-high, throttle medium-low
[-0.6, 0.2, 0.0], # left medium-high, throttle low
[-0.6, 0.0, 0.9], # left medium-high, brake high
[-0.6, 0.0, 0.6], # left medium-high, brake medium-high
[-0.6, 0.0, 0.4], # left medium-high, brake medium-low
[-0.6, 0.0, 0.2], # left medium-high, brake low
[-0.6, 0.0, 0.0], # left medium-high, no throttle
[-0.4, 0.9, 0.0], # left medium-low, throttle high
[-0.4, 0.6, 0.0], # left medium-low, throttle medium-high
[-0.4, 0.4, 0.0], # left medium-low, throttle medium-low
[-0.4, 0.2, 0.0], # left medium-low, throttle low
[-0.4, 0.0, 0.9], # left medium-low, brake high
[-0.4, 0.0, 0.6], # left medium-low, brake medium-high
[-0.4, 0.0, 0.4], # left medium-low, brake medium-low
[-0.4, 0.0, 0.2], # left medium-low, brake low
[-0.4, 0.0, 0.0], # left medium-low, no throttle
[-0.2, 0.9, 0.0], # left low, throttle high
[-0.2, 0.6, 0.0], # left low, throttle medium-high
[-0.2, 0.4, 0.0], # left low, throttle medium-low
[-0.2, 0.2, 0.0], # left low, throttle low
[-0.2, 0.0, 0.9], # left low, brake high
[-0.2, 0.0, 0.6], # left low, brake medium-high
[-0.2, 0.0, 0.4], # left low, brake medium-low
[-0.2, 0.0, 0.2], # left low, brake low
[-0.2, 0.0, 0.0], # left low, no throttle
[0.9, 0.9, 0.0], # right high, throttle high
[0.9, 0.6, 0.0], # right high, throttle medium-high
[0.9, 0.4, 0.0], # right high, throttle medium-low
[0.9, 0.2, 0.0], # right high, throttle low
[0.9, 0.0, 0.9], # right high, brake high
[0.9, 0.0, 0.6], # right high, brake medium-high
[0.9, 0.0, 0.4], # right high, brake medium-low
[0.9, 0.0, 0.2], # right high, brake low
[0.9, 0.0, 0.0], # right high, no throttle
[0.6, 0.9, 0.0], # right medium-high, throttle high
[0.6, 0.6, 0.0], # right medium-high, throttle medium-high
[0.6, 0.4, 0.0], # right medium-high, throttle medium-low
[0.6, 0.2, 0.0], # right medium-high, throttle low
[0.6, 0.0, 0.9], # right medium-high, brake high
[0.6, 0.0, 0.6], # right medium-high, brake medium-high
[0.6, 0.0, 0.4], # right medium-high, brake medium-low
[0.6, 0.0, 0.2], # right medium-high, brake low
[0.6, 0.0, 0.0], # right medium-high, no throttle
[0.4, 0.9, 0.0], # right medium-low, throttle high
[0.4, 0.6, 0.0], # right medium-low, throttle medium-high
[0.4, 0.4, 0.0], # right medium-low, throttle medium-low
[0.4, 0.2, 0.0], # right medium-low, throttle low
[0.4, 0.0, 0.9], # right medium-low, brake high
[0.4, 0.0, 0.6], # right medium-low, brake medium-high
[0.4, 0.0, 0.4], # right medium-low, brake medium-low
[0.4, 0.0, 0.2], # right medium-low, brake low
[0.4, 0.0, 0.0], # right medium-low, no throttle
[0.2, 0.9, 0.0], # right low, throttle high
[0.2, 0.6, 0.0], # right low, throttle medium-high
[0.2, 0.4, 0.0], # right low, throttle medium-low
[0.2, 0.2, 0.0], # right low, throttle low
[0.2, 0.0, 0.9], # right low, brake high
[0.2, 0.0, 0.6], # right low, brake medium-high
[0.2, 0.0, 0.4], # right low, brake medium-low
[0.2, 0.0, 0.2], # right low, brake low
[0.2, 0.0, 0.0], # right low, no throttle

The max Running Reward achieved was 448 at the 3.5k episode mark, but for most of the experiment the Running Reward was negative and the experiment ended at a very sharp drop with a final value of -13, even though it was over 200 not even 50 episodes before.

I believe it's likely that the results could be improved further with additional training, but after the success we've found with finetuning the action set in REINFORCE experiments, we believe that the only way we could achieve noticeably improved results over what we've got so far with REINFORCE with Baseline is by limiting the action set in a way that forbids the network from choosing actions with catastrophic consequences, which essentially means driving slowly.

Tensorboard screenshots below:

Sample video below:
https://user-images.githubusercontent.com/1465235/113552629-aead0480-95f6-11eb-8e44-0c70b081d4c2.mp4

ziritrion added 2 commits April 3, 2021 17:13

Start of experiment 4, with action set #3

169796a

20k episodes, running reward 17

b5b2d52

ziritrion changed the base branch from main to RL-baseline-v5 April 5, 2021 08:09

ziritrion changed the title ~~Rl baseline v5 exp4~~ [RL-baseline] Model v5, experiment #4 Apr 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RL-baseline] Model v5, experiment #4 #47

[RL-baseline] Model v5, experiment #4 #47

ziritrion commented Apr 5, 2021

[RL-baseline] Model v5, experiment #4 #47

Are you sure you want to change the base?

[RL-baseline] Model v5, experiment #4 #47

Conversation

ziritrion commented Apr 5, 2021