[RL-baseline] Model v5, experiment #4 #47
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
For this experiment, a brand new move set with extreme granularity was chosen:
[0.0, 0.0, 0.0], # no action
[0.0, 0.9, 0.0], # throttle high
[0.0, 0.6, 0.0], # throttle medium-high
[0.0, 0.4, 0.0], # throttle medium-low
[0.0, 0.2, 0.0], # throttle low
[0.0, 0.0, 0.9], # brake high
[0.0, 0.0, 0.6], # brake medium-high
[0.0, 0.0, 0.4], # brake medium-low
[0.0, 0.0, 0.2], # brake low
[-0.9, 0.9, 0.0], # left high, throttle high
[-0.9, 0.6, 0.0], # left high, throttle medium-high
[-0.9, 0.4, 0.0], # left high, throttle medium-low
[-0.9, 0.2, 0.0], # left high, throttle low
[-0.9, 0.0, 0.9], # left high, brake high
[-0.9, 0.0, 0.6], # left high, brake medium-high
[-0.9, 0.0, 0.4], # left high, brake medium-low
[-0.9, 0.0, 0.2], # left high, brake low
[-0.9, 0.0, 0.0], # left high, no throttle
[-0.6, 0.9, 0.0], # left medium-high, throttle high
[-0.6, 0.6, 0.0], # left medium-high, throttle medium-high
[-0.6, 0.4, 0.0], # left medium-high, throttle medium-low
[-0.6, 0.2, 0.0], # left medium-high, throttle low
[-0.6, 0.0, 0.9], # left medium-high, brake high
[-0.6, 0.0, 0.6], # left medium-high, brake medium-high
[-0.6, 0.0, 0.4], # left medium-high, brake medium-low
[-0.6, 0.0, 0.2], # left medium-high, brake low
[-0.6, 0.0, 0.0], # left medium-high, no throttle
[-0.4, 0.9, 0.0], # left medium-low, throttle high
[-0.4, 0.6, 0.0], # left medium-low, throttle medium-high
[-0.4, 0.4, 0.0], # left medium-low, throttle medium-low
[-0.4, 0.2, 0.0], # left medium-low, throttle low
[-0.4, 0.0, 0.9], # left medium-low, brake high
[-0.4, 0.0, 0.6], # left medium-low, brake medium-high
[-0.4, 0.0, 0.4], # left medium-low, brake medium-low
[-0.4, 0.0, 0.2], # left medium-low, brake low
[-0.4, 0.0, 0.0], # left medium-low, no throttle
[-0.2, 0.9, 0.0], # left low, throttle high
[-0.2, 0.6, 0.0], # left low, throttle medium-high
[-0.2, 0.4, 0.0], # left low, throttle medium-low
[-0.2, 0.2, 0.0], # left low, throttle low
[-0.2, 0.0, 0.9], # left low, brake high
[-0.2, 0.0, 0.6], # left low, brake medium-high
[-0.2, 0.0, 0.4], # left low, brake medium-low
[-0.2, 0.0, 0.2], # left low, brake low
[-0.2, 0.0, 0.0], # left low, no throttle
[0.9, 0.9, 0.0], # right high, throttle high
[0.9, 0.6, 0.0], # right high, throttle medium-high
[0.9, 0.4, 0.0], # right high, throttle medium-low
[0.9, 0.2, 0.0], # right high, throttle low
[0.9, 0.0, 0.9], # right high, brake high
[0.9, 0.0, 0.6], # right high, brake medium-high
[0.9, 0.0, 0.4], # right high, brake medium-low
[0.9, 0.0, 0.2], # right high, brake low
[0.9, 0.0, 0.0], # right high, no throttle
[0.6, 0.9, 0.0], # right medium-high, throttle high
[0.6, 0.6, 0.0], # right medium-high, throttle medium-high
[0.6, 0.4, 0.0], # right medium-high, throttle medium-low
[0.6, 0.2, 0.0], # right medium-high, throttle low
[0.6, 0.0, 0.9], # right medium-high, brake high
[0.6, 0.0, 0.6], # right medium-high, brake medium-high
[0.6, 0.0, 0.4], # right medium-high, brake medium-low
[0.6, 0.0, 0.2], # right medium-high, brake low
[0.6, 0.0, 0.0], # right medium-high, no throttle
[0.4, 0.9, 0.0], # right medium-low, throttle high
[0.4, 0.6, 0.0], # right medium-low, throttle medium-high
[0.4, 0.4, 0.0], # right medium-low, throttle medium-low
[0.4, 0.2, 0.0], # right medium-low, throttle low
[0.4, 0.0, 0.9], # right medium-low, brake high
[0.4, 0.0, 0.6], # right medium-low, brake medium-high
[0.4, 0.0, 0.4], # right medium-low, brake medium-low
[0.4, 0.0, 0.2], # right medium-low, brake low
[0.4, 0.0, 0.0], # right medium-low, no throttle
[0.2, 0.9, 0.0], # right low, throttle high
[0.2, 0.6, 0.0], # right low, throttle medium-high
[0.2, 0.4, 0.0], # right low, throttle medium-low
[0.2, 0.2, 0.0], # right low, throttle low
[0.2, 0.0, 0.9], # right low, brake high
[0.2, 0.0, 0.6], # right low, brake medium-high
[0.2, 0.0, 0.4], # right low, brake medium-low
[0.2, 0.0, 0.2], # right low, brake low
[0.2, 0.0, 0.0], # right low, no throttle
The max Running Reward achieved was 448 at the 3.5k episode mark, but for most of the experiment the Running Reward was negative and the experiment ended at a very sharp drop with a final value of -13, even though it was over 200 not even 50 episodes before.
I believe it's likely that the results could be improved further with additional training, but after the success we've found with finetuning the action set in REINFORCE experiments, we believe that the only way we could achieve noticeably improved results over what we've got so far with REINFORCE with Baseline is by limiting the action set in a way that forbids the network from choosing actions with catastrophic consequences, which essentially means driving slowly.
Tensorboard screenshots below:
Sample video below:
https://user-images.githubusercontent.com/1465235/113552629-aead0480-95f6-11eb-8e44-0c70b081d4c2.mp4