reinforce-learningrate #actions1,2,3,4 #45
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Experiments manually modifying actions on a trained reinforce algorithm to solve gym CarRacing.
We start from the model as follows:
Branch --> reinforce-learningrate
Set of actions:
available_actions = [
[0.0, 0.7, 0.0], # throttle
[0.0, 0.5, 0.0], # throttle
[0.0, 0.2, 0.0], # throttle
[0.0, 0.0, 0.7], # break
[0.0, 0.0, 0.5], # break
[0.0, 0.0, 0.2], # break
[-0.8, 0.1, 0.0], # left
[-0.5, 0.1, 0.0], # left
[-0.2, 0.1, 0.0], # left
[0.8, 0.1, 0.0], # right
[0.5, 0.1, 0.0], # right
[0.2, 0.1, 0.0], # right
]
Average Reward.
Model was stuck around 200 and 600 average reward. There were some high average reward moments, with average up to 800 between 46k and 55k episodes.
We analyzed what was happening seeing the actual video. We noticed the agent was not able to correctly process low speed corners. It was not even braking.
Car was accelerating until speed was too high to manage any corner.
openaigym.video.0.632780.video000000.mp4
We stared an investigation using different sets of actions but from an already trained network. (the one on the picture above.)
Branch --> reinforce-learningrate-act1
We changed set of actions, trying to reduce speed.
available_actions = [
[0.0, 0.2, 0.0], # throttle – lower acc (from 0.7 to 0.2)
[0.0, 0.1, 0.0], # throttle – lower acc (from 0.5 to 0.1)
[0.0, 0.0, 0.0], # no action
[0.0, 0.0, 0.7], # break
[0.0, 0.0, 0.5], # break
[0.0, 0.0, 0.2], # break
[-1.0, 0.0, 0.0], # left – more steering angle (from -0.8 to -1) / and no throttle when turning
[-0.5, 0.0, 0.0], # left
[-0.2, 0.0, 0.0], # left
[1.0, 0.0, 0.0], # right – more steering angle (from 0.8 to 1) / and no throttle when turning
[0.5, 0.0, 0.0], # right
[0.2, 0.0, 0.0], # right
]
Results improved significantly
Now it was quite clear that car was not loosing track so easy as acceleration was reduced.
We had some good examples when we were lucky and track was easy with many straights and few sharp corners.
929_reward_act1_good_example.mp4
But some other bad examples when track had sharp corners. Car was still driving to fast to successfully turn.
554_reward_act1_bad_example.mp4
Branch --> reinforce-learningrate-act2
from this point, we thought on introducing some brake action when turning and adding some more acceleration to compensate braking at corners.
available_actions = [
[0.0, 0.3, 0.0], # throttle – higher acc (from 0.2 to 0.3)
[0.0, 0.1, 0.0], # throttle
[0.0, 0.0, 0.0], # throttle
[0.0, 0.0, 0.7], # break
[0.0, 0.0, 0.5], # break
[0.0, 0.0, 0.2], # break
[-1.0, 0.0, 0.2], # left – slight braking at corners (from 0.0 to 0.2)
[-0.5, 0.0, 0.2], # left – slight braking at corners (from 0.0 to 0.2)
[-0.2, 0.0, 0.2], # left – slight braking at corners (from 0.0 to 0.2)
[1.0, 0.0, 0.2], # right – slight braking at corners (from 0.0 to 0.2)
[0.5, 0.0, 0.2], # right – slight braking at corners (from 0.0 to 0.2)
[0.2, 0.0, 0.2], # right – slight braking at corners (from 0.0 to 0.2)
]
Here we were obviously too conservative on braking but we thought it may be a good working path to prevent car from accelerating to much and crashing on sharp corners.
openaigym.video.0.610454.video000000.mp4
Branch --> reinforce-learningrate-act3
We adjusted corner braking and results were as follows:
available_actions = [
[0.0, 0.3, 0.0], # throttle
[0.0, 0.1, 0.0], # throttle
[0.0, 0.0, 0.0], # throttle
[0.0, 0.0, 0.7], # break
[0.0, 0.0, 0.5], # break
[0.0, 0.0, 0.2], # break
[-1.0, 0.0, 0.05], # left – slight braking at corners (from 0.2 to 0.05)
[-0.5, 0.0, 0.05], # left – slight braking at corners (from 0.2 to 0.05)
[-0.2, 0.0, 0.05], # left – slight braking at corners (from 0.2 to 0.05)
[1.0, 0.0, 0.05], # right – slight braking at corners (from 0.2 to 0.05)
[0.5, 0.0, 0.05], # right – slight braking at corners (from 0.2 to 0.05)
[0.2, 0.0, 0.05], # right – slight braking at corners (from 0.2 to 0.05)
]
Results were even better (final orange rewards) with a maximum average reward of 889.
openaigym.video.0.618566.video000000.mp4
It seems we are very close to average reward 900 but we think actions setup is critical.
Branch --> reinforce-learningrate-act4
We tried minimal changes to check if we could see better results.
available_actions = [
[0.0, 0.25, 0.0], # throttle – lower acc (from 0.3 to 0.25)
[0.0, 0.1, 0.0], # throttle
[0.0, 0.0, 0.0], # throttle
[0.0, 0.0, 0.7], # break
[0.0, 0.0, 0.5], # break
[0.0, 0.0, 0.2], # break
[-1.0, 0.0, 0.04], # left – slight braking at corners (from 0.05 to 0.04)
[-0.5, 0.0, 0.04], # left – slight braking at corners (from 0.05 to 0.04)
[-0.2, 0.0, 0.04], # left – slight braking at corners (from 0.05 to 0.04)
[1.0, 0.0, 0.04], # right – slight braking at corners (from 0.05 to 0.04)
[0.5, 0.0, 0.04], # right – slight braking at corners (from 0.05 to 0.04)
[0.2, 0.0, 0.04], # right – slight braking at corners (from 0.05 to 0.04)
]
Average reward results were now similar to previous action set results (final blue curve area).
openaigym.video.0.630905.video000000.mp4
CONCLUSION:
Set of available actions plays key role on agent performance. It is not possible to solve car racing environment with poor set of actions using reinforce.
Defining good set of actions seems as important as defining a proper neural network to solve this environment.
To reach average reward of 900, additional fine tuning on actions or continuous set of actions should be used.