Simple REINFORCE algorithm. In the raycast environment, we have:
- 5 rays providing distance measurements
- Each ray measurement is normalized using the max range of raycast before fed into the network.
- Robot has 3 actions to pick from:
- Drive straight with velocity v
- Steer right with with velocity v/2
- Steer left with with velocity v/2
- Robot is rewarded +1 for each step until collision.
- At each step, actor samples an action from the probabilities provided by the network (1 probability per each action-index), given the sensor measurements.
- After each episode, policy is updated per REINFORCE algorithm:
- Discounted rewards are calculated such that the early episodes contain the upcoming discounted rewards:
- reward_step_0 = discount*(sum_of_all_rewards_till_end_of_episode)
- Discounted rewards are scaled with the log probability of the action that was taken that led to that reward.
- Hence, both the log_probability and the rewards per each step during an episode needs to be saved until the end of the episode when the policy gets updated through back-propagation.
- The loss for the training is taken as the cumulative negative log-likelihood of the action.
- As we want to maximize the log_probability*reward, but gradient descent in torch is used for minimization, we turn the optimization into a minimization by negating it.
- Discounted rewards are calculated such that the early episodes contain the upcoming discounted rewards: