Skip to content

Latest commit






Simple REINFORCE algorithm. In the raycast environment, we have:

  • 5 rays providing distance measurements
  • Each ray measurement is normalized using the max range of raycast before fed into the network.
  • Robot has 3 actions to pick from:
    • Drive straight with velocity v
    • Steer right with with velocity v/2
    • Steer left with with velocity v/2
  • Robot is rewarded +1 for each step until collision.
  • At each step, actor samples an action from the probabilities provided by the network (1 probability per each action-index), given the sensor measurements.
  • After each episode, policy is updated per REINFORCE algorithm:
    • Discounted rewards are calculated such that the early episodes contain the upcoming discounted rewards:
      • reward_step_0 = discount*(sum_of_all_rewards_till_end_of_episode)
    • Discounted rewards are scaled with the log probability of the action that was taken that led to that reward.
      • Hence, both the log_probability and the rewards per each step during an episode needs to be saved until the end of the episode when the policy gets updated through back-propagation.
      • The loss for the training is taken as the cumulative negative log-likelihood of the action.
        • As we want to maximize the log_probability*reward, but gradient descent in torch is used for minimization, we turn the optimization into a minimization by negating it.