Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug of the calculation precision #30

Open
yangysc opened this issue May 21, 2019 · 5 comments
Open

Possible bug of the calculation precision #30

yangysc opened this issue May 21, 2019 · 5 comments

Comments

@yangysc
Copy link

yangysc commented May 21, 2019

Hello, Lucas
First, thanks for your code. Now I'm using your ppo algorithm for my project. But I found a possible issue of the number precision, which may cause the training process diverged.

Here is the problem. My original internal reward (each step) is small, like 0.02, 0.001, and the max return is 0.08 for the environment. Strangely, the training does not converge at all, and this problem is solved by multiply the reward by a factor of ten.

I used the OpenAI's baselines when I was using tensorflow. The precision was not a problem, and the small reward can also work well. Now when I ported to Pytorch, The policy and the agent are basically the same (torch and TF). I do not know how to handle this possible issue.

Here are the logs of the original reward and the magnified reward by a factor of ten. You can check the mean return of each of them.
The first one is in the range of [0.0100, 0.012] and goes up and down.
reward_origin

Here is the magnified reward, and the return is increasing over time.
reward_magnified

I wonder why the small reward can make the training stage fail, so I open this issue to discuss it.

Thanks in advance.

@lcswillems
Copy link
Owner

Hi,

Thank you for reporting this problem. I have already gotten this kind of issue, but never found the reason...

I don't have the time anymore to investigate it. If you find the origin of the issue, let me know. Thank you :)

@maximecb
Copy link
Contributor

Typically, other RL frameworks attempt to normalize the reward into the 0-1 range to address this problem, AFAIK.

@lcswillems
Copy link
Owner

lcswillems commented May 21, 2019

Thank you @maximecb for this information. I didn't know. @yangysc, do you know if OpenAI is doing it?

@yangysc
Copy link
Author

yangysc commented May 22, 2019

Hi, @lcswillems , although OpenAI provides a API ( VecNormalize) to normalize the observations and rewards, I did not use it actually. I will try some RL frameworks and hope to figure it out.

@lcswillems
Copy link
Owner

Okay. Let me know. Thank you for reporting this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants