Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds carflag-v0 env #34

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

ashok-arora
Copy link

Car Flag tasks a car with driving across a 1D line to the correct flag. The car must first drive to the oracle flag and then to the correct endpoint. The agent's observation is a vector of three floats: its position on the line, its velocity at each timestep, and the goal flag's location when it reaches the oracle flag. The agent's actions alter its velocity: it can accelerate left, perform a no-op (maintain current velocity), or accelerate right.

Car Flag tasks a car with driving across a 1D line to the correct flag.
The car must first drive to the oracle flag and then to the correct endpoint.
The agent's observation is a vector of three floats: its position on the line,
its velocity at each timestep, and the goal flag's location when it reaches
the oracle flag. The agent's actions alter its velocity: it can accelerate left,
perform a no-op (maintain current velocity), or accelerate right.
Copy link
Collaborator

@smorad smorad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Just a few more requests:

  1. Consider adding easy/medium/hard versions if you would like this to be future proof! You can probably just increase the heaven/hell positions or decrease velocities to make it harder, but you are the expert.
  2. Please add this to the list of environments. That way, the users can find it. I also have some unit tests that will use this list of envs to ensure that nobody in the future introduces code that breaks your env.
  3. Consider writing a unit test, just to ensure the environment behaves exactly as expected. RL algorithms are very good at exploiting even the smallest bugs. You can validate everything works as expected by running pytest tests/. I think this will already bring up one bug (hint: compare the type signature for this reset function to the MineSweeper one.)
  4. The carflag reward is determined to be 1.0 per timestep in the paper. But in popgym, the other envs have a max cumulative reward of 1.0 and a min of -1.0. This is so you can train on 50 episodes of Env1 and 50 episodes of Env2 using the same learning rate. Not sure what the correct answer is here, but I figured I would let you know just in case.


return self.state, env_reward, done, {"is_success": env_reward > 0}

def reset(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reset function should reset using a seed. This ensures we can reproduce trajectories exactly, which is useful when comparing policies.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, thank you for the pointer. I have updated it now.

The easy level has the range [-1, 1].
The medium level has the range [-3, 3].
The hard level has the range [-5, 5].
@ashok-arora
Copy link
Author

Thank you for the detailed comments and support. I have added the easy, medium, and hard levels and updated the list of environments accordingly.

While writing the tests, I noticed a bug in the DTQN codebase. The env currently returns only a done state instead of differentiating between truncated and terminated states.

To address this, I observed that the minesweeper env code uses a combination of the timestep and board dimensions to calculate the maximum time it could take, which works well with the varying game levels since the dimensions change. Could you suggest a scalable method to calculate the truncated state across all levels in the carflag env?

@smorad
Copy link
Collaborator

smorad commented Jul 3, 2024

Yeah this is tricky. The minimum time required would be the time to reach maximum velocity, drive to the oracle, slow down to 0 velocity, accelerate to max velocity in the direction of the goal, then drive to the goal.

Also, we would probably want to be in "heaven" or "hell" for more than a single timestep to help with credit assignment.

Do you know how many timesteps they used in the DTQN paper? If so, maybe we can just scale that. For example, maybe the standard/easy mode would take a minimum of 20 timesteps to reach the oracle and then heaven, and DTQN gives the agent 40 timesteps, so 20 timesteps to remain in "heaven".

If the harder mode takes 60 timesteps to reach oracle and heaven, then we can set the episodes to be 120 timesteps long. These are just made up numbers.

@ashok-arora
Copy link
Author

ashok-arora commented Jul 3, 2024

The DTQN paper has set the maximum_time_steps to be 200 for the easy mode.

https://github.com/kevslinger/DTQN/blob/79ccf8b548a2f6263b770e051b42ced2932137ee/envs/__init__.py#L43-L48

@ashok-arora
Copy link
Author

Would it be better to set an arbitrary value or one based on the environment conditions?

@smorad
Copy link
Collaborator

smorad commented Jul 4, 2024

Ideally, I'd say it should be set based on environment conditions. How many timesteps would it take an optimal policy to reach the oracle and then heaven?

@ashok-arora
Copy link
Author

Hey, I trained the DTQN policy and it took 85 timesteps for it to reach the oracle and then to heaven.

@smorad
Copy link
Collaborator

smorad commented Jul 6, 2024

Oh if you have a working DTQN implementation it's even easier! How many timesteps does it take on medium and hard?

@ashok-arora
Copy link
Author

85-137 for easy, 125-175 for medium and 165-210 for hard.

@smorad
Copy link
Collaborator

smorad commented Jul 6, 2024

How does easy: 200, medium: 300, hard: 400 timesteps sound?

@ashok-arora
Copy link
Author

ashok-arora commented Jul 6, 2024

Sounds good 👍

@ashok-arora
Copy link
Author

Hey @smorad, I am getting this error in the testcases but unable to figure out what's wrong:

=================================================================== short test summary info ===================================================================
FAILED tests/test_wrappers.py::test_markovian_state_space_full[CarFlagEasy] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([-0.4428458,  0.       ,  0.       ], dtype=float...
FAILED tests/test_wrappers.py::test_markovian_state_space_full[CarFlagMedium] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([-0.3615731,  0.       ,  0.       ], dtype=float...
FAILED tests/test_wrappers.py::test_markovian_state_space_full[CarFlagHard] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([1.020816, 0.      , 0.      ], dtype=float32), 0...
FAILED tests/test_wrappers.py::test_markovian_state_space_partial[CarFlagEasy] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data [ 0.32249273 -0.0015     -1.        ]
FAILED tests/test_wrappers.py::test_markovian_state_space_partial[CarFlagMedium] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data [ 0.73226799 -0.0015      0.        ]
FAILED tests/test_wrappers.py::test_markovian_state_space_partial[CarFlagHard] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data [-0.7669738  0.0015     0.       ]
FAILED tests/test_wrappers.py::test_markovian_state_space_info_dict[CarFlagEasy] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([0.10640713, 0.        , 0.        ]), 0.5, 1.0, ...
FAILED tests/test_wrappers.py::test_markovian_state_space_info_dict[CarFlagMedium] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([-0.65530812, -0.0015    ,  0.        ]), 0.5, -1...
FAILED tests/test_wrappers.py::test_markovian_state_space_info_dict[CarFlagHard] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([0.25203142, 0.        , 0.        ]), 0.5, -1.0,...
FAILED tests/test_wrappers.py::test_state_space_full_and_partial[CarFlagEasy] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([0.50059146, 0.        , 0.        ], dtype=float...
FAILED tests/test_wrappers.py::test_state_space_full_and_partial[CarFlagMedium] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([-1.0373735,  0.       ,  0.       ], dtype=float...
FAILED tests/test_wrappers.py::test_state_space_full_and_partial[CarFlagHard] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([0.86538035, 0.        , 0.        ], dtype=float...
=================================================== 12 failed, 441 passed, 6 skipped, 417 warnings in 5.21s ===================================================

I tried to np.clip the values as well yet I dont see how the values are out of bound. Could you please help?

@smorad
Copy link
Collaborator

smorad commented Aug 26, 2024

Ah I think this might be related to our Markovian state space requirement. So with POMDPs, it is useful to have the state available for debugging. I think you might need to add the state_space member variable to your class. See here for an example. You should also implement a get_state method that returns a Markov state (see here).

I think the issue is that state_space is defined to be an np.array of shape (3,) while you return a tuple of four elements from get_state.

@ashok-arora
Copy link
Author

Thank you for the advice. I’ll look into the dimensions of get_state.

@ashok-arora
Copy link
Author

Changing the get_state to return just the observation (of size 3) also gives the same error.

 def get_state(self):
        # Return the position of the car, oracle, and goal
        return self.state

@ashok-arora
Copy link
Author

yay it passes the tests now!
@smorad could you review the code quality?
lastly, should I add new tests or would the existing ones suffice?

@smorad
Copy link
Collaborator

smorad commented Aug 29, 2024

Great work! Your code is good. I think I have just one more recommendation:

To be in-line with the other popgym environments, I think it would be best if the episodic reward (non discounted return) was always between -1.0, 1.0. Maybe you could rescale the reward to be

max_reward = self.max_steps - self.heaven_position / self.max_velocity
reward = reward / max_reward

As for tests, I would suggest testing that the episode ends after you reach the boundary, and that the episode return is always bounded in [-1, 1]. But it's up to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants