Adds carflag-v0 env #34

ashok-arora · 2024-06-29T10:39:20Z

Car Flag tasks a car with driving across a 1D line to the correct flag. The car must first drive to the oracle flag and then to the correct endpoint. The agent's observation is a vector of three floats: its position on the line, its velocity at each timestep, and the goal flag's location when it reaches the oracle flag. The agent's actions alter its velocity: it can accelerate left, perform a no-op (maintain current velocity), or accelerate right.

smorad

Great work! Just a few more requests:

Consider adding easy/medium/hard versions if you would like this to be future proof! You can probably just increase the heaven/hell positions or decrease velocities to make it harder, but you are the expert.
Please add this to the list of environments. That way, the users can find it. I also have some unit tests that will use this list of envs to ensure that nobody in the future introduces code that breaks your env.
Consider writing a unit test, just to ensure the environment behaves exactly as expected. RL algorithms are very good at exploiting even the smallest bugs. You can validate everything works as expected by running pytest tests/. I think this will already bring up one bug (hint: compare the type signature for this reset function to the MineSweeper one.)
The carflag reward is determined to be 1.0 per timestep in the paper. But in popgym, the other envs have a max cumulative reward of 1.0 and a min of -1.0. This is so you can train on 50 episodes of Env1 and 50 episodes of Env2 using the same learning rate. Not sure what the correct answer is here, but I figured I would let you know just in case.

smorad · 2024-06-29T15:01:15Z

popgym/envs/carflag.py

+
+        return self.state, env_reward, done, {"is_success": env_reward > 0}
+
+    def reset(self):


The reset function should reset using a seed. This ensures we can reproduce trajectories exactly, which is useful when comparing policies.

Oh right, thank you for the pointer. I have updated it now.

The easy level has the range [-1, 1]. The medium level has the range [-3, 3]. The hard level has the range [-5, 5].

ashok-arora · 2024-07-01T07:19:29Z

Thank you for the detailed comments and support. I have added the easy, medium, and hard levels and updated the list of environments accordingly.

While writing the tests, I noticed a bug in the DTQN codebase. The env currently returns only a done state instead of differentiating between truncated and terminated states.

To address this, I observed that the minesweeper env code uses a combination of the timestep and board dimensions to calculate the maximum time it could take, which works well with the varying game levels since the dimensions change. Could you suggest a scalable method to calculate the truncated state across all levels in the carflag env?

smorad · 2024-07-03T10:31:23Z

Yeah this is tricky. The minimum time required would be the time to reach maximum velocity, drive to the oracle, slow down to 0 velocity, accelerate to max velocity in the direction of the goal, then drive to the goal.

Also, we would probably want to be in "heaven" or "hell" for more than a single timestep to help with credit assignment.

Do you know how many timesteps they used in the DTQN paper? If so, maybe we can just scale that. For example, maybe the standard/easy mode would take a minimum of 20 timesteps to reach the oracle and then heaven, and DTQN gives the agent 40 timesteps, so 20 timesteps to remain in "heaven".

If the harder mode takes 60 timesteps to reach oracle and heaven, then we can set the episodes to be 120 timesteps long. These are just made up numbers.

ashok-arora · 2024-07-03T10:36:14Z

The DTQN paper has set the maximum_time_steps to be 200 for the easy mode.

https://github.com/kevslinger/DTQN/blob/79ccf8b548a2f6263b770e051b42ced2932137ee/envs/__init__.py#L43-L48

ashok-arora · 2024-07-03T13:06:07Z

Would it be better to set an arbitrary value or one based on the environment conditions?

smorad · 2024-07-04T10:16:00Z

Ideally, I'd say it should be set based on environment conditions. How many timesteps would it take an optimal policy to reach the oracle and then heaven?

ashok-arora · 2024-07-05T14:21:57Z

Hey, I trained the DTQN policy and it took 85 timesteps for it to reach the oracle and then to heaven.

smorad · 2024-07-06T10:07:22Z

Oh if you have a working DTQN implementation it's even easier! How many timesteps does it take on medium and hard?

ashok-arora · 2024-07-06T11:19:48Z

85-137 for easy, 125-175 for medium and 165-210 for hard.

smorad · 2024-07-06T11:22:37Z

How does easy: 200, medium: 300, hard: 400 timesteps sound?

ashok-arora · 2024-07-06T11:55:19Z

Sounds good 👍

ashok-arora · 2024-08-25T17:03:00Z

Hey @smorad, I am getting this error in the testcases but unable to figure out what's wrong:

=================================================================== short test summary info ===================================================================
FAILED tests/test_wrappers.py::test_markovian_state_space_full[CarFlagEasy] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([-0.4428458,  0.       ,  0.       ], dtype=float...
FAILED tests/test_wrappers.py::test_markovian_state_space_full[CarFlagMedium] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([-0.3615731,  0.       ,  0.       ], dtype=float...
FAILED tests/test_wrappers.py::test_markovian_state_space_full[CarFlagHard] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([1.020816, 0.      , 0.      ], dtype=float32), 0...
FAILED tests/test_wrappers.py::test_markovian_state_space_partial[CarFlagEasy] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data [ 0.32249273 -0.0015     -1.        ]
FAILED tests/test_wrappers.py::test_markovian_state_space_partial[CarFlagMedium] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data [ 0.73226799 -0.0015      0.        ]
FAILED tests/test_wrappers.py::test_markovian_state_space_partial[CarFlagHard] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data [-0.7669738  0.0015     0.       ]
FAILED tests/test_wrappers.py::test_markovian_state_space_info_dict[CarFlagEasy] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([0.10640713, 0.        , 0.        ]), 0.5, 1.0, ...
FAILED tests/test_wrappers.py::test_markovian_state_space_info_dict[CarFlagMedium] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([-0.65530812, -0.0015    ,  0.        ]), 0.5, -1...
FAILED tests/test_wrappers.py::test_markovian_state_space_info_dict[CarFlagHard] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([0.25203142, 0.        , 0.        ]), 0.5, -1.0,...
FAILED tests/test_wrappers.py::test_state_space_full_and_partial[CarFlagEasy] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([0.50059146, 0.        , 0.        ], dtype=float...
FAILED tests/test_wrappers.py::test_state_space_full_and_partial[CarFlagMedium] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([-1.0373735,  0.       ,  0.       ], dtype=float...
FAILED tests/test_wrappers.py::test_state_space_full_and_partial[CarFlagHard] - ValueError: space Box([-1.1  -0.07 -1.  ], [1.1  0.07 1.  ], (3,), float32) does not contain data (array([0.86538035, 0.        , 0.        ], dtype=float...
=================================================== 12 failed, 441 passed, 6 skipped, 417 warnings in 5.21s ===================================================

I tried to np.clip the values as well yet I dont see how the values are out of bound. Could you please help?

smorad · 2024-08-26T14:24:35Z

Ah I think this might be related to our Markovian state space requirement. So with POMDPs, it is useful to have the state available for debugging. I think you might need to add the state_space member variable to your class. See here for an example. You should also implement a get_state method that returns a Markov state (see here).

I think the issue is that state_space is defined to be an np.array of shape (3,) while you return a tuple of four elements from get_state.

ashok-arora · 2024-08-27T04:54:17Z

Thank you for the advice. I’ll look into the dimensions of get_state.

ashok-arora · 2024-08-27T17:00:46Z

Changing the get_state to return just the observation (of size 3) also gives the same error.

 def get_state(self):
        # Return the position of the car, oracle, and goal
        return self.state

ashok-arora · 2024-08-27T17:22:49Z

yay it passes the tests now!
@smorad could you review the code quality?
lastly, should I add new tests or would the existing ones suffice?

smorad · 2024-08-29T10:56:28Z

Great work! Your code is good. I think I have just one more recommendation:

To be in-line with the other popgym environments, I think it would be best if the episodic reward (non discounted return) was always between -1.0, 1.0. Maybe you could rescale the reward to be

max_reward = self.max_steps - self.heaven_position / self.max_velocity
reward = reward / max_reward

As for tests, I would suggest testing that the episode ends after you reach the boundary, and that the episode return is always bounded in [-1, 1]. But it's up to you.

ashok-arora mentioned this pull request Jun 29, 2024

Suggestions for environments based on other POMDP types #4

Open

smorad reviewed Jun 29, 2024

View reviewed changes

ashok-arora added 2 commits July 1, 2024 11:12

Adds easy, medium and hard levels

9a71557

The easy level has the range [-1, 1]. The medium level has the range [-3, 3]. The hard level has the range [-5, 5].

Adds the env to the list of envs

94f6180

replace done with terminated, truncated

c68b7cf

fix the state_space dim error

a537918

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds carflag-v0 env #34

Adds carflag-v0 env #34

ashok-arora commented Jun 29, 2024

smorad left a comment

smorad Jun 29, 2024

ashok-arora Jul 1, 2024

ashok-arora commented Jul 1, 2024

smorad commented Jul 3, 2024 •

edited

Loading

ashok-arora commented Jul 3, 2024 •

edited

Loading

ashok-arora commented Jul 3, 2024

smorad commented Jul 4, 2024

ashok-arora commented Jul 5, 2024

smorad commented Jul 6, 2024

ashok-arora commented Jul 6, 2024

smorad commented Jul 6, 2024

ashok-arora commented Jul 6, 2024 •

edited

Loading

ashok-arora commented Aug 25, 2024

smorad commented Aug 26, 2024 •

edited

Loading

ashok-arora commented Aug 27, 2024

ashok-arora commented Aug 27, 2024

ashok-arora commented Aug 27, 2024

smorad commented Aug 29, 2024 •

edited

Loading


		return self.state, env_reward, done, {"is_success": env_reward > 0}

		def reset(self):

Adds carflag-v0 env #34

Are you sure you want to change the base?

Adds carflag-v0 env #34

Conversation

ashok-arora commented Jun 29, 2024

smorad left a comment

Choose a reason for hiding this comment

smorad Jun 29, 2024

Choose a reason for hiding this comment

ashok-arora Jul 1, 2024

Choose a reason for hiding this comment

ashok-arora commented Jul 1, 2024

smorad commented Jul 3, 2024 • edited Loading

ashok-arora commented Jul 3, 2024 • edited Loading

ashok-arora commented Jul 3, 2024

smorad commented Jul 4, 2024

ashok-arora commented Jul 5, 2024

smorad commented Jul 6, 2024

ashok-arora commented Jul 6, 2024

smorad commented Jul 6, 2024

ashok-arora commented Jul 6, 2024 • edited Loading

ashok-arora commented Aug 25, 2024

smorad commented Aug 26, 2024 • edited Loading

ashok-arora commented Aug 27, 2024

ashok-arora commented Aug 27, 2024

ashok-arora commented Aug 27, 2024

smorad commented Aug 29, 2024 • edited Loading

smorad commented Jul 3, 2024 •

edited

Loading

ashok-arora commented Jul 3, 2024 •

edited

Loading

ashok-arora commented Jul 6, 2024 •

edited

Loading

smorad commented Aug 26, 2024 •

edited

Loading

smorad commented Aug 29, 2024 •

edited

Loading