Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More Sample Efficient MAL #41

Open
YanickZengaffinen opened this issue Jun 23, 2024 · 3 comments
Open

More Sample Efficient MAL #41

YanickZengaffinen opened this issue Jun 23, 2024 · 3 comments
Assignees
Milestone

Comments

@YanickZengaffinen
Copy link
Collaborator

Our improved MAL solutions currently take lots of samples but implementing model-learning as RL with a complex NN describing the agents "policy" might not be necessary, when essentially all we want to do is find a query-/context-vector that is close to the actual environment and gives a policy that performs well in the real env.

Bc our S and A are discrete, we could simply have a model of the environment that is continually extended (ie SxA->S is just the mean of all transitions we observed, similar to how it's done in DynaQ [see comparison to literature]), which might be more sample efficient.

Other approaches are also feasible.

It's worth always paying attention to what properties of Gerstgrasser we'd give up (ie. leader should see the query responses but get 0 reward for it will most likely be violated => not guarantees to converge to SE anymore)

@Angramme
Copy link
Collaborator

Yannicks comments

Yeah I think it could be more sample efficient (bc rn we are learning a NN over SxA->S, just to then get good query answers/learn a good policy in S->A). We actually started off with an implementation similar to what you are proposing (e.g check old train file, like here https://github.com/nilscrm/stackelberg-ml/blob/kldiv/stackelberg_mbrl/train.py ). Why we abanoned this was bc it wasnt learning and we attributed that to us not showing the leader its query answers properly but we now know its bc the model can cheat. I do think we can profit in terms of sample efficiency, heck we probably dont even need an NN as the model but can just return the average over what we sampled so far and where we dont have any samples we return random values until observed. BUT by doing so we are not following Gerstgrasser anymore, so we lose guarantees that we converge to a SE

Could then also take a more principled approach and after updating the model, we sample in parts of the real env where we still have high uncertainty (ie where the new best policy leads us but no policy has lead us before). Fixing the cheating issue could be done similarily to MAL + Noise (ie we sample a few random transitions from the real env)

@Angramme
Copy link
Collaborator

TODO try:

  • Approach 1: policy model parameters is the oracle query response vector directly. Special wrapper class for policy that exposes query response vector as parameters for the optimizer, the model is just the true env. Then input this into PPO. If needed, add noise to the wrapper policy such that random trajectories are picked sometimes.
  • Approach 2: Do as Yannick said, construct the oracle context based on the samples collected so far, potentially add noise to explore more (although this should be handled by PPO?). Special wrapper for Policy that constructs the oracle vector based on step observations, again the env is just the true env. Again input those into PPO.

@YanickZengaffinen
Copy link
Collaborator Author

YanickZengaffinen commented Jun 24, 2024

For completeness, here's pseudocode of what I'm proposing.

// follower pretraining
pretrain policy for all possible models (as we do rn) // 0x samples from real env until this point

// leader training
initial_models = draw a few random models
initial_policies = get the best-response policies for the initial_models
samples = rollout the initial_policies
model = mean(samples, axis=0) in R^{(SA)xS} // random if no samples are available
old_model = -inf in R^{(S
A)xS}
while |model - old_model| > eps: // while the model is still changing (maybe find better criteria)
samples.extend(rollout the best response policy to the current model)
old_model = model
model = mean(samples, axis=0)

FAQ:

  • why do we need multiple models in the beginning? drawing multiple random models combined with the fact that the model can never "unlearn" should increase the probability of parts of the world being hidden. Obviously this is no guarantee
  • why do we sample models and then get policies, couldn't we just use random policies? should work too with random policies but why not use the best-response policies when we have them. unclear which is best though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants