-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] The learning design #88
Comments
Some personal thoughts, I am not an expert on this:
|
Hi @glmcdona , thanks for the feedback. Regarding 1, I agree that it should be OK. For the second point, you raise an interesting point about including more global game observations. Too risky though, I wouldn't want to try to make it work. The CNN approach is what I also want to test. Have you just used |
@nosound2. Yeah, I think the built-in default CnnPolicy isn't a good fit. You can define your own layers: I just shared an example notebook with you on Kaggle that I've been using. I've tried a few architectures, and the latest one is inspired by that imitation learning notebook model layout. Note, although it did work, it didn't get to as high a reward solution as the simple non-CNN example in this repo. I'm personally working on implementing a solution more similar to the OpenAI Five observation setup now. |
Ok, very interesting, I am reading your notebook now. Just a small remark, I believe theoretically it is called "private sharing", which is not allowed. Let's refrain from this in the future (as long as we are not in the same team!). |
Oh wow, I didn't know we couldn't share code with each other if we aren't on the same team! Thanks for the heads up. I'll get a proper run of that notebook done and share it public. |
A few comments on that notebook
|
Are you on the competition discord server @nosound2 ? Regarding the architecture and whether or not the usage of skip/residual elements. The current "miner-state' has ~100 values (order of magnitude), any output of a CNN feature extractor is likely to be >10k values. Fancy architectures are great but the training time (and hyperparameters selection) is getting quickly out of hand (at least from my attempts). I'm currently trying to inject as much human-knowledge as it is reasonable in the observation to reduce what has to be learned from scratch to improve training speed. |
This is similar to a basic VGG16 model architecture, though looks like it should run a relu every 3x3 conv, eg:
Yup, you are describing an earlier version of that notebook! I modified it to incorporate everything into the CNN layers to more closely match the imitation learning setup in case it helped. The original design had them added at the FC layer instead of adding them as layers to the CNN input. |
Here is the example notebook shared now: Note that for kaggle submission, the main_lux-ai-2021.py needs to edited to include specifying the feature extractor in the model load operation, eg something like this:
|
The MLp only has 4 layers 2 layers of 64 for both the actor and the critic. The CnnPolicy only works good on images. The api gives us all of the information without any of the noise. Cnn approach would never be able to determine if there were multiple workers on city tile for example. |
Geoff, btw do you have any idea how to get rid of the runtime error stacking error. At around 40-50 milllion steps, too many of the games stop early because the model hasn't quite learned to save fuel during the night.n And this causes there to be compile errors if too many games end early. |
Not sure what would cause this. Do you have a copy of the error by any chance? Is it a memory leak, out of memory error? |
ocess SpawnProcess-32: During handling of the above exception, another exception occurred: Traceback (most recent call last): |
Hate to nag, but recording playing command does not seem to work and the new updated files dont compile on the kaggle server for submissions |
Hi @goforks12 , is it a different issue now? If so, can you please open a different issue per problem. Also, more details for the second problem will be helpful, I think. |
It seems to be a problem in your custom code, in this line specifically: |
I didn't mess with any of the game engine. I didnt change anything within the LuxAI computations. I was however using 16 cpu cores. And my MLP I was training had much larger layers. |
lux-ai-2021 --seed=100 ./kaggle_submissions/main_lux-ai-2021.py ./kaggle_submissions/main_lux-ai-2021.py --maxtime 100000 I try to do this command in bash with my Model.zip and my Agent_policy.py in the kaggle submission folder. Should lux-a-2021 be a python file? Or should it be the folder we cd into to run the evaluation? |
|
If you didn't modify |
i was doing an obscenely log training period. Will use shorter times now. |
Here is an example training run from an 'okay' RL personal agent I've built. Notes:
Learning curve for a few batch sizes (n_steps is set to batch_size for each one): Here are a couple replay files of the trained agent from the |
I think about the learning design that is implemented here, and I just can't resolve to myself two questions. The core function for the learning is the environment step function. The chain of learning is
[OBS_UNIT1 -> ACTION1 -> REWARD -> OBS_UNIT2 -> ACTION2 -> OBS_UNIT3 -> ACTION3 ... -> ALL TURN ACTIONS ARE ACTUALLY TAKEN] -> [THE SAME FOR THE NEXT TURN ...]
. The questions are:Less important. Only the first action gets reward. Doesn't it create significant problems, especially when the number of units per turn is big? Especially if the discount factor
gamma
is small, but also in general. Even this intermediate reward for most actions is delayed. I wonder how much harder the life is for the model because of this. One thing, - the ordering of the units to act can be important. I can imagine that the model can handle it. But is there an example of multi-unit problems that are designed like this?More important. The algorithms like TD(0), Q-Learning, and more involved like PPO, all depend for the model update not only on the current state (or state-action pair) but also the next one. But the next step is a different unit, its observation is unit-dependent, its value function is completely different, and barely related. The process is basically not markovian, the states are heavily incomplete information, and each time different incomplete information. Isn't it a no-go? Or I miss-understand something major?
Please share your thought!
The text was updated successfully, but these errors were encountered: