-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Monte Carlo learning for RL agent #34
Conversation
178af19
to
a1e6df2
Compare
It looks like |
Of course, I will also add some choices for the user:
I will define these as constant macros but not conditional compilations. |
I will rewrite README.md latter. |
c7dfc46
to
aadc742
Compare
train.c
Outdated
#define MONTE_CARLO 1 | ||
|
||
// for Markov decision model | ||
#define GAMMA 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we need to set
definition for
fixed point iteration in lemma 1.7
https://rltheorybook.github.io/rl_monograph_AJK.pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we use monde carlo learning in chess games, gamma is usually 1, or the value will be too small to propagate to the initial state when the board is big.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will read the book latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got your point. But if I set
Besides the convergence guarantee, another reason to set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right. It's big enough to propagate, I set gamma to 1 because I mistakenly thought that MuZero did that.
init_rl_agent(agent, state_num, player); | ||
for (unsigned int i = 0; i < state_num; i++) | ||
agent->state_value[i] = | ||
get_score(hash_to_table(i), player) * INITIAL_MUTIPLIER; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you use the INITIAL_MUTIPLIER
to make initial value in the same numeric range with the episode reward 0 or -1.
Would it be more straight that we directly check the game result of the board (1, 0, -1) as initial value if we enable EPISODE_REWARD
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use check_win
to initialize the values, there may be too many 0s in the early game, resulting in the AI not knowing which move is better. However, if we use get_score
, the AI always makes the first move at the center, which is usually the best move. This reduces the episode for AI to find the optimal policy. Additionally, we cannot use MCTS because it consumes too much time.
train.c
Outdated
rl_agent_t *agent) | ||
{ | ||
#if EPISODE_REWARD | ||
float target = -GAMMA * next_target; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The term target
is for TD learning, I believe it might lead to confusion if we reuse the same term in Monte Carlo Search.(In Monte Carlo, we might care about total return
?) Perhaps you could split the update functions for these two algorithms into different functions to make the code more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, It is "return" in MC, or "TD target" in TD. Is changing the name to 'curr' and adding a comment a good approach?
38aea7a
to
51bdb01
Compare
@ndsl7109256 Should I remove |
Using rebase to modify previous commits is preferable to adding new commits. Having two commits in the same PR where one modifies the code and the other reverts the changes can make the git history less clean. |
That's cool, I didn't know rebase could be used like that. I'll give it a try later. |
I print them just to check the correctness of state value, maybe you could add debug flag to control whether to print these information. |
win = check_win(table); | ||
episode_moves[episode_len] = table_to_hash(table); | ||
reward[episode_len] = | ||
(1 - REWARD_TRADEOFF) * get_score(table, agent[turn].player) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are trying to define a new reward function here, maybe you could use a new macro to replace it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my experience, using episode results as a reward is usually the best solution. But trying a new way is worthy of encouragement. I will keep the reward for now.
table[i] = agent->player; | ||
float new_q = state_value[table_to_hash(table)]; | ||
printf("%f ", new_q); | ||
if (new_q == max_q) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought checking the same Q is redundant here, the Q would only be the exactly same at the beginning of training. And we have exploration to update another candidate that has max_q
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This affects the AI's performance a lot, I have tested it in 3x3. the AI trained by the original method can't produce the optimal action. btw, epsilon greedy doesn't work well in my test.
exit(1); | ||
turn = !turn; // the player who makes the last move. | ||
float next = 0; | ||
for (int i_move = episode_len - 1; i_move >= 0; --i_move) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to make the update function more readable, i thought you could introduce
typedef struct {
int cur_state_hash, after_state_hash;
} transition_t;
and record these transition in an array.
Then use update_state_value(rl_agent_t *agent, trantition_t *trajectory, int episode_len)
to update the state value which means we use the trajectory to update the agent. I thought it would make the function more straightly.
Above is just a proposal, you could still come up another way to refine it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your method works fine if we implement the replay buffer, but I may not implement it in this PR, I need to do my lab0-c and its deadline is 3/4.
This approach was not entirely invented by me. It's according to https://github.com/moporgic/TDL2048-Demo/blob/master/2048.cpp#L712
Although this approach is not easy to read at first glance, it will become more and more elegant after you look at it for a long time.
3fe83cb
to
0329c0c
Compare
I fixed the branch, and it cost me an arm and a leg... |
The files |
Additionally, there were similar issues with ensuring that all files have trailing newline characters. Changes were initially introduced without newline symbols, which were subsequently addressed and corrected in later commits. |
I see, and I think the first version was just a transition, so I removed it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use git rebase -i
to refine the commits. Make it easier for reviewing.
67379bd
to
17897f9
Compare
The `for_each_empty_grid` introduced in PR#26 wasn't included in `ForEachMacros` within .clang-format, which made clang-format incorrectly format `for_each_empty_grid` block onto a single line. The `for_each_empty_grid` was included in `ForEachMacros` in this commit to fix this issue.
Monte Carlo learning is a reinforcement learning approach trained using the Monte Carlo method instead of the TD method. This method updates state values based on the gains of full game episodes, without depending on the next state value. Consequently, the agent exhibits lower error after training but experiences higher variance during training compared to TD learning. To share the code between the Monte Carlo learning and TD learning, the functions `train` and `update_state_value` in train.c are modified to use back-propagation to train, and the macro `MONTE_CARLO` is introduced to decide what method is used. This commit also introduces the macros `REWARD_TRADEOFF`, `INITIAL_MULTIPLIER`, and `EPSILON_GREEDY` for users to customize their training. Details are in README.md.
The memory leak of `get_action_epsilon_greedy` in train.c is fixed.
Instead of returning the first action of the best actions if there is more than one best action available, the function `get_action_exploit` will now randomly select one from the best actions, which enhances the agent's performance.
The util.h header was missing the necessary `#pragma once` directive to ensure that it's included only once in compilation. This commit adds the `#pragma once` directive to prevent potential multiple inclusion issues.
Maybe you could try squash to merge your commits into one commit, just remain commit |
Maybe you're right, but there is no relation between these commits, I want others to know what happens just by viewing commit messages, so I seperate them. Here is the example that seperates the commits in a PR: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good to me now.
Thanks for all the great work!
Let's see if anyone else has any further comments.
Thank @paul90317 for contributing! |
Monte Carlo learning is an RL approach trained using the Monte Carlo method instead of the TD method. This method updates state values based on the gains of full game episodes, without depending on the next state value. Consequently, the agent exhibits lower error after training but experiences higher variance during training compared to TD learning.
ref: https://en.wikipedia.org/wiki/Reinforcement_learning