Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Monte Carlo learning for RL agent #34

Merged
merged 5 commits into from
Mar 5, 2024

Conversation

paul90317
Copy link
Contributor

Monte Carlo learning is an RL approach trained using the Monte Carlo method instead of the TD method. This method updates state values based on the gains of full game episodes, without depending on the next state value. Consequently, the agent exhibits lower error after training but experiences higher variance during training compared to TD learning.

ref: https://en.wikipedia.org/wiki/Reinforcement_learning

@paul90317 paul90317 force-pushed the mc_learning branch 3 times, most recently from 178af19 to a1e6df2 Compare February 20, 2024 07:18
@ndsl7109256
Copy link
Contributor

It looks like train_mc.c and train_td.c share lots of code, is it possible to combine them into one file and use conditional compilation to control the algorithm for training like how we determine the AI used in main.c?

@jserv jserv requested a review from visitorckw February 20, 2024 15:50
@paul90317
Copy link
Contributor Author

Of course, I will also add some choices for the user:

  • MONTE_CARLO: use the Monte Carlo method or the TD method.
  • EPISODE_REWARD: use the episode result as a reward or get_score.
  • INITIAL_MUTIPLIER: the multiplier for the initial value from get_score.
  • EPSILON_GREEDY: whether or not to use epsilon greedy.

I will define these as constant macros but not conditional compilations.

@paul90317
Copy link
Contributor Author

I will rewrite README.md latter.

@paul90317 paul90317 requested a review from visitorckw February 22, 2024 16:50
train.c Outdated
#define MONTE_CARLO 1

// for Markov decision model
#define GAMMA 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we need to set $\gamma$ with [0, 1) for $\gamma$-contraction and iteration convergence to hold convergence property. Maybe you could also set the upper bound for static aseert.

definition for $\gamma$ in ch1.1
$\gamma$-contraction in lemma 1.5
fixed point iteration in lemma 1.7
https://rltheorybook.github.io/rl_monograph_AJK.pdf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we use monde carlo learning in chess games, gamma is usually 1, or the value will be too small to propagate to the initial state when the board is big.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will read the book latter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got your point. But if I set $\gamma$ = 0.999 and use 10x10 board, the smallest discount would be 0.904($0.999^{100}$). I thoght it is still big enough to propagate?

Besides the convergence guarantee, another reason to set $\gamma < 1$ is to compare the number of steps required to win. Intuitively, winning in 5 steps is better than winning in 9 steps. If we set $\gamma = 1$, the total reward we get in these two situations is the same, and we couldn't determine which is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. It's big enough to propagate, I set gamma to 1 because I mistakenly thought that MuZero did that.

init_rl_agent(agent, state_num, player);
for (unsigned int i = 0; i < state_num; i++)
agent->state_value[i] =
get_score(hash_to_table(i), player) * INITIAL_MUTIPLIER;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you use the INITIAL_MUTIPLIER to make initial value in the same numeric range with the episode reward 0 or -1.
Would it be more straight that we directly check the game result of the board (1, 0, -1) as initial value if we enable EPISODE_REWARD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use check_win to initialize the values, there may be too many 0s in the early game, resulting in the AI not knowing which move is better. However, if we use get_score, the AI always makes the first move at the center, which is usually the best move. This reduces the episode for AI to find the optimal policy. Additionally, we cannot use MCTS because it consumes too much time.

train.c Outdated
rl_agent_t *agent)
{
#if EPISODE_REWARD
float target = -GAMMA * next_target;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term target is for TD learning, I believe it might lead to confusion if we reuse the same term in Monte Carlo Search.(In Monte Carlo, we might care about total return?) Perhaps you could split the update functions for these two algorithms into different functions to make the code more readable.

Copy link
Contributor Author

@paul90317 paul90317 Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, It is "return" in MC, or "TD target" in TD. Is changing the name to 'curr' and adding a comment a good approach?

@paul90317 paul90317 requested a review from jserv February 26, 2024 15:08
@paul90317
Copy link
Contributor Author

@ndsl7109256 Should I remove printf and draw_board in train.c because when we remove them, the time to train will reduce from 1 minute to 1 second? I'm not sure why you add them.

@paul90317 paul90317 requested a review from visitorckw February 27, 2024 17:12
@visitorckw
Copy link
Collaborator

Using rebase to modify previous commits is preferable to adding new commits. Having two commits in the same PR where one modifies the code and the other reverts the changes can make the git history less clean.

@paul90317
Copy link
Contributor Author

That's cool, I didn't know rebase could be used like that. I'll give it a try later.

@ndsl7109256
Copy link
Contributor

@ndsl7109256 Should I remove printf and draw_board in train.c because when we remove them, the time to train will reduce from 1 minute to 1 second? I'm not sure why you add them.

I print them just to check the correctness of state value, maybe you could add debug flag to control whether to print these information.

win = check_win(table);
episode_moves[episode_len] = table_to_hash(table);
reward[episode_len] =
(1 - REWARD_TRADEOFF) * get_score(table, agent[turn].player) +
Copy link
Contributor

@ndsl7109256 ndsl7109256 Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are trying to define a new reward function here, maybe you could use a new macro to replace it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience, using episode results as a reward is usually the best solution. But trying a new way is worthy of encouragement. I will keep the reward for now.

table[i] = agent->player;
float new_q = state_value[table_to_hash(table)];
printf("%f ", new_q);
if (new_q == max_q) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought checking the same Q is redundant here, the Q would only be the exactly same at the beginning of training. And we have exploration to update another candidate that has max_q.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This affects the AI's performance a lot, I have tested it in 3x3. the AI trained by the original method can't produce the optimal action. btw, epsilon greedy doesn't work well in my test.

exit(1);
turn = !turn; // the player who makes the last move.
float next = 0;
for (int i_move = episode_len - 1; i_move >= 0; --i_move) {
Copy link
Contributor

@ndsl7109256 ndsl7109256 Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to make the update function more readable, i thought you could introduce

typedef struct {
    int cur_state_hash, after_state_hash;
} transition_t;

and record these transition in an array.
Then use update_state_value(rl_agent_t *agent, trantition_t *trajectory, int episode_len) to update the state value which means we use the trajectory to update the agent. I thought it would make the function more straightly.

Above is just a proposal, you could still come up another way to refine it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your method works fine if we implement the replay buffer, but I may not implement it in this PR, I need to do my lab0-c and its deadline is 3/4.

This approach was not entirely invented by me. It's according to https://github.com/moporgic/TDL2048-Demo/blob/master/2048.cpp#L712
Although this approach is not easy to read at first glance, it will become more and more elegant after you look at it for a long time.

@paul90317 paul90317 force-pushed the mc_learning branch 2 times, most recently from 3fe83cb to 0329c0c Compare February 28, 2024 15:45
@paul90317
Copy link
Contributor Author

@visitorckw Using rebase to modify previous commits is preferable to adding new commits. Having two commits in the same PR where one modifies the code and the other reverts the changes can make the git history less clean.

I fixed the branch, and it cost me an arm and a leg...

@visitorckw
Copy link
Collaborator

The files train_mc.c and train_td.c were created but subsequently deleted. Can we still organize them using rebase? This not only helps keep the git history cleaner but also makes reviewing easier.

@visitorckw
Copy link
Collaborator

visitorckw commented Mar 1, 2024

Additionally, there were similar issues with ensuring that all files have trailing newline characters. Changes were initially introduced without newline symbols, which were subsequently addressed and corrected in later commits.

@paul90317
Copy link
Contributor Author

The files train_mc.c and train_td.c were created but subsequently deleted. Can we still organize them using rebase? This not only helps keep the git history cleaner but also makes reviewing easier.

I see, and I think the first version was just a transition, so I removed it.

Copy link
Owner

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use git rebase -i to refine the commits. Make it easier for reviewing.

@paul90317 paul90317 force-pushed the mc_learning branch 2 times, most recently from 67379bd to 17897f9 Compare March 5, 2024 07:15
The `for_each_empty_grid` introduced in PR#26 wasn't included in
`ForEachMacros` within .clang-format, which made clang-format
incorrectly format `for_each_empty_grid` block onto a single line.

The `for_each_empty_grid` was included in `ForEachMacros` in this
commit to fix this issue.
Monte Carlo learning is a reinforcement learning approach trained
using the Monte Carlo method instead of the TD method. This method
updates state values based on the gains of full game episodes, without
depending on the next state value. Consequently, the agent exhibits
lower error after training but experiences higher variance during
training compared to TD learning.

To share the code between the Monte Carlo learning and TD learning, the
functions `train` and `update_state_value` in train.c are modified to
use back-propagation to train, and the macro `MONTE_CARLO` is
introduced to decide what method is used.

This commit also introduces the macros `REWARD_TRADEOFF`,
`INITIAL_MULTIPLIER`, and `EPSILON_GREEDY` for users to customize
their training. Details are in README.md.
The memory leak of `get_action_epsilon_greedy` in train.c is fixed.
Instead of returning the first action of the best actions if there is
more than one best action available, the function `get_action_exploit`
will now randomly select one from the best actions, which enhances the
agent's performance.
The util.h header was missing the necessary `#pragma once` directive to
ensure that it's included only once in compilation. This commit adds
the `#pragma once` directive to prevent potential multiple inclusion
issues.
@ndsl7109256
Copy link
Contributor

Maybe you could try squash to merge your commits into one commit, just remain commit Implement Monte Carlo learning for RL agent to emphasize your purpose.

@paul90317
Copy link
Contributor Author

Maybe you're right, but there is no relation between these commits, I want others to know what happens just by viewing commit messages, so I seperate them.

Here is the example that seperates the commits in a PR:
https://github.com/jserv/ttt/pull/23/commits

Copy link
Collaborator

@visitorckw visitorckw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me now.
Thanks for all the great work!
Let's see if anyone else has any further comments.

@paul90317 paul90317 requested a review from jserv March 5, 2024 11:46
@jserv jserv merged commit 776db3f into jserv:main Mar 5, 2024
2 checks passed
@jserv
Copy link
Owner

jserv commented Mar 5, 2024

Thank @paul90317 for contributing!

@paul90317 paul90317 deleted the mc_learning branch March 5, 2024 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants