Implement Monte Carlo learning for RL agent #34

paul90317 · 2024-02-20T05:08:51Z

Monte Carlo learning is an RL approach trained using the Monte Carlo method instead of the TD method. This method updates state values based on the gains of full game episodes, without depending on the next state value. Consequently, the agent exhibits lower error after training but experiences higher variance during training compared to TD learning.

ref: https://en.wikipedia.org/wiki/Reinforcement_learning

ndsl7109256 · 2024-02-20T10:44:34Z

It looks like train_mc.c and train_td.c share lots of code, is it possible to combine them into one file and use conditional compilation to control the algorithm for training like how we determine the AI used in main.c?

paul90317 · 2024-02-20T16:23:50Z

Of course, I will also add some choices for the user:

MONTE_CARLO: use the Monte Carlo method or the TD method.
EPISODE_REWARD: use the episode result as a reward or get_score.
INITIAL_MUTIPLIER: the multiplier for the initial value from get_score.
EPSILON_GREEDY: whether or not to use epsilon greedy.

I will define these as constant macros but not conditional compilations.

paul90317 · 2024-02-21T01:45:24Z

I will rewrite README.md latter.

README.md

agents/rl_agent.c

ndsl7109256 · 2024-02-21T06:00:53Z

train.c

+#define MONTE_CARLO 1
+
+// for Markov decision model
+#define GAMMA 1


I thought we need to set $\gamma$ with [0, 1) for $\gamma$-contraction and iteration convergence to hold convergence property. Maybe you could also set the upper bound for static aseert.

definition for $\gamma$ in ch1.1
$\gamma$-contraction in lemma 1.5
fixed point iteration in lemma 1.7
https://rltheorybook.github.io/rl_monograph_AJK.pdf

When we use monde carlo learning in chess games, gamma is usually 1, or the value will be too small to propagate to the initial state when the board is big.

I will read the book latter.

I got your point. But if I set $\gamma$ = 0.999 and use 10x10 board, the smallest discount would be 0.904($0.999^{100}$). I thoght it is still big enough to propagate?

Besides the convergence guarantee, another reason to set $\gamma < 1$ is to compare the number of steps required to win. Intuitively, winning in 5 steps is better than winning in 9 steps. If we set $\gamma = 1$, the total reward we get in these two situations is the same, and we couldn't determine which is better.

Yes, you're right. It's big enough to propagate, I set gamma to 1 because I mistakenly thought that MuZero did that.

ndsl7109256 · 2024-02-23T09:50:14Z

train.c

+    init_rl_agent(agent, state_num, player);
+    for (unsigned int i = 0; i < state_num; i++)
+        agent->state_value[i] =
+            get_score(hash_to_table(i), player) * INITIAL_MUTIPLIER;


I guess you use the INITIAL_MUTIPLIER to make initial value in the same numeric range with the episode reward 0 or -1.
Would it be more straight that we directly check the game result of the board (1, 0, -1) as initial value if we enable EPISODE_REWARD?

If we use check_win to initialize the values, there may be too many 0s in the early game, resulting in the AI not knowing which move is better. However, if we use get_score, the AI always makes the first move at the center, which is usually the best move. This reduces the episode for AI to find the optimal policy. Additionally, we cannot use MCTS because it consumes too much time.

ndsl7109256 · 2024-02-23T10:06:16Z

train.c

+                                rl_agent_t *agent)
+{
+#if EPISODE_REWARD
+    float target = -GAMMA * next_target;


The term target is for TD learning, I believe it might lead to confusion if we reuse the same term in Monte Carlo Search.(In Monte Carlo, we might care about total return?) Perhaps you could split the update functions for these two algorithms into different functions to make the code more readable.

Yes, It is "return" in MC, or "TD target" in TD. Is changing the name to 'curr' and adding a comment a good approach?

.ci/check-format.sh

agents/rl_agent.h

.ci/check-format.sh

train.c

paul90317 · 2024-02-27T17:12:39Z

@ndsl7109256 Should I remove printf and draw_board in train.c because when we remove them, the time to train will reduce from 1 minute to 1 second? I'm not sure why you add them.

visitorckw · 2024-02-27T18:27:02Z

Using rebase to modify previous commits is preferable to adding new commits. Having two commits in the same PR where one modifies the code and the other reverts the changes can make the git history less clean.

paul90317 · 2024-02-28T01:25:35Z

That's cool, I didn't know rebase could be used like that. I'll give it a try later.

ndsl7109256 · 2024-02-28T09:13:05Z

@ndsl7109256 Should I remove printf and draw_board in train.c because when we remove them, the time to train will reduce from 1 minute to 1 second? I'm not sure why you add them.

I print them just to check the correctness of state value, maybe you could add debug flag to control whether to print these information.

ndsl7109256 · 2024-02-28T09:19:42Z

train.c

+        win = check_win(table);
+        episode_moves[episode_len] = table_to_hash(table);
+        reward[episode_len] =
+            (1 - REWARD_TRADEOFF) * get_score(table, agent[turn].player) +


You are trying to define a new reward function here, maybe you could use a new macro to replace it.

In my experience, using episode results as a reward is usually the best solution. But trying a new way is worthy of encouragement. I will keep the reward for now.

ndsl7109256 · 2024-02-28T09:47:31Z

agents/reinforcement_learning.c

+        table[i] = agent->player;
+        float new_q = state_value[table_to_hash(table)];
+        printf("%f ", new_q);
+        if (new_q == max_q) {


I thought checking the same Q is redundant here, the Q would only be the exactly same at the beginning of training. And we have exploration to update another candidate that has max_q.

This affects the AI's performance a lot, I have tested it in 3x3. the AI trained by the original method can't produce the optimal action. btw, epsilon greedy doesn't work well in my test.

ndsl7109256 · 2024-02-28T10:26:57Z

train.c

-        exit(1);
+    turn = !turn;  // the player who makes the last move.
+    float next = 0;
+    for (int i_move = episode_len - 1; i_move >= 0; --i_move) {


I am trying to make the update function more readable, i thought you could introduce

typedef struct { int cur_state_hash, after_state_hash; } transition_t;

and record these transition in an array.
Then use update_state_value(rl_agent_t *agent, trantition_t *trajectory, int episode_len) to update the state value which means we use the trajectory to update the agent. I thought it would make the function more straightly.

Above is just a proposal, you could still come up another way to refine it.

I think your method works fine if we implement the replay buffer, but I may not implement it in this PR, I need to do my lab0-c and its deadline is 3/4.

This approach was not entirely invented by me. It's according to https://github.com/moporgic/TDL2048-Demo/blob/master/2048.cpp#L712
Although this approach is not easy to read at first glance, it will become more and more elegant after you look at it for a long time.

paul90317 · 2024-02-28T15:49:43Z

@visitorckw Using rebase to modify previous commits is preferable to adding new commits. Having two commits in the same PR where one modifies the code and the other reverts the changes can make the git history less clean.

I fixed the branch, and it cost me an arm and a leg...

visitorckw · 2024-03-01T22:21:10Z

The files train_mc.c and train_td.c were created but subsequently deleted. Can we still organize them using rebase? This not only helps keep the git history cleaner but also makes reviewing easier.

visitorckw · 2024-03-01T22:23:17Z

Additionally, there were similar issues with ensuring that all files have trailing newline characters. Changes were initially introduced without newline symbols, which were subsequently addressed and corrected in later commits.

paul90317 · 2024-03-02T16:42:41Z

The files train_mc.c and train_td.c were created but subsequently deleted. Can we still organize them using rebase? This not only helps keep the git history cleaner but also makes reviewing easier.

I see, and I think the first version was just a transition, so I removed it.

jserv

Use git rebase -i to refine the commits. Make it easier for reviewing.

The `for_each_empty_grid` introduced in PR#26 wasn't included in `ForEachMacros` within .clang-format, which made clang-format incorrectly format `for_each_empty_grid` block onto a single line. The `for_each_empty_grid` was included in `ForEachMacros` in this commit to fix this issue.

Monte Carlo learning is a reinforcement learning approach trained using the Monte Carlo method instead of the TD method. This method updates state values based on the gains of full game episodes, without depending on the next state value. Consequently, the agent exhibits lower error after training but experiences higher variance during training compared to TD learning. To share the code between the Monte Carlo learning and TD learning, the functions `train` and `update_state_value` in train.c are modified to use back-propagation to train, and the macro `MONTE_CARLO` is introduced to decide what method is used. This commit also introduces the macros `REWARD_TRADEOFF`, `INITIAL_MULTIPLIER`, and `EPSILON_GREEDY` for users to customize their training. Details are in README.md.

The memory leak of `get_action_epsilon_greedy` in train.c is fixed.

Instead of returning the first action of the best actions if there is more than one best action available, the function `get_action_exploit` will now randomly select one from the best actions, which enhances the agent's performance.

The util.h header was missing the necessary `#pragma once` directive to ensure that it's included only once in compilation. This commit adds the `#pragma once` directive to prevent potential multiple inclusion issues.

ndsl7109256 · 2024-03-05T08:01:46Z

Maybe you could try squash to merge your commits into one commit, just remain commit Implement Monte Carlo learning for RL agent to emphasize your purpose.

paul90317 · 2024-03-05T08:41:46Z

Maybe you're right, but there is no relation between these commits, I want others to know what happens just by viewing commit messages, so I seperate them.

Here is the example that seperates the commits in a PR:
https://github.com/jserv/ttt/pull/23/commits

visitorckw

The changes look good to me now.
Thanks for all the great work!
Let's see if anyone else has any further comments.

jserv · 2024-03-05T13:55:38Z

Thank @paul90317 for contributing!

paul90317 force-pushed the mc_learning branch 3 times, most recently from 178af19 to a1e6df2 Compare February 20, 2024 07:18

jserv requested a review from visitorckw February 20, 2024 15:50

paul90317 force-pushed the mc_learning branch from c7dfc46 to aadc742 Compare February 21, 2024 04:58

visitorckw reviewed Feb 22, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

visitorckw reviewed Feb 22, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

visitorckw reviewed Feb 22, 2024

View reviewed changes

agents/rl_agent.c Outdated Show resolved Hide resolved

paul90317 requested a review from visitorckw February 22, 2024 16:50

ndsl7109256 reviewed Feb 23, 2024

View reviewed changes

paul90317 requested a review from ndsl7109256 February 24, 2024 04:27

jserv reviewed Feb 26, 2024

View reviewed changes

.ci/check-format.sh Outdated Show resolved Hide resolved

jserv reviewed Feb 26, 2024

View reviewed changes

agents/rl_agent.h Outdated Show resolved Hide resolved

paul90317 requested a review from jserv February 26, 2024 15:08

visitorckw reviewed Feb 27, 2024

View reviewed changes

.ci/check-format.sh Outdated Show resolved Hide resolved

visitorckw reviewed Feb 27, 2024

View reviewed changes

train.c Outdated Show resolved Hide resolved

paul90317 force-pushed the mc_learning branch from 38aea7a to 51bdb01 Compare February 27, 2024 16:08

paul90317 requested a review from visitorckw February 27, 2024 17:12

ndsl7109256 reviewed Feb 28, 2024

View reviewed changes

paul90317 force-pushed the mc_learning branch 2 times, most recently from 3fe83cb to 0329c0c Compare February 28, 2024 15:45

paul90317 requested a review from ndsl7109256 February 28, 2024 15:51

paul90317 force-pushed the mc_learning branch from 0329c0c to 78b43db Compare March 2, 2024 16:23

jserv requested changes Mar 4, 2024

View reviewed changes

paul90317 force-pushed the mc_learning branch 2 times, most recently from 67379bd to 17897f9 Compare March 5, 2024 07:15

paul90317 added 5 commits March 5, 2024 07:36

Fix memory leak in train.c

757b5c8

The memory leak of `get_action_epsilon_greedy` in train.c is fixed.

Implement random selection of best moves

fc6d5e3

Instead of returning the first action of the best actions if there is more than one best action available, the function `get_action_exploit` will now randomly select one from the best actions, which enhances the agent's performance.

Fix missing #pragma once directive to util.h

f3fe44d

The util.h header was missing the necessary `#pragma once` directive to ensure that it's included only once in compilation. This commit adds the `#pragma once` directive to prevent potential multiple inclusion issues.

paul90317 force-pushed the mc_learning branch from 17897f9 to f3fe44d Compare March 5, 2024 07:48

visitorckw approved these changes Mar 5, 2024

View reviewed changes

paul90317 requested a review from jserv March 5, 2024 11:46

jserv merged commit 776db3f into jserv:main Mar 5, 2024
2 checks passed

paul90317 deleted the mc_learning branch March 5, 2024 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Monte Carlo learning for RL agent #34

Implement Monte Carlo learning for RL agent #34

paul90317 commented Feb 20, 2024

ndsl7109256 commented Feb 20, 2024

paul90317 commented Feb 20, 2024

paul90317 commented Feb 21, 2024

ndsl7109256 Feb 21, 2024

paul90317 Feb 23, 2024

paul90317 Feb 23, 2024

ndsl7109256 Feb 23, 2024

paul90317 Feb 23, 2024

ndsl7109256 Feb 23, 2024

paul90317 Feb 23, 2024

ndsl7109256 Feb 23, 2024

paul90317 Feb 23, 2024 •

edited

Loading

paul90317 commented Feb 27, 2024

visitorckw commented Feb 27, 2024

paul90317 commented Feb 28, 2024

ndsl7109256 commented Feb 28, 2024

ndsl7109256 Feb 28, 2024 •

edited

Loading

paul90317 Feb 28, 2024

ndsl7109256 Feb 28, 2024

paul90317 Feb 28, 2024

ndsl7109256 Feb 28, 2024 •

edited

Loading

paul90317 Feb 28, 2024

paul90317 commented Feb 28, 2024

visitorckw commented Mar 1, 2024

visitorckw commented Mar 1, 2024 •

edited

Loading

paul90317 commented Mar 2, 2024

jserv left a comment

ndsl7109256 commented Mar 5, 2024

paul90317 commented Mar 5, 2024

visitorckw left a comment

jserv commented Mar 5, 2024

Implement Monte Carlo learning for RL agent #34

Implement Monte Carlo learning for RL agent #34

Conversation

paul90317 commented Feb 20, 2024

ndsl7109256 commented Feb 20, 2024

paul90317 commented Feb 20, 2024

paul90317 commented Feb 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paul90317 Feb 23, 2024 • edited Loading

Choose a reason for hiding this comment

paul90317 commented Feb 27, 2024

visitorckw commented Feb 27, 2024

paul90317 commented Feb 28, 2024

ndsl7109256 commented Feb 28, 2024

ndsl7109256 Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ndsl7109256 Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paul90317 commented Feb 28, 2024

visitorckw commented Mar 1, 2024

visitorckw commented Mar 1, 2024 • edited Loading

paul90317 commented Mar 2, 2024

jserv left a comment

Choose a reason for hiding this comment

ndsl7109256 commented Mar 5, 2024

paul90317 commented Mar 5, 2024

visitorckw left a comment

Choose a reason for hiding this comment

jserv commented Mar 5, 2024

paul90317 Feb 23, 2024 •

edited

Loading

ndsl7109256 Feb 28, 2024 •

edited

Loading

ndsl7109256 Feb 28, 2024 •

edited

Loading

visitorckw commented Mar 1, 2024 •

edited

Loading