-
Notifications
You must be signed in to change notification settings - Fork 13
/
ABMmodels_model17_reinforcement_learning.Rmd
1529 lines (974 loc) · 75 KB
/
ABMmodels_model17_reinforcement_learning.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Simulation Models of Cultural Evolution in R"
author: "Alex Mesoudi"
output: pdf_document
---
# Model 17: Reinforcement learning
## Introduction
It's fair to say that the field of cultural evolution focuses far more on social learning (learning from others) than individual learning (learning on one's own). While we have modelled several different types of social learning (direct bias, indirect bias, conformity, vertical transmission etc.), individual learning has taken relatively simple forms, such as the cultural mutation of Model 2. Here individuals invent a new cultural trait or switch to a different cultural trait with a certain probability. This is a reasonable starting point but can be much improved to give a more psychologically informed model of individual learning. After all, without new variation introduced by individual learning, cultural evolution cannot occur, so individual learning deserves more attention.
Model 17 will implement reinforcement learning, a popular method of modelling individual learning across several disciplines. Reinforcement learning has its roots in psychology, where early behaviourists like Edward Thorndyke, Ivan Pavlov and BF Skinner argued that behaviours which get rewarded (or 'reinforced') increase in frequency, while behaviours that are unrewarded are performed less often. More recently, reinforcement learning has become popular in machine learning and artificial intelligence (Sutton & Barto 2018). In these fields reinforcement learning is defined as learning from the environment what actions or choices maximise some numerical reward. This reward might be points in a game, money in a transaction, even happiness if we are willing to assign it a numerical value. Researchers studying reinforcement learning devise algorithms - models, essentially - that can achieve this maximisation.
In the field of cultural evolution, reinforcement learning models have been used in lab experiments to model participants' choices in decision-making tasks, combined with the social learning biases that we have already come across (e.g. McElreath et al. 2008; Barrett et al. 2017; Toyokawa et al. 2019; Deffner et al. 2020). In Model 17 we will recreate simple versions of these studies' models to see how to simulate individual learning as reinforcement learning and how to combine it with social learning.
Also, in a 'Statistical Appendix' we will see how to retrieve parameters of our reinforcement-plus-social-learning models from simulation data. In other words, we run our agent-based simulation of reinforcement and social learning with various known parameters, such as a probability $s = 0.3$ that an agent engages in social learning rather than reinforcement learning. This gives us our familiar *output* dataframe containing all the traits of all agents over all trials. We then feed this *output* dataframe into the statistical modelling platform Stan, without the original parameter values, to try to retrieve those parameter values. In the example above we would hope to obtain $s = 0.3$ or thereabouts, rather than $s = 0.1$ or $s = 0.9$. This might seem somewhat pointless with simulated data when we already know what the parameters were. But in experiments and analyses of real world data we won't know in advance what real people's tendencies to engage in social learning are, or their propensity for conformity, etc. By testing statistical analyses on agent-based simulation data, we can confirm that our statistical methods adequately estimate such parameters. This is another valuable use for agent-based models.
## Model 17a: Reinforcement learning with a single agent
Before simulating multiple agents who can potentially learn from one another, we start with a single agent. This is all we need for individual learning and will help to better understand how reinforcement learning works.
In models of reinforcement learning agents typically face what is known as a multi-armed or $k$-armed bandit task. A 'one-armed bandit' is a slot machine like you would find in a casino. You pull the lever and hope to get a reward. A $k$-armed bandit is a set of $k$ of these slot machines. On each trial the agent pulls one of the $k$ arms and gets a reward. While all arms give rewards, they differ in the average size of the reward. However, the agent doesn't know these average rewards in advance. The agent's task is therefore to learn which of the $k$ arms gives the highest average reward by pulling the arms and observing the rewards.
To make things difficult, there is random noise in the exact size of the reward received on each trial. The agent can't just try each arm and immediately know which is best, they must learn which is best over successive trials each of which gives noisy data.
We can implement this by assuming that each arm's reward is drawn randomly from a normal distribution with a different mean for each arm and a common standard deviation $\sigma$. We will use a simplified version of the reward environment implemented in an experiment by Deffner et al. (2020). They had $k = 4$ arms. One arm gave a higher mean reward than the other three, which all had the same mean reward. We assume that the optimal arm (arm 4) has a mean reward of 13 points, and the others (arms 1-3) have a mean reward of 10 points. All arms have a standard deviation $\sigma = 1.5$. Let's plot these reward distributions:
```{r}
# define parameters of the 4-armed bandit task
arm_means <- c(10,10,10,13)
arm_sd <- 1.5
# empty plot
plot(NA,
ylab = "probability density",
xlab = "reward",
type = 'n',
bty='l',
xlim = c(4, 20),
ylim = c(0, 0.3))
# plot each arm's normally distributed reward
x <- seq(-10, 100, 0.1)
polygon(x, dnorm(x, arm_means[1], arm_sd), col = adjustcolor("royalblue", alpha.f = 0.2))
text(10, 0.29, "arms 1-3")
polygon(x, dnorm(x, arm_means[4], arm_sd), col = adjustcolor("orange", alpha.f = 0.2))
text(13, 0.29, "arm 4")
```
While arm 4 gives a higher reward on average, there is overlap with the other three arms' reward distributions. This is made more concrete by picking 10 values from each arm's reward distribution:
```{r}
m <- matrix(NA, nrow = 4, ncol = 10,
dimnames = list(c("arm1","arm2","arm3","arm4"), NULL))
for (i in 1:4) m[i,] <- round(rnorm(10, arm_means[i], arm_sd))
m
```
Think of each of the 10 columns as an agent trying all four arms once to see which gives the highest reward. While most of the 10 columns will show that arm 4 gives a higher reward than the other arms, some won't. This is the challenge that the agent faces. Over successive trials, they must maximise rewards by picking from the arm that gives the highest mean reward without knowing which arm does this in advance, and with noisy feedback from the environment. (NB the parameter $\sigma$ controls this uncertainty; try increasing it to $\sigma = 3$ and rerunning the two code chunks above, to see how the task becomes even more difficult.)
A fundamental problem which arises in tasks such as the $k$-armed bandit is an exploitation vs exploration tradeoff. Imagine an agent picks each of the above four arms once on four successive trials. There is a chance that arm 4 gives the highest reward, but it's far from certain (in how many of the 10 columns in the rewards output above does arm 4 give the single highest reward?). Imagine instead that arm 2 gives the highest reward. A highly exploitative agent will proceed to pick ('exploit') arm 2 until the end of the simulation, believing (falsely) that arm 2 maximises its rewards. At the other extreme, a highly exploratory agent will continue to try all of the other arms to gain as much information about the reward distributions as possible. Both agents lose out: the exploitative agent is losing out on the rewards it would have got if it had explored more and discovered that arm 4 is actually the best, while the exploratory agent is losing out because it never settles on arm 4, continuing to pick arms 1-3 until the end.
What is needed is an algorithm that balances exploitation and exploration to explore enough to find the best option, but not too much so that the best option is exploited as much as possible. That's what we'll implement in Model 17a.
First we need to numerically represent the agent's estimate of the value of each arm. These estimates are called Q values (or sometimes 'attractions', denoted $A$). The agent has one Q value for each arm. Q values can take any value - fractions, zeroes, negative numbers, etc. - as long as they numerically specify the relative value of each action.
We create a variable *Q_values* to hold the agent's Q values for each arm on the current trial:
```{r}
# for storing Q values for 4 arms on current trial
Q_values <- rep(0, 4)
names(Q_values) <- paste("arm", 1:4, sep="")
Q_values
```
On the first trial ($t = 1$) we'll set each Q value to zero, as shown above. This indicates that the agent has no initial preference for any of the arms. If we had prior knowledge about the arms we might set different values, but for now equal preferences is a reasonable assumption.
Now we convert these values into probabilities, specifying the probability that the agent chooses each arm on the next trial. This is where the exploitation-exploration trade-off is addressed. We use what is called a 'softmax' function, which converts any set of values into probabilities that sum to one (as probabilities must) and weights the probabilities according to those values. The higher the relative value, the higher the relative probability.
The softmax function contains a parameter $\beta$, often called 'inverse temperature', that determines how exploitative the agent is (sometimes expressed as how 'greedy' the agent is). When $\beta = 0$, arms are chosen entirely at random with no influence of the Q values. This is super exploratory, as the agent continues to choose all arms irrespective of observed rewards. As $\beta$ increases, there is a higher probability of picking the arm with the highest Q value. This is increasingly exploitative (or 'greedy'). When $\beta$ is very large then only the arm that currently has the highest Q value will be chosen, even if other arms might actually be better.
Mathematically, the softmax function looks like this:
$$ p_i = \frac{e^{\beta Q_i}}{\sum_{j=1} ^k {e^{\beta Q_j}}} \hspace{30 mm}(17.1)$$
On the left hand side $p_i$ means the probability of choosing arm $i$. This is equal to the exponential of the product of the parameter $\beta$ and the $Q$ value of arm $i$, divided by the sum of this value for all arms ($Q_j$ for $j = 1$ to $j = k$, where $k = 4$ in our case).
Some code will help to understand this equation. The softmax function can easily be written in R as follows, defining $\beta = 0$ for now:
```{r}
beta <- 0
exp(beta * Q_values) / sum(exp(beta * Q_values))
```
Given that our initial Q values are all zero, each arm has an equal probability of being chosen. $\beta$ is irrelevant here as no arm has the highest reward yet. Let's try a different set of Q values, representing a hypothetical set of four trials, one per arm, like one of the 10 columns in the matrix above:
```{r}
Q_values[1:4] <- c(10,14,9,13)
exp(beta * Q_values) / sum(exp(beta * Q_values))
```
Because $\beta = 0$, the new Q values make no difference. The agent still picks each arm with equal probability. Now we increase $\beta$ to 0.3 (rounding the display to 2 decimal places for ease of reading using the **round** function):
```{r}
beta <- 0.3
round(exp(beta * Q_values) / sum(exp(beta * Q_values)), 2)
```
Now arm 2 is more likely to be chosen, given that it has the highest reward. But it's still possible that the other arms are chosen, particularly arm 4 which has an only slightly lower reward than arm 2. This seems like a good value of $\beta$ to balance exploitation and exploration: the currently highest valued arm is most likely to be chosen, but there is a chance of exploring other arms, particularly other relatively high-scoring arms.
To check this, we increase $\beta$ to a much higher value:
```{r}
beta <- 5
round(exp(beta * Q_values) / sum(exp(beta * Q_values)), 2)
```
It's now almost certain that arm 2 is chosen. This $\beta$ value is probably too large. It will greedily exploit arm 2 without ever exploring the other arms, and never discover that arm 4 is better.
For now we'll revert to our initial all-equal Q values and a reasonable $\beta = 0.3$. Our agent stores the softmax probabilities in *probs* then uses this in **sample** to choose one of the four arms. This choice is stored in *choice*.
```{r}
Q_values[1:4] <- rep(0,4)
beta <- 0.3
probs <- exp(beta * Q_values) / sum(exp(beta * Q_values))
choice <- sample(1:4, 1, prob = probs)
choice
```
The arm number above will vary on different simulations, given that it is essentially a random choice. Next the agent pulls the chosen arm and gets a *reward*, using the command **rnorm** to pick a random number from one of the normal distributions specified above. Note that we **round** the reward to the nearest integer to avoid unwieldy decimals.
```{r}
reward <- round(rnorm(1, mean = arm_means[choice], sd = arm_sd))
reward
```
This reward will also vary on different simulations, as it's a random draw from a normal distribution with a randomly selected mean.
Finally, the agent updates its Q values given the reward. The extent to which Q values are updated is controlled by a second parameter, $\alpha$, which ranges from 0 to 1. When $\alpha = 0$ there is no updating and the reward has no effect on Q values. When $\alpha = 1$ the Q value for the rewarded arm increases by the full reward amount. When $0 < \alpha < 1$ the Q value for the rewarded arm increases by a proportion $\alpha$ of the reward. There is again a trade-off here. Obviously $\alpha = 0$ is no good because the agent is not learning. However, if $\alpha$ is too large then the agent can prematurely focus on an arm that gives a high reward early on, without exploring other arms.
The following code updates the Q values according to $\alpha = 0.7$ and the *reward*:
```{r}
alpha <- 0.7
Q_values[choice] <- Q_values[choice] + alpha * (reward - Q_values[choice])
Q_values
```
You can see that the chosen arm's Q value has increased by $\alpha$ times the *reward* shown above. The other arms remain zero.
We then repeat these steps (get probabilities, make choice, get reward, update Q values) over each of $t_{max} = 100$ timesteps. But let's add a final twist. Every 25 timesteps we will change which arm gives the higher reward. This lets us see whether our reinforcement algorithm can handle environmental change, rather than blindly sticking with arm 4 even after it is no longer optimal.
The following function **RL_single** does all this, using a dataframe *output* to store the Q values, choice and reward for each trial. For simplicity, we hard code the number of arms ($k$), the arm means and standard deviations, and the number of timesteps ($t_{max}$) rather than making them user-defined variables.
```{r}
RL_single <- function(alpha, beta) {
# set up arm rewards
arm_means <- data.frame(p1 = c(10,10,10,13),
p2 = c(10,13,10,10),
p3 = c(13,10,10,10),
p4 = c(10,10,13,10))
arm_sd <- 1.5
# for storing Q values for 4 arms on current trial, initially all zero
Q_values <- rep(0, 4)
# for storing Q values, choices, rewards per trial
output <- as.data.frame(matrix(NA, 100, 6))
names(output) <- c(paste("arm", 1:4, sep=""), "choice", "reward")
# t-loop
for (t in 1:100) {
# get softmax probabilities from Q_values and beta
probs <- exp(beta * Q_values) / sum(exp(beta * Q_values))
# choose an arm based on probs
choice <- sample(1:4, 1, prob = probs)
# get reward, given current time period
if (t <= 25) reward <- round(rnorm(1, mean = arm_means$p1[choice], sd = arm_sd))
if (t > 25 & t <= 50) reward <- round(rnorm(1, mean = arm_means$p2[choice], sd = arm_sd))
if (t > 50 & t <= 75) reward <- round(rnorm(1, mean = arm_means$p3[choice], sd = arm_sd))
if (t > 75) reward <- round(rnorm(1, mean = arm_means$p4[choice], sd = arm_sd))
# update Q_values for choice based on reward and alpha
Q_values[choice] <- Q_values[choice] + alpha * (reward - Q_values[choice])
# store all in output dataframe
output[t,1:4] <- Q_values
output$choice[t] <- choice
output$reward[t] <- reward
}
# record whether correct
output$correct <- output$choice == c(rep(4,25),rep(2,25),rep(1,25),rep(3,25))
# export output dataframe
output
}
```
The core code regarding Q values, probabilities and choices is as it was earlier. However, instead of a single set of *arm_rewards*, we now create a dataframe with the different rewards for the four time periods (p1 to p4). These are then used later to calculate the reward given the current time period. The *output* dataframe records each Q value, choice and reward for each timestep. After the t-loop, a variable is created within *output* called *correct* which records whether the agent picked the optimal arm for that timestep.
Here is one run of **RL_single** with the parameter values from above, showing the first period of 25 timesteps when arm 4 is optimal:
```{r}
data_model17a <- RL_single(alpha = 0.7, beta = 0.3)
data_model17a[1:25,]
```
Each run will be different, but you can see the Q values gradually updating over these 25 timesteps. Perhaps arm 4 ends the period with the highest Q value. Perhaps not. Possibly not all arms got explored. The task is difficult and this learning algorithm is not perfect. But some learning is occurring. What about the final 25 timesteps, where arm 3 is optimal?
```{r}
data_model17a[76:100,]
```
Again, each run will be different, but arm 3 should have the highest Q value or thereabouts. By the final period, the agent will have had a chance to explore all of the arms.
## Model 17b: Reinforcement learning with multiple agents
It is relatively straightforward to add multiple agents to **RL_single**. This serves three purposes. First, we are effectively adding multiple independent runs to **RL_single** as we have done with previous models to better understand the stochasticity in our results. Second, this allows us in Model 17c to build in social learning between the multiple agents.
Third, we can introduce agent heterogeneity. In previous models we have assumed that all agents have the same global parameter values. For example, in Model 5 (conformity), all $N$ agents have the same conformity parameter $D$ in any one simulation. Here instead we will introduce heterogeneity (sometimes called 'individual variation' or 'individual differences') amongst the agents in parameter values $\alpha$ and $\beta$. This simulates a situation where agents differ in their reinforcement learning parameters. The latter might better resemble reality, where some people are more exploratory than others, some more conservative than others.
Each of $N$ agents therefore has an $\alpha$ and $\beta$ drawn from a normal distribution with mean $\alpha_\mu$ and $\beta_\mu$, and standard deviation $\alpha_\sigma$ and $\beta_\sigma$, respectively. When the standard deviations are zero, all agents have identical $\alpha$ and $\beta$, and we are doing independent runs with identical parameters for all agents. When the standard deviations are greater than zero, we are simulating heterogeneity amongst agents. Below is an example of agent heterogeneity. Note that we add some code to bound $\alpha$ between zero and one, and $\beta$ greater than zero, otherwise our reinforcement learning algorithm won't work properly.
```{r}
N <- 10
alpha_mu <- 0.7
alpha_sd <- 0.1
beta_mu <- 0.3
beta_sd <- 0.1
beta <- rnorm(N, beta_mu, beta_sd) # inverse temperatures
alpha <- rnorm(N, alpha_mu, alpha_sd) # learning rates
alpha[alpha < 0] <- 0 # ensure all alphas are >0
alpha[alpha > 1] <- 1 # ensure all alphas are <1
beta[beta < 0] <- 0 # ensure all betas are >0
data.frame(agent = 1:N, alpha, beta)
```
Now that we have $N$ agents, *Q_values* must store the four Q values for each agent, not just one. We make it an $N$ x 4 matrix, initialised with zeroes as before:
```{r}
# for storing Q values for 4 arms on current trial, initially all zero
Q_values <- matrix(data = 0,
nrow = N,
ncol = 4,
dimnames = list(NULL, paste("arm", 1:4, sep="")))
Q_values
```
Arms are represented along the columns as before, but now each row is a different agent.
Next we calculate the probabilities of each arm using the softmax function, this time for each agent. Note that for this and the next few commands, we use vectorised code to calculate *probs*, *choice*, *reward* and updated *Q_values* for all agents simultaneously, rather than looping over each agent. As always, vectorised code is much quicker than loops.
```{r}
probs <- exp(beta * Q_values) / rowSums(exp(beta * Q_values))
probs
```
As before, with initially all-zero Q values, each arm is equally likely for all agents. Note that unlike in Model 17a, *probs* is now divided by **rowSums**. This gives the sum of each agent's exponentiated and multiplied-by-$\beta$ Q value which are stored on each row.
Next we choose an arm based on *probs*:
```{r}
choice <- apply(probs, 1, function(probs) sample(1:4, 1, prob = probs))
choice
```
This returns $N$ choices, one per agent. Here, *choice* is obtained using **apply**, a command which applies a function to each row (indicated with a '1'; columns would be '2') of the *probs* matrix. The function it applies is **sample** which picks an arm in proportion to that agent's *probs* as in Model 17a.
Next we get a *reward* from each *choice*:
```{r}
reward <- round(rnorm(N, mean = arm_means[choice], sd = arm_sd))
reward
```
Note that *reward* is now $N$ random draws from a normal distribution, one for each agent, rather than one draw for one agent like it was in Model 17a.
Finally we update *Q_values* based on these rewards. We do this one arm at a time, for each arm identifying which agents chose that arm using a variable *chosen*, then updating those Q values of the *chosen* agents' *arm* as before.
```{r}
for (arm in 1:4) {
chosen <- which(choice==arm)
Q_values[chosen, arm] <- Q_values[chosen, arm] +
alpha[chosen] * (reward[chosen] - Q_values[chosen, arm])
}
Q_values
```
These updated values will all be different, as each agent has a different choice, reward and $\alpha$.
Now we can put these steps into a $t$-loop as before and run over 100 trials, all within a function called **RL_multiple**. Note that we need to modify the *output* dataframe to hold the choices and rewards of all agents. We do this by adding an 'agent' column which indicates the agent id from 1 to $N$. We also add a 'trial' column now that there are multiple rows per trial. For brevity we omit Q values from *output*.
```{r}
RL_multiple <- function(N, alpha_mu, alpha_sd, beta_mu, beta_sd) {
# set up arm rewards
arm_means <- data.frame(p1 = c(10,10,10,13),
p2 = c(10,13,10,10),
p3 = c(13,10,10,10),
p4 = c(10,10,13,10))
arm_sd <- 1.5
# draw agent beta and alpha from overall mean and sd
beta <- rnorm(N, beta_mu, beta_sd) # inverse temperatures
alpha <- rnorm(N, alpha_mu, alpha_sd) # learning rates
alpha[alpha < 0] <- 0 # ensure all alphas are >0
alpha[alpha > 1] <- 1 # ensure all alphas are <1
beta[beta < 0] <- 0 # ensure all betas are >0
# for storing Q values for k arms on current trial, initially all zero
Q_values <- matrix(data = 0,
nrow = N,
ncol = 4)
# for storing choices and rewards per agent per trial
output <- data.frame(trial = rep(1:100, each = N),
agent = rep(1:N, 100),
choice = rep(NA, 100*N),
reward = rep(NA, 100*N))
# t-loop
for (t in 1:100) {
# get softmax probabilities from Q_values and beta
probs <- exp(beta * Q_values) / rowSums(exp(beta * Q_values))
# choose an arm based on probs
choice <- apply(probs, 1, function(probs) sample(1:4, 1, prob = probs))
# get reward
if (t <= 25) reward <- round(rnorm(N, mean = arm_means$p1[choice], sd = arm_sd))
if (t > 25 & t <= 50) reward <- round(rnorm(N, mean = arm_means$p2[choice], sd = arm_sd))
if (t > 50 & t <= 75) reward <- round(rnorm(N, mean = arm_means$p3[choice], sd = arm_sd))
if (t > 75) reward <- round(rnorm(N, mean = arm_means$p4[choice], sd = arm_sd))
# update Q values one arm at a time
for (arm in 1:4) {
chosen <- which(choice==arm)
Q_values[chosen,arm] <- Q_values[chosen,arm] +
alpha[chosen] * (reward[chosen] - Q_values[chosen,arm])
}
# store choice and reward in output
output[output$trial == t,]$choice <- choice
output[output$trial == t,]$reward <- reward
}
# record whether correct
output$correct <- output$choice == c(rep(4,25*N),rep(2,25*N),rep(1,25*N),rep(3,25*N))
# export output dataframe
output
}
```
Here are the first 6 rows of *output* using the parameters from above, and $N = 200$:
```{r}
data_model17b <- RL_multiple(N = 200,
alpha_mu = 0.7,
alpha_sd = 0.1,
beta_mu = 0.3,
beta_sd = 0.1)
head(data_model17b)
```
As always, it's easier to understand the output by plotting it. We are most interested in whether the agents have learned the optimal arm during each period. The following function **plot_correct** plots the proportion of $N$ agents who choose the optimal in each timestep. This is stored in the variable *plot_data* which simply takes the mean of *output$correct* which, given that correct choices are TRUE (which R reads as 1) and incorrect choices are FALSE (which R reads as 0), results in the proportion correct. We also add dotted vertical lines indicating the timesteps when optimal arm changes, and a dashed horizontal line indicating random chance (25%).
```{r}
plot_correct <- function(output) {
plot_data <- rep(NA, 100)
for (t in 1:100) plot_data[t] <- mean(output$correct[output$trial == t])
plot(x = 1:100,
y = plot_data,
type = 'l',
ylab = "frequency correct",
xlab = "timestep",
ylim = c(0,1),
lwd = 2)
# dotted vertical lines indicate changes in optimal
abline(v = c(25,50,75),
lty = 2)
# dotted horizontal line indicates chance success rate
abline(h = 0.25,
lty = 3)
}
```
First we can plot $N = 200$ with no agent heterogeneity and the usual values of $\alpha$ and $\beta$. This allows us to effectively run multiple independent runs of the simulation above with **RL_single**.
```{r}
data_model17b <- RL_multiple(N = 200,
alpha_mu = 0.7,
alpha_sd = 0,
beta_mu = 0.3,
beta_sd = 0)
plot_correct(data_model17b)
```
In each time period, the frequency of correct arm choices increases from chance (25%) to around 40%. While not amazing, some agents at least are successfully learning the optimal arm. And they are doing this despite environmental change and noisy rewards.
We now add heterogeneity via $\alpha_\sigma$ and $\beta_\sigma$:
```{r}
data_model17b <- RL_multiple(N = 200,
alpha_mu = 0.7,
alpha_sd = 0.1,
beta_mu = 0.3,
beta_sd = 0.1)
plot_correct(data_model17b)
```
There is not too much difference here with just a small amount of random variation in the parameters. However, adding too much heterogeneity has a negative effect:
```{r}
data_model17b <- RL_multiple(N = 200,
alpha_mu = 0.7,
alpha_sd = 1,
beta_mu = 0.3,
beta_sd = 1)
plot_correct(data_model17b)
```
Now the propotion of optimal choices remain at chance. With too much heterogeneity we lose the exploration-exploitation balance that the values $\alpha = 0.7$ and $\beta = 0.3$ give us. Instead, most agents are either too exploitative or too exploratory and fail to learn.
To illustrate this further, we can remove heterogeneity and increase $\beta_\mu$:
```{r}
data_model17b <- RL_multiple(N = 200,
alpha_mu = 0.7,
alpha_sd = 0,
beta_mu = 5,
beta_sd = 0)
plot_correct(data_model17b)
```
Now agents are too greedy, never exploring other arms and never learning the optimal arm. Indeed, the flat lines indicate that agents never switch arms at all.
Making $\beta_\mu$ too low also eliminates learning, as agents are too exploratory and fail to exploit the optimal:
```{r}
data_model17b <- RL_multiple(N = 200,
alpha_mu = 0.7,
alpha_sd = 0,
beta_mu = 0.1,
beta_sd = 0)
plot_correct(data_model17b)
```
Rather than flat lines, this gives fluctuating lines as agents continually switch arms, never settling on the optimal.
Here then we have more robustly confirmed the results obtained with **RL_single**, demonstrating that this reinforcement learning algorithm with suitable values of $\alpha$ and $\beta$ leads on average to the learning of an optimal choice despite environmental change and noisy rewards.
## Model 17c: Reinforcement learning plus social learning
Now that we have multiple agents all learning alongside each other, we can add social learning. Agents will be able to update their Q values not only using their own personal rewards, but also the choices of the other agents.
To implement social learning we use two new parameters, $s$ and $f$. $s$ is defined as the probability on each trial that an agent engages in social learning, rather than the reinforcement learning as modelled above. When $s = 0$, social learning is never used, and we revert to Model 17b. As $s$ increases, social learning is more likely.
If social learning is used, it takes the form of unbiased cultural transmission (see Model 1) or conformist cultural transmission (see Model 5). Unlike in Model 5, we use a slightly different way of implementing conformist transmission. This way is less intuitive to understand, but easier to implement in this context.
We assume that the probability that an agent chooses arm $i$, $p_i$, is given by:
$$ p_i = \frac{n_i^f}{\sum_{j=1} ^k {n_j^f}} \hspace{30 mm}(17.2)$$
where $n_i$ is the number of agents in the population who chose arm $i$ on the previous trial and $f$ is a conformity parameter. This $f$ parameter is technically different to the $D$ used in Model 5, but has the same effect of controlling the strength of conformity. The denominator is the sum of $n^f$ for all arms from 1 to $k$, where in our case $k = 4$.
When $f = 1$, transmission is unbiased. Anything to the power of 1 is unchanged, so the equation above simply gives the frequency of arm $i$ in the population. When $f > 1$, there is conformity in the sense defined in Model 5, i.e. more popular arms are disproportionately more likely to be chosen relative to unbiased transmission. To see this, here is a plot for two arms ($k = 2$), comparable to the two traits we simulated in Model 5, with $N = 100$ and $f = 2$:
```{r}
N <- 100 # number of agents
f <- 2 # conformity strength
n <- 1:N # number of N agents who picked arm 1; arm 2 is then N-n
freq <- n / N # frequencies of arm 1
prob <- n^f / (n^f + (N-n)^f) # probabilities according to equation above
plot(freq,
prob,
type = 'l',
xlab = "frequency of arm 1",
ylab = "probability of choosing arm 1")
abline(a = 0, b = 1, lty = 3) # dotted line for unbiased transmission baseline
```
Here we see the familiar S-shaped conformity curve from Model 5. When a majority of agents pick arm 1 (frequency > 0.5), there is a greater than expected probability of picking arm 1 relative to unbiased transmission. When arm 1 is in the minority (frequency < 0.5), there is a less than expected probability. Just like $D$, $f$ acts to magnify the adoption of more-common traits. However, this formulation with $f$ allows multiple traits - here, multiple arms - whereas the $D$ formulation only permits two options.
Let's start by assuming that all agents have identical $s$ and $f$ values, with $\alpha$ and $\beta$ still potentially varying across agents. The function **RL_social** below is largely identical to **RL_multiple**. However, instead of *probs* being the softmax probabilities derived from Q values, we instead store these softmax probabilities in a new variable *p_RL* (probabilities for reinforcement learning). On trial 1 ($t == 1$), *probs* is just *p_RL*, because there haven't been any choices yet to use for social learning. On subsequent trials ($t > 1$), we also calculate *p_SL* (probabilities for social learning). First we get $n[arm]$, the number of agents who chose each arm on trial $t-1$. Then we apply our equation above to get *p_SL*. After converting *p_SL* into a matrix to make it comparable with *p_RL*, we combine *p_RL* and *p_SL* according to $s$ to get overall *probs*. The rest is the same as before.
```{r}
RL_social <- function(N, alpha_mu, alpha_sd, beta_mu, beta_sd, s, f) {
# set up arm rewards
arm_means <- data.frame(p1 = c(10,13,10,10),
p2 = c(10,10,10,13),
p3 = c(13,10,10,10),
p4 = c(10,10,13,10))
arm_sd <- 1.5
# draw agent beta, alpha, s and f from overall mean and sd
beta <- rnorm(N, beta_mu, beta_sd) # inverse temperatures
alpha <- rnorm(N, alpha_mu, alpha_sd) # learning rates
# avoid impossible values
alpha[alpha < 0] <- 0 # ensure all alphas are >0
alpha[alpha > 1] <- 1 # ensure all alphas are <1
beta[beta < 0] <- 0 # ensure all betas are >0
# for storing Q values for 4 arms on current trial, initially all zero
Q_values <- matrix(data = 0,
nrow = N,
ncol = 4)
# for storing choices and rewards per agent per trial
output <- data.frame(trial = rep(1:100, each = N),
agent = rep(1:N, 100),
choice = rep(NA, 100*N),
reward = rep(NA, 100*N))
# vector to hold frequencies of choices for conformity
n <- rep(NA, 4)
# t-loop
for (t in 1:100) {
# get asocial softmax probabilities p_RL from Q_values
p_RL <- exp(beta * Q_values) / rowSums(exp(beta * Q_values))
# get social learning probabilities p_SL from t=2 onwards
if (t == 1) {
probs <- p_RL
} else {
# get number of agents who chose each option
for (arm in 1:4) n[arm] <- sum(output[output$trial == (t-1),]$choice == arm)
# conformity according to f
p_SL <- n^f / sum(n^f)
# convert p_SL to N-row matrix to match p_RL
p_SL <- matrix(p_SL, nrow = N, ncol = 4, byrow = T)
# update probs by combining p_RL and p_SL according to s
probs <- (1-s)*p_RL + s*p_SL
}
# choose an arm based on probs
choice <- apply(probs, 1, function(x) sample(1:4, 1, prob = x))
# get reward
if (t <= 25) reward <- round(rnorm(N, mean = arm_means$p1[choice], sd = arm_sd))
if (t > 25 & t <= 50) reward <- round(rnorm(N, mean = arm_means$p2[choice], sd = arm_sd))
if (t > 50 & t <= 75) reward <- round(rnorm(N, mean = arm_means$p3[choice], sd = arm_sd))
if (t > 75) reward <- round(rnorm(N, mean = arm_means$p4[choice], sd = arm_sd))
# update Q values one arm at a time
for (arm in 1:4) {
chosen <- which(choice==arm)
Q_values[chosen,arm] <- Q_values[chosen,arm] +
alpha[chosen] * (reward[chosen] - Q_values[chosen,arm])
}
# store choice and reward in output
output[output$trial == t,]$choice <- choice
output[output$trial == t,]$reward <- reward
}
# record whether correct
output$correct <- output$choice == c(rep(2,25*N),rep(4,25*N),rep(1,25*N),rep(3,25*N))
# export output dataframe
output
}
```
If we set $s = 0$ - no social learning - we should recapitulate **RL_multiple**. Note that when $s = 0$, $f$ is irrelevant as conformity cannot occur when there is no social learning.
```{r}
data_model17c <- RL_social(N = 200,
alpha_mu = 0.7,
alpha_sd = 0.1,
beta_mu = 0.3,
beta_sd = 0.1,
s = 0,
f = 1)
plot_correct(data_model17c)
```
As before, the frequency of optimal arm choices increases in each time period to around 40%.
What happens if we add some social learning, $s = 0.3$?
```{r}
data_model17c <- RL_social(N = 200,
alpha_mu = 0.7,
alpha_sd = 0.1,
beta_mu = 0.3,
beta_sd = 0.1,
s = 0.3,
f = 1)
plot_correct(data_model17c)
```
Adding some social learning improves the performance of our agents slightly. The proportion of correct choices now exceeds 40% by the end of most time periods. Social learning is magnifying the effect of reinforcement learning. Those agents who are engaging in reinforcement learning are on average learning and choosing the optimal arm, and agents who are engaging in social learning are therefore more likely to copy this more-frequent arm than the other less-frequent arms.
What about conformity, say $f = 2$?
```{r}
data_model17c <- RL_social(N = 200,
alpha_mu = 0.7,
alpha_sd = 0.1,
beta_mu = 0.3,
beta_sd = 0.1,
s = 0.3,
f = 2)
plot_correct(data_model17c)
```
This is even better, with 60-70% of agents now making optimal choices by the end of each time period. Conformity magnifies reinforcement learning even more than unbiased social learning, given that reinforcement learning makes the optimal arm the most-common arm and conformity disproportionately increases this arm's frequency.
However, too much social learning is not good. Here we increase $s$ to 0.8:
```{r}
data_model17c <- RL_social(N = 200,
alpha_mu = 0.7,
alpha_sd = 0.1,
beta_mu = 0.3,
beta_sd = 0.1,
s = 0.8,
f = 2)
plot_correct(data_model17c)
```
The output should show one time period at almost 100% correct choices, and the other three at almost 0%. With too much social learning, agents mostly make choices based on what others are choosing. Too few agents are using reinforcement / individual learning to discover the optimal arm. Instead, one arm goes to fixation due to cultural drift. In one time period it happens to be optimal, in the others it isn't, and which one is purely due to chance.
Finally, we can add agent heterogeneity in both $s$ and $f$, just as we did for $\alpha$ and $\beta$. It seems reasonable that real people vary in both their propensity to copy others and their propensity for conformity, so we should be able to simulate these individual differences.
The following updated **RL_social** function draws $s$ and $f$ from normal distributions with means $s_\mu$ and $f_\mu$ and standard deviations $s_\sigma$ and $f_\sigma$ respectively. Note the new conformity code, which creates a matrix of $n$ for each agent, raises it to the power of that agent's $f$ value, and divides by the sum of all the $n^f$ values. This is a bit opaque, but as before this vectorised code is much faster than looping through all agents.
```{r}
RL_social <- function(N,
alpha_mu, alpha_sd,
beta_mu, beta_sd,
s_mu, s_sd,
f_mu, f_sd) {
# set up arm rewards
arm_means <- data.frame(p1 = c(10,10,10,13),
p2 = c(10,13,10,10),
p3 = c(13,10,10,10),
p4 = c(10,10,13,10))
arm_sd <- 1.5
# draw agent beta, alpha, s and f from overall mean and sd
beta <- rnorm(N, beta_mu, beta_sd) # inverse temperatures
alpha <- rnorm(N, alpha_mu, alpha_sd) # learning rates
s <- rnorm(N, s_mu, s_sd) # social learning tendency
f <- rnorm(N, f_mu, f_sd) # conformity parameter
# avoid impossible values
alpha[alpha < 0] <- 0 # ensure all alphas are >0
alpha[alpha > 1] <- 1 # ensure all alphas are <1
beta[beta < 0] <- 0 # ensure all betas are >0
s[s < 0] <- 0 # s between 0 and 1
s[s > 1] <- 1
f[f < 0] <- 0 # f > 0
# for storing Q values for 4 arms on current trial, initially all zero
Q_values <- matrix(data = 0,
nrow = N,
ncol = 4)
# for storing choices and rewards per agent per trial
output <- data.frame(trial = rep(1:100, each = N),
agent = rep(1:N, 100),
choice = rep(NA, 100*N),
reward = rep(NA, 100*N))
# vector to hold frequencies of choices for conformity
n <- rep(NA, 4)
# t-loop
for (t in 1:100) {
# get asocial softmax probabilities p_RL from Q_values
p_RL <- exp(beta * Q_values) / rowSums(exp(beta * Q_values))
# get social learning probabilities p_SL from t=2 onwards
if (t == 1) {
probs <- p_RL
} else {
# get number of agents who chose each option
for (arm in 1:4) n[arm] <- sum(output[output$trial == (t-1),]$choice == arm)
# conformity according to f, allowing for different agent f's if f_sd>0
p_SL <- matrix(rep(n, times = N)^rep(f, each = 4), nrow = N, ncol = 4, byrow = T)
p_SL_rowsums <- matrix(rep(rowSums(p_SL), each = 4), nrow = N, ncol = 4, byrow = T)
p_SL <- p_SL / p_SL_rowsums
# update probs by combining p_RL and p_SL according to s
probs <- (1-s)*p_RL + s*p_SL
}
# choose an arm based on probs
choice <- apply(probs, 1, function(x) sample(1:4, 1, prob = x))
# get reward
if (t <= 25) reward <- round(rnorm(N, mean = arm_means$p1[choice], sd = arm_sd))
if (t > 25 & t <= 50) reward <- round(rnorm(N, mean = arm_means$p2[choice], sd = arm_sd))
if (t > 50 & t <= 75) reward <- round(rnorm(N, mean = arm_means$p3[choice], sd = arm_sd))
if (t > 75) reward <- round(rnorm(N, mean = arm_means$p4[choice], sd = arm_sd))
# update Q values one arm at a time
for (arm in 1:4) {
chosen <- which(choice==arm)
Q_values[chosen,arm] <- Q_values[chosen,arm] +
alpha[chosen] * (reward[chosen] - Q_values[chosen,arm])
}
# store choice and reward in output
output[output$trial == t,]$choice <- choice
output[output$trial == t,]$reward <- reward
}
# record whether correct
output$correct <- output$choice == c(rep(4,25*N),rep(2,25*N),rep(1,25*N),rep(3,25*N))
# export output dataframe
output
}
```
We can confirm that the last output above is recreated with agent heterogeneity in $\alpha$, $\beta$, $s$ and $f$:
```{r}
data_model17c <- RL_social(N = 200,
alpha_mu = 0.7,
alpha_sd = 0.1,
beta_mu = 0.3,
beta_sd = 0.1,
s_mu = 0.3,
s_sd = 0.1,
f_mu = 2,
f_sd = 0.1)
plot_correct(data_model17c)
```
While it may seem trivial to show that adding a small amount of random variation to parameters gives the same output as not having any variation, it is useful in principle to be able to simulate agent heterogeneity in model parameters. In reality individual heterogeneity in learning and behaviour is the norm. In the Statistical Appendix below we will see how to recover these individually-varying parameters from our simulated data. This serves as a validation check for experiments and real-world data, where we might want to estimate these parameters from actual, individually-varying people.
***
## Summary
In Model 17 we saw how to implement reinforcement learning. Cultural evolution models require some kind of individual learning to introduce variation and track environmental change. Reinforcement learning, extensively used in artificial intelligence and computer science, is a richer form of individual learning than something like cultural mutation seen in Model 2.
Our reinforcement learning algorithm is designed to balance exploitation and exploration in the multi-armed bandit task with hidden optima and noisy reward feedback. Too much exploitation means that the agent might prematurely settle on a non-optimal choice. Too much exploration and the agent will continue to try all options, never settling on the optimal choice. The parameters $\alpha$ (the learning rate) and $\beta$ (inverse temperature) can be used to balance this trade-off.
We also saw how social learning can magnify reinforcement learning, particularly conformist social learning. Conformity boosts the frequency of the most common option. If reinforcement learning has increased the frequency of the optimal option even slightly above chance, then conformity can boost the frequency of this optimal option further. However, too much social learning is bad. If there is not enough reinforcement learning going on, the optimal option is not discovered, and so a random arm is magnified.
Reinforcement learning has commonly been used in model-driven cultural evolution lab experiments such as McElreath et al. (2008), Barrett et al. (2017), Toyokawa et al. (2019) and Deffner et al. (2020). These experiments get participants to play a multi-armed bandit task of the kind simulated above. Rather than using generic statistical models such as linear regression to analyse the participant's data, these studies use the reinforcement and social learning models implemented above. Participant data is fit to these models, and model parameters estimated both for the whole sample and for each participant. Some participants might be particularly exploratory, reflected in a low estimated $\beta$. Some might be particularly prone to conformity, reflected in a large estimated $f$. This is why we simulated agent heterogeneity - to reflect this real life heterogeneity. Overall, these experiments have shown that real participants are often quite effective at solving the multi-armed bandit task, both in balancing the exploitation-exploration trade-off and in using conformist social learning.
One might wonder about the ecological validity of multi-armed bandit tasks. Most of us rarely find ourselves faced with the task of choosing one of $k$ slot machines in a casino. However, the general structure of the task probably resembles many real world problems beyond the windowless walls of a casino. For example, we might want to decide which of several restaurants have the best food; which beach has the fewest people; or which website gives the cheapest travel tickets. In each case, you do not know in advance which option is best; feedback is noisy (e.g. you might visit a restaurant on a day when they've run out of your favourite dish); and it's possible to learn from others' choices (e.g. by seeing how busy the restaurant is). The exploitation-exploration trade-off is inherent in all these real world examples of information search.
In terms of programming techniques, one major innovation is the introduction of agent heterogeneity in model parameters. Rather than assuming that every agent has the same, say, conformity parameter, we assume each agent differs. Formally, agent parameter values are drawn randomly from a normal distribution, with the standard deviation of this normal distribution controlling the amount of heterogeneity. However there is no reason that this has to be a normal distribution. Other distributions might better reflect real world individual variation.
We also saw a different way of modelling conformity to the one introduced in Model 5. While less intuitive than the 'mating tables' approach used to derive equation 5.1, equation 17.2 has the advantage of allowing any number of choices rather than just two. That makes it useful for the multi-armed bandit task when $k > 2$.
***
## Acknowledgements
The following resources were invaluable in compiling Model 17: Raphael Geddert's *Reinforcement Learning Resource Guide & Tutorial* (https://github.com/rmgeddert/Reinforcement-Learning-Resource-Guide/blob/master/RLResourceGuide.md), Lei Zhang's *Bayesian Statistics and Hierarchical Bayesian Modeling for Psychological Science* (https://github.com/lei-zhang/BayesCog_Wien) course, and Dominik Deffner's analysis code for his 2020 study (https://github.com/DominikDeffner/Dynamic-Social-Learning). Many thanks also to Dominik Deffner who helped with the Stan code in the Statistical Appendix.
***
## Exercises
1. Use **RL_single** to explore different values of $\alpha$ and $\beta$. Using code from previous models, write a function to generate a heatmap showing the proportion correct arm for a range of values of each parameter.
2. In Model 17 we started the agents with Q values all initially zero. As rewards are obtained, these Q values then increased. Try instead very high starting values, e.g. all 100. These will then decrease. Does this change the dynamics of **RL_multiple** shown in the plots? (see Sutton & Barto 2018 for discussion)
3. Use **RL_social** to explore different values of $s$ and $f$, using the heatmap function from (1) with fixed $\alpha$ and $\beta$. What happens when $s = 1$? What about when $s = 0.3$ and $f = 10$? Explain these dynamics in terms of the findings regarding unbiased transmission in Model 1 and conformity in Model 5.
4. Modify the Model 17 functions to make the number of arms $k$, the standard deviation of the arm reward distribution $arm_sd$, and the number of timesteps $t_{max}$, user-defined parameters. Instead of changing the optimal arm at fixed intervals, add another parameter which specifies the probability on each timestep of a new random arm becoming the optimal. Explore how the reinforcement learning algorithm handles different numbers of arms, different arm noisinesses, and different rates of environmental change.
***
## Statistical Appendix
As discussed above, it is useful to be able to recover the model parameters (in our case $\alpha$, $\beta$, $s$ and $f$) from the output simulation data. This serves as a validity check when running the same analyses on experimental or observational data generated by real people (or other animals), where we don't know the parameter values.
Here we follow studies such as McElreath et al. (2008), Barrett et al. (2017), Toyokawa et al. (2019) and Deffner et al. (2020) and use multi-level Bayesian statistical models run using the platform Stan (https://mc-stan.org/). McElreath (2020) provides an excellent overview of Bayesian statistics in R and Stan. I won't discuss the statistical methods used here but I recommend working through McElreath's book to understand them fully.
We will also use McElreath's R package `rethinking` (McElreath 2021) to interface with Stan. Follow the instructions to install `rethinking` at https://github.com/rmcelreath/rethinking. Then load the package:
```{r, message=FALSE}
library(rethinking)
```
First we try to recover $\alpha$ and $\beta$ from a run of **RL_single**. We run the simulation and remind ourselves of the output:
```{r}
data_model17a <- RL_single(alpha = 0.7, beta = 0.3)
head(data_model17a)
```
Our Stan model will only use *choice*s and *reward*s from this output. These need to be in the form of a list rather than a dataframe, which we call *dat*:
```{r}
# create data list
dat <- list(choice = data_model17a$choice,
reward = data_model17a$reward)
dat
```
Stan requires models written in the Stan programming language. This is slightly different to R. Usually the Stan model is saved as a separate text file with the extension `.stan`. This `.stan` file is then called within R to actually run the model. We will follow this practice. All the `.stan` files we will use are in the folder `"model17_stanfiles"`. The first one is called `model17_single.stan`. I will reproduce chunks of these stan files here but include `eval=FALSE` in the code chunks so that they are not run if you knit the RMarkdown file. If you are following along, don't run these Stan chunks (you'll just get an error).
Stan files require at least three sections. The first, labelled `data`, specifies the data that is required by the model. This should match the *dat* list from above. The `data` section of `model17_single.stan` is reproduced below:
```{stan, eval=FALSE, output.var=""}
data {
array[100] int<lower=1,upper=4> choice; //arm choice
array[100] int reward; // reward
}
```
Here we have our *reward* and *choice* variables. There are a few differences in this Stan code compared to R code. First, all statements must end in a semicolon. Second, comments are marked with `//` rather than `#`. Third, Stan requires the programmer to explicitly declare what type each variable is, i.e. whether it is an integer, real (continuous) number, vector, matrix or whatever (such languages are called 'statically typed', as opposed to R which is 'dynamically typed'). The possible range of each variable can also be specified. The advantage of this explicit type declaration is that Stan throws an error if you try to do a calculation with, say, an integer when a real number is required, or mistakenly enter a value of 100 for a variable that is constrained to be between 1 and 10.
Hence *choice* and *reward* are both declared as arrays of 100 integers (`int`). An array is like a vector, but vectors in Stan must be `real`, while our variables are `int`, so we use arrays instead. *choice* is additionally constrained between 1 and 4 using the code `<lower=1,upper=4>` given that we only have four arms.
The second Stan block is `parameters`, where we declare the parameters of the model that we want Stan to estimate. For **RL_single** these are $\alpha$ and $\beta$:
```{stan, eval=FALSE, output.var=""}
parameters {
real<lower = 0, upper = 1> alpha; // learning rate
real<lower = 0> beta; // inverse temperature
}
```
Both our parameters are `real`, i.e. continuous. We constrain $\alpha$ to be between 0 and 1, and $\beta$ to be positive.
The final block in the Stan file is the `model`. This essentially recreates the simulation contained in **RL_single**, but instead of generating choices and rewards from *probs* and *arm_means* respectively, we use the already-generated *choice* and *reward* data in *dat*. We also omit *output*, because that's already been generated.
```{stan, eval=FALSE, output.var=""}
model {
vector[4] Q_values; // Q values for each arm
vector[4] probs; // probabilities for each arm
// priors
alpha ~ beta(1,1);