Skip to content

Commit

Permalink
Improve CB tutorial text
Browse files Browse the repository at this point in the history
Summary: Improve CB tutorial text

Reviewed By: Yonathae

Differential Revision: D57520857

fbshipit-source-id: 7f779f43fb50dad36bc13355a0c9cb32d57b2bf9
  • Loading branch information
rodrigodesalvobraz authored and facebook-github-bot committed May 20, 2024
1 parent ce13690 commit 75324b2
Showing 1 changed file with 21 additions and 38 deletions.
59 changes: 21 additions & 38 deletions tutorials/contextual_bandits/contextual_bandits_tutorial.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"metadata": {
"id": "8NNfwWXGvn_o"
},
Expand All @@ -24,7 +24,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": null,
"metadata": {
"id": "1uLHbYlegKX-",
"colab": {
Expand Down Expand Up @@ -177,7 +177,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"metadata": {
"id": "vcb70ZC_h3OA"
},
Expand Down Expand Up @@ -218,16 +218,16 @@
},
"source": [
"## Load Environment\n",
"The environment which underlies the experiments to follow is a contextual bandits environment we added to Pearl that allows to use UCI datasets (https://archive.ics.uci.edu/datasets) to built environmens to test contextual bandits algorithms.\n",
"The environment which underlies the experiments to follow is a contextual bandit environment we added to Pearl that allows us to use UCI datasets (https://archive.ics.uci.edu/datasets).\n",
"\n",
"The UCI datasets span a wide variety of prediction tasks. We use these tasks to construct a contexual bandit environment in which an agent recives an expected reward of 1 if it correctly label a data point and 0 otherwise. Pearl library currently support the following datasets: pendigits, letter, satimage, yeast. Additional ones can be readily added.\n",
"The UCI datasets span a wide variety of prediction tasks. We use these tasks to construct a contexual bandit environment in which an agent receives an expected reward of 1 if it correctly labels a data point and 0 otherwise. Pearl currently supports the following datasets: pendigits, letter, satimage, yeast. Additional ones can be readily added.\n",
"\n",
"In the following experiment will test different types of contextual bandits algorithms on the pendigits UCI dataset."
"In the following experiment we will test different types of contextual bandits algorithms on the pendigits UCI dataset."
]
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"metadata": {
"id": "g1VHtmldi3A2"
},
Expand Down Expand Up @@ -263,38 +263,28 @@
"id": "UYIoDAGSLNGR"
},
"source": [
"\n",
"\n",
"```\n",
"# This is formatted as code\n",
"```\n",
"\n",
"## Contextual Bandits learners\n",
"The following code sections show how to implement the neural versions of SquareCB, LinUCB and LinTS with the Pearl library.\n",
"The following sections show how to implement the neural versions of SquareCB, LinUCB and LinTS with Pearl.\n",
"\n",
"## Contextual Bandits learners: SquareCB\n",
"\n",
"The SquareCB algorithm requires only a regression model with which it learns the reward function. Given the reward model, SquareCB executes the following policy.\n",
"The SquareCB algorithm requires only a regression model with which it learns the reward function. Given the reward model, SquareCB executes the following policy:\n",
"$$\n",
"\\widehat{a}_*\\in \\arg\\max_a\\widehat{r}(x,a)\\\\\n",
"\\widehat{r}_*\\in \\max_a\\widehat{r}(x,a)\\\\\n",
"\\text{If $a\\neq \\widehat{a}_*$}: \\pi(a,x)= \\frac{1}{A + \\gamma (\\widehat{r}_* - \\widehat{r}(x,a))}\\\\\n",
"\\text{If $a= \\widehat{a}_*$}: \\pi(a,x) = 1-\\sum_{a'\\neq \\widehat{a}_*}\\pi(a',x).\n",
"$$\n",
"This exploratiative policy, that balances exploration and exploitation in an intelligent way.\n",
"This policy balances exploration and exploitation in an intelligent way.\n",
"\n",
"To use the SquareCB algrorithm in Pearl we set the policy learner as NeuralBandit. NeuralBandit is a base class and supports the estimation of the reward function with a neural architecture. With access to an estimated reward model, we then instantiate the exploration module with SquareCBExploration module.\n",
"To use the SquareCB algrorithm in Pearl we set the policy learner as `NeuralBandit`. `NeuralBandit` is class supportings the estimation of the reward function with a neural architecture. With access to an estimated reward model, we then use an instance of `SquareCBExploration` as an exploration module.\n",
"\n",
"To further highlight the versatility of the modular design of Pearl, we use the OneHotActionTensorRepresentationModule as the action representation module. Namley, when the action set is a finite number of elementes\n",
"$\n",
"\\{1,2,.,,N\\}\n",
"$\n",
"as a one-hot vector.\n"
"To further highlight the versatility of the modular design of Pearl, we use the `OneHotActionTensorRepresentationModule` as the action representation module. This module internally converts actions from integers to one-hot-encoded vectors.\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
Expand Down Expand Up @@ -573,22 +563,22 @@
"source": [
"## Contextual Bandits learners: LinUCB\n",
"\n",
"Next, we describe how to use the neural version of the LinUCB algorithm with the Pearl library, which uses UCB type of exploration with neural architectures. The LinUCB and its neural version, are generalizations of the seminal Upper Confidence Bound (UCB) algorithm. Both execute a policy of the following form:\n",
"Next, we describe how to use the neural version of the LinUCB algorithm with Pearl, which uses UCB type of exploration with neural architectures. LinUCB and its neural version are generalizations of the seminal Upper Confidence Bound (UCB) algorithm. Both execute a policy of the following form:\n",
"$$\n",
"\\pi(a,x) \\in \\arg\\max_a \\widehat{r}(x,a) + \\mathrm{score}(x,a),\n",
"$$\n",
"namely, both uses a function that estimates the expected reward with an additional bonus term, that quantifies the potential of choosing an action given a certain context. A common way to estimate the score function, in the linear case, when the features are $\\phi(x,a)$ is via:\n",
"that is, both use a function that estimates the expected reward with an additional bonus term that quantifies the potential of choosing an action given a certain context. A common way to estimate the score function in the linear case with features $\\phi(x,a)$ is:\n",
"$$\n",
"\\mathrm{score}(x,a) = \\alpha ||\\phi(x,a) ||_{A^{-1}}\\\\\n",
"\\text{where } A= \\lambda I + \\sum_{n\\leq t} \\phi(x_n,a_n)\\phi^T(x_n,a_n).\n",
"$$\n",
"\n",
"To implement the LinUCB algorithm in Pearl, use the NeuralLinearBandit policy learner module. This module supports (i) learning a reward model, and, (ii) calculates a score function by estimating the uncertainty using the last layer features. Further, set the exploration module to be UCBExploration and set the alpha hyper-parameters to enable the agent with the UCB-like update rule.\n"
"To implement the LinUCB algorithm in Pearl, use the `NeuralLinearBandit` policy learner module. This module supports (i) learning a reward model, and (ii) calculating a score function by estimating the uncertainty using the last layer features. Further, we set the exploration module to an instance of `UCBExploration` and set the `alpha` hyper-parameter to enable the agent with the UCB-like update rule.\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": null,
"metadata": {
"id": "cDauzO74nS4c",
"colab": {
Expand Down Expand Up @@ -868,16 +858,16 @@
"id": "QNUmNO77LNGR"
},
"source": [
"## Contextual Bandits learners: LinTS\n",
"## Contextual Bandits learners: Linear Thompson Sampling\n",
"\n",
"Lastly, we describe how to use the neural version of the LinTS algorithm with the Pearl library, namely, the algorithm which uses Thompson sampling exploration with neural architectures. The LinTS sampling is closely related to the LinUCB algorithm, with a key modification that often improves its convergence in practice: sample the score function from a probability, instead of fixing it determinstically. Practically, this often reduces over-exploring arms, since the score may be smaller than in the LinUCB algorithm.\n",
"Lastly, we describe how to use the neural version of the Linear Thompson Sampling (LinTS) algorithm with Pearl. The algorithm which uses Thompson sampling exploration with neural architectures. The LinTS sampling is closely related to the LinUCB algorithm, with a key modification that often improves its convergence in practice: sample the score function from a probability, instead of fixing it determinstically. Practically, this often reduces the over-exploring of arms, since the score may be smaller than in the LinUCB algorithm.\n",
"\n",
"To implement the LinTS algorithm in Pearl, use the NeuralLinearBandit policy learner module. Further, set the exploration module to be ThompsonSamplingExplorationLinear. This enables the agent to sample the score based on its estimated uncertainty, rather to fix it as in LinUCB algorithm.\n"
"To implement the LinTS algorithm in Pearl, use the `NeuralLinearBandit` policy learner module combined with an exploration module of type `ThompsonSamplingExplorationLinear`. This enables the agent to sample the score based on its estimated uncertainty, rather than to fix it as in LinUCB algorithm.\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": null,
"metadata": {
"id": "_7Cpzoi3nVAw",
"colab": {
Expand Down Expand Up @@ -1150,13 +1140,6 @@
"## Summary\n",
"In this example, we showed how to use popular contextual bandits algorithms in Pearl."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ANo74OTbLNGS"
},
"source": []
}
],
"metadata": {
Expand Down

0 comments on commit 75324b2

Please sign in to comment.