Use digital Reinforcement Learning to learn to balance an inverse pendulum
on a cart simulated on a Model-1 by
Analog Paradigm (http://analogparadigm.com)
Done in July to September 2019. Analog part by vaxman. Digital part by sy2002. Download the paper on arXiv.org. Our paper was published in Andrew_Adamatzky's book Handbook of Unconventional Computing in World Scientific Publishing.
Watch a short introduction to this experiment on YouTube by clicking the image below or by using this link: https://youtu.be/jDGLh8YWvNE
One of the classical "Hello World" examples of Reinforcement Learning is the inverse pendulum. A pole is mounted on a cart. The cart can move in one dimension, for example to the left or to the right. The center of mass is located above the pivot point as shown in the following image:
A force is applied to move the cart to the right and to the left to keep the pendulum upright. Important parameters of this system are: the position of the cart, the velocity of the cart, the angle of the pole and the angular velocity.
For training your reinforcement learning algorithms, OpenAI Gym offers a simulation of a simplified version of this model that delivers exactly these four parameters for your control algorithm and that expects that you keep applying impulses from the left or from the right to keep the pole upright.
In our experiment, we used a Model-1 from Analog Paradigm to create an analog version of OpenAI Gym's simulation. Here is our setup:
You can see Model-1 on the right side. The algorithm (shown below) is wired on the Model-1 using the black, blue, yellow and red cables. On top of Model-1, there is an oscilloscope that shows the output of some training episodes of the reinforcement learning algorithm. On the left is a Mac, on which a Python script runs the reinforcement learning and which uses Model-1's Hybrid Controller to send commands from the digital computer to the analog computer.
The oscilloscope on the right shows the control actions of the Python script: A bar in the middle means "controls nothing" but "calculates and learns". A bar on top means push cart to the right and a bar on the bottom means push cart to the left.
Programming an analog computer is quite different from programming a classic stored-program digital computer as there is no algorithm controlling the operation of an analog computer. Instead, an analog computer program is a schematic describing how the various computing elements (such as summers, integrators, multipliers, coeffizient potentiometers etc.) have to be interconnected in order to form an electronic model, an analogue, for the problem being investigated.
At first sight this looks (and feels) quite weird but it turns out that it is not only much more easy to program an analog computer than a digital computer as there is no need to think about numerical algorithms when it comes to tasks like integration etc. Further, an analog computer is much more energy efficient than a digital computer since all of its computing elements work in parallel without any need for slow memory lookups etc. Further, the signals on which an analog computer simulation is based are continuous voltages or currents and not binary signals, so the super-linear increase of power consumption as is typical for digital computers does not hold for analog computers.
History background of analog computers can be found in this great book by Bernd Ulmann.
A modern introduction to the programming of analog and hybrid computers is described in this textbook.
The schematic shown below is the analog computer setup to simulate an inverted pendulum. The mathematic derivation can be found in this paper.
The picture below shows the subprogram controlling the moving cart by means of a hybrid controller which couples the analog computer with a stored-program digital computer by means of an USB-interface. D0 and D1 depict two digital outputs of this controller and control two electronic switches by which the cart can be pushed to the left or to the right.
analog-cartpole.py represents the digital part of this experiment. It is a Python 3 script that uses the good old serial communication to control the analog computer. For being able to do so, the analog computer Model 1 offers an analog/digital interface called Hybrid Controller.
In our experiment, Model 1's Hybrid Controller is connected to a Mac using a standard USB cable. There is no driver or other additional software necessary to communicate between the Mac and the Hybrid Controller. The reason why this works so smoothly is, that the Hybrid Controller uses an FTDI chip to implement the serial communication over USB. Apple added FTDI support natively from OS X Mavericks (10.9) on as described in Technical Note TN2315. For Windows and Linux, have a look at the installation guides.
The bottom line is, that from the perspective of our Python 3 script, the analog computer can be controlled easily by sending serial commands to the Hybrid Controller using pySerial.
If you do not have an Analog Paradigm Model 1 analog computer available, then
you can also run it on any digital computer that supports Python 3
and OpenAI Gym. For doing so, you can toggle the
operation mode of analog-cartpole.py by setting the
flag SOFTWARE_ONLY
to True
.
When doing so, we advise you to also set PERFORM_CALIBRATION
to True
as
this yields much better results on OpenAI Gym's simulation.
We apply Reinforcement Learning to balance the CartPole. For doing so, the following four features are used to feed a Q-learning algorithm: cart position, cart velocity, pole angle and the angular velocity of the pole's tip.
As Q-learning is a model-free algorithm, it actually "does not matter for the
algorithm", what the semantics of these four parameters/features mentioned above
are. It "does not know" the "meaning" of a feature such as "cart position" or
"angular velocity". For the Q-learning algorithm, the set of these features
is just what the current State
within the Environment
is comprised of.
And the State
enables the Agent
to decide, which Action
to perform next.
If you are new to Q-learning, then this simple introduction might be helpful. But in contrast to this simple example, the feature space of cart position, cart velocity, pole angle and angular velocity is pretty large: Normalizing all of them to the interval [0, 1], and each state consisting of these four dimensions will lead to an infinite amount of possible states. Therefore a naive tabular or discrete approach to model the action value function Q(s, a), will be far from optimal.
-
The current state s is equal to the
observation
obtained by querying the analog computer using the functionenv_step(a)
as shown here. It consists of the cart position, cart velocity, pole angle and angular velocity represented as floating point numbers between 0 and 1. -
You can have an arbitrary amount of possible actions on an analog computer by applying different forces for different amounts of durations to push the cart. We chose the following simplification: There are only two possible actions: "Push the cart to the left" and "push the cart to the right." Both are implemented by applying a constant force for a constant period of time. The strength of the force is hard-wired at the analog computer and the constant period of time is configured in the variable
HC_IMPULSE_DURATION
. The functionhc_influence_sim
shows, how the two possible actions0
and1
are translated into impulses that control the cart. -
We use Linear Regression to model the action value function Q. For each action, there is one Linear Regression, so instead of having one action value function Q(s, a), we are having two value functions Q_a=0(s) and Q_a=1(s). The reason for this is that value functions are complex beasts and each simplification is welcome to increase the chance of fast and efficient learning.
-
scikit-learn's Linear Regression class/algorithm SGDRegressor is used for the actual implementation. It conveniently offers a so called
partial_fit
function that is allowing us to perform online-learning. This is necessary, as the Q-Learning algorithm itself is among other things based on Dynamic Programming principles and therefore cannot provide batch data. Instead, after each step in each episode, we can refine our model of the Q function by calling our update functionrl_set_Q_s_a
that itself relies onSGDRegressor
'spartial_fit
function. Please note that SGDRegressor allows to specify a learning rate (alpha) upon construction so that the calculation of the Q update value can be simplified fromnew_value = old_qsa + alpha * (r + GAMMA * max_q_s2a2 - old_qsa)
tonew_value = r + GAMMA * max_q_s2a2
. -
The two functions Q_a=0(s) and Q_a=1(s) are stored in the list
rbf_net
that itself contains two instances of SGDRegressor. ThereforeQ_a=x(s) = rbf_net[x].predict(s)
. -
The CartPole challenge itself can be solved by merely performing a dot multiplication of a vector with four wisely chosen values with a vector containing the four state variables (cart position, cart velocity, pole angle and angular velocity). But this is not the point of this experiement that wants to show that a general purpose machine learning algorithm like Q-Learning can be efficiently trainined using an analog computer performing the environment simulation. Therefore it needs to be stated that the value functions Q_a=0(s) and Q_a=1(s) are much more complex (as they contain predictions for future rewards) than a mere control policy for the CartPole. And this is why we need a model that is itself complex enough to match this complexity. Our experiments have shown that Linear Regression using just four linear variables (such as the four state variables) is underfitting the data. So if four is not enough, how much is enough? Finding out the right treshold for the model complexity is one of the many hyper parameters that need to be found and tuned in machine learning. A feature transformation from
s = (cart position, cart velocity, pole angle, angular velocity)
tos_transformed = (f1, f2, f3, ... fn), n >> 4
is needed. -
We decided to use Radial Basis Functions (RBFs) for transforming
s
intos_transformed
. An RBF is a function that maps a vector to a real number by calculating (usually) the Eucledian distance from the function’s argument to a previously defined center which is sometimes also called exemplar. This distance is then used inside a kernel function to obtain the output value of the RBF. In our implementation we chose a Gaussian kernel function as described here. -
scikit-learn offers a class called
RBFSampler
that approximates the feature map of an RBF kernel by Monte Carlo approximation of its Fourier transform. The experiments have shown that RBFSampler is fast enough for our real-time requirements (after all we are working with an analog computer) and that the approximation is good enough. This means that inside the functionrl_transform_s
the four state variables are transformed to a multitude of distances relative to RBF "exemplars" (aka "centers"). The amount of these randomly chosen exemplars is specified by the variableRBF_EXEMPLARS
. To increase the variance, we are also using multiple shapes of the Gaussian functions that can be specified in the constructor ofRBFSampler
. This is why the amount of features that the original four features are transformed into is defined as the product between the two variablesRBF_EXEMPLARS
andRBF_GAMMA_COUNT
. The following part of the source code clarifies how the whole concept works.
The Q-learning itself is implemented pretty straightforwardly. This README.md does not contain an explanation how Q-learning or Reinforcement Learning works, but is focused on the specific implementation choices we made. So the following list gives only the high-level view on our implementation of the Q-learning algorithm. More details can be found in Richard Sutton's and Andrew Barto's book.
- Given the current state
s
: Decide with the probability ofEPSILON
, if we want to "exploit" (means "use") the currently learnedValue Function
to decide which action to take - or - if we'd like to "explore" other options (means choose a random action). - Perform the action
a
, get the rewardr
and enter states2
. (The states2
is the state that we arrive in, when we perform actiona
in states
.) - Get the next best action
a2
given our current states2
by greedily looking for the highestValue Function
over all actions that are possible ins2
. - Learn: Modify the
Value Function
for (s
|a
) by utilizing the highest possible rewards in the successor states2
after having performed actiona2
and respect the discount factorGAMMA
and the learning rateALPHA
. - If not done (i.e. the pole has fallen or the whole experiment runs a predefined amount of steps): Repeat and go to (1).
The following code snippet is abbreviated code from analog-cartpole.py
that implements the steps shown above. env_step
is modifying the environment
by performing action a
; in our case this means: The cart is recieving an impulse
either from the left or from the right. rl_max_Q_s
looks for the next best action,
given a certain state, and returns that action together with the respective value of the Value Function
.
while not done:
# epsilon-greedy: do a random move with the decayed EPSILON probability
if p > (1 - eps):
a = np.random.choice(env_actions)
# exploit or explore and collect reward
observation, r, done = env_step(a)
s2 = observation
# Q-Learning
# if this is not the terminal position (which has no actions), then we can proceed as normal
if not done:
# "Q-greedily" grab the gain of the best step forward from here
a2, max_q_s2a2 = rl_max_Q_s(s2)
else:
a2 = a
max_q_s2a2 = 0 # G (future rewards) of terminal position is 0 although the reward r is high
# learn the new value for the Value Function
new_value = r + GAMMA * max_q_s2a2
rl_set_Q_s_a(s, a, new_value)
# next state = current state, next action is the "Q-greedy" best action
s = s2
a = a2
A last note that is not necessarily relevant for understanding how the algorithm works, but that might still be interessting: Despite its name, the RBFSampler is actually not using any RBFs inside. You can find some experiments and explanations here to learn more about this.
The following table shows how the Python script running on a standard laptop was able to leverage the real-time simulation of the analog computer to learn quickly. After for example 120 episodes, the mean steps were already at 4,674 and the maximum steps at 19,709 which is the equivalent of more than 15 minutes of successful balancing per episode.
Episode | Mean Steps | Median Steps | Min. Steps | Max. Steps | Epsilon |
---|---|---|---|---|---|
0 | 82.00 | 82.00 | 82 | 82 | 0.4545 |
20 | 81.75 | 61.00 | 34 | 217 | 0.4167 |
40 | 170.10 | 138.50 | 48 | 420 | 0.3846 |
60 | 351.80 | 256.50 | 48 | 1398 | 0.3571 |
80 | 633.35 | 477.00 | 95 | 2421 | 0.3333 |
100 | 589.35 | 387.00 | 39 | 2945 | 0.3125 |
120 | 4674.80 | 2825.00 | 14 | 19709 | 0.2941 |
140 | 3608.60 | 1708.00 | 105 | 14783 | 0.2778 |
160 | 4826.75 | 1839.50 | 27 | 23965 | 0.2632 |
180 | 3121.75 | 820.50 | 57 | 17782 | 0.2381 |
200 | 2181.90 | 1281.00 | 33 | 9920 | 0.2273 |
220 | 4216.50 | 1993.50 | 29 | 26387 | 0.2174 |
240 | 3867.75 | 446.50 | 22 | 25592 | 0.2083 |
260 | 6535.00 | 713.00 | 121 | 66848 | 0.2000 |
280 | 4120.25 | 737.50 | 38 | 39596 | 0.1923 |
300 | 4531.75 | 647.50 | 67 | 28202 | 0.1852 |
320 | 5513.95 | 1434.50 | 24 | 33491 | 0.1786 |
340 | 6595.30 | 1885.00 | 46 | 43789 | 0.1724 |
The key to success was to minimize the latency introduced by the serial communication between the digital and the analog computer, so that the Q-Learning algorithm did not have to learn a lot about handling latency in addition to learn how to balance the pole. This was done by leveraging the Hybrid Controller's "bulk mode" that is able to transfer the whole state (cart position, cart velocity, pole angle, angular velocity) in one go.
The result table also shows, that episode 260 yielded a very good "brain"
and after episode 260, there might have been
overfitting or other reasons for decreased performance. The Python script's
"enhanced persistence mode" (activated by the command line parameter -SP
)
helps to find and store the right cutoff point for the learning because
it stores the current state of the "brain" every few episodes.