Implementing FrozenLake-v1 by following examples #618

gthampi · 2023-03-14T17:04:15Z

gthampi
Mar 14, 2023

I'm new to reinforcement learning. I'm trying to solve the FrozenLake-v1 game using OpenAI's gymnasium learning environment and BindsNet, which is a library to simulate Spiking Neural Networks using PyTorch.

I've gone over the examples provided by BindsNet, mainly BreakoutDeterministic-v4 and SpaceInvaders-v0. I understand that for using a DQN the no. of neurons in the input layer should map to the observation space while the no. of neurons in the output layer should map to the action space. I've followed their RL example for Breakout and SpaceInvaders and made changes as per my requirements (the no. of neurons and shape of the input and output layer).

from bindsnet.encoding import bernoulli
from bindsnet.environment import GymEnvironment
from bindsnet.learning import MSTDP
from bindsnet.network import Network
from bindsnet.network.nodes import Input, IzhikevichNodes
from bindsnet.network.topology import Connection
from bindsnet.pipeline import EnvironmentPipeline
from bindsnet.pipeline.action import select_softmax
import matplotlib.pyplot as plt


# Build network.
network = Network(dt=1.0)

# Load the Breakout environment.
environment = GymEnvironment("FrozenLake-v1", is_slippery=False, render_mode='human')   # , render_mode='rgb_array')

# Layers of neurons.
inpt = Input(n=16, shape=(1, 1, 16), traces=True)
middle = IzhikevichNodes(n=10, traces=True)
out = IzhikevichNodes(n=4, refrac=0, traces=True)

# Connections between layers.
inpt_middle = Connection(source=inpt, target=middle, wmin=0, wmax=1)
middle_out = Connection(source=middle, target=out, wmin=0, wmax=1,
                        update_rule=MSTDP,  # using MSTDP (reward-modulated STDP) learning
                        nu=[1e-2, 1e-2])

# Add all layers and connections to the network.
network.add_layer(inpt, name="Input Layer")
network.add_layer(middle, name="Hidden Layer")
network.add_layer(out, name="Output Layer")
network.add_connection(inpt_middle, source="Input Layer", target="Hidden Layer")
network.add_connection(middle_out, source="Hidden Layer", target="Output Layer")

environment.reset()

# Build pipeline from specified components.
pipeline = EnvironmentPipeline(
    network,
    environment,
    encoding=bernoulli,
    action_function=select_softmax,
    output="Output Layer",
    time=100,
    history_length=1,
    delta=1,
    plot_interval=1,
    render_interval=1,
)

rewards = []
# Run environment simulation for 1000 episodes.
for i in range(1000):
    total_reward = 0
    pipeline.reset_state_variables()
    is_done = False
    while not is_done:
        result = pipeline.env_step()
        pipeline.step(result)

        environment.render()

        reward = result[1]
        total_reward += reward

        is_done = result[2]

    rewards.append(total_reward)
    print(f"Episode {i} total reward:{total_reward}")

# plot the reward for each episode
plt.plot(rewards)
plt.xlabel('Episode')
plt.ylabel('Reward')

I also had to make a change to the preprocess() function in the environment.py file in BindsNet. A condition for FrozenLake-v1 needed to be added and the observation one hot encoded.

def preprocess(self) -> None:
        # language=rst
        """
        Pre-processing step for an observation from a ``gym`` environment.
        """
        if self.name == "SpaceInvaders-v0":
            self.obs = subsample(gray_scale(self.obs), 84, 110)
            self.obs = self.obs[26:104, :]
            self.obs = binary_image(self.obs)
        elif self.name == "BreakoutDeterministic-v4":
            self.obs = subsample(gray_scale(crop(self.obs, 34, 194, 0, 160)), 80, 80)
            self.obs = binary_image(self.obs)
        elif self.name == "FrozenLake-v1":
            self.obs = np.array([1 if self.obs == i else 0 for i in range(16)])
        else:  # Default pre-processing step.
            pass

        self.obs = torch.from_numpy(self.obs).float()

After this the algorithm runs without errors but it doesn't seem to be learning. While debugging I can see that the output layer returns 'S' is a 4x4 tensor. I'm confused on what the output should look like and represent in terms of Q values. Based on having 4 neurons I think we should get 4 outputs and the one with the max probability would be the associated action to be taken. I'm confused on how to get information for the Q values, I think that each observation space should have 4 associated Q values for each action (so 16 x 4). Based on my limited knowledge and going over the documentation for BindsNet I'm unable to figure out why my algorithm doesn't seem to be learning.

I've confirmed that pipeline.network.learning is True and that the code is stepping through a couple of functions that based on their names seem to be used for training or the forward step. However, the parameters or values in the layers of the network don't seem to be changing.

I'm also confused on why this gymnasium has specified rewards for this game as either reward 1 or 0. How would we get an accumulated reward for an episode? Shouldn't the accumulated reward be affected if the agent reached the goal in 5 steps vs. 10 steps?

Any help would be really appreciated.

Answered by Hananel-Hazan

Mar 19, 2023

At first glance, the code appears to be fine.

To clarify, the DQN example in BindsNET has two main steps. First, we train an artificial neural network (ANN) using DQN. Next, we copy the weights from the ANN (with slight modifications) to a spiking neural network (SNN) that has the same topology. Did you also train an ANN and then copy the weights in your code?

The other reinforcement learning (RL) code examples in BindsNET serve as proof of concept to demonstrate that the framework can utilize reward modulated signals to change the weights, similar to the RL framework. However, this arrangement does not perform well.

Regarding your other question, if the weights are not changing, I sugges…

View full answer

Hananel-Hazan · 2023-03-19T16:11:42Z

Hananel-Hazan
Mar 19, 2023
Maintainer

At first glance, the code appears to be fine.

To clarify, the DQN example in BindsNET has two main steps. First, we train an artificial neural network (ANN) using DQN. Next, we copy the weights from the ANN (with slight modifications) to a spiking neural network (SNN) that has the same topology. Did you also train an ANN and then copy the weights in your code?

The other reinforcement learning (RL) code examples in BindsNET serve as proof of concept to demonstrate that the framework can utilize reward modulated signals to change the weights, similar to the RL framework. However, this arrangement does not perform well.

Regarding your other question, if the weights are not changing, I suggest adding monitors to the network to plot the spike trains of the activity in each layer of the network, including the reward signal. Then, check if they are correlated. To change the weights, the spiking activity needs to be aligned, and only then will the STDP with the reward module be activated to modify the weights.

With regard to the desired output of gymnasium, each game has different controllers and scores.

0 replies

gthampi · 2023-03-20T21:49:30Z

gthampi
Mar 20, 2023
Author

Thank you for that clarification. I didn't train an ANN and copy the code so I will begin there.

Thanks for the explanation & suggestions, things make more sense to me now & I should be able to move forward. Appreciate you taking the time out to respond to all my questions.

1 reply

Hananel-Hazan Mar 21, 2023
Maintainer

Forgot to mention, the training done in the examples is very similar to what done in this paper

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing FrozenLake-v1 by following examples #618

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Implementing FrozenLake-v1 by following examples #618

gthampi Mar 14, 2023

Replies: 2 comments · 1 reply

Hananel-Hazan Mar 19, 2023 Maintainer

gthampi Mar 20, 2023 Author

Hananel-Hazan Mar 21, 2023 Maintainer

gthampi
Mar 14, 2023

Replies: 2 comments 1 reply

Hananel-Hazan
Mar 19, 2023
Maintainer

gthampi
Mar 20, 2023
Author

Hananel-Hazan Mar 21, 2023
Maintainer