Open In Colab

Reinforcement learning introduction: Frozen Lake example

The Frozen Lake example is a classic reinforcement learning problem, where the goal is to teach an agent to navigate a frozen lake and reach the goal without falling through the ice.


In this example, we use the Q-learning algorithm to train an agent to navigate the FrozenLake environment. The Q-learning algorithm works by learning a Q-table, which is a table that maps each state-action pair to a value that represents the expected future reward of taking that action in that state. The Q-table is initialized to zeros, and is updated over time based on the rewards that the agent receives for taking actions in different states.

The training process involves repeatedly running episodes, where each episode consists of the agent taking actions in the environment until it reaches the goal or falls in a hole. During each time step of an episode, the agent selects an action based on the current state and the Q-table, and then takes that action and observes the resulting reward and next state. The Q-table is then updated based on the observed reward and the expected future reward of the next state, according to the Q-learning update rule.

Once the agent has been trained, we test its performance on a set of test episodes. During each test episode, the agent takes actions in the environment using the Q-table that was learned during training, and we observe whether it is able to reach the goal or not.

Overall, the Frozen Lake example is a simple but illustrative example of how reinforcement learning can be used to train an agent to navigate an environment and accomplish a task.

.

.

.


Description:

The Frozen Lake environment is a grid of frozen ice, holes, and a goal, here we will have (4*4) grid size, that’s mean we will have (16 STATES) each cell repersent a state.

The agent starts at the top left corner cell in the grid, and can take four actions at each time step, We will have (4 ACTIONS):

  • move up
  • move down
  • move left
  • move right

The goal is to reach the goal cell in the bottom right corner of the grid without falling through any holes. Therefore, our REWARDS are:

  • If agent current state is a hole state, the reward = -1
  • If agent current state is the goal state, the reward = +1
  • Else, the reward = 0

Screen Shot 2023-04-16 at 2.07.26 PM.png

WE’RE GOING TO DO TWO VERSIONS of FROZEN LAKE EXAMPLE:


Version [1]:

In this version we use β€˜Gym’ to simplfiy the code

Import packages and setup your environment

This code imports the FrozenLake environment from the OpenAI Gym library and creates an instance of the environment.

import gym
import numpy as np
import matplotlib.pyplot as plt


# Create the FrozenLake environment
env = gym.make('FrozenLake-v1')

This code initializes the Q-table to zeros. The Q-table is a matrix where the rows represent the possible states of the environment and the columns represent the possible actions that the agent can take.

# Initialize the Q-table to zeros
Q = np.zeros([env.observation_space.n, env.action_space.n])

Hyperparameters πŸ“ˆ

# Set hyperparameters
lr = 0.8  # learning rate
gamma = 0.95  # discount factor
num_episodes = 2000  # number of training episodes

TRAIN YOUR AGENT πŸ€– ❄

This code trains the agent using Q-learning. During training, the agent interacts with the environment by selecting actions based on the Q-table and updating the Q-table based on the observed reward. The hyperparameters lr and gamma control how much the agent values immediate rewards versus future rewards. The num_episodes parameter controls how many times the agent interacts with the environment.

# Keep track of the total reward for each episode 
rewards = np.zeros(num_episodes)

# Train the agent using Q-learning
for i in range(num_episodes):
    # Reset the environment for each episode
    s = env.reset()
    done = False
    while not done:
        # Choose an action based on the Q-table, with some random noise
        a = np.argmax(Q[s,:] + np.random.randn(1, env.action_space.n)*(1./(i+1)))
        
        # Take the chosen action and observe the next state and reward
        s_new, r, done, _ = env.step(a)
        
        # Update the Q-table based on the observed reward
        Q[s,a] = Q[s,a] + lr*(r + gamma*np.max(Q[s_new,:]) - Q[s,a])
        
        
        # Add the reward to the total reward for the episode
        rewards[i] += r
        
        # Set the current state to the next state
        s = s_new

TEST YOUR AGENT πŸ§ͺ

This code tests the agent on 100 episodes after training. During testing, the agent chooses actions based on the Q-table and tries to reach the goal state. The code keeps track of the number of successful episodes (where the agent reaches the goal state) and prints the success rate at the end.

# Test the agent on 100 episodes 
num_successes = 0
for i in range(100):
    s = env.reset()
    done = False
    while not done:
        # Choose an action based on the Q-table
        a = np.argmax(Q[s,:])
        s, r, done, _ = env.step(a)
    if r == 1:
        num_successes += 1

# Print the success rate
print("Success rate:", num_successes/100)

Success rate: 0.54

NOW LET’s PLOT THE LEARNING πŸ–Œ


# Calculate the rolling average of rewards 
rolling_avg_rewards = np.zeros(num_episodes)
window_size = 100
for i in range(num_episodes):
    rolling_avg_rewards[i] = np.mean(rewards[max(0,i-window_size+1):(i+1)])


# Plot the total rewards and rolling average rewards
fig, ax = plt.subplots(2, 1, figsize=(8,8))
ax[0].plot(rewards)
ax[0].set_xlabel('Episode')
ax[0].set_ylabel('Total reward')
ax[1].plot(rolling_avg_rewards)
ax[1].set_xlabel('Episode')
ax[1].set_ylabel(f'Rolling average reward (window size {window_size})')

Text(0, 0.5, β€˜Rolling average reward (window size 100)’)

png

.

.


Version [2]


LET GET MORE TO THE DETAILS:

Screen Shot 2023-04-16 at 2.25.49 PM.png

Import packages

import numpy as np
import random

First, we define the FrozenLake environment as a class, with methods for resetting the environment, taking actions, rendering the current state, and showing the current Q-table and policy.

class FrozenLake:
    def __init__(self, size=4):
        self.size = size
        self.grid = np.zeros((size, size), dtype=int)
        self.start_state = (0, 0)
        self.goal_state = (size-1, size-1)
        self.hole_states = [(1, 1), (2, 3), (3, 0)]
        for i, j in self.hole_states:
            self.grid[i][j] = 1
    
    def reset(self):
        self.current_state = self.start_state
        return self.current_state
    
    def step(self, action):
        i, j = self.current_state
        if action == 0: # move up
            i = max(i-1, 0)
        elif action == 1: # move down
            i = min(i+1, self.size-1)
        elif action == 2: # move left
            j = max(j-1, 0)
        elif action == 3: # move right
            j = min(j+1, self.size-1)
        
        self.current_state = (i, j)
        
        if self.current_state == self.goal_state:
            reward = 1
            done = True
        elif self.current_state in self.hole_states:
            reward = -1
            done = True
        else:
            reward = 0
            done = False
        
        return self.current_state, reward, done


    # console text printing the grid and show the agent's moves
    def render(self):
        print('\n')
        for i in range(self.size):
            for j in range(self.size):
                if self.grid[i][j] == 0:
                    if (i, j) == self.current_state:
                        print('S', end=' ')
                    elif (i, j) == self.goal_state:
                        print('G', end=' ')
                    else:
                        print('.', end=' ')
                elif self.grid[i][j] == 1:
                    if (i, j) == self.current_state:
                        print('S', end=' ')
                    else:
                        print('X', end=' ')
            print()
        print()
    
    #print the Q-table of all values
    def show_q_table(self, q_table):

        print('-----------------------------------------------------------------')
        print('Q-Table:')
        print('-----------------------------------------------------------------')

        for i in range(self.size):
            for j in range(self.size):
                if self.grid[i][j] == 0:
                    print( '%.2f' % q_table[i][j][0], end='\t')
                    print('%.2f' % q_table[i][j][1], end='\t')
                    print('%.2f' % q_table[i][j][2], end='\t')
                    print('%.2f' % q_table[i][j][3])
                else:
                    print('NULL', end='\t')
                    print('NULL', end='\t')
                    print('NULL', end='\t')
                    print('NULL')
            print()


    # In one text line show the policy (the sequence of actions that agent take )
    def show_policy(self, q_table):
        print('\n Policy:')
        for i in range(self.size):
            for j in range(self.size):
                if self.grid[i][j] == 0:
                    action = np.argmax(q_table[i][j])
                    if action == 0:
                        print('UP', end=' ')
                    elif action == 1:
                        print('DOWN', end=' ')
                    elif action == 2:
                        print('LEFT', end=' ')
                    elif action == 3:
                        print('RIGHT', end=' ')
                else:
                    print('STAY', end=' ')

Next, we create an instance of the environment and initialize the Q-table with zeros.

# Create the environment
env = FrozenLake()

# Initialize Q-table with zeros
q_table = np.zeros((env.size, env.size, 4))

Hyperparameters πŸ“ˆ

We then set some hyperparameters for the Q-learning algorithm, such as the number of episodes to run, the maximum number of steps per episode, the learning rate, the discount factor, the starting exploration rate (epsilon), the minimum exploration rate, and the rate at which epsilon decays over time.

# Set hyperparameters
num_episodes = 10000
max_steps_per_episode = 100
learning_rate = 0.1
discount_factor = 0.99
epsilon = 1.0
min_epsilon = 0.01
epsilon_decay_rate = 0.001

We define an epsilon-greedy policy for selecting actions, which chooses a random action with probability epsilon or the greedy action (i.e., the action with the highest Q-value) with probability 1 - epsilon.

Screen Shot 2023-04-16 at 2.11.38 PM.png

# Define epsilon-greedy policy
def epsilon_greedy_policy(state):
    if random.uniform(0, 1) < epsilon:
        return random.randint(0, 3)
    else:
        return np.argmax(q_table[state[0]][state[1]])

We train the agent by running a loop over the specified number of episodes. In each episode, we start by resetting the environment and selecting actions according to the epsilon-greedy policy. We then update the Q-values for the current state-action pair using the Q-learning update rule. Finally, we update the current state and repeat until the episode ends (either because the agent reaches the goal or exceeds the maximum number of steps).

We decay the exploration rate (epsilon) after each episode to gradually shift the agent from exploration to exploitation over time.

Periodically, we render the current state of the environment and display the current Q-table and policy for visualization.

# Train agent
for episode in range(num_episodes):
    state = env.reset()
    done = False
    t = 0
    while not done and t < max_steps_per_episode:
        action = epsilon_greedy_policy(state)
        next_state, reward, done = env.step(action)
        q_table[state[0]][state[1]][action] += learning_rate * \
            (reward + discount_factor * np.max(q_table[next_state[0]][next_state[1]]) - q_table[state[0]][state[1]][action])
        state = next_state
        t += 1
    epsilon = max(min_epsilon, epsilon * (1 - epsilon_decay_rate))

    # Show progress
    if episode % 1000 == 0:
        env.render()
        env.show_q_table(q_table)
        env.show_policy(q_table)
. . . . 
. S . . 
. . . X 
X . . G 

-----------------------------------------------------------------
Q-Table:
-----------------------------------------------------------------
0.00	0.00	0.00	0.00
0.00	0.00	0.00	0.00
0.00	0.00	0.00	0.00
0.00	0.00	0.00	0.00

0.00	0.00	0.00	-0.10
NULL	NULL	NULL	NULL
0.00	0.00	0.00	0.00
0.00	0.00	0.00	0.00

0.00	0.00	0.00	0.00
0.00	0.00	0.00	0.00
0.00	0.00	0.00	0.00
NULL	NULL	NULL	NULL

NULL	NULL	NULL	NULL
0.00	0.00	0.00	0.00
0.00	0.00	0.00	0.00
0.00	0.00	0.00	0.00


Policy:
UP UP UP UP UP STAY UP UP UP UP UP STAY STAY UP UP UP 

. . . . 
. X . . 
. . . X 
X . . S 

-----------------------------------------------------------------
Q-Table:
-----------------------------------------------------------------
0.94	0.93	0.94	0.95
0.95	-1.00	0.94	0.96
0.96	0.97	0.95	0.95
0.84	0.85	0.96	0.82

0.94	0.58	0.85	-1.00
NULL	NULL	NULL	NULL
0.96	0.98	-1.00	0.96
0.65	-0.95	0.97	0.80

0.47	-1.00	0.32	0.81
-0.95	0.72	0.45	0.98
0.97	0.99	0.96	-1.00
NULL	NULL	NULL	NULL

NULL	NULL	NULL	NULL
0.58	0.71	-0.81	0.98
0.98	0.99	0.94	1.00
0.00	0.00	0.00	0.00


Policy:
RIGHT RIGHT DOWN LEFT UP STAY DOWN LEFT RIGHT RIGHT DOWN STAY STAY RIGHT RIGHT UP 

. . . . 
. X . . 
. . . X 
X . . S 

-----------------------------------------------------------------
Q-Table:
-----------------------------------------------------------------
0.94	0.93	0.94	0.95
0.95	-1.00	0.94	0.96
0.96	0.97	0.95	0.95
0.88	0.91	0.96	0.90

0.94	0.69	0.88	-1.00
NULL	NULL	NULL	NULL
0.96	0.98	-1.00	0.96
0.81	-0.96	0.97	0.90

0.56	-1.00	0.37	0.91
-0.95	0.79	0.66	0.98
0.97	0.99	0.97	-1.00
NULL	NULL	NULL	NULL

NULL	NULL	NULL	NULL
0.65	0.84	-0.88	0.99
0.98	0.99	0.98	1.00
0.00	0.00	0.00	0.00


Policy:
RIGHT RIGHT DOWN LEFT UP STAY DOWN LEFT RIGHT RIGHT DOWN STAY STAY RIGHT RIGHT UP 

. . . . 
. X . . 
. . . X 
X . . S 

-----------------------------------------------------------------
Q-Table:
-----------------------------------------------------------------
0.94	0.93	0.94	0.95
0.95	-1.00	0.94	0.96
0.96	0.97	0.95	0.95
0.89	0.91	0.96	0.90

0.94	0.73	0.89	-1.00
NULL	NULL	NULL	NULL
0.96	0.98	-1.00	0.96
0.81	-0.96	0.97	0.90

0.56	-1.00	0.37	0.92
-0.96	0.81	0.69	0.98
0.97	0.99	0.97	-1.00
NULL	NULL	NULL	NULL

NULL	NULL	NULL	NULL
0.68	0.84	-0.89	0.99
0.98	0.99	0.98	1.00
0.00	0.00	0.00	0.00


Policy:
RIGHT RIGHT DOWN LEFT UP STAY DOWN LEFT RIGHT RIGHT DOWN STAY STAY RIGHT RIGHT UP 

. . . . 
. X . . 
. . . X 
X . . S 

-----------------------------------------------------------------
Q-Table:
-----------------------------------------------------------------
0.94	0.93	0.94	0.95
0.95	-1.00	0.94	0.96
0.96	0.97	0.95	0.95
0.89	0.91	0.96	0.91

0.94	0.73	0.89	-1.00
NULL	NULL	NULL	NULL
0.96	0.98	-1.00	0.96
0.81	-0.96	0.97	0.90

0.56	-1.00	0.37	0.92
-0.96	0.81	0.69	0.98
0.97	0.99	0.97	-1.00
NULL	NULL	NULL	NULL

NULL	NULL	NULL	NULL
0.68	0.84	-0.89	0.99
0.98	0.99	0.98	1.00
0.00	0.00	0.00	0.00


Policy:
RIGHT RIGHT DOWN LEFT UP STAY DOWN LEFT RIGHT RIGHT DOWN STAY STAY RIGHT RIGHT UP 

. . . . 
. X . . 
. . . X 
X . . S 

-----------------------------------------------------------------
Q-Table:
-----------------------------------------------------------------
0.94	0.93	0.94	0.95
0.95	-1.00	0.94	0.96
0.96	0.97	0.95	0.95
0.89	0.91	0.96	0.91

0.94	0.73	0.89	-1.00
NULL	NULL	NULL	NULL
0.96	0.98	-1.00	0.96
0.81	-0.96	0.97	0.90

0.56	-1.00	0.37	0.92
-0.96	0.81	0.69	0.98
0.97	0.99	0.97	-1.00
NULL	NULL	NULL	NULL

NULL	NULL	NULL	NULL
0.68	0.84	-0.89	0.99
0.98	0.99	0.98	1.00
0.00	0.00	0.00	0.00


Policy:
RIGHT RIGHT DOWN LEFT UP STAY DOWN LEFT RIGHT RIGHT DOWN STAY STAY RIGHT RIGHT UP 

. . . . 
. X . . 
. . . X 
X . . S 

-----------------------------------------------------------------
Q-Table:
-----------------------------------------------------------------
0.94	0.93	0.94	0.95
0.95	-1.00	0.94	0.96
0.96	0.97	0.95	0.95
0.89	0.91	0.96	0.91

0.94	0.73	0.89	-1.00
NULL	NULL	NULL	NULL
0.96	0.98	-1.00	0.96
0.81	-0.96	0.97	0.90

0.56	-1.00	0.37	0.92
-0.96	0.81	0.69	0.98
0.97	0.99	0.97	-1.00
NULL	NULL	NULL	NULL

NULL	NULL	NULL	NULL
0.68	0.84	-0.89	0.99
0.98	0.99	0.98	1.00
0.00	0.00	0.00	0.00


Policy:
RIGHT RIGHT DOWN LEFT UP STAY DOWN LEFT RIGHT RIGHT DOWN STAY STAY RIGHT RIGHT UP 

. . . . 
. X . . 
. . . X 
X . . S 

-----------------------------------------------------------------
Q-Table:
-----------------------------------------------------------------
0.94	0.93	0.94	0.95
0.95	-1.00	0.94	0.96
0.96	0.97	0.95	0.95
0.89	0.91	0.96	0.91

0.94	0.73	0.89	-1.00
NULL	NULL	NULL	NULL
0.96	0.98	-1.00	0.96
0.81	-0.96	0.97	0.90

0.56	-1.00	0.37	0.92
-0.96	0.81	0.69	0.98
0.97	0.99	0.97	-1.00
NULL	NULL	NULL	NULL

NULL	NULL	NULL	NULL
0.68	0.84	-0.89	0.99
0.98	0.99	0.98	1.00
0.00	0.00	0.00	0.00


Policy:
RIGHT RIGHT DOWN LEFT UP STAY DOWN LEFT RIGHT RIGHT DOWN STAY STAY RIGHT RIGHT UP 

. . . . 
. X . . 
. . . X 
X . . S 

-----------------------------------------------------------------
Q-Table:
-----------------------------------------------------------------
0.94	0.93	0.94	0.95
0.95	-1.00	0.94	0.96
0.96	0.97	0.95	0.95
0.89	0.91	0.96	0.91

0.94	0.73	0.89	-1.00
NULL	NULL	NULL	NULL
0.96	0.98	-1.00	0.96
0.81	-0.96	0.97	0.90

0.56	-1.00	0.37	0.92
-0.96	0.81	0.69	0.98
0.97	0.99	0.97	-1.00
NULL	NULL	NULL	NULL

NULL	NULL	NULL	NULL
0.68	0.84	-0.89	0.99
0.98	0.99	0.98	1.00
0.00	0.00	0.00	0.00


Policy:
RIGHT RIGHT DOWN LEFT UP STAY DOWN LEFT RIGHT RIGHT DOWN STAY STAY RIGHT RIGHT UP 

. . . . 
. X . . 
. . . X 
X . . S 

-----------------------------------------------------------------
Q-Table:
-----------------------------------------------------------------
0.94	0.93	0.94	0.95
0.95	-1.00	0.94	0.96
0.96	0.97	0.95	0.95
0.89	0.91	0.96	0.91

0.94	0.73	0.89	-1.00
NULL	NULL	NULL	NULL
0.96	0.98	-1.00	0.96
0.81	-0.96	0.97	0.90

0.56	-1.00	0.37	0.92
-0.96	0.81	0.69	0.98
0.97	0.99	0.97	-1.00
NULL	NULL	NULL	NULL

NULL	NULL	NULL	NULL
0.68	0.84	-0.89	0.99
0.98	0.99	0.98	1.00
0.00	0.00	0.00	0.00


Policy:
RIGHT RIGHT DOWN LEFT UP STAY DOWN LEFT RIGHT RIGHT DOWN STAY STAY RIGHT RIGHT UP 

Once training is complete, we test the agent by running a loop until the agent reaches the goal or exceeds the maximum number of steps. In each step, we select the greedy action (i.e., the action with the highest Q-value) and update the current state. We render the environment at each step for visualization.


# Test agent
state = env.reset()
done = False
while not done:
    action = np.argmax(q_table[state[0]][state[1]])
    next_state, reward, done = env.step(action)
    env.render()
    state = next_state
. S . . 
. X . . 
. . . X 
X . . G 



. . S . 
. X . . 
. . . X 
X . . G 



. . . . 
. X S . 
. . . X 
X . . G 



. . . . 
. X . . 
. . S X 
X . . G 



. . . . 
. X . . 
. . . X 
X . S G 



. . . . 
. X . . 
. . . X 
X . . S