Open In Colab

Reinforcement learning introduction: Frozen Lake example

The Frozen Lake example is a classic reinforcement learning problem, where the goal is to teach an agent to navigate a frozen lake and reach the goal without falling through the ice.


Description:

The Frozen Lake environment is a grid of frozen ice, holes, and a goal, here we will have (4*4) grid size, that’s mean we will have (16 STATES) each cell represent a state.

The agent starts at the top left corner cell in the grid, and can take four actions at each time step, We will have (4 ACTIONS):

  • move up
  • move down
  • move left
  • move right

The goal is to reach the goal cell in the bottom right corner of the grid without falling through any holes. Therefore, our REWARDS are:

  • If agent current state is a hole state, the reward = -1
  • If agent current state is the goal state, the reward = +1
  • Else, the reward = 0

Screen Shot 2023-04-16 at 2.07.26 PM.png


Screen Shot 2023-04-16 at 2.25.49 PM.png

Import packages and setup your environment

import numpy as np
import random

First, we define the FrozenLake environment as a class, with methods for resetting the environment, taking actions, rendering the current state, and showing the current Q-table and policy.

class FrozenLake:
    def __init__(self, size=4):
        self.size = size
        self.grid = np.zeros((4, 4), dtype=int) # spcifiy grid size please
        self.start_state = (0, 0)
        self.goal_state = (3, 3) # can you state the goal state in the grid (row, column)
        self.hole_states = [(1, 1), (2, 3), (3, 0)]
        for i, j in self.hole_states:
            self.grid[i][j] = 1
    
    def reset(self):
        self.current_state = self.start_state
        return self.current_state
    
    def step(self, action): # could you please set the actions as numbers
        i, j = self.current_state
        if action == 0: # move up
            i = max(i-1, 0)
        elif action == 1: # move down
            i = min(i+1, self.size-1)
        elif action == 2: # move left
            j = max(j-1, 0)
        elif action == 3: # move right
            j = min(j+1, self.size-1)
        
        self.current_state = (i, j)
        
        # what is your rewads?
        if self.current_state == self.goal_state: 
            reward = 1 # winning reward?
            done = True
        elif self.current_state in self.hole_states:
            reward = -1 # losing for hole states, reward?
            done = True
        else:
            reward = 0 # else?
            done = False
        
        return self.current_state, reward, done


    def render(self):
        print('\n')
        for i in range(self.size):
            for j in range(self.size):
                if self.grid[i][j] == 0:
                    if (i, j) == self.current_state:
                        print('S', end=' ')
                    elif (i, j) == self.goal_state:
                        print('G', end=' ')
                    else:
                        print('.', end=' ')
                elif self.grid[i][j] == 1:
                    if (i, j) == self.current_state:
                        print('S', end=' ')
                    else:
                        print('X', end=' ')
            print()
        print()
    
    
    def show_q_table(self, q_table):

        print('-----------------------------------------------------------------')
        print('Q-Table:')
        print('-----------------------------------------------------------------')

        for i in range(self.size):
            for j in range(self.size):
                if self.grid[i][j] == 0:
                    print( '%.2f' % q_table[i][j][0], end='\t')
                    print('%.2f' % q_table[i][j][1], end='\t')
                    print('%.2f' % q_table[i][j][2], end='\t')
                    print('%.2f' % q_table[i][j][3])
                else:
                    print('NULL', end='\t')
                    print('NULL', end='\t')
                    print('NULL', end='\t')
                    print('NULL')
            print()


    # In one text line show the policy (the sequence of actions that agent take )
    def show_policy(self, q_table):
        print('\n Policy:')
        for i in range(self.size):
            for j in range(self.size):
                if self.grid[i][j] == 0:
                    action = np.argmax(q_table[i][j])
                    if action == 0:
                        print('UP', end=' ')
                    elif action == 1:
                        print('DOWN', end=' ')
                    elif action == 2:
                        print('LEFT', end=' ')
                    elif action == 3:
                        print('RIGTH', end=' ')
                else:
                    print('STAY', end=' ')

Next, we create an instance of the environment and initialize the Q-table with zeros.

env = FrozenLake()
q_table = np.zeros((env.size, env.size, 4))

Hyperparameters

We then set some hyperparameters for the Q-learning algorithm, such as the number of episodes to run, the maximum number of steps per episode, the learning rate, the discount factor, the starting exploration rate (epsilon), the minimum exploration rate, and the rate at which epsilon decays over time.

# Could you please set your hyperparameters?
num_episodes = 2000
max_steps_per_episode = 10
learning_rate = 0.05
discount_factor = 0.99

epsilon = 1.0
min_epsilon = 0.01
epsilon_decay_rate = 0.001

We define an epsilon-greedy policy for selecting actions, which chooses a random action with probability epsilon or the greedy action (i.e., the action with the highest Q-value) with probability 1 - epsilon.

Screen Shot 2023-04-16 at 2.11.38 PM.png

# Define epsilon-greedy policy
def epsilon_greedy_policy(state):
    if random.uniform(0, 1) < epsilon:
        return random.randint(0, 3)
    else:
        return np.argmax(q_table[state[0]][state[1]])

We train the agent by running a loop over the specified number of episodes. In each episode, we start by resetting the environment and selecting actions according to the epsilon-greedy policy. We then update the Q-values for the current state-action pair using the Q-learning update rule. Finally, we update the current state and repeat until the episode ends (either because the agent reaches the goal or exceeds the maximum number of steps).

We decay the exploration rate (epsilon) after each episode to gradually shift the agent from exploration to exploitation over time.

Periodically, we render the current state of the environment and display the current Q-table and policy for visualization.

for episode in range(num_episodes):
    state = env.reset()
    done = False
    t = 0
    while not done and t < max_steps_per_episode:

        action = epsilon_greedy_policy(state)

        next_state, reward, done = env.step(action)

        # what is missing to update the Q value?
        #q_table[state[0]][state[1]][action] += learning_rate * (reward + discount_factor * np.max(q_table[next_state[0]][next_state[1]]) - q_table[state[0]][state[1]][action])
        q_table[state[0]][state[1]][action] += learning_rate * (reward + discount_factor * np.max(q_table[next_state[0]][next_state[1]]) - q_table[state[0]][state[1]][action])

        state = next_state

        t += 1

    epsilon = max(min_epsilon, epsilon * (1 - epsilon_decay_rate))

    # Show progress
    if episode % 1000 == 0:
        env.render()
        env.show_q_table(q_table)
        env.show_policy(q_table)
. . . . 
. S . . 
. . . X 
X . . G 

-----------------------------------------------------------------
Q-Table:
-----------------------------------------------------------------
0.00	0.00	0.00	0.00
0.00	-0.05	0.00	0.00
0.00	0.00	0.00	0.00
0.00	0.00	0.00	0.00

0.00	0.00	0.00	0.00
NULL	NULL	NULL	NULL
0.00	0.00	0.00	0.00
0.00	0.00	0.00	0.00

0.00	0.00	0.00	0.00
0.00	0.00	0.00	0.00
0.00	0.00	0.00	0.00
NULL	NULL	NULL	NULL

NULL	NULL	NULL	NULL
0.00	0.00	0.00	0.00
0.00	0.00	0.00	0.00
0.00	0.00	0.00	0.00


Policy:
UP UP UP UP UP STAY UP UP UP UP UP STAY STAY UP UP UP 

. . . . 
. S . . 
. . . X 
X . . G 

-----------------------------------------------------------------
Q-Table:
-----------------------------------------------------------------
0.00	0.00	0.00	0.00
0.00	-1.00	0.00	0.00
0.00	0.00	0.00	0.00
0.00	0.00	0.00	0.00

0.00	0.00	0.00	-1.00
NULL	NULL	NULL	NULL
0.00	0.00	-0.64	0.00
0.00	-0.19	0.00	0.00

0.00	-0.74	0.00	0.00
-0.26	0.00	0.00	0.00
0.00	0.00	0.00	-0.05
NULL	NULL	NULL	NULL

NULL	NULL	NULL	NULL
0.00	0.00	-0.10	0.00
0.00	0.00	0.00	0.00
0.00	0.00	0.00	0.00


Policy:
UP UP UP UP UP STAY UP UP UP DOWN UP STAY STAY UP UP UP 

Once training is complete, we test the agent by running a loop until the agent reaches the goal or exceeds the maximum number of steps. In each step, we select the greedy action (i.e., the action with the highest Q-value) and update the current state. We render the environment at each step for visualization.
# Test agent
state = env.reset()
done = False
while not done:
    action = np.argmax(q_table[state[0]][state[1]])
    next_state, reward, done = env.step(action)
    env.render()
    state = next_state
S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G 



S . . . 
. X . . 
. . . X 
X . . G