Reinforcement learning introduction - Challenge
Reinforcement learning introduction: Frozen Lake example
The Frozen Lake example is a classic reinforcement learning problem, where the goal is to teach an agent to navigate a frozen lake and reach the goal without falling through the ice.
Description:
The Frozen Lake environment is a grid of frozen ice, holes, and a goal, here we will have (4*4) grid size, that’s mean we will have (16 STATES) each cell represent a state.
The agent starts at the top left corner cell in the grid, and can take four actions at each time step, We will have (4 ACTIONS):
- move up
- move down
- move left
- move right
The goal is to reach the goal cell in the bottom right corner of the grid without falling through any holes. Therefore, our REWARDS are:
- If agent current state is a hole state, the reward = -1
- If agent current state is the goal state, the reward = +1
- Else, the reward = 0
Import packages and setup your environment
import numpy as np
import random
First, we define the FrozenLake environment as a class, with methods for resetting the environment, taking actions, rendering the current state, and showing the current Q-table and policy.
class FrozenLake:
def __init__(self, size=4):
self.size = size
self.grid = np.zeros((4, 4), dtype=int) # spcifiy grid size please
self.start_state = (0, 0)
self.goal_state = (3, 3) # can you state the goal state in the grid (row, column)
self.hole_states = [(1, 1), (2, 3), (3, 0)]
for i, j in self.hole_states:
self.grid[i][j] = 1
def reset(self):
self.current_state = self.start_state
return self.current_state
def step(self, action): # could you please set the actions as numbers
= self.current_state
i, j if action == 0: # move up
= max(i-1, 0)
i elif action == 1: # move down
= min(i+1, self.size-1)
i elif action == 2: # move left
= max(j-1, 0)
j elif action == 3: # move right
= min(j+1, self.size-1)
j
self.current_state = (i, j)
# what is your rewads?
if self.current_state == self.goal_state:
= 1 # winning reward?
reward = True
done elif self.current_state in self.hole_states:
= -1 # losing for hole states, reward?
reward = True
done else:
= 0 # else?
reward = False
done
return self.current_state, reward, done
def render(self):
print('\n')
for i in range(self.size):
for j in range(self.size):
if self.grid[i][j] == 0:
if (i, j) == self.current_state:
print('S', end=' ')
elif (i, j) == self.goal_state:
print('G', end=' ')
else:
print('.', end=' ')
elif self.grid[i][j] == 1:
if (i, j) == self.current_state:
print('S', end=' ')
else:
print('X', end=' ')
print()
print()
def show_q_table(self, q_table):
print('-----------------------------------------------------------------')
print('Q-Table:')
print('-----------------------------------------------------------------')
for i in range(self.size):
for j in range(self.size):
if self.grid[i][j] == 0:
print( '%.2f' % q_table[i][j][0], end='\t')
print('%.2f' % q_table[i][j][1], end='\t')
print('%.2f' % q_table[i][j][2], end='\t')
print('%.2f' % q_table[i][j][3])
else:
print('NULL', end='\t')
print('NULL', end='\t')
print('NULL', end='\t')
print('NULL')
print()
# In one text line show the policy (the sequence of actions that agent take )
def show_policy(self, q_table):
print('\n Policy:')
for i in range(self.size):
for j in range(self.size):
if self.grid[i][j] == 0:
= np.argmax(q_table[i][j])
action if action == 0:
print('UP', end=' ')
elif action == 1:
print('DOWN', end=' ')
elif action == 2:
print('LEFT', end=' ')
elif action == 3:
print('RIGTH', end=' ')
else:
print('STAY', end=' ')
Next, we create an instance of the environment and initialize the Q-table with zeros.
= FrozenLake()
env = np.zeros((env.size, env.size, 4)) q_table
Hyperparameters
We then set some hyperparameters for the Q-learning algorithm, such as the number of episodes to run, the maximum number of steps per episode, the learning rate, the discount factor, the starting exploration rate (epsilon), the minimum exploration rate, and the rate at which epsilon decays over time.
# Could you please set your hyperparameters?
= 2000
num_episodes = 10
max_steps_per_episode = 0.05
learning_rate = 0.99
discount_factor
= 1.0
epsilon = 0.01
min_epsilon = 0.001 epsilon_decay_rate
We define an epsilon-greedy policy for selecting actions, which chooses a random action with probability epsilon or the greedy action (i.e., the action with the highest Q-value) with probability 1 - epsilon.
# Define epsilon-greedy policy
def epsilon_greedy_policy(state):
if random.uniform(0, 1) < epsilon:
return random.randint(0, 3)
else:
return np.argmax(q_table[state[0]][state[1]])
We train the agent by running a loop over the specified number of episodes. In each episode, we start by resetting the environment and selecting actions according to the epsilon-greedy policy. We then update the Q-values for the current state-action pair using the Q-learning update rule. Finally, we update the current state and repeat until the episode ends (either because the agent reaches the goal or exceeds the maximum number of steps).
We decay the exploration rate (epsilon) after each episode to gradually shift the agent from exploration to exploitation over time.
Periodically, we render the current state of the environment and display the current Q-table and policy for visualization.
for episode in range(num_episodes):
= env.reset()
state = False
done = 0
t while not done and t < max_steps_per_episode:
= epsilon_greedy_policy(state)
action
= env.step(action)
next_state, reward, done
# what is missing to update the Q value?
#q_table[state[0]][state[1]][action] += learning_rate * (reward + discount_factor * np.max(q_table[next_state[0]][next_state[1]]) - q_table[state[0]][state[1]][action])
0]][state[1]][action] += learning_rate * (reward + discount_factor * np.max(q_table[next_state[0]][next_state[1]]) - q_table[state[0]][state[1]][action])
q_table[state[
= next_state
state
+= 1
t
= max(min_epsilon, epsilon * (1 - epsilon_decay_rate))
epsilon
# Show progress
if episode % 1000 == 0:
env.render()
env.show_q_table(q_table) env.show_policy(q_table)
. . . .
. S . .
. . . X
X . . G
-----------------------------------------------------------------
-Table:
Q-----------------------------------------------------------------
0.00 0.00 0.00 0.00
0.00 -0.05 0.00 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
NULL NULL NULL NULL0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
NULL NULL NULL NULL
NULL NULL NULL NULL0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
Policy:
UP UP UP UP UP STAY UP UP UP UP UP STAY STAY UP UP UP
. . . .
. S . .
. . . X
X . . G
-----------------------------------------------------------------
-Table:
Q-----------------------------------------------------------------
0.00 0.00 0.00 0.00
0.00 -1.00 0.00 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00 -1.00
NULL NULL NULL NULL0.00 0.00 -0.64 0.00
0.00 -0.19 0.00 0.00
0.00 -0.74 0.00 0.00
-0.26 0.00 0.00 0.00
0.00 0.00 0.00 -0.05
NULL NULL NULL NULL
NULL NULL NULL NULL0.00 0.00 -0.10 0.00
0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
Policy:
UP UP UP UP UP STAY UP UP UP DOWN UP STAY STAY UP UP UP
is complete, we test the agent by running a loop until the agent reaches the goal or exceeds the maximum number of steps. In each step, we select the greedy action (i.e., the action with the highest Q-value) and update the current state. We render the environment at each step for visualization. Once training
# Test agent
= env.reset()
state = False
done while not done:
= np.argmax(q_table[state[0]][state[1]])
action = env.step(action)
next_state, reward, done
env.render()= next_state state
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X
X . . G
S . . .
. X . .
. . . X X . . G