import gymnasium as gym
import random
import numpy as np
from IPython.display import Image
import matplotlib.pyplot as plt
❄️ Frozen Lake
Frozen Lake is a simple environment composed of tiles, where the AI has to move from an initial tile to a goal.
Tiles can be a safe frozen lake ✅, or a hole ❌ that gets you stuck forever.
The AI, or agent, has 4 possible actions: go ◀️ LEFT, 🔽 DOWN, ▶️ RIGHT, or 🔼 UP.
The agent must learn to avoid holes in order to reach the goal in a minimal number of actions.
Required Libraries
Initialize the Environment
= gym.make("FrozenLake-v1", is_slippery = False) #in non-slippery version actions cannot be ignored
env
env.reset() env.render()
/Users/cjrobe/miniforge3/envs/website/lib/python3.11/site-packages/gymnasium/envs/toy_text/frozen_lake.py:328: UserWarning: WARN: You are calling render method without specifying any render mode. You can specify the render_mode at initialization, e.g. gym.make("FrozenLake-v1", render_mode="rgb_array")
gym.logger.warn(
- S: starting point, safe
- F: frozen surface, safe
- H: hole, stuck forever
- G: goal, safe
= "FrozenLake.gif", width=400) Image(filename
<IPython.core.display.Image object>
= "Final.gif", width=400) Image(filename
<IPython.core.display.Image object>
Reward
Reward schedule:
Reach goal(G): +1
Reach hole(H): 0
Reach frozen surface(F): 0
Size of Action and State Space
print("State space: ", env.observation_space.n)
print("Action space: ", env.action_space.n)
State space: 16
Action space: 4
In Frozen Lake, there are 16 tiles, which means our agent can be found in 16 different positions, called states.
For each state, there are 4 possible actions:
- ◀️ LEFT: 0
- 🔽 DOWN: 1
- ▶️ RIGHT: 2
- 🔼 UP: 3
Initialize Q Table
= "QTable.gif", width=400) Image(filename
<IPython.core.display.Image object>
# Our table has the following dimensions:
# (rows x columns) = (states x actions) = (16 x 4)
= env.observation_space.n # = 16
nb_states = env.action_space.n # = 4
nb_actions = np.zeros((nb_states, nb_actions))
qtable
# Let's see how it looks
print('Q-table =')
print(qtable)
Q-table =
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
Update Formula
$ Q_{new}(s_t, a_t) = Q(s_t, a_t) + (r_t + max_a Q(s_{t+1}, a) - Q(s_t, a_t)) $
Epsilon-Greedy Algorithm
In this method, we want to allow our agent to either:
- Take the action with the highest value (exploitation);
- Choose a random action to try to find even better ones (exploration).
= "tradeoff.gif", width=700) Image(filename
<IPython.core.display.Image object>
Hyperparameters
= 1000 # Total number of episodes
episodes = 0.5 # Learning rate
alpha = 0.9 # Discount factor
gamma = 1.0 # Amount of randomness in the action selection
epsilon = 0.001 # Fixed amount to decrease epsilon_decay
Training
# List of outcomes to plot
= []
outcomes
for _ in range(episodes):
= env.reset()
state,info = False
done
# By default, we consider our outcome to be a failure
"Failure")
outcomes.append(
# Until the agent gets stuck in a hole or reaches the goal, keep training it
while not done:
# Generate a random number between 0 and 1
= np.random.random()
rnd
# If random number < epsilon, take a random action
if rnd < epsilon:
= env.action_space.sample()
action # Else, take the action with the highest value in the current state
else:
= np.argmax(qtable[state])
action
# Implement this action and move the agent in the desired direction
= env.step(action)
new_state, reward, done, _, info
# Update Q(s,a)
= qtable[state, action] + \
qtable[state, action] * (reward + gamma * np.max(qtable[new_state]) - qtable[state, action])
alpha
# Update our current state
= new_state
state
# If we have a reward, it means that our outcome is a success
if reward:
-1] = "Success"
outcomes[
# Update epsilon
= max(epsilon - epsilon_decay, 0) epsilon
Updated Q Table
print('===========================================')
print('Q-table after training:')
print(qtable)
===========================================
Q-table after training:
[[0.531441 0.59049 0.59048919 0.531441 ]
[0.531441 0. 0.65609983 0.56212481]
[0.50422703 0.72899998 0.12149802 0.61759207]
[0.45242846 0. 0. 0. ]
[0.59049 0.6561 0. 0.531441 ]
[0. 0. 0. 0. ]
[0. 0.81 0. 0.60486928]
[0. 0. 0. 0. ]
[0.6561 0. 0.729 0.59049 ]
[0.6561 0.81 0.81 0. ]
[0.72894994 0.9 0. 0.72001722]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0.81 0.9 0.729 ]
[0.81 0.9 1. 0.81 ]
[0. 0. 0. 0. ]]
Plot Outcomes
=(12, 5))
plt.figure(figsize"Run number")
plt.xlabel("Outcome")
plt.ylabel(= plt.gca()
ax 'gainsboro')
ax.set_facecolor(range(len(outcomes)), outcomes, color="navy", width=1.0)
plt.bar( plt.show()
Evaluation
= 1
episodes = 0
nb_success
= env.reset()
state,info
env.render()= False
done
# Until the agent gets stuck or reaches the goal, keep training it
while not done:
# Choose the action with the highest value in the current state
= np.argmax(qtable[state])
action
# Implement this action and move the agent in the desired direction
= env.step(action)
new_state, reward, done, _, info
# Render the environment
print()
env.render()
# Update our current state
= new_state
state
# When we get a reward, it means we solved the game
+= reward
nb_success
# Let's check our success rate!
print()
print (f"Success rate = {nb_success/episodes*100}%")
Success rate = 100.0%