Grid World

AI Simulator Platform

Grid World RL Lab

Interactive reinforcement learning sandbox with Q-Learning, Value Iteration, and Policy Iteration.

An agent moves on a 2D grid, receives rewards, and learns a policy that maximizes return while handling walls and terminal goals.

  • Q-Learning: model-free temporal-difference update with epsilon-greedy exploration.
  • Value Iteration: Bellman optimality backups over all states.
  • Policy Iteration: alternating policy evaluation and policy improvement.

Info

Mode: Idle
Episode: 0 / 0
Steps (episode): 0
Last Return: -

Policy & Values

Policy arrows indicate greedy actions from stored Q values or model-based updates.


                

Algorithm Notes

  • Q-Learning updates action values online with bootstrapped targets.
  • Value Iteration computes optimal state values from Bellman backups.
  • Policy Iteration alternates evaluation and improvement for convergence.