Grid World RL Lab
Interactive reinforcement learning sandbox with Q-Learning, Value Iteration, and Policy Iteration.
An agent moves on a 2D grid, receives rewards, and learns a policy that maximizes return while handling walls and terminal goals.
- Q-Learning: model-free temporal-difference update with epsilon-greedy exploration.
- Value Iteration: Bellman optimality backups over all states.
- Policy Iteration: alternating policy evaluation and policy improvement.
Info
Mode: Idle
Episode: 0 / 0
Steps (episode): 0
Last Return: -
Policy & Values
Policy arrows indicate greedy actions from stored Q values or model-based updates.
Algorithm Notes
- Q-Learning updates action values online with bootstrapped targets.
- Value Iteration computes optimal state values from Bellman backups.
- Policy Iteration alternates evaluation and improvement for convergence.