Lecture 22: Reinforcement Learning
Learning Objectives¶
Define reinforcement learning (no model)
Implement passive RL: ADP, TD
Implement active RL: Q-learning
Handle exploration vs. exploitation
Apply deep RL
RL vs. MDP¶
MDP: Model known (P, R)
RL: Model unknown, learn from experience
Goal: Find optimal policy
Passive RL¶
Policy fixed: π given
Task: Learn V^π or Q^π
Direct utility: Average returns
ADP: Learn model, then value iteration
Temporal-Difference Learning¶
TD(0): V(s) ← V(s) + α[r + γV(s’) - V(s)]
No model: Update from experience
Bootstrap: Use V(s’) as estimate
Active RL¶
Choose actions: Exploration
ε-greedy: Random with prob ε
Q-learning: Off-policy, learn Q*
Q-Learning¶
Update: Q(s,a) ← Q(s,a) + α[r + γ max_a’ Q(s’,a’) - Q(s,a)]
Off-policy: Learn optimal while following exploratory policy
Convergence: To Q* under conditions
Exploration¶
Exploration-exploitation tradeoff
ε-greedy: Simple
UCB: Optimistic initialization
Safe exploration: Constrained
Deep RL¶
Q-network: Approximate Q(s,a) with neural net
DQN: Experience replay, target network
Policy gradient: Directly optimize policy
Summary¶
Passive: Learn V given π
Active: Learn π*
TD, Q-learning: Model-free
Deep RL: Function approximation
References¶
AIMA Ch. 22
Russell & Norvig, AIMA 4e, Ch. 22
Chapter PDF:
chapters/chapter-22.pdfaima-python: reinforcement_learning4e.ipynb