Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Lecture 22: Reinforcement Learning

Lecture 22: Reinforcement Learning

AIMA Chapter 22 — 1 hour

Learning Objectives

  • Define reinforcement learning (no model)

  • Implement passive RL: ADP, TD

  • Implement active RL: Q-learning

  • Handle exploration vs. exploitation

  • Apply deep RL

RL vs. MDP

  • MDP: Model known (P, R)

  • RL: Model unknown, learn from experience

  • Goal: Find optimal policy

Passive RL

  • Policy fixed: π given

  • Task: Learn V^π or Q^π

  • Direct utility: Average returns

  • ADP: Learn model, then value iteration

Temporal-Difference Learning

  • TD(0): V(s) ← V(s) + α[r + γV(s’) - V(s)]

  • No model: Update from experience

  • Bootstrap: Use V(s’) as estimate

Active RL

  • Choose actions: Exploration

  • ε-greedy: Random with prob ε

  • Q-learning: Off-policy, learn Q*

Q-Learning

  • Update: Q(s,a) ← Q(s,a) + α[r + γ max_a’ Q(s’,a’) - Q(s,a)]

  • Off-policy: Learn optimal while following exploratory policy

  • Convergence: To Q* under conditions

Exploration

  • Exploration-exploitation tradeoff

  • ε-greedy: Simple

  • UCB: Optimistic initialization

  • Safe exploration: Constrained

Deep RL

  • Q-network: Approximate Q(s,a) with neural net

  • DQN: Experience replay, target network

  • Policy gradient: Directly optimize policy

Summary

  • Passive: Learn V given π

  • Active: Learn π*

  • TD, Q-learning: Model-free

  • Deep RL: Function approximation

References

  • AIMA Ch. 22

  • Russell & Norvig, AIMA 4e, Ch. 22

  • Chapter PDF: chapters/chapter-22.pdf

  • aima-python: reinforcement_learning4e.ipynb

Questions?

Next lecture: Natural Language Processing (Chapter 23)