Lecture 17: Making Complex Decisions
Learning Objectives¶
Define Markov Decision Processes (MDPs)
Implement value iteration and policy iteration
Handle partially observable MDPs (POMDPs)
Apply bandit problems
Sequential Decisions¶

MDP: States, actions, transition model, reward
Policy: π(s) → action
Utility: Sum of (discounted) rewards
Value Iteration¶

V(s)*: Optimal value
Bellman: V*(s) = max_a Σₛ’ P(s’|s,a)[R(s,a,s’) + γV*(s’)]
Iterate: Until convergence
Policy Iteration¶
Policy evaluation: Compute V^π
Policy improvement: π’(s) = argmax_a Q(s,a)
Repeat: Until policy stable
POMDPs¶
Belief state: Distribution over states
Belief-state MDP: Continuous state space
Value iteration: Over belief space
Bandit Problems¶
Arms: Each with unknown reward distribution
Exploration vs. exploitation
Gittins index: Optimal for discounted case
Summary¶
MDP: States, actions, rewards
Value/policy iteration
POMDP: Belief states
Bandits: Exploration-exploitation
References¶
AIMA Ch. 17
Russell & Norvig, AIMA 4e, Ch. 17
Chapter PDF:
chapters/chapter-17.pdfaima-python: mdp4e.ipynb