Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Lecture 17: Making Complex Decisions

Lecture 17: Making Complex Decisions

AIMA Chapter 17 — 1 hour

Learning Objectives

  • Define Markov Decision Processes (MDPs)

  • Implement value iteration and policy iteration

  • Handle partially observable MDPs (POMDPs)

  • Apply bandit problems

Sequential Decisions

MDP grid
  • MDP: States, actions, transition model, reward

  • Policy: π(s) → action

  • Utility: Sum of (discounted) rewards

Value Iteration

Value iteration
  • V(s)*: Optimal value

  • Bellman: V*(s) = max_a Σₛ’ P(s’|s,a)[R(s,a,s’) + γV*(s’)]

  • Iterate: Until convergence

Policy Iteration

  • Policy evaluation: Compute V^π

  • Policy improvement: π’(s) = argmax_a Q(s,a)

  • Repeat: Until policy stable

POMDPs

  • Belief state: Distribution over states

  • Belief-state MDP: Continuous state space

  • Value iteration: Over belief space

Bandit Problems

  • Arms: Each with unknown reward distribution

  • Exploration vs. exploitation

  • Gittins index: Optimal for discounted case

Summary

  • MDP: States, actions, rewards

  • Value/policy iteration

  • POMDP: Belief states

  • Bandits: Exploration-exploitation

References

  • AIMA Ch. 17

  • Russell & Norvig, AIMA 4e, Ch. 17

  • Chapter PDF: chapters/chapter-17.pdf

  • aima-python: mdp4e.ipynb

Questions?

Next lecture: Multiagent Decision Making (Chapter 18)