Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Chapter 20: Learning Probabilistic Models — Exercises

Chapter 20: Learning Probabilistic Models

Russell & Norvig, Artificial Intelligence: A Modern Approach (4th ed.), end-of-chapter Exercise prompts — full text is included on this site (same material as the exercises/ch20/ sources in the aima monorepo).


Exercise 20.1

The data used for Figure bayes-candy-figure on page bayes-candy-figure can be viewed as being generated by h5h_5. For each of the other four hypotheses, generate a data set of length 100 and plot the corresponding graphs for P(hid1,,dN)P(h_i|d_1,\ldots,d_N) and P(DN+1=limed1,,dN)P(D_{N+1}=lime|d_1,\ldots,d_N). Comment on your results.


Exercise 20.2

Repeat Exercise bayes-candy-exercise, this time plotting the values of P(DN+1=limehMAP)P(D_{N+1}=lime|h_{MAP}) and P(DN+1=limehML)P(D_{N+1}=lime|h_{ML}).


Exercise 20.3

Suppose that Ann’s utilities for cherry and lime candies are cAc_A and A\ell_A, whereas Bob’s utilities are cBc_B and B\ell_B. (But once Ann has unwrapped a piece of candy, Bob won’t buy it.) Presumably, if Bob likes lime candies much more than Ann, it would be wise for Ann to sell her bag of candies once she is sufficiently sure of its lime content. On the other hand, if Ann unwraps too many candies in the process, the bag will be worth less. Discuss the problem of determining the optimal point at which to sell the bag. Determine the expected utility of the optimal procedure, given the prior distribution from Section statistical-learning-section.


Exercise 20.4

Two statisticians go to the doctor and are both given the same prognosis: A 40% chance that the problem is the deadly disease AA, and a 60% chance of the fatal disease BB. Fortunately, there are anti-AA and anti-BB drugs that are inexpensive, 100% effective, and free of side-effects. The statisticians have the choice of taking one drug, both, or neither. What will the first statistician (an avid Bayesian) do? How about the second statistician, who always uses the maximum likelihood hypothesis?

The doctor does some research and discovers that disease BB actually comes in two versions, dextro-BB and levo-BB, which are equally likely and equally treatable by the anti-BB drug. Now that there are three hypotheses, what will the two statisticians do?


Exercise 20.5

Explain how to apply the boosting method of Chapter concept-learning-chapter to naive Bayes learning. Test the performance of the resulting algorithm on the restaurant learning problem.


Exercise 20.6

Consider NN data points (xj,yj)(x_j,y_j), where the yjy_js are generated from the xjx_js according to the linear Gaussian model in Equation (linear-gaussian-likelihood-equation). Find the values of θ1\theta_1, θ2\theta_2, and σ\sigma that maximize the conditional log likelihood of the data.


Exercise 20.7

Consider the noisy-OR model for fever described in Section canonical-distribution-section. Explain how to apply maximum-likelihood learning to fit the parameters of such a model to a set of complete data. (Hint: use the chain rule for partial derivatives.)


Exercise 20.8

This exercise investigates properties of the Beta distribution defined in Equation (beta-equation).

  1. By integrating over the range [0,1][0,1], show that the normalization constant for the distribution beta[a,b]{{\rm beta}}[a,b] is given by α=Γ(a+b)/Γ(a)Γ(b)\alpha = \Gamma(a+b)/\Gamma(a)\Gamma(b) where Γ(x)\Gamma(x) is the Gamma function, defined by Γ(x+1)=xΓ(x)\Gamma(x+1){{\,=\,}}x\cdot\Gamma(x) and Γ(1)=1\Gamma(1){{\,=\,}}1. (For integer xx, Γ(x+1)=x!\Gamma(x+1){{\,=\,}}x!.)

  2. Show that the mean is a/(a+b)a/(a+b).

  3. Find the mode(s) (the most likely value(s) of θ\theta).

  4. Describe the distribution beta[ϵ,ϵ]{{\rm beta}}[\epsilon,\epsilon] for very small ϵ\epsilon. What happens as such a distribution is updated?


Exercise 20.9

Consider an arbitrary Bayesian network, a complete data set for that network, and the likelihood for the data set according to the network. Give a simple proof that the likelihood of the data cannot decrease if we add a new link to the network and recompute the maximum-likelihood parameter values.


Exercise 20.10

Consider a single Boolean random variable YY (the “classification”). Let the prior probability P(Y=true)P(Y=true) be π\pi. Let’s try to find π\pi, given a training set D=(y1,,yN)D=(y_1,\ldots,y_N) with NN independent samples of YY. Furthermore, suppose pp of the NN are positive and nn of the NN are negative.

  1. Write down an expression for the likelihood of DD (i.e., the probability of seeing this particular sequence of examples, given a fixed value of π\pi) in terms of π\pi, pp, and nn.

  2. By differentiating the log likelihood LL, find the value of π\pi that maximizes the likelihood.

  3. Now suppose we add in kk Boolean random variables X1,X2,,XkX_1, X_2,\ldots,X_k (the “attributes”) that describe each sample, and suppose we assume that the attributes are conditionally independent of each other given the goal YY. Draw the Bayes net corresponding to this assumption.

  4. Write down the likelihood for the data including the attributes, using the following additional notation:

    • αi\alpha_i is P(Xi=trueY=true)P(X_i=true \| Y=true).

    • βi\beta_i is P(Xi=trueY=false)P(X_i=true \| Y=false).

    • pi+p_i^+ is the count of samples for which Xi=trueX_i=true and Y=trueY=true.

    • ni+n_i^+ is the count of samples for which Xi=falseX_i=false and Y=trueY=true.

    • pip_i^- is the count of samples for which Xi=trueX_i=true and Y=falseY=false.

    • nin_i^- is the count of samples for which Xi=falseX_i=false and Y=falseY=false.

    [Hint: consider first the probability of seeing a single example with specified values for X1,X2,,XkX_1, X_2,\ldots,X_k and YY.]

  5. By differentiating the log likelihood LL, find the values of αi\alpha_i and βi\beta_i (in terms of the various counts) that maximize the likelihood and say in words what these values represent.

  6. Let k=2k = 2, and consider a data set with 4 all four possible examples of thexor function. Compute the maximum likelihood estimates of π\pi, α1\alpha_1, α2\alpha_2, β1\beta_1, and β2\beta_2.

  7. Given these estimates of π\pi, α1\alpha_1, α2\alpha_2, β1\beta_1, and β2\beta_2, what are the posterior probabilities P(Y=truex1,x2)P(Y=true | x_1,x_2) for each example?


Exercise 20.11

Consider the application of EM to learn the parameters for the network in Figure mixture-networks-figure(a), given the true parameters in Equation (candy-true-equation).

  1. Explain why the EM algorithm would not work if there were just two attributes in the model rather than three.

  2. Show the calculations for the first iteration of EM starting from Equation (candy-64-equation).

  3. What happens if we start with all the parameters set to the same value pp? (Hint: you may find it helpful to investigate this empirically before deriving the general result.)

  4. Write out an expression for the log likelihood of the tabulated candy data on page candy-counts-page in terms of the parameters, calculate the partial derivatives with respect to each parameter, and investigate the nature of the fixed point reached in part (c).