MAP Derivation of the GPML book

Published: 2023-05-27

In this post, we will try to derive Maximum Aposteriori Probability based on Rasmussen’s book on Gaussian Process for Machine Learning (GPML). We will refer to equation 2.7 and 2.8 in the book. This post is not intended to explain what and why the Maximum Aposteriori Probability estimation is. For those whose interested more, the PML Book chapter 4.5 is a good start.

Maximum Aposteriori Probability (MAP) estimate, is an estimate of unknown parameter w\mathbf{w}. Keep in mind that even MAP incorporates prior distribution, it is not a fully bayesian treatment since we dont fully integrates the normalization terms of the Bayes Rule. Thus it’s served as a point estimate of the mode of the posterior distribution of w\mathbf{w}

Let’s start with the Bayes Rule from the equation 2.5 of the book:

p(wy,X)=p(yX,w)p(w)p(yX)p(\mathbf{w} \mid \mathbf{y}, X)=\frac{p(\mathbf{y} \mid X, \mathbf{w}) p(\mathbf{w})}{p(\mathbf{y} \mid X)}

Note that in this case we assume that the likelihood p(yX,w)p(\mathbf{y} \mid X, \mathbf{w}) and prior p(yX)p(\mathbf{y} \mid X) are normally distributed (see equations 2.3 and 2.4). Because of this, the posterior is also normal, since normal prior is a conjugate with normal posterior. Then, since the normalization term (also called evidence) is not depends on w\mathbf{w} we can drop it off and we can define the approximation of the posterior as:

p(wX,y)exp(12σn2(yXw)(yXw))exp(12wΣp1w)\begin{aligned} p(\mathbf{w}|X,\mathbf{y}) & \propto \exp(- \frac{1}{2 \sigma_n^2}(\mathbf{y}-X^\intercal \mathbf{w})^\intercal (\mathbf{y}-X^\intercal \mathbf{w})) \exp(- \frac{1}{2} \mathbf{w}^\intercal \Sigma^{-1}_p \mathbf{w}) \end{aligned}

To derive the closed form formula of the posterior in the left term of the above equation, we use the completing the square formula:

wMw2bw=(wM1b)M(wM1b)bM1b\mathbf{w}^\intercal \mathbf{M} \mathbf{w} - 2 \mathbf{b}^\intercal \mathbf{w} = (\mathbf{w} - \mathbf{M}^{-1} \mathbf{b})^\intercal \mathbf{M} (\mathbf{w}-\mathbf{M}^{-1} \mathbf{b}) - \mathbf{b}^\intercal \mathbf{M}^{-1} \mathbf{b}

the trick is to rearrange the approximate posterior formula to match the left term of the completing the square formula above, then using the equality we can recover a gaussian distribution from the right term of the above equation.

p(wX,y)exp(12σn2(yXw)(yXw))exp(12wΣp1w)exp(12σn2(yXw)(yXw)12wΣp1w)exp(12(σn2(yXw)(yXw)+wΣp1w))exp(12(σn2(yyyXwwXy+wXXw)+wΣp1w))(remove constant)exp(12(σn2yXwσn2wXy+σn2wXXw+wΣp1w))exp(12(σn2yXwσn2wXy+w(σn2XX+Σp1)w))(note that, yXw=wXy)exp(12(2σn2yXw+w(σn2XX+Σp1)w))\begin{aligned} p(\mathbf{w}|X,\mathbf{y}) & \propto \exp(- \frac{1}{2 \sigma_n^2}(\mathbf{y}-X^\intercal \mathbf{w})^\intercal (\mathbf{y}-X^\intercal \mathbf{w})) \exp(- \frac{1}{2} \mathbf{w}^\intercal \Sigma^{-1}_p \mathbf{w}) \\ & \propto \exp(- \frac{1}{2 \sigma_n^2}(\mathbf{y}-X^\intercal \mathbf{w})^\intercal (\mathbf{y}-X^\intercal \mathbf{w}) - \frac{1}{2} \mathbf{w}^\intercal \Sigma^{-1}_p \mathbf{w}) \\ & \propto \exp(- \frac{1}{2}(\sigma_n^{-2}(\mathbf{y}-X^\intercal \mathbf{w})^\intercal (\mathbf{y}-X^\intercal \mathbf{w})+ \mathbf{w}^\intercal \Sigma^{-1}_p \mathbf{w})) \\ & \propto \exp(- \frac{1}{2}( \sigma_n^{-2}(\cancel{\mathbf{y}^\intercal\mathbf{y}} - \mathbf{y}^\intercal X^\intercal \mathbf{w} - \mathbf{w}^\intercal X\mathbf{y} + \mathbf{w}^\intercal X X^\intercal \mathbf{w}) +\mathbf{w}^\intercal \Sigma^{-1}_p \mathbf{w})) &\text{(remove constant)}\\ & \propto \exp(- \frac{1}{2}( -\sigma_n^{-2}\mathbf{y}^\intercal X^\intercal \mathbf{w} - \sigma_n^{-2} \mathbf{w}^\intercal X\mathbf{y} + \sigma_n^{-2}\mathbf{w}^\intercal X X^\intercal \mathbf{w} +\mathbf{w}^\intercal \Sigma^{-1}_p \mathbf{w}))\\ & \propto \exp(- \frac{1}{2}( -\sigma_n^{-2}\mathbf{y}^\intercal X^\intercal \mathbf{w} - \sigma_n^{-2} \mathbf{w}^\intercal X\mathbf{y} + \mathbf{w}^\intercal ( \sigma_n^2 X X^\intercal + \Sigma^{-1}_p) \mathbf{w})) &(\text{note that, }\mathbf{y}^\intercal X^\intercal \mathbf{w} = \mathbf{w}^\intercal X\mathbf{y})\\ & \propto \exp(- \frac{1}{2}( -2\sigma_n^{-2}\mathbf{y}^\intercal X^\intercal \mathbf{w} + \mathbf{w}^\intercal ( \sigma_n^{-2} X X^\intercal + \Sigma^{-1}_p) \mathbf{w}))\\ \end{aligned}

let us define:

M=σn2XX+Σp1b=σn2Xy\begin{aligned} \mathbf{M} &= \sigma_n^{-2}XX^\intercal + \Sigma_p^{-1} \\ \mathbf{b} &= \sigma_n^{-2}X\mathbf{y} \end{aligned}

thus:

M1b=(σn2XX+Σp1)1σn2Xy=σn2(σn2XX+Σp1)1Xy=w~\begin{aligned} \mathbf{M}^{-1}\mathbf{b} &= (\sigma_n^{-2}XX^\intercal + \Sigma_p^{-1})^{-1} \sigma_n^{-2}X\mathbf{y}\\ &= \sigma_n^{-2}(\sigma_n^{-2}XX^\intercal + \Sigma_p^{-1})^{-1} X\mathbf{y}\\ &= \tilde{\mathbf{w}} \end{aligned}

finally using equality of the completing the square formula, we can get:

p(wX,y)exp(12(wM1b)M(wM1b)bM1b)(remove constant)exp(12(ww~)M(ww~))exp(12(ww~)(σn2XX+Σp1)(ww~))\begin{aligned} p(\mathbf{w}|X,\mathbf{y}) & \propto \exp(-\frac{1}{2} (\mathbf{w} - \mathbf{M}^{-1} \mathbf{b})^\intercal \mathbf{M} (\mathbf{w}-\mathbf{M}^{-1} \mathbf{b}) - \cancel{\mathbf{b}^\intercal \mathbf{M}^{-1} \mathbf{b}}) &\text{(remove constant)}\\ & \propto \exp(-\frac{1}{2} (\mathbf{w} - \tilde{\mathbf{w}})^\intercal \mathbf{M} (\mathbf{w}-\tilde{\mathbf{w}})) \\ & \propto \exp(-\frac{1}{2} (\mathbf{w} - \tilde{\mathbf{w}})^\intercal (\sigma_n^{-2}XX^\intercal + \Sigma_p^{-1}) (\mathbf{w}-\tilde{\mathbf{w}})) \\ \end{aligned}

As we can see from the formula above, we can rewrite it as a gaussian distribution with the following form:

p(wX,y)N(w~=1σn2A1Xy,A1)\begin{aligned} p(\mathbf{w} \mid X, \mathbf{y}) \sim \mathcal{N}\left(\tilde{\mathbf{w}}=\frac{1}{\sigma_n^2} A^{-1} X \mathbf{y}, A^{-1}\right) \end{aligned}

where A=σn2XX+Σp1A = \sigma_n^{-2}XX^\intercal + \Sigma_p^{-1}

That’s it, we have derived the closed form formula for the MAP of gaussian posterior (formula 2.8 of the GPML book).

Note: If you notice some errors in the derivation and/or the note, please let me know!

References: