MAP Derivation of the GPML book Published: 2023-05-27
In this post, we will try to derive Maximum Aposteriori Probability based on Rasmussen’s book on Gaussian Process for Machine Learning (GPML). We will refer to equation 2.7 and 2.8 in the book. This post is not intended to explain what and why the Maximum Aposteriori Probability estimation is. For those whose interested more, the PML Book chapter 4.5 is a good start.
Maximum Aposteriori Probability (MAP) estimate, is an estimate of unknown parameter w \mathbf{w} w . Keep in mind that even MAP incorporates prior distribution, it is not a fully bayesian treatment since we dont fully integrates the normalization terms of the Bayes Rule . Thus it’s served as a point estimate of the mode of the posterior distribution of w \mathbf{w} w
Let’s start with the Bayes Rule from the equation 2.5 of the book:
p ( w ∣ y , X ) = p ( y ∣ X , w ) p ( w ) p ( y ∣ X ) p(\mathbf{w} \mid \mathbf{y}, X)=\frac{p(\mathbf{y} \mid X, \mathbf{w}) p(\mathbf{w})}{p(\mathbf{y} \mid X)} p ( w ∣ y , X ) = p ( y ∣ X ) p ( y ∣ X , w ) p ( w ) Note that in this case we assume that the likelihood p ( y ∣ X , w ) p(\mathbf{y} \mid X, \mathbf{w}) p ( y ∣ X , w ) and prior p ( y ∣ X ) p(\mathbf{y} \mid X) p ( y ∣ X ) are normally distributed (see equations 2.3 and 2.4). Because of this, the posterior is also normal, since normal prior is a conjugate with normal posterior. Then, since the normalization term (also called evidence) is not depends on w \mathbf{w} w we can drop it off and we can define the approximation of the posterior as:
p ( w ∣ X , y ) ∝ exp ( − 1 2 σ n 2 ( y − X ⊺ w ) ⊺ ( y − X ⊺ w ) ) exp ( − 1 2 w ⊺ Σ p − 1 w ) \begin{aligned}
p(\mathbf{w}|X,\mathbf{y}) & \propto \exp(- \frac{1}{2 \sigma_n^2}(\mathbf{y}-X^\intercal \mathbf{w})^\intercal (\mathbf{y}-X^\intercal \mathbf{w})) \exp(- \frac{1}{2} \mathbf{w}^\intercal \Sigma^{-1}_p \mathbf{w})
\end{aligned} p ( w ∣ X , y ) ∝ exp ( − 2 σ n 2 1 ( y − X ⊺ w ) ⊺ ( y − X ⊺ w )) exp ( − 2 1 w ⊺ Σ p − 1 w ) To derive the closed form formula of the posterior in the left term of the above equation, we use the completing the square formula:
w ⊺ M w − 2 b ⊺ w = ( w − M − 1 b ) ⊺ M ( w − M − 1 b ) − b ⊺ M − 1 b \mathbf{w}^\intercal \mathbf{M} \mathbf{w} - 2 \mathbf{b}^\intercal \mathbf{w} = (\mathbf{w} - \mathbf{M}^{-1} \mathbf{b})^\intercal \mathbf{M} (\mathbf{w}-\mathbf{M}^{-1} \mathbf{b}) - \mathbf{b}^\intercal \mathbf{M}^{-1} \mathbf{b} w ⊺ Mw − 2 b ⊺ w = ( w − M − 1 b ) ⊺ M ( w − M − 1 b ) − b ⊺ M − 1 b
the trick is to rearrange the approximate posterior formula to match the left term of the completing the square formula above, then using the equality we can recover a gaussian distribution from the right term of the above equation.
p ( w ∣ X , y ) ∝ exp ( − 1 2 σ n 2 ( y − X ⊺ w ) ⊺ ( y − X ⊺ w ) ) exp ( − 1 2 w ⊺ Σ p − 1 w ) ∝ exp ( − 1 2 σ n 2 ( y − X ⊺ w ) ⊺ ( y − X ⊺ w ) − 1 2 w ⊺ Σ p − 1 w ) ∝ exp ( − 1 2 ( σ n − 2 ( y − X ⊺ w ) ⊺ ( y − X ⊺ w ) + w ⊺ Σ p − 1 w ) ) ∝ exp ( − 1 2 ( σ n − 2 ( y ⊺ y − y ⊺ X ⊺ w − w ⊺ X y + w ⊺ X X ⊺ w ) + w ⊺ Σ p − 1 w ) ) (remove constant) ∝ exp ( − 1 2 ( − σ n − 2 y ⊺ X ⊺ w − σ n − 2 w ⊺ X y + σ n − 2 w ⊺ X X ⊺ w + w ⊺ Σ p − 1 w ) ) ∝ exp ( − 1 2 ( − σ n − 2 y ⊺ X ⊺ w − σ n − 2 w ⊺ X y + w ⊺ ( σ n 2 X X ⊺ + Σ p − 1 ) w ) ) ( note that, y ⊺ X ⊺ w = w ⊺ X y ) ∝ exp ( − 1 2 ( − 2 σ n − 2 y ⊺ X ⊺ w + w ⊺ ( σ n − 2 X X ⊺ + Σ p − 1 ) w ) ) \begin{aligned}
p(\mathbf{w}|X,\mathbf{y}) & \propto \exp(- \frac{1}{2 \sigma_n^2}(\mathbf{y}-X^\intercal \mathbf{w})^\intercal (\mathbf{y}-X^\intercal \mathbf{w})) \exp(- \frac{1}{2} \mathbf{w}^\intercal \Sigma^{-1}_p \mathbf{w}) \\
& \propto \exp(- \frac{1}{2 \sigma_n^2}(\mathbf{y}-X^\intercal \mathbf{w})^\intercal (\mathbf{y}-X^\intercal \mathbf{w}) - \frac{1}{2} \mathbf{w}^\intercal \Sigma^{-1}_p \mathbf{w}) \\
& \propto \exp(- \frac{1}{2}(\sigma_n^{-2}(\mathbf{y}-X^\intercal \mathbf{w})^\intercal (\mathbf{y}-X^\intercal \mathbf{w})+ \mathbf{w}^\intercal \Sigma^{-1}_p \mathbf{w})) \\
& \propto \exp(- \frac{1}{2}( \sigma_n^{-2}(\cancel{\mathbf{y}^\intercal\mathbf{y}} - \mathbf{y}^\intercal X^\intercal \mathbf{w} - \mathbf{w}^\intercal X\mathbf{y} + \mathbf{w}^\intercal X X^\intercal \mathbf{w}) +\mathbf{w}^\intercal \Sigma^{-1}_p \mathbf{w})) &\text{(remove constant)}\\
& \propto \exp(- \frac{1}{2}( -\sigma_n^{-2}\mathbf{y}^\intercal X^\intercal \mathbf{w} - \sigma_n^{-2} \mathbf{w}^\intercal X\mathbf{y} + \sigma_n^{-2}\mathbf{w}^\intercal X X^\intercal \mathbf{w} +\mathbf{w}^\intercal \Sigma^{-1}_p \mathbf{w}))\\
& \propto \exp(- \frac{1}{2}( -\sigma_n^{-2}\mathbf{y}^\intercal X^\intercal \mathbf{w} - \sigma_n^{-2} \mathbf{w}^\intercal X\mathbf{y} + \mathbf{w}^\intercal ( \sigma_n^2 X X^\intercal + \Sigma^{-1}_p) \mathbf{w})) &(\text{note that, }\mathbf{y}^\intercal X^\intercal \mathbf{w} = \mathbf{w}^\intercal X\mathbf{y})\\
& \propto \exp(- \frac{1}{2}( -2\sigma_n^{-2}\mathbf{y}^\intercal X^\intercal \mathbf{w} + \mathbf{w}^\intercal ( \sigma_n^{-2} X X^\intercal + \Sigma^{-1}_p) \mathbf{w}))\\
\end{aligned} p ( w ∣ X , y ) ∝ exp ( − 2 σ n 2 1 ( y − X ⊺ w ) ⊺ ( y − X ⊺ w )) exp ( − 2 1 w ⊺ Σ p − 1 w ) ∝ exp ( − 2 σ n 2 1 ( y − X ⊺ w ) ⊺ ( y − X ⊺ w ) − 2 1 w ⊺ Σ p − 1 w ) ∝ exp ( − 2 1 ( σ n − 2 ( y − X ⊺ w ) ⊺ ( y − X ⊺ w ) + w ⊺ Σ p − 1 w )) ∝ exp ( − 2 1 ( σ n − 2 ( y ⊺ y − y ⊺ X ⊺ w − w ⊺ X y + w ⊺ X X ⊺ w ) + w ⊺ Σ p − 1 w )) ∝ exp ( − 2 1 ( − σ n − 2 y ⊺ X ⊺ w − σ n − 2 w ⊺ X y + σ n − 2 w ⊺ X X ⊺ w + w ⊺ Σ p − 1 w )) ∝ exp ( − 2 1 ( − σ n − 2 y ⊺ X ⊺ w − σ n − 2 w ⊺ X y + w ⊺ ( σ n 2 X X ⊺ + Σ p − 1 ) w )) ∝ exp ( − 2 1 ( − 2 σ n − 2 y ⊺ X ⊺ w + w ⊺ ( σ n − 2 X X ⊺ + Σ p − 1 ) w )) (remove constant) ( note that, y ⊺ X ⊺ w = w ⊺ X y ) let us define:
M = σ n − 2 X X ⊺ + Σ p − 1 b = σ n − 2 X y \begin{aligned}
\mathbf{M} &= \sigma_n^{-2}XX^\intercal + \Sigma_p^{-1} \\
\mathbf{b} &= \sigma_n^{-2}X\mathbf{y}
\end{aligned} M b = σ n − 2 X X ⊺ + Σ p − 1 = σ n − 2 X y thus:
M − 1 b = ( σ n − 2 X X ⊺ + Σ p − 1 ) − 1 σ n − 2 X y = σ n − 2 ( σ n − 2 X X ⊺ + Σ p − 1 ) − 1 X y = w ~ \begin{aligned}
\mathbf{M}^{-1}\mathbf{b} &= (\sigma_n^{-2}XX^\intercal + \Sigma_p^{-1})^{-1} \sigma_n^{-2}X\mathbf{y}\\
&= \sigma_n^{-2}(\sigma_n^{-2}XX^\intercal + \Sigma_p^{-1})^{-1} X\mathbf{y}\\
&= \tilde{\mathbf{w}}
\end{aligned} M − 1 b = ( σ n − 2 X X ⊺ + Σ p − 1 ) − 1 σ n − 2 X y = σ n − 2 ( σ n − 2 X X ⊺ + Σ p − 1 ) − 1 X y = w ~ finally using equality of the completing the square formula, we can get:
p ( w ∣ X , y ) ∝ exp ( − 1 2 ( w − M − 1 b ) ⊺ M ( w − M − 1 b ) − b ⊺ M − 1 b ) (remove constant) ∝ exp ( − 1 2 ( w − w ~ ) ⊺ M ( w − w ~ ) ) ∝ exp ( − 1 2 ( w − w ~ ) ⊺ ( σ n − 2 X X ⊺ + Σ p − 1 ) ( w − w ~ ) ) \begin{aligned}
p(\mathbf{w}|X,\mathbf{y}) & \propto \exp(-\frac{1}{2} (\mathbf{w} - \mathbf{M}^{-1} \mathbf{b})^\intercal \mathbf{M} (\mathbf{w}-\mathbf{M}^{-1} \mathbf{b}) - \cancel{\mathbf{b}^\intercal \mathbf{M}^{-1} \mathbf{b}}) &\text{(remove constant)}\\
& \propto \exp(-\frac{1}{2} (\mathbf{w} - \tilde{\mathbf{w}})^\intercal \mathbf{M} (\mathbf{w}-\tilde{\mathbf{w}})) \\
& \propto \exp(-\frac{1}{2} (\mathbf{w} - \tilde{\mathbf{w}})^\intercal (\sigma_n^{-2}XX^\intercal + \Sigma_p^{-1}) (\mathbf{w}-\tilde{\mathbf{w}})) \\
\end{aligned} p ( w ∣ X , y ) ∝ exp ( − 2 1 ( w − M − 1 b ) ⊺ M ( w − M − 1 b ) − b ⊺ M − 1 b ) ∝ exp ( − 2 1 ( w − w ~ ) ⊺ M ( w − w ~ )) ∝ exp ( − 2 1 ( w − w ~ ) ⊺ ( σ n − 2 X X ⊺ + Σ p − 1 ) ( w − w ~ )) (remove constant) As we can see from the formula above, we can rewrite it as a gaussian distribution with the following form:
p ( w ∣ X , y ) ∼ N ( w ~ = 1 σ n 2 A − 1 X y , A − 1 ) \begin{aligned}
p(\mathbf{w} \mid X, \mathbf{y}) \sim \mathcal{N}\left(\tilde{\mathbf{w}}=\frac{1}{\sigma_n^2} A^{-1} X \mathbf{y}, A^{-1}\right)
\end{aligned} p ( w ∣ X , y ) ∼ N ( w ~ = σ n 2 1 A − 1 X y , A − 1 ) where A = σ n − 2 X X ⊺ + Σ p − 1 A = \sigma_n^{-2}XX^\intercal + \Sigma_p^{-1} A = σ n − 2 X X ⊺ + Σ p − 1
That’s it, we have derived the closed form formula for the MAP of gaussian posterior (formula 2.8 of the GPML book).
Note: If you notice some errors in the derivation and/or the note, please let me know!
References: