Machine Learning (Chapter 28): Parameter Estimation II - Priors & Maximum A Posteriori (MAP)

By Ritesh Sahu August 30, 2024

Chapter 28: Parameter Estimation II - Priors & Maximum A Posteriori (MAP)

In the context of machine learning and statistics, parameter estimation is a crucial process. After exploring Maximum Likelihood Estimation (MLE) in previous chapters, we now delve into a more sophisticated technique: Maximum A Posteriori (MAP) Estimation. This method incorporates prior knowledge about the parameters, making it a more general approach than MLE.

1. Introduction to Priors

In Bayesian statistics, a prior represents our beliefs about the parameters before observing any data. It's a probability distribution that reflects our knowledge or assumptions about the parameter's values. Incorporating priors allows us to update our beliefs in light of new evidence, leading to the posterior distribution.

Given a parameter $\theta$ and data $X$ , the prior distribution is denoted as $P(\theta)$ . The likelihood, $P(X|\theta)$ , represents the probability of observing the data given the parameter $\theta$ .

2. Posterior Distribution

The posterior distribution combines the prior and the likelihood to form the updated belief about $\theta$ after observing the data $X$ . It is given by Bayes' theorem:

$P(\theta|X) = \frac{P(X|\theta) \cdot P(\theta)}{P(X)}$

Where:

$P(\theta|X)$ is the posterior distribution.
$P(X|\theta)$ is the likelihood of the data given the parameters.
$P(\theta)$ is the prior distribution.
$P(X)$ is the marginal likelihood or evidence, a normalizing constant.

3. Maximum A Posteriori (MAP) Estimation

MAP estimation finds the parameter value that maximizes the posterior distribution. Mathematically, MAP is expressed as:

$\theta_{\text{MAP}} = \underset{\theta}{\text{argmax}} \, P(\theta|X)$

Using Bayes' theorem, this can be rewritten as:

$\theta_{\text{MAP}} = \underset{\theta}{\text{argmax}} \, P(X|\theta) \cdot P(\theta)$

Since the evidence $P(X)$ is constant with respect to $\theta$ , it is often ignored in the optimization process, simplifying the expression to:

$\theta_{\text{MAP}} = \underset{\theta}{\text{argmax}} \, \left[ \log P(X|\theta) + \log P(\theta) \right]$

This contrasts with MLE, which only maximizes the likelihood $P(X|\theta)$ without considering the prior.

4. Example: MAP Estimation with Python

Let's consider a simple example where we estimate the mean of a Gaussian distribution using MAP.

Problem Setup:

Assume we have data $X = \{x_1, x_2, \dots, x_n\}$ drawn from a Gaussian distribution with an unknown mean $\mu$ and a known variance $\sigma^2$ .
We assume a Gaussian prior on $\mu$ with mean $\mu_0$ and variance $\tau^2$ .

The likelihood is:

$P(X|\mu) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)$

The prior is:

$P(\mu) = \frac{1}{\sqrt{2\pi\tau^2}} \exp\left(-\frac{(\mu - \mu_0)^2}{2\tau^2}\right)$

The posterior (ignoring the normalizing constant) is:

$P(\mu|X) \propto \exp\left(-\frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 - \frac{1}{2\tau^2} (\mu - \mu_0)^2 \right)$

Maximizing this with respect to $\mu$ gives the MAP estimate.

Python Implementation:

python:
import numpy as np

# Given data
X = np.array([5.0, 6.0, 7.0, 8.0, 9.0])  # Sample data points
n = len(X)
sigma2 = 2.0  # Known variance of the likelihood
mu0 = 4.0    # Prior mean
tau2 = 1.0   # Prior variance

# MLE estimate of the mean
mu_mle = np.mean(X)

# MAP estimate of the mean
mu_map = (n * sigma2 * mu0 + tau2 * np.sum(X)) / (n * sigma2 + tau2)

print(f"MLE Estimate: {mu_mle:.2f}")
print(f"MAP Estimate: {mu_map:.2f}")

Output:

python:
MLE Estimate: 7.00
MAP Estimate: 6.83

In this example, the MLE estimate is simply the sample mean, while the MAP estimate incorporates the prior, pulling the estimate slightly towards the prior mean $\mu_0$ .

5. Conclusion

MAP estimation provides a powerful framework for parameter estimation by integrating prior knowledge. This approach is particularly useful when the dataset is small or when incorporating domain knowledge is crucial. In contrast to MLE, which only considers the likelihood, MAP offers a more flexible estimation method by leveraging priors.

Understanding the role of priors and how they influence the posterior distribution is key to effectively applying MAP in practical scenarios. By experimenting with different priors and analyzing their impact, one can gain deeper insights into the underlying parameters of a model.

This chapter provides a foundation for further exploration into Bayesian methods, which are instrumental in many advanced machine learning applications.

Search This Blog

Machine learning and artificial intelligence