Machine Learning (Chapter 29): Parameter Estimation III - Bayesian Estimation


 


Machine Learning (Chapter 29): Parameter Estimation III - Bayesian Estimation

Bayesian Estimation is a powerful statistical technique that extends the principles of Bayesian inference to parameter estimation. Unlike Maximum Likelihood Estimation (MLE) which focuses solely on finding the parameter values that maximize the likelihood function, Bayesian Estimation incorporates prior knowledge or beliefs about the parameters and updates these beliefs with data. This chapter explores the mathematical foundation of Bayesian Estimation, provides examples, and includes Python code to demonstrate the concepts.

1. Introduction to Bayesian Estimation

Bayesian Estimation is based on Bayes' Theorem, which is used to update the probability of a hypothesis as more evidence becomes available. In the context of parameter estimation, the hypothesis represents the parameters of a model, and the evidence is the observed data.

Bayes' Theorem is mathematically expressed as:

P(θX)=P(Xθ)P(θ)P(X)P(\theta | \mathbf{X}) = \frac{P(\mathbf{X} | \theta) \cdot P(\theta)}{P(\mathbf{X})}

Where:

  • P(θX)P(\theta | \mathbf{X}) is the posterior probability of the parameter θ\theta given the data X\mathbf{X}.
  • P(Xθ)P(\mathbf{X} | \theta) is the likelihood of the data X\mathbf{X} given the parameter θ\theta.
  • P(θ)P(\theta) is the prior probability of the parameter θ\theta.
  • P(X)P(\mathbf{X}) is the marginal likelihood or evidence, which is a normalizing constant.

The goal of Bayesian Estimation is to derive the posterior distribution P(θX)P(\theta | \mathbf{X}), which combines the prior distribution with the likelihood.

2. Prior Distributions

The prior distribution P(θ)P(\theta) represents our belief about the parameters before observing any data. Choosing a prior is crucial, as it can significantly influence the posterior distribution. Priors can be:

  • Non-informative (Uniform): Assumes no prior knowledge about the parameter.
  • Informative: Incorporates specific prior knowledge.
  • Conjugate: Chosen to simplify the posterior calculation.

For example, if we are estimating the parameter μ\mu of a normal distribution, a common choice for the prior might be another normal distribution.

μN(μ0,σ02)\mu \sim \mathcal{N}(\mu_0, \sigma_0^2)

3. Posterior Distribution

The posterior distribution combines the prior with the likelihood of the observed data to provide a new distribution reflecting our updated beliefs. For many models, the posterior distribution may not have a closed-form solution, requiring numerical methods like Markov Chain Monte Carlo (MCMC) for estimation.

Let's consider a simple example of Bayesian Estimation with a normal likelihood and a normal prior.

Example:

Given data X={x1,x2,,xn}\mathbf{X} = \{x_1, x_2, \dots, x_n\} sampled from a normal distribution N(μ,σ2)\mathcal{N}(\mu, \sigma^2), where σ2\sigma^2 is known, the likelihood function is:

P(Xμ)=i=1n12πσ2exp((xiμ)22σ2)P(\mathbf{X} | \mu) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)

Assume a normal prior for μ\mu:

μN(μ0,σ02)\mu \sim \mathcal{N}(\mu_0, \sigma_0^2)

The posterior distribution of μ\mu given the data X\mathbf{X} is:

P(μX)P(Xμ)P(μ)P(\mu | \mathbf{X}) \propto P(\mathbf{X} | \mu) \cdot P(\mu)

By multiplying the likelihood and the prior, the posterior distribution can be shown to be:

μXN(μ0σ02+i=1nxiσ21σ02+nσ2,11σ02+nσ2)\mu | \mathbf{X} \sim \mathcal{N}\left(\frac{\frac{\mu_0}{\sigma_0^2} + \frac{\sum_{i=1}^n x_i}{\sigma^2}}{\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}}, \frac{1}{\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}}\right)

4. Python Implementation

Let's implement Bayesian Estimation in Python using the above example.

python:

import numpy as np import matplotlib.pyplot as plt # Given data (X) and known variance (sigma^2) X = np.array([2.5, 3.0, 2.8, 3.2, 2.9]) sigma2 = 0.25 # Prior parameters mu0 = 3.0 sigma0_2 = 0.1 # Compute the posterior mean and variance n = len(X) mu_numerator = mu0 / sigma0_2 + np.sum(X) / sigma2 mu_denominator = 1 / sigma0_2 + n / sigma2 mu_post = mu_numerator / mu_denominator sigma_post_2 = 1 / mu_denominator # Print posterior mean and variance print(f'Posterior Mean: {mu_post}') print(f'Posterior Variance: {sigma_post_2}') # Plot the prior, likelihood, and posterior mu_values = np.linspace(2.6, 3.4, 100) # Prior distribution prior = (1 / np.sqrt(2 * np.pi * sigma0_2)) * np.exp(-0.5 * (mu_values - mu0)**2 / sigma0_2) # Likelihood (assuming likelihood dominates) likelihood = (1 / np.sqrt(2 * np.pi * sigma2 / n)) * np.exp(-0.5 * n * (mu_values - np.mean(X))**2 / sigma2) # Posterior distribution posterior = (1 / np.sqrt(2 * np.pi * sigma_post_2)) * np.exp(-0.5 * (mu_values - mu_post)**2 / sigma_post_2) plt.plot(mu_values, prior, label='Prior') plt.plot(mu_values, likelihood, label='Likelihood') plt.plot(mu_values, posterior, label='Posterior') plt.xlabel('Mu') plt.ylabel('Density') plt.legend() plt.show()

In this code:

  • We compute the posterior mean and variance based on the prior and observed data.
  • The prior, likelihood, and posterior distributions are plotted to visualize how Bayesian Estimation updates the belief about the parameter μ\mu.

5. Conclusion

Bayesian Estimation provides a coherent and flexible approach to parameter estimation, allowing the incorporation of prior knowledge. It’s particularly useful in scenarios where data is sparse or when integrating domain knowledge is crucial. The posterior distribution resulting from Bayesian Estimation not only provides point estimates but also quantifies uncertainty, making it a powerful tool for decision-making in uncertain environments.

This chapter has introduced the mathematical foundation of Bayesian Estimation, illustrated it with a simple example, and provided a Python implementation to demonstrate its practical application.

Comments

Popular posts from this blog

Machine Learning (Chapter 35): Decision Trees - Multiway Splits

Machine Learning (Chapter 6): Statistical Decision Theory - Classification

Machine Learning (Chapter 32): Stopping Criteria & Pruning