Machine Learning (Chapter 6): Statistical Decision Theory

Machine Learning (Chapter 6): Statistical Decision Theory - Classification

By Ritesh Sahu June 29, 2024

Machine Learning: Chapter 6 - Statistical Decision Theory: Classification

Introduction

Statistical decision theory forms the backbone of many machine learning techniques, especially in the domain of classification. Classification is a fundamental problem in machine learning, where the goal is to assign a label to a given input based on its features. In this article, we will delve into the mathematical foundation of classification using statistical decision theory, explore key concepts, and provide Python code to illustrate these ideas.

1. The Basics of Classification

Classification is the task of predicting a discrete label $y$ (such as a class) given an input vector $\mathbf{x}$ . The input vector $\mathbf{x}$ represents the features of the data, and $y$ is the class label. In the context of statistical decision theory, we aim to find a decision rule that minimizes the probability of misclassification.

2. Probability Distributions in Classification

In statistical decision theory, we assume that the data is generated according to some probability distribution. Let $p(\mathbf{x}|y)$ be the class-conditional probability density function, representing the probability of observing $\mathbf{x}$ given that the true class label is $y$ . Additionally, $p(y)$ is the prior probability of class $y$ .

The joint probability of $\mathbf{x}$ and $y$ is given by:

$p(\mathbf{x}, y) = p(\mathbf{x}|y) p(y)$

The goal is to infer the class label $y$ given the input $\mathbf{x}$ , which can be achieved using Bayes' theorem:

$p(y|\mathbf{x}) = \frac{p(\mathbf{x}|y) p(y)}{p(\mathbf{x})}$

where $p(\mathbf{x}) = \sum_{y} p(\mathbf{x}|y) p(y)$ is the marginal probability of $\mathbf{x}$ .

3. Bayesian Decision Rule

The Bayesian decision rule is to assign the class label $y$ that maximizes the posterior probability $p(y|\mathbf{x})$ . This is known as Maximum A Posteriori (MAP) estimation:

$y_{\text{MAP}} = \arg\max_{y} p(y|\mathbf{x}) = \arg\max_{y} p(\mathbf{x}|y) p(y)$

4. Loss Function and Risk

In practice, misclassification may have different consequences depending on the true class and the predicted class. This can be captured using a loss function $L(y, \hat{y})$ , which represents the cost of predicting $\hat{y}$ when the true class is $y$ .

The expected loss, also known as the risk, is given by:

$R(\hat{y}|\mathbf{x}) = \sum_{y} L(y, \hat{y}) p(y|\mathbf{x})$

The goal is to minimize the expected loss. The Bayes risk is the minimum risk that can be achieved:

$R_{\text{Bayes}} = \min_{\hat{y}} R(\hat{y}|\mathbf{x})$

In the case of 0-1 loss, where the cost of misclassification is the same for all errors, the Bayes classifier reduces to the MAP rule.

5. Generative vs. Discriminative Models

There are two main approaches to classification: generative and discriminative models.

Generative Models: These models explicitly model the joint probability $p(\mathbf{x}, y)$ or the class-conditional distribution $p(\mathbf{x}|y)$ and the prior $p(y)$ . The decision rule is then derived using Bayes' theorem. Examples include Gaussian Naive Bayes and Hidden Markov Models (HMMs).
Discriminative Models: These models directly model the posterior probability $p(y|\mathbf{x})$ or directly learn the decision boundary between classes without modeling the underlying distribution. Examples include Logistic Regression and Support Vector Machines (SVMs).

6. Quadratic Discriminant Analysis (QDA)

Quadratic Discriminant Analysis (QDA) is a generative model where the class-conditional densities $p(\mathbf{x}|y)$ are assumed to be Gaussian with different covariance matrices for each class.

The class-conditional density for class $y$ is given by:

$p(\mathbf{x}|y) = \frac{1}{(2\pi)^{d/2} |\Sigma_y|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \mu_y)^\top \Sigma_y^{-1} (\mathbf{x} - \mu_y)\right)$

where $\mu_y$ is the mean vector and $\Sigma_y$ is the covariance matrix of the Gaussian distribution for class $y$ .

The decision rule for QDA is to assign $\mathbf{x}$ to the class $y$ that maximizes the following discriminant function:

$\delta_y(\mathbf{x}) = -\frac{1}{2} \log|\Sigma_y| - \frac{1}{2} (\mathbf{x} - \mu_y)^\top \Sigma_y^{-1} (\mathbf{x} - \mu_y) + \log p(y)$

The QDA decision boundary is quadratic, making it more flexible but also more prone to overfitting compared to linear models.

7. Python Implementation

Let's implement QDA in Python using the scikit-learn library.

python:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.metrics import accuracy_score

# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=2, n_classes=2, n_clusters_per_class=1, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and fit the QDA model
qda = QuadraticDiscriminantAnalysis()
qda.fit(X_train, y_train)

# Make predictions
y_pred = qda.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"QDA Accuracy: {accuracy * 100:.2f}%")

8. Conclusion

Statistical decision theory provides a solid mathematical foundation for classification problems in machine learning. By understanding the underlying probability distributions, we can develop decision rules that minimize the risk of misclassification. Generative models like QDA offer a powerful approach to classification, especially when the assumptions of Gaussian distributions hold. However, the choice between generative and discriminative models should be guided by the specific characteristics of the data and the problem at hand.

Understanding these concepts and implementing them in Python equips us with the tools to tackle a wide range of classification tasks effectively.

Search This Blog

Machine learning and artificial intelligence

Machine Learning (Chapter 6): Statistical Decision Theory - Classification

Machine Learning: Chapter 6 - Statistical Decision Theory: Classification

Introduction

1. The Basics of Classification

2. Probability Distributions in Classification

3. Bayesian Decision Rule

4. Loss Function and Risk

5. Generative vs. Discriminative Models

6. Quadratic Discriminant Analysis (QDA)

7. Python Implementation

8. Conclusion

Comments

Post a Comment

Popular posts from this blog

Machine Learning (Chapter 41): The ROC Curve

Machine Learning (Chapter 39): Bootstrapping & Cross-Validation

Machine Learning (Chapter 40): Class Evaluation Measures