Machine Learning (Chapter 6): Statistical Decision Theory - Classification

 



Machine Learning: Chapter 6 - Statistical Decision Theory: Classification

Introduction

Statistical decision theory forms the backbone of many machine learning techniques, especially in the domain of classification. Classification is a fundamental problem in machine learning, where the goal is to assign a label to a given input based on its features. In this article, we will delve into the mathematical foundation of classification using statistical decision theory, explore key concepts, and provide Python code to illustrate these ideas.

1. The Basics of Classification

Classification is the task of predicting a discrete label yy (such as a class) given an input vector x\mathbf{x}. The input vector x\mathbf{x} represents the features of the data, and yy is the class label. In the context of statistical decision theory, we aim to find a decision rule that minimizes the probability of misclassification.

2. Probability Distributions in Classification

In statistical decision theory, we assume that the data is generated according to some probability distribution. Let p(xy)p(\mathbf{x}|y) be the class-conditional probability density function, representing the probability of observing x\mathbf{x} given that the true class label is yy. Additionally, p(y)p(y) is the prior probability of class yy.

The joint probability of x\mathbf{x} and yy is given by:

p(x,y)=p(xy)p(y)p(\mathbf{x}, y) = p(\mathbf{x}|y) p(y)

The goal is to infer the class label yy given the input x\mathbf{x}, which can be achieved using Bayes' theorem:

p(yx)=p(xy)p(y)p(x)p(y|\mathbf{x}) = \frac{p(\mathbf{x}|y) p(y)}{p(\mathbf{x})}

where p(x)=yp(xy)p(y)p(\mathbf{x}) = \sum_{y} p(\mathbf{x}|y) p(y) is the marginal probability of x\mathbf{x}.

3. Bayesian Decision Rule

The Bayesian decision rule is to assign the class label yy that maximizes the posterior probability p(yx)p(y|\mathbf{x}). This is known as Maximum A Posteriori (MAP) estimation:

yMAP=argmaxyp(yx)=argmaxyp(xy)p(y)y_{\text{MAP}} = \arg\max_{y} p(y|\mathbf{x}) = \arg\max_{y} p(\mathbf{x}|y) p(y)

4. Loss Function and Risk

In practice, misclassification may have different consequences depending on the true class and the predicted class. This can be captured using a loss function L(y,y^)L(y, \hat{y}), which represents the cost of predicting y^\hat{y} when the true class is yy.

The expected loss, also known as the risk, is given by:

R(y^x)=yL(y,y^)p(yx)R(\hat{y}|\mathbf{x}) = \sum_{y} L(y, \hat{y}) p(y|\mathbf{x})

The goal is to minimize the expected loss. The Bayes risk is the minimum risk that can be achieved:

RBayes=miny^R(y^x)R_{\text{Bayes}} = \min_{\hat{y}} R(\hat{y}|\mathbf{x})

In the case of 0-1 loss, where the cost of misclassification is the same for all errors, the Bayes classifier reduces to the MAP rule.

5. Generative vs. Discriminative Models

There are two main approaches to classification: generative and discriminative models.

  • Generative Models: These models explicitly model the joint probability p(x,y)p(\mathbf{x}, y) or the class-conditional distribution p(xy)p(\mathbf{x}|y) and the prior p(y)p(y). The decision rule is then derived using Bayes' theorem. Examples include Gaussian Naive Bayes and Hidden Markov Models (HMMs).

  • Discriminative Models: These models directly model the posterior probability p(yx)p(y|\mathbf{x}) or directly learn the decision boundary between classes without modeling the underlying distribution. Examples include Logistic Regression and Support Vector Machines (SVMs).

6. Quadratic Discriminant Analysis (QDA)

Quadratic Discriminant Analysis (QDA) is a generative model where the class-conditional densities p(xy)p(\mathbf{x}|y) are assumed to be Gaussian with different covariance matrices for each class.

The class-conditional density for class yy is given by:

p(xy)=1(2π)d/2Σy1/2exp(12(xμy)Σy1(xμy))p(\mathbf{x}|y) = \frac{1}{(2\pi)^{d/2} |\Sigma_y|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \mu_y)^\top \Sigma_y^{-1} (\mathbf{x} - \mu_y)\right)

where μy\mu_y is the mean vector and Σy\Sigma_y is the covariance matrix of the Gaussian distribution for class yy.

The decision rule for QDA is to assign x\mathbf{x} to the class yy that maximizes the following discriminant function:

δy(x)=12logΣy12(xμy)Σy1(xμy)+logp(y)\delta_y(\mathbf{x}) = -\frac{1}{2} \log|\Sigma_y| - \frac{1}{2} (\mathbf{x} - \mu_y)^\top \Sigma_y^{-1} (\mathbf{x} - \mu_y) + \log p(y)

The QDA decision boundary is quadratic, making it more flexible but also more prone to overfitting compared to linear models.

7. Python Implementation

Let's implement QDA in Python using the scikit-learn library.

python:
import numpy as np from sklearn.model_selection import train_test_split from sklearn.datasets import make_classification from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis from sklearn.metrics import accuracy_score # Generate a synthetic dataset X, y = make_classification(n_samples=1000, n_features=2, n_classes=2, n_clusters_per_class=1, random_state=42) # Split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Initialize and fit the QDA model qda = QuadraticDiscriminantAnalysis() qda.fit(X_train, y_train) # Make predictions y_pred = qda.predict(X_test) # Calculate the accuracy of the model accuracy = accuracy_score(y_test, y_pred) print(f"QDA Accuracy: {accuracy * 100:.2f}%")

8. Conclusion

Statistical decision theory provides a solid mathematical foundation for classification problems in machine learning. By understanding the underlying probability distributions, we can develop decision rules that minimize the risk of misclassification. Generative models like QDA offer a powerful approach to classification, especially when the assumptions of Gaussian distributions hold. However, the choice between generative and discriminative models should be guided by the specific characteristics of the data and the problem at hand.

Understanding these concepts and implementing them in Python equips us with the tools to tackle a wide range of classification tasks effectively.

Comments

Popular posts from this blog

Machine Learning (Chapter 35): Decision Trees - Multiway Splits

Machine Learning (Chapter 41): The ROC Curve

Machine Learning (Chapter 32): Stopping Criteria & Pruning