Machine Learning (Chapter 16): Linear Discriminant Analysis (LDA)
Machine Learning (Chapter 16): Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA) is a powerful technique in machine learning used for classification and dimensionality reduction. It is especially useful when dealing with datasets where you have multiple classes and you want to reduce the number of features while preserving as much of the class discriminatory information as possible.
Mathematical Formulation
The core idea behind LDA is to find a linear combination of features that best separates the classes. The goal is to project the data onto a lower-dimensional space where the classes are as distinct as possible.
Let's break down the mathematics behind LDA:
Compute the Within-Class Scatter Matrix : For each class , compute the scatter matrix :
where is the set of data points in class , and is the mean vector of class .
The total within-class scatter matrix is:
where is the number of classes.
Compute the Between-Class Scatter Matrix : Compute the scatter matrix between classes as:
where is the number of samples in class , is the mean vector of class , and is the overall mean vector of all classes.
Solve the Generalized Eigenvalue Problem: To find the linear discriminants, solve the eigenvalue problem:
where represents the eigenvectors (discriminant vectors) and are the corresponding eigenvalues.
Project the Data: Use the top eigenvectors (where is the number of classes minus one) to project the data onto a -dimensional space:
where is the matrix of input features, and is the matrix of top eigenvectors.
Example in Python
Let's apply LDA to a simple dataset using Python. We will use the Iris dataset for demonstration purposes.
python:import numpy as np
from sklearn.datasets import load_iris
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)
# Apply LDA
lda = LinearDiscriminantAnalysis(n_components=2)
X_train_lda = lda.fit_transform(X_train, y_train)
X_test_lda = lda.transform(X_test)
# Train a classifier
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train_lda, y_train)
# Predict and evaluate
y_pred = clf.predict(X_test_lda)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
# Plot the results
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
colors = ['navy', 'turquoise', 'darkorange']
for color, i, target_name in zip(colors, [0, 1, 2], data.target_names):
plt.scatter(X_train_lda[y_train == i, 0], X_train_lda[y_train == i, 1], color=color, alpha=.8, label=target_name)
plt.xlabel('LD1')
plt.ylabel('LD2')
plt.title('LDA: Iris dataset')
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.show()
Explanation of the Code
Data Loading and Preprocessing:
- We load the Iris dataset and standardize the features to have zero mean and unit variance.
Splitting the Data:
- The dataset is split into training and testing sets.
Applying LDA:
- We apply LDA to reduce the dimensionality to 2 components.
Training a Classifier:
- A logistic regression classifier is trained on the reduced feature set.
Evaluation:
- We evaluate the accuracy of the classifier on the test set.
Plotting:
- We visualize the results to see how well LDA separates the classes in the reduced dimensional space.
By applying LDA, we can achieve a lower-dimensional representation of the data while maintaining class separability, which is useful for both visualization and improving the performance of machine learning models.

Comments
Post a Comment