Machine Learning (Chapter 12): Principal Components Regression

By Ritesh Sahu August 16, 2024

Principal Components Regression (Chapter 12): An Overview

Principal Components Regression (PCR) is a technique that combines Principal Component Analysis (PCA) with linear regression. It is particularly useful when dealing with multicollinearity or when the predictors are highly correlated. PCR works by transforming the original predictors into a set of uncorrelated components and then performing regression on these components.

Understanding Principal Components Regression

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms a set of correlated variables into a set of linearly uncorrelated variables called principal components. These components are ordered by the amount of variance they explain in the data.

Principal Components Regression involves two main steps:

PCA: Compute the principal components from the predictor variables.
Regression: Fit a linear regression model using a subset of these principal components.

Mathematical Formulation

Principal Component Analysis:
Given a data matrix $X$ (where $X \in \mathbb{R}^{n \times p}$ ), where $n$ is the number of observations and $p$ is the number of predictors, PCA transforms $X$ into a set of orthogonal components.
The principal components $Z$ are computed as follows:
$Z = X W$
where $W$ is the matrix of eigenvectors of the covariance matrix $X^T X$ , and $Z$ are the principal component scores.
Regression:
Once the principal components are computed, you can perform linear regression using these components.
The regression model is:
$Y = Z \beta + \epsilon$
where $Y$ is the response variable, $\beta$ are the regression coefficients, and $\epsilon$ is the error term.

Example in Python

Let's work through an example using Python. We'll use the scikit-learn library to perform Principal Components Regression on a sample dataset.

1. Importing Libraries

python:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

2. Generating Sample Data

python:
# Generate sample data
np.random.seed(0)
X = np.random.rand(100, 5)
y = X @ np.array([1.5, -2.0, 0.5, 3.0, -1.0]) + np.random.randn(100) * 0.5

3. Splitting Data

python:
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

4. Standardizing the Data

python:
# Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

5. Performing PCA

python:
# Apply PCA
pca = PCA(n_components=3)  # Choose number of components
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

6. Fitting the Regression Model

python:
# Fit linear regression model on principal components
model = LinearRegression()
model.fit(X_train_pca, y_train)

7. Making Predictions and Evaluating

python:
# Predict on test data
y_pred = model.predict(X_test_pca)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

Summary

Principal Components Regression is a powerful technique for dealing with multicollinearity and reducing the dimensionality of the predictor variables. By transforming the predictors into a set of uncorrelated components and then performing linear regression, PCR can improve the model's performance and interpretability. The Python code provided demonstrates a practical implementation of PCR using sample data.

If you have any more questions or need further clarification on Principal Components Regression, feel free to ask!

Search This Blog

Machine learning and artificial intelligence