Machine Learning (Chapter 12): Principal Components Regression

 



Principal Components Regression (Chapter 12): An Overview

Principal Components Regression (PCR) is a technique that combines Principal Component Analysis (PCA) with linear regression. It is particularly useful when dealing with multicollinearity or when the predictors are highly correlated. PCR works by transforming the original predictors into a set of uncorrelated components and then performing regression on these components.

Understanding Principal Components Regression

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms a set of correlated variables into a set of linearly uncorrelated variables called principal components. These components are ordered by the amount of variance they explain in the data.

Principal Components Regression involves two main steps:

  1. PCA: Compute the principal components from the predictor variables.
  2. Regression: Fit a linear regression model using a subset of these principal components.

Mathematical Formulation

  1. Principal Component Analysis:

    Given a data matrix XX (where XRn×pX \in \mathbb{R}^{n \times p}), where nn is the number of observations and pp is the number of predictors, PCA transforms XX into a set of orthogonal components.

    The principal components ZZ are computed as follows:

    Z=XWZ = X W

    where WW is the matrix of eigenvectors of the covariance matrix XTXX^T X, and ZZ are the principal component scores.

  2. Regression:

    Once the principal components are computed, you can perform linear regression using these components.

    The regression model is:

    Y=Zβ+ϵY = Z \beta + \epsilon

    where YY is the response variable, β\beta are the regression coefficients, and ϵ\epsilon is the error term.

Example in Python

Let's work through an example using Python. We'll use the scikit-learn library to perform Principal Components Regression on a sample dataset.

1. Importing Libraries

python:
import numpy as np import pandas as pd from sklearn.decomposition import PCA from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_squared_error

2. Generating Sample Data

python:
# Generate sample data np.random.seed(0) X = np.random.rand(100, 5) y = X @ np.array([1.5, -2.0, 0.5, 3.0, -1.0]) + np.random.randn(100) * 0.5

3. Splitting Data

python:
# Split data into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

4. Standardizing the Data

python:
# Standardize the data scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test)

5. Performing PCA

python:
# Apply PCA pca = PCA(n_components=3) # Choose number of components X_train_pca = pca.fit_transform(X_train_scaled) X_test_pca = pca.transform(X_test_scaled)

6. Fitting the Regression Model

python:
# Fit linear regression model on principal components model = LinearRegression() model.fit(X_train_pca, y_train)

7. Making Predictions and Evaluating

python:
# Predict on test data y_pred = model.predict(X_test_pca) # Evaluate the model mse = mean_squared_error(y_test, y_pred) print(f"Mean Squared Error: {mse}")

Summary

Principal Components Regression is a powerful technique for dealing with multicollinearity and reducing the dimensionality of the predictor variables. By transforming the predictors into a set of uncorrelated components and then performing linear regression, PCR can improve the model's performance and interpretability. The Python code provided demonstrates a practical implementation of PCR using sample data.

If you have any more questions or need further clarification on Principal Components Regression, feel free to ask!

Comments

Popular posts from this blog

Machine Learning (Chapter 41): The ROC Curve

Machine Learning (Chapter 39): Bootstrapping & Cross-Validation

Machine Learning (Chapter 40): Class Evaluation Measures