Machine Learning (Chapter 12): Principal Components Regression
Principal Components Regression (Chapter 12): An Overview
Principal Components Regression (PCR) is a technique that combines Principal Component Analysis (PCA) with linear regression. It is particularly useful when dealing with multicollinearity or when the predictors are highly correlated. PCR works by transforming the original predictors into a set of uncorrelated components and then performing regression on these components.
Understanding Principal Components Regression
Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms a set of correlated variables into a set of linearly uncorrelated variables called principal components. These components are ordered by the amount of variance they explain in the data.
Principal Components Regression involves two main steps:
- PCA: Compute the principal components from the predictor variables.
- Regression: Fit a linear regression model using a subset of these principal components.
Mathematical Formulation
Principal Component Analysis:
Given a data matrix (where ), where is the number of observations and is the number of predictors, PCA transforms into a set of orthogonal components.
The principal components are computed as follows:
where is the matrix of eigenvectors of the covariance matrix , and are the principal component scores.
Regression:
Once the principal components are computed, you can perform linear regression using these components.
The regression model is:
where is the response variable, are the regression coefficients, and is the error term.
Example in Python
Let's work through an example using Python. We'll use the scikit-learn library to perform Principal Components Regression on a sample dataset.
1. Importing Libraries
python:import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
2. Generating Sample Data
python:# Generate sample data
np.random.seed(0)
X = np.random.rand(100, 5)
y = X @ np.array([1.5, -2.0, 0.5, 3.0, -1.0]) + np.random.randn(100) * 0.5
3. Splitting Data
python:# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
4. Standardizing the Data
python:# Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
5. Performing PCA
python:# Apply PCA
pca = PCA(n_components=3) # Choose number of components
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)
6. Fitting the Regression Model
python:# Fit linear regression model on principal components
model = LinearRegression()
model.fit(X_train_pca, y_train)
7. Making Predictions and Evaluating
python:# Predict on test data
y_pred = model.predict(X_test_pca)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
Summary
Principal Components Regression is a powerful technique for dealing with multicollinearity and reducing the dimensionality of the predictor variables. By transforming the predictors into a set of uncorrelated components and then performing linear regression, PCR can improve the model's performance and interpretability. The Python code provided demonstrates a practical implementation of PCR using sample data.
If you have any more questions or need further clarification on Principal Components Regression, feel free to ask!

Comments
Post a Comment