Machine Learning (Chapter 13): Partial Least Squares
Machine Learning: Partial Least Squares (PLS) - Chapter 13
Partial Least Squares (PLS) is a powerful statistical technique used in machine learning and data analysis for modeling complex relationships between variables. It is particularly useful when dealing with high-dimensional datasets where traditional methods may struggle. This article provides an overview of PLS, including the mathematics behind it, and demonstrates its implementation with Python code.
Understanding Partial Least Squares (PLS)
PLS is a regression technique that models the relationships between two matrices: one containing predictors (X) and the other containing responses (Y). Unlike ordinary least squares (OLS) regression, which requires that predictors and responses be uncorrelated, PLS can handle multicollinearity and high-dimensional data.
Key Concepts
Latent Variables: PLS identifies latent (hidden) variables that capture the most variance in both predictors and responses. These latent variables are linear combinations of the original predictors and responses.
Orthogonal Transformation: PLS performs an orthogonal transformation to find a new set of basis vectors that better represent the relationships between predictors and responses.
Mathematical Formulation
The PLS algorithm aims to find linear combinations of the predictors (X) and responses (Y) that maximize the covariance between them. The model can be summarized as follows:
Model Definition:
where:
- is the matrix of predictors.
- is the matrix of responses.
- and are matrices of latent variables (scores).
- and are matrices of loadings.
- and are matrices of residuals.
Objective: Maximize the covariance between and :
Algorithm:
- Compute the latent variables and by finding the directions in which and vary together.
- Extract loadings and from the computed latent variables.
- Iterate to refine the latent variables and loadings until convergence.
Python Implementation
We can use the scikit-learn library to implement PLS regression. Below is an example demonstrating how to perform PLS regression using Python.
Example Code
python:import numpy as np
from sklearn.cross_decomposition import PLSRegression
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Load dataset
data = load_diabetes()
X = data.data
y = data.target
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize and fit PLS model
pls = PLSRegression(n_components=2)
pls.fit(X_train, y_train)
# Make predictions
y_pred = pls.predict(X_test)
# Evaluate model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.4f}")
# Display the coefficients
print("PLS Coefficients:")
print(pls.coef_)
Explanation of the Code
Data Preparation: We use the
load_diabetesdataset fromsklearn.datasetsand split it into training and testing sets.Model Initialization: We create an instance of
PLSRegressionwith 2 components. The number of components is a hyperparameter that controls the complexity of the model.Model Fitting: We fit the PLS model to the training data.
Prediction and Evaluation: We use the model to make predictions on the test set and evaluate the performance using mean squared error (MSE).
Coefficients: We print the coefficients of the PLS model to understand the influence of each predictor on the response.
Conclusion
Partial Least Squares (PLS) is a versatile tool for regression and dimensionality reduction, especially useful in scenarios with high-dimensional data or multicollinearity. By transforming predictors and responses into latent variables, PLS finds the directions of maximum covariance and provides a robust approach to modeling complex relationships. The provided Python example demonstrates how to implement PLS regression using scikit-learn and highlights its application in real-world data analysis.

Comments
Post a Comment