Machine Learning (Chapter 13): Partial Least Squares



Machine Learning: Partial Least Squares (PLS) - Chapter 13

Partial Least Squares (PLS) is a powerful statistical technique used in machine learning and data analysis for modeling complex relationships between variables. It is particularly useful when dealing with high-dimensional datasets where traditional methods may struggle. This article provides an overview of PLS, including the mathematics behind it, and demonstrates its implementation with Python code.

Understanding Partial Least Squares (PLS)

PLS is a regression technique that models the relationships between two matrices: one containing predictors (X) and the other containing responses (Y). Unlike ordinary least squares (OLS) regression, which requires that predictors and responses be uncorrelated, PLS can handle multicollinearity and high-dimensional data.

Key Concepts

  1. Latent Variables: PLS identifies latent (hidden) variables that capture the most variance in both predictors and responses. These latent variables are linear combinations of the original predictors and responses.

  2. Orthogonal Transformation: PLS performs an orthogonal transformation to find a new set of basis vectors that better represent the relationships between predictors and responses.

Mathematical Formulation

The PLS algorithm aims to find linear combinations of the predictors (X) and responses (Y) that maximize the covariance between them. The model can be summarized as follows:

  1. Model Definition:

    X=TPT+E\mathbf{X} = \mathbf{T} \mathbf{P}^T + \mathbf{E} Y=UQT+F\mathbf{Y} = \mathbf{U} \mathbf{Q}^T + \mathbf{F}

    where:

    • X\mathbf{X} is the matrix of predictors.
    • Y\mathbf{Y} is the matrix of responses.
    • T\mathbf{T} and U\mathbf{U} are matrices of latent variables (scores).
    • P\mathbf{P} and Q\mathbf{Q} are matrices of loadings.
    • E\mathbf{E} and F\mathbf{F} are matrices of residuals.
  2. Objective: Maximize the covariance between T\mathbf{T} and U\mathbf{U}:

    Cov(T,U)\text{Cov}(\mathbf{T}, \mathbf{U})
  3. Algorithm:

    • Compute the latent variables T\mathbf{T} and U\mathbf{U} by finding the directions in which X\mathbf{X} and Y\mathbf{Y} vary together.
    • Extract loadings P\mathbf{P} and Q\mathbf{Q} from the computed latent variables.
    • Iterate to refine the latent variables and loadings until convergence.

Python Implementation

We can use the scikit-learn library to implement PLS regression. Below is an example demonstrating how to perform PLS regression using Python.

Example Code

python:
import numpy as np from sklearn.cross_decomposition import PLSRegression from sklearn.datasets import load_diabetes from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error # Load dataset data = load_diabetes() X = data.data y = data.target # Split dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Initialize and fit PLS model pls = PLSRegression(n_components=2) pls.fit(X_train, y_train) # Make predictions y_pred = pls.predict(X_test) # Evaluate model mse = mean_squared_error(y_test, y_pred) print(f"Mean Squared Error: {mse:.4f}") # Display the coefficients print("PLS Coefficients:") print(pls.coef_)

Explanation of the Code

  1. Data Preparation: We use the load_diabetes dataset from sklearn.datasets and split it into training and testing sets.

  2. Model Initialization: We create an instance of PLSRegression with 2 components. The number of components is a hyperparameter that controls the complexity of the model.

  3. Model Fitting: We fit the PLS model to the training data.

  4. Prediction and Evaluation: We use the model to make predictions on the test set and evaluate the performance using mean squared error (MSE).

  5. Coefficients: We print the coefficients of the PLS model to understand the influence of each predictor on the response.

Conclusion

Partial Least Squares (PLS) is a versatile tool for regression and dimensionality reduction, especially useful in scenarios with high-dimensional data or multicollinearity. By transforming predictors and responses into latent variables, PLS finds the directions of maximum covariance and provides a robust approach to modeling complex relationships. The provided Python example demonstrates how to implement PLS regression using scikit-learn and highlights its application in real-world data analysis.

Comments

Popular posts from this blog

Machine Learning (Chapter 41): The ROC Curve

Machine Learning (Chapter 39): Bootstrapping & Cross-Validation

Machine Learning (Chapter 40): Class Evaluation Measures