Machine Learning (Chapter 10): Subset Selection

By Ritesh Sahu August 05, 2024

Machine Learning: Subset Selection (Chapter 10)

Introduction

Subset selection is a crucial technique in machine learning used for feature selection, model simplification, and improving model performance. It involves choosing a subset of relevant features from a larger set to build a more efficient and accurate model. This chapter delves into the mathematical foundations and practical implementations of subset selection.

Mathematical Formulation

Objective Function

The goal of subset selection is to select a subset of features $S$ from a total set of features $F$ such that a certain objective function is optimized. Typically, this objective function involves minimizing an error metric while considering the number of features used.

Let:

$\mathbf{X}$ be the matrix of features (with dimensions $n \times p$ ), where $n$ is the number of samples and $p$ is the number of features.
$\mathbf{y}$ be the vector of target values (with dimensions $n \times 1$ ).
$S \subseteq \{1, 2, \ldots, p\}$ be the subset of feature indices selected.
$\mathbf{X}_S$ be the submatrix of $\mathbf{X}$ containing only the features in subset $S$ .

The objective function often used in subset selection is the Mean Squared Error (MSE) of a model, which can be expressed as:

$\text{MSE}(\mathbf{X}_S, \mathbf{y}) = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2$

where $\hat{y}_i$ is the predicted value for the $i$ -th sample using the subset of features $S$ .

Feature Selection Criteria

Two common criteria for feature selection are:

Forward Selection: Start with an empty set of features and add features one by one, choosing the feature that improves the model performance the most at each step.
Backward Elimination: Start with all features and remove features one by one, choosing the feature whose removal improves the model performance the most at each step.

Example Python Code

Here's an example using Python and the sklearn library to perform subset selection through Recursive Feature Elimination (RFE) on a dataset.

Import Libraries

python:
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

Load Dataset

python:
# Load the Boston housing dataset
data = load_boston()
X = data.data
y = data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Perform RFE for Feature Selection

python:
# Create a linear regression model
model = LinearRegression()

# Create the RFE model and select the top 5 features
rfe = RFE(model, n_features_to_select=5)
fit = rfe.fit(X_train, y_train)

# Print the selected features
print("Selected features:", fit.support_)
print("Feature ranking:", fit.ranking_)

Evaluate Model Performance

python:
# Fit the model using the selected features
selected_features = np.where(fit.support_)[0]
X_train_selected = X_train[:, selected_features]
X_test_selected = X_test[:, selected_features]

# Train the model
model.fit(X_train_selected, y_train)

# Make predictions
y_pred = model.predict(X_test_selected)

# Calculate and print the Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Conclusion

Subset selection is a powerful technique for enhancing machine learning models by focusing on the most relevant features. Through methods like Recursive Feature Elimination, practitioners can effectively reduce the dimensionality of their datasets, leading to more efficient and interpretable models. The combination of mathematical formulation and practical implementation in Python makes subset selection accessible and actionable for real-world machine learning tasks.

Search This Blog

Machine learning and artificial intelligence