Machine Learning (Chapter 39): Bootstrapping & Cross-Validation
Machine Learning (Chapter 39): Bootstrapping & Cross-Validation
Introduction
In the context of machine learning, evaluating the performance of models is a critical step. Two essential techniques widely used for model evaluation and resampling are Bootstrapping and Cross-Validation. Both help estimate the performance of a model by making efficient use of available data, reducing overfitting, and improving generalization.
1. Bootstrapping
Bootstrapping is a resampling method where subsets of data are sampled with replacement. The goal is to create multiple training datasets, then build models for each dataset, and assess model variability and performance.
Key Concept: When we sample with replacement, some observations may appear multiple times in the bootstrap sample, while others may not appear at all.
Mathematical Foundation
Given a dataset , a bootstrap sample is created by drawing observations from , with replacement. For each bootstrap sample, the model is trained and evaluated.
The estimated performance (such as accuracy, error) of the model is then averaged over all bootstrap samples.
Let:
- be the -th bootstrap sample.
- be the model trained on .
- be the error on the test set for model .
Then the bootstrap estimate of the error is given by:
where is the number of bootstrap samples.
Python Code Example: Bootstrapping
pythonimport numpy as np
from sklearn.utils import resample
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
# Load dataset
data = load_iris()
X, y = data.data, data.target
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Number of bootstrap samples
B = 1000
bootstrap_accuracies = []
# Perform Bootstrapping
for i in range(B):
# Create a bootstrap sample
X_resampled, y_resampled = resample(X_train, y_train, random_state=i)
# Train model
model = DecisionTreeClassifier()
model.fit(X_resampled, y_resampled)
# Evaluate model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
bootstrap_accuracies.append(accuracy)
# Estimate of the model's performance
bootstrap_mean_accuracy = np.mean(bootstrap_accuracies)
print(f"Bootstrap Estimate of Accuracy: {bootstrap_mean_accuracy:.4f}")
In this example, we create multiple bootstrap samples, train a decision tree on each, and estimate the accuracy by averaging over all bootstrap samples.
2. Cross-Validation
Cross-Validation is a model evaluation technique that splits the data into several "folds." The model is trained on some folds and tested on the remaining ones. K-Fold Cross-Validation is a commonly used variation where the data is divided into equal-sized folds.
Mathematical Foundation
Given a dataset , we divide it into equally sized folds .
For each fold , the model is trained on (all data except the -th fold) and tested on (the -th fold).
The error estimate for K-fold cross-validation is given by:
where is the test error on the -th fold.
Cross-Validation helps mitigate overfitting and provides a more reliable estimate of the model’s performance.
Python Code Example: Cross-Validation
pythonfrom sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
# Load dataset
data = load_iris()
X, y = data.data, data.target
# Define KFold Cross-Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_accuracies = []
# Perform Cross-Validation
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Train model
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# Evaluate model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
cv_accuracies.append(accuracy)
# Estimate of the model's performance
cv_mean_accuracy = np.mean(cv_accuracies)
print(f"Cross-Validation Estimate of Accuracy: {cv_mean_accuracy:.4f}")
In this example, we split the data into 5 folds and calculate the accuracy for each fold. The mean accuracy across all folds gives us a better estimate of the model’s performance.
Bootstrapping vs. Cross-Validation
Aspect | Bootstrapping | Cross-Validation |
---|---|---|
Resampling Method | Sampling with replacement from the original dataset | Splitting the dataset into K subsets without replacement |
Data Utilization | Some samples may appear multiple times in the training set | Every sample is used exactly once for testing |
Bias | May have a higher bias due to sampling with replacement | Tends to have lower bias but may have slightly higher variance |
Variance | Can be used to estimate model variability | Provides a reliable estimate of model performance |
Java Code Example: K-Fold Cross-Validation
javaimport weka.classifiers.trees.J48;
import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;
import weka.classifiers.Evaluation;
import java.util.Random;
public class KFoldCrossValidation {
public static void main(String[] args) throws Exception {
// Load dataset
DataSource source = new DataSource("iris.arff");
Instances data = source.getDataSet();
data.setClassIndex(data.numAttributes() - 1);
// Define classifier
J48 tree = new J48();
// 5-fold cross-validation
Evaluation eval = new Evaluation(data);
eval.crossValidateModel(tree, data, 5, new Random(42));
// Output accuracy
System.out.println("Cross-Validation Accuracy: " + (1 - eval.errorRate()));
}
}
In this Java example, we use Weka's J48
decision tree and perform 5-fold cross-validation on the Iris dataset.
Conclusion
Bootstrapping and Cross-Validation are both powerful tools for model evaluation, with each method offering different strengths. Bootstrapping is often preferred when we want to estimate variability or have a smaller dataset. Cross-Validation is a more reliable method when we have sufficient data and want a better generalization of the model’s performance. Both techniques play a crucial role in building robust machine learning models.
Comments
Post a Comment