Machine Learning (Chapter 39): Bootstrapping & Cross-Validation


 


Machine Learning (Chapter 39): Bootstrapping & Cross-Validation

Introduction

In the context of machine learning, evaluating the performance of models is a critical step. Two essential techniques widely used for model evaluation and resampling are Bootstrapping and Cross-Validation. Both help estimate the performance of a model by making efficient use of available data, reducing overfitting, and improving generalization.

1. Bootstrapping

Bootstrapping is a resampling method where subsets of data are sampled with replacement. The goal is to create multiple training datasets, then build models for each dataset, and assess model variability and performance.

Key Concept: When we sample with replacement, some observations may appear multiple times in the bootstrap sample, while others may not appear at all.

Mathematical Foundation

Given a dataset D={x1,x2,,xn}D = \{x_1, x_2, \dots, x_n\}, a bootstrap sample DbD_b is created by drawing nn observations from DD, with replacement. For each bootstrap sample, the model is trained and evaluated.

The estimated performance (such as accuracy, error) of the model is then averaged over all bootstrap samples.

Let:

  • Db(i)D_b^{(i)} be the ii-th bootstrap sample.
  • fθ(i)f_{\theta^{(i)}} be the model trained on Db(i)D_b^{(i)}.
  • Error(i)\text{Error}^{(i)} be the error on the test set for model fθ(i)f_{\theta^{(i)}}.

Then the bootstrap estimate of the error is given by:

E^=1Bi=1BError(i)\hat{E} = \frac{1}{B} \sum_{i=1}^{B} \text{Error}^{(i)}

where BB is the number of bootstrap samples.

Python Code Example: Bootstrapping

python
import numpy as np from sklearn.utils import resample from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split from sklearn.datasets import load_iris # Load dataset data = load_iris() X, y = data.data, data.target # Split data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Number of bootstrap samples B = 1000 bootstrap_accuracies = [] # Perform Bootstrapping for i in range(B): # Create a bootstrap sample X_resampled, y_resampled = resample(X_train, y_train, random_state=i) # Train model model = DecisionTreeClassifier() model.fit(X_resampled, y_resampled) # Evaluate model y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) bootstrap_accuracies.append(accuracy) # Estimate of the model's performance bootstrap_mean_accuracy = np.mean(bootstrap_accuracies) print(f"Bootstrap Estimate of Accuracy: {bootstrap_mean_accuracy:.4f}")

In this example, we create multiple bootstrap samples, train a decision tree on each, and estimate the accuracy by averaging over all bootstrap samples.


2. Cross-Validation

Cross-Validation is a model evaluation technique that splits the data into several "folds." The model is trained on some folds and tested on the remaining ones. K-Fold Cross-Validation is a commonly used variation where the data is divided into KK equal-sized folds.

Mathematical Foundation

Given a dataset D={x1,x2,,xn}D = \{x_1, x_2, \dots, x_n\}, we divide it into KK equally sized folds D1,D2,,DKD_1, D_2, \dots, D_K.

For each fold ii, the model is trained on DDiD \setminus D_i (all data except the ii-th fold) and tested on DiD_i (the ii-th fold).

The error estimate for K-fold cross-validation is given by:

E^CV=1Ki=1KError(Di)\hat{E}_{CV} = \frac{1}{K} \sum_{i=1}^{K} \text{Error}(D_i)

where Error(Di)\text{Error}(D_i) is the test error on the ii-th fold.

Cross-Validation helps mitigate overfitting and provides a more reliable estimate of the model’s performance.

Python Code Example: Cross-Validation

python
from sklearn.model_selection import KFold from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score from sklearn.datasets import load_iris # Load dataset data = load_iris() X, y = data.data, data.target # Define KFold Cross-Validation kf = KFold(n_splits=5, shuffle=True, random_state=42) cv_accuracies = [] # Perform Cross-Validation for train_index, test_index in kf.split(X): X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] # Train model model = DecisionTreeClassifier() model.fit(X_train, y_train) # Evaluate model y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) cv_accuracies.append(accuracy) # Estimate of the model's performance cv_mean_accuracy = np.mean(cv_accuracies) print(f"Cross-Validation Estimate of Accuracy: {cv_mean_accuracy:.4f}")

In this example, we split the data into 5 folds and calculate the accuracy for each fold. The mean accuracy across all folds gives us a better estimate of the model’s performance.


Bootstrapping vs. Cross-Validation

AspectBootstrappingCross-Validation
Resampling MethodSampling with replacement from the original datasetSplitting the dataset into K subsets without replacement
Data UtilizationSome samples may appear multiple times in the training setEvery sample is used exactly once for testing
BiasMay have a higher bias due to sampling with replacementTends to have lower bias but may have slightly higher variance
VarianceCan be used to estimate model variabilityProvides a reliable estimate of model performance

Java Code Example: K-Fold Cross-Validation

java
import weka.classifiers.trees.J48; import weka.core.Instances; import weka.core.converters.ConverterUtils.DataSource; import weka.classifiers.Evaluation; import java.util.Random; public class KFoldCrossValidation { public static void main(String[] args) throws Exception { // Load dataset DataSource source = new DataSource("iris.arff"); Instances data = source.getDataSet(); data.setClassIndex(data.numAttributes() - 1); // Define classifier J48 tree = new J48(); // 5-fold cross-validation Evaluation eval = new Evaluation(data); eval.crossValidateModel(tree, data, 5, new Random(42)); // Output accuracy System.out.println("Cross-Validation Accuracy: " + (1 - eval.errorRate())); } }

In this Java example, we use Weka's J48 decision tree and perform 5-fold cross-validation on the Iris dataset.

Conclusion

Bootstrapping and Cross-Validation are both powerful tools for model evaluation, with each method offering different strengths. Bootstrapping is often preferred when we want to estimate variability or have a smaller dataset. Cross-Validation is a more reliable method when we have sufficient data and want a better generalization of the model’s performance. Both techniques play a crucial role in building robust machine learning models.

Comments

Popular posts from this blog

Machine Learning (Chapter 35): Decision Trees - Multiway Splits

Machine Learning (Chapter 6): Statistical Decision Theory - Classification

Machine Learning (Chapter 32): Stopping Criteria & Pruning