Machine Learning (Chapter 39): Bootstrapping & Cross-Validation

Introduction

In the context of machine learning, evaluating the performance of models is a critical step. Two essential techniques widely used for model evaluation and resampling are Bootstrapping and Cross-Validation. Both help estimate the performance of a model by making efficient use of available data, reducing overfitting, and improving generalization.

1. Bootstrapping

Bootstrapping is a resampling method where subsets of data are sampled with replacement. The goal is to create multiple training datasets, then build models for each dataset, and assess model variability and performance.

Key Concept: When we sample with replacement, some observations may appear multiple times in the bootstrap sample, while others may not appear at all.

Mathematical Foundation

Given a dataset $D = \{x_1, x_2, \dots, x_n\}$ , a bootstrap sample $D_b$ is created by drawing $n$ observations from $D$ , with replacement. For each bootstrap sample, the model is trained and evaluated.

The estimated performance (such as accuracy, error) of the model is then averaged over all bootstrap samples.

Let:

$D_b^{(i)}$ be the $i$ -th bootstrap sample.
$f_{\theta^{(i)}}$ be the model trained on $D_b^{(i)}$ .
$\text{Error}^{(i)}$ be the error on the test set for model $f_{\theta^{(i)}}$ .

Then the bootstrap estimate of the error is given by:

$\hat{E} = \frac{1}{B} \sum_{i=1}^{B} \text{Error}^{(i)}$

where $B$ is the number of bootstrap samples.

Python Code Example: Bootstrapping

python
import numpy as np
from sklearn.utils import resample
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Number of bootstrap samples
B = 1000
bootstrap_accuracies = []

# Perform Bootstrapping
for i in range(B):
    # Create a bootstrap sample
    X_resampled, y_resampled = resample(X_train, y_train, random_state=i)
    
    # Train model
    model = DecisionTreeClassifier()
    model.fit(X_resampled, y_resampled)
    
    # Evaluate model
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    bootstrap_accuracies.append(accuracy)

# Estimate of the model's performance
bootstrap_mean_accuracy = np.mean(bootstrap_accuracies)
print(f"Bootstrap Estimate of Accuracy: {bootstrap_mean_accuracy:.4f}")

In this example, we create multiple bootstrap samples, train a decision tree on each, and estimate the accuracy by averaging over all bootstrap samples.

2. Cross-Validation

Cross-Validation is a model evaluation technique that splits the data into several "folds." The model is trained on some folds and tested on the remaining ones. K-Fold Cross-Validation is a commonly used variation where the data is divided into $K$ equal-sized folds.

Mathematical Foundation

Given a dataset $D = \{x_1, x_2, \dots, x_n\}$ , we divide it into $K$ equally sized folds $D_1, D_2, \dots, D_K$ .

For each fold $i$ , the model is trained on $D \setminus D_i$ (all data except the $i$ -th fold) and tested on $D_i$ (the $i$ -th fold).

The error estimate for K-fold cross-validation is given by:

$\hat{E}_{CV} = \frac{1}{K} \sum_{i=1}^{K} \text{Error}(D_i)$

where $\text{Error}(D_i)$ is the test error on the $i$ -th fold.

Cross-Validation helps mitigate overfitting and provides a more reliable estimate of the model’s performance.

Python Code Example: Cross-Validation

python
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Define KFold Cross-Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_accuracies = []

# Perform Cross-Validation
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # Train model
    model = DecisionTreeClassifier()
    model.fit(X_train, y_train)
    
    # Evaluate model
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    cv_accuracies.append(accuracy)

# Estimate of the model's performance
cv_mean_accuracy = np.mean(cv_accuracies)
print(f"Cross-Validation Estimate of Accuracy: {cv_mean_accuracy:.4f}")

In this example, we split the data into 5 folds and calculate the accuracy for each fold. The mean accuracy across all folds gives us a better estimate of the model’s performance.

Bootstrapping vs. Cross-Validation

Aspect	Bootstrapping	Cross-Validation
Resampling Method	Sampling with replacement from the original dataset	Splitting the dataset into K subsets without replacement
Data Utilization	Some samples may appear multiple times in the training set	Every sample is used exactly once for testing
Bias	May have a higher bias due to sampling with replacement	Tends to have lower bias but may have slightly higher variance
Variance	Can be used to estimate model variability	Provides a reliable estimate of model performance

Java Code Example: K-Fold Cross-Validation

java
import weka.classifiers.trees.J48;
import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;
import weka.classifiers.Evaluation;
import java.util.Random;

public class KFoldCrossValidation {
    public static void main(String[] args) throws Exception {
        // Load dataset
        DataSource source = new DataSource("iris.arff");
        Instances data = source.getDataSet();
        data.setClassIndex(data.numAttributes() - 1);

        // Define classifier
        J48 tree = new J48();

        // 5-fold cross-validation
        Evaluation eval = new Evaluation(data);
        eval.crossValidateModel(tree, data, 5, new Random(42));

        // Output accuracy
        System.out.println("Cross-Validation Accuracy: " + (1 - eval.errorRate()));
    }
}

In this Java example, we use Weka's J48 decision tree and perform 5-fold cross-validation on the Iris dataset.

Conclusion

Bootstrapping and Cross-Validation are both powerful tools for model evaluation, with each method offering different strengths. Bootstrapping is often preferred when we want to estimate variability or have a smaller dataset. Cross-Validation is a more reliable method when we have sufficient data and want a better generalization of the model’s performance. Both techniques play a crucial role in building robust machine learning models.

Search This Blog

Machine learning and artificial intelligence