Machine Learning (Chapter 36): Decision Trees - Missing Values, Imputation & Surrogate Splits

By Ritesh Sahu September 13, 2024

Machine Learning (Chapter 36): Decision Trees - Missing Values, Imputation & Surrogate Splits

Decision trees are a popular method for both classification and regression tasks. However, they can face challenges when data contains missing values. To handle such issues, various strategies, such as missing value imputation and surrogate splits, are employed. This article covers how decision trees address missing values, along with mathematical formulas and Python/Java code examples.

1. Missing Values in Decision Trees

Missing values in decision trees can occur in both features (predictor variables) and target variables. Some common strategies to handle missing values include:

Removing records: This is a simple method but can lead to loss of important data.
Imputation: Missing values are replaced using statistical methods.
Surrogate Splits: When a feature used for splitting has missing values, alternative features (surrogates) are used.

2. Mathematical Formulation

Imputation Methods:

Mean/Median Imputation: For numeric features, missing values can be replaced by the mean or median.
$\hat{X}_i = \frac{1}{n} \sum_{j=1}^{n} X_j \quad \text{(mean)}$
or
$\hat{X}_i = \text{median}(X)$
Mode Imputation: For categorical features, the missing value can be imputed by the mode (the most frequent category).
$\hat{X}_i = \text{mode}(X)$

Surrogate Splits:

Surrogate splits find an alternative feature highly correlated with the primary feature used for splitting. Let the original splitting variable be $X$ and the surrogate variable be $S$ . If $X$ has missing values, we evaluate the surrogate feature $S$ for a split decision.

Given that $X$ is the primary split feature, $S$ should minimize the impurity (Gini Index or entropy) on the missing samples:

$G(S) = \sum_{i=1}^{k} p_i(1 - p_i)$

where $p_i$ is the probability of class $i$ .

The surrogate split will be used when values for $X$ are missing.

3. Python Code Example: Decision Trees with Missing Values Handling

Below is an example using scikit-learn to handle missing values in decision trees. We’ll use imputation and explore how decision trees manage missing data.

python:

import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample dataset with missing values
X = np.array([[1, 2, np.nan], 
              [3, np.nan, 1], 
              [np.nan, 2, 0], 
              [3, 3, 1], 
              [2, np.nan, 0]])

y = np.array([0, 1, 0, 1, 0])

# Imputation of missing values (Mean strategy for simplicity)
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X_imputed, y, test_size=0.2, random_state=42)

# Decision Tree Classifier
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

# Prediction and Evaluation
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy with imputed data:", accuracy)

In this code:

We create a dataset with missing values (np.nan).
The SimpleImputer replaces missing values using the mean strategy.
A decision tree is trained on the imputed data, and the accuracy is measured.

Imputation with Surrogate Splits:

scikit-learn automatically handles missing values internally for surrogate splits using different heuristics, but for direct control, we would need to modify how the tree handles missing data manually.

4. Java Code Example: Decision Trees with Missing Values Handling

Below is an example using Java and the Weka library, which supports decision trees (like J48) and handles missing values.

java:

import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;
import weka.classifiers.trees.J48;
import weka.filters.Filter;
import weka.filters.unsupervised.attribute.ReplaceMissingValues;

public class DecisionTreeWithMissingValues {
    public static void main(String[] args) throws Exception {
        // Load dataset
        DataSource source = new DataSource("data/your-dataset.arff");
        Instances data = source.getDataSet();
        
        // Impute missing values
        ReplaceMissingValues replaceMissing = new ReplaceMissingValues();
        replaceMissing.setInputFormat(data);
        Instances newData = Filter.useFilter(data, replaceMissing);
        
        // Set class attribute
        if (newData.classIndex() == -1)
            newData.setClassIndex(newData.numAttributes() - 1);
        
        // Train decision tree (J48 is Weka's implementation of C4.5)
        J48 tree = new J48();
        tree.buildClassifier(newData);
        
        System.out.println(tree);
    }
}

In this Java example:

The Weka ReplaceMissingValues filter imputes missing values.
The J48 classifier (a decision tree algorithm) builds a model on the imputed data.

5. Conclusion

Handling missing values is crucial for building robust decision tree models. Imputation techniques like mean, median, or mode filling are straightforward, but surrogate splits offer a more refined way of dealing with missing data during the splitting process. The choice of strategy depends on the problem and the data characteristics. The Python and Java examples demonstrate how to apply these techniques in practical decision tree models.

By properly managing missing data, you can enhance the performance and reliability of decision trees in machine learning applications.

Search This Blog

Machine learning and artificial intelligence