Machine Learning (Chapter 36): Decision Trees - Missing Values, Imputation & Surrogate Splits

 



Machine Learning (Chapter 36): Decision Trees - Missing Values, Imputation & Surrogate Splits

Decision trees are a popular method for both classification and regression tasks. However, they can face challenges when data contains missing values. To handle such issues, various strategies, such as missing value imputation and surrogate splits, are employed. This article covers how decision trees address missing values, along with mathematical formulas and Python/Java code examples.

1. Missing Values in Decision Trees

Missing values in decision trees can occur in both features (predictor variables) and target variables. Some common strategies to handle missing values include:

  • Removing records: This is a simple method but can lead to loss of important data.
  • Imputation: Missing values are replaced using statistical methods.
  • Surrogate Splits: When a feature used for splitting has missing values, alternative features (surrogates) are used.

2. Mathematical Formulation

Imputation Methods:
  1. Mean/Median Imputation: For numeric features, missing values can be replaced by the mean or median.

    X^i=1nj=1nXj(mean)\hat{X}_i = \frac{1}{n} \sum_{j=1}^{n} X_j \quad \text{(mean)}

    or

    X^i=median(X)\hat{X}_i = \text{median}(X)
  2. Mode Imputation: For categorical features, the missing value can be imputed by the mode (the most frequent category).

    X^i=mode(X)\hat{X}_i = \text{mode}(X)
Surrogate Splits:

Surrogate splits find an alternative feature highly correlated with the primary feature used for splitting. Let the original splitting variable be XX and the surrogate variable be SS. If XX has missing values, we evaluate the surrogate feature SS for a split decision.

Given that XX is the primary split feature, SS should minimize the impurity (Gini Index or entropy) on the missing samples:

G(S)=i=1kpi(1pi)G(S) = \sum_{i=1}^{k} p_i(1 - p_i)

where pip_i is the probability of class ii.

The surrogate split will be used when values for XX are missing.

3. Python Code Example: Decision Trees with Missing Values Handling

Below is an example using scikit-learn to handle missing values in decision trees. We’ll use imputation and explore how decision trees manage missing data.

python:

import numpy as np from sklearn.impute import SimpleImputer from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Sample dataset with missing values X = np.array([[1, 2, np.nan], [3, np.nan, 1], [np.nan, 2, 0], [3, 3, 1], [2, np.nan, 0]]) y = np.array([0, 1, 0, 1, 0]) # Imputation of missing values (Mean strategy for simplicity) imputer = SimpleImputer(strategy='mean') X_imputed = imputer.fit_transform(X) # Train-Test Split X_train, X_test, y_train, y_test = train_test_split(X_imputed, y, test_size=0.2, random_state=42) # Decision Tree Classifier clf = DecisionTreeClassifier() clf.fit(X_train, y_train) # Prediction and Evaluation y_pred = clf.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print("Accuracy with imputed data:", accuracy)

In this code:

  • We create a dataset with missing values (np.nan).
  • The SimpleImputer replaces missing values using the mean strategy.
  • A decision tree is trained on the imputed data, and the accuracy is measured.
Imputation with Surrogate Splits:

scikit-learn automatically handles missing values internally for surrogate splits using different heuristics, but for direct control, we would need to modify how the tree handles missing data manually.

4. Java Code Example: Decision Trees with Missing Values Handling

Below is an example using Java and the Weka library, which supports decision trees (like J48) and handles missing values.

java:

import weka.core.Instances; import weka.core.converters.ConverterUtils.DataSource; import weka.classifiers.trees.J48; import weka.filters.Filter; import weka.filters.unsupervised.attribute.ReplaceMissingValues; public class DecisionTreeWithMissingValues { public static void main(String[] args) throws Exception { // Load dataset DataSource source = new DataSource("data/your-dataset.arff"); Instances data = source.getDataSet(); // Impute missing values ReplaceMissingValues replaceMissing = new ReplaceMissingValues(); replaceMissing.setInputFormat(data); Instances newData = Filter.useFilter(data, replaceMissing); // Set class attribute if (newData.classIndex() == -1) newData.setClassIndex(newData.numAttributes() - 1); // Train decision tree (J48 is Weka's implementation of C4.5) J48 tree = new J48(); tree.buildClassifier(newData); System.out.println(tree); } }

In this Java example:

  • The Weka ReplaceMissingValues filter imputes missing values.
  • The J48 classifier (a decision tree algorithm) builds a model on the imputed data.

5. Conclusion

Handling missing values is crucial for building robust decision tree models. Imputation techniques like mean, median, or mode filling are straightforward, but surrogate splits offer a more refined way of dealing with missing data during the splitting process. The choice of strategy depends on the problem and the data characteristics. The Python and Java examples demonstrate how to apply these techniques in practical decision tree models.

By properly managing missing data, you can enhance the performance and reliability of decision trees in machine learning applications.

Comments

Popular posts from this blog

Machine Learning (Chapter 41): The ROC Curve

Machine Learning (Chapter 39): Bootstrapping & Cross-Validation

Machine Learning (Chapter 40): Class Evaluation Measures