Machine Learning (Chapter 36): Decision Trees - Missing Values, Imputation & Surrogate Splits
Machine Learning (Chapter 36): Decision Trees - Missing Values, Imputation & Surrogate Splits
Decision trees are a popular method for both classification and regression tasks. However, they can face challenges when data contains missing values. To handle such issues, various strategies, such as missing value imputation and surrogate splits, are employed. This article covers how decision trees address missing values, along with mathematical formulas and Python/Java code examples.
1. Missing Values in Decision Trees
Missing values in decision trees can occur in both features (predictor variables) and target variables. Some common strategies to handle missing values include:
- Removing records: This is a simple method but can lead to loss of important data.
- Imputation: Missing values are replaced using statistical methods.
- Surrogate Splits: When a feature used for splitting has missing values, alternative features (surrogates) are used.
2. Mathematical Formulation
Imputation Methods:
Mean/Median Imputation: For numeric features, missing values can be replaced by the mean or median.
or
Mode Imputation: For categorical features, the missing value can be imputed by the mode (the most frequent category).
Surrogate Splits:
Surrogate splits find an alternative feature highly correlated with the primary feature used for splitting. Let the original splitting variable be and the surrogate variable be . If has missing values, we evaluate the surrogate feature for a split decision.
Given that is the primary split feature, should minimize the impurity (Gini Index or entropy) on the missing samples:
where is the probability of class .
The surrogate split will be used when values for are missing.
3. Python Code Example: Decision Trees with Missing Values Handling
Below is an example using scikit-learn to handle missing values in decision trees. We’ll use imputation and explore how decision trees manage missing data.
python:
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Sample dataset with missing values
X = np.array([[1, 2, np.nan],
[3, np.nan, 1],
[np.nan, 2, 0],
[3, 3, 1],
[2, np.nan, 0]])
y = np.array([0, 1, 0, 1, 0])
# Imputation of missing values (Mean strategy for simplicity)
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X_imputed, y, test_size=0.2, random_state=42)
# Decision Tree Classifier
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
# Prediction and Evaluation
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy with imputed data:", accuracy)
In this code:
- We create a dataset with missing values (
np.nan). - The
SimpleImputerreplaces missing values using the mean strategy. - A decision tree is trained on the imputed data, and the accuracy is measured.
Imputation with Surrogate Splits:
scikit-learn automatically handles missing values internally for surrogate splits using different heuristics, but for direct control, we would need to modify how the tree handles missing data manually.
4. Java Code Example: Decision Trees with Missing Values Handling
Below is an example using Java and the Weka library, which supports decision trees (like J48) and handles missing values.
java:
import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;
import weka.classifiers.trees.J48;
import weka.filters.Filter;
import weka.filters.unsupervised.attribute.ReplaceMissingValues;
public class DecisionTreeWithMissingValues {
public static void main(String[] args) throws Exception {
// Load dataset
DataSource source = new DataSource("data/your-dataset.arff");
Instances data = source.getDataSet();
// Impute missing values
ReplaceMissingValues replaceMissing = new ReplaceMissingValues();
replaceMissing.setInputFormat(data);
Instances newData = Filter.useFilter(data, replaceMissing);
// Set class attribute
if (newData.classIndex() == -1)
newData.setClassIndex(newData.numAttributes() - 1);
// Train decision tree (J48 is Weka's implementation of C4.5)
J48 tree = new J48();
tree.buildClassifier(newData);
System.out.println(tree);
}
}
In this Java example:
- The Weka
ReplaceMissingValuesfilter imputes missing values. - The
J48classifier (a decision tree algorithm) builds a model on the imputed data.
5. Conclusion
Handling missing values is crucial for building robust decision tree models. Imputation techniques like mean, median, or mode filling are straightforward, but surrogate splits offer a more refined way of dealing with missing data during the splitting process. The choice of strategy depends on the problem and the data characteristics. The Python and Java examples demonstrate how to apply these techniques in practical decision tree models.
By properly managing missing data, you can enhance the performance and reliability of decision trees in machine learning applications.

Comments
Post a Comment