Machine Learning (Chapter 35): Decision Trees

Chapter 35: Decision Trees - Multiway Splits

Introduction

Decision trees are a powerful and intuitive method for both classification and regression tasks in machine learning. They work by recursively splitting the data into subsets based on the value of input features, eventually leading to a final prediction. A key decision when building a decision tree is how to split the data at each node. In many cases, binary splits (two-way splits) are used, but multiway splits, where a node is split into more than two branches, can also be beneficial in certain scenarios.

Multiway Splits

A multiway split occurs when a decision node divides the dataset into more than two subsets based on the possible values of a feature. This approach is especially useful for categorical features with multiple levels, where splitting into all possible categories at once might lead to a simpler and more interpretable tree.

Mathematics Behind Multiway Splits

The goal of a multiway split is to maximize the separation of classes or reduce the impurity within each subset. The most common criteria for evaluating the quality of a split are Gini impurity and entropy (used in calculating information gain). Let's consider how these metrics are applied in a multiway split.

Gini Impurity:
The Gini impurity for a single node is calculated as:
$Gini(t) = 1 - \sum_{i=1}^{n} p_i^2$
where $p_i$ is the proportion of instances belonging to class $i$ in node $t$ .
For a multiway split, the weighted Gini impurity is computed as:
$Gini_{\text{split}} = \sum_{j=1}^{k} \frac{N_j}{N} \times Gini(t_j)$
where $N_j$ is the number of instances in subset $j$ , $N$ is the total number of instances before the split, and $Gini(t_j)$ is the Gini impurity of subset $j$ .
Entropy and Information Gain:
The entropy for a single node is calculated as:
$Entropy(t) = -\sum_{i=1}^{n} p_i \log_2(p_i)$
For a multiway split, the weighted entropy is computed as:
$Entropy_{\text{split}} = \sum_{j=1}^{k} \frac{N_j}{N} \times Entropy(t_j)$
The information gain for a split is then:
$IG = Entropy_{\text{parent}} - Entropy_{\text{split}}$

Example in Python

Let's consider an example using Python and the popular scikit-learn library. We'll create a decision tree classifier that uses multiway splits on a dataset.

python:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample dataset
data = {
    'Feature1': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'C', 'B'],
    'Feature2': [1, 2, 1, 1, 2, 2, 1, 2, 1],
    'Label': ['Yes', 'No', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No']
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Convert categorical features to numeric codes
df['Feature1'] = df['Feature1'].astype('category').cat.codes
X = df[['Feature1', 'Feature2']]
y = df['Label'].astype('category').cat.codes

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Decision Tree Classifier with multiway splits
clf = DecisionTreeClassifier(criterion='entropy')  # Using entropy for information gain
clf.fit(X_train, y_train)

# Prediction
y_pred = clf.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

# Plotting the tree
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plot_tree(clf, feature_names=['Feature1', 'Feature2'], class_names=['No', 'Yes'], filled=True)
plt.show()

In this example:

We create a simple dataset with categorical and numeric features.
The DecisionTreeClassifier from scikit-learn is used to train a decision tree.
The tree is allowed to perform multiway splits on the features, and we use the entropy criterion to decide the best splits.
Finally, the accuracy of the model is evaluated, and the tree is visualized.

Example in Java

Here’s how you can implement a similar decision tree in Java using the Weka library.

java:
import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;
import weka.classifiers.trees.J48;
import weka.core.Utils;

public class DecisionTreeMultiwaySplit {

    public static void main(String[] args) throws Exception {
        // Load dataset
        DataSource source = new DataSource("data.arff");
        Instances dataset = source.getDataSet();
        
        // Set class index to the last attribute
        if (dataset.classIndex() == -1) {
            dataset.setClassIndex(dataset.numAttributes() - 1);
        }
        
        // Build Decision Tree classifier
        J48 tree = new J48();
        tree.setOptions(Utils.splitOptions("-C 0.25 -M 2")); // Confidence factor and min num objects
        tree.buildClassifier(dataset);
        
        // Print the decision tree
        System.out.println(tree);
    }
}

In this example:

We load a dataset from an ARFF file using Weka’s DataSource.
A decision tree classifier (J48, which is an implementation of the C4.5 algorithm) is built.
We can set the options to allow multiway splits.
The resulting decision tree is printed out.

Conclusion

Multiway splits in decision trees allow for more granular decision-making, especially when dealing with categorical features that have multiple levels. The mathematical foundation relies on reducing impurity or maximizing information gain across multiple branches. Both Python (with scikit-learn) and Java (with Weka) offer powerful tools to implement and visualize such decision trees, making it easier to apply these concepts in real-world applications.

Search This Blog

Machine learning and artificial intelligence