Machine Learning (Chapter 34): Decision Trees

Chapter 34: Decision Trees - Categorical Attributes

Introduction

Decision trees are powerful tools in machine learning for both classification and regression tasks. They work by recursively partitioning the data into subsets based on attribute values, ultimately leading to a decision or prediction. While decision trees can handle numerical attributes by splitting the data at certain thresholds, they can also handle categorical attributes by splitting based on category values. This chapter focuses on decision trees using categorical attributes.

Mathematics Behind Decision Trees with Categorical Attributes

Entropy and Information Gain

When dealing with categorical attributes, the key metrics used to make decisions about splits in the tree are Entropy and Information Gain.

Entropy measures the uncertainty or impurity in a dataset. For a categorical attribute with $k$ different categories, the entropy is defined as:
$H(D) = -\sum_{i=1}^{k} p_i \log_2 p_i$
where:
- $p_i$ is the proportion of samples in the dataset belonging to category $i$ .
Information Gain (IG) is used to decide which attribute to split on at each step of building the tree. It measures the reduction in entropy after the dataset is split on an attribute. The information gain for a categorical attribute $A$ is defined as:
$IG(D, A) = H(D) - \sum_{v \in \text{Values}(A)} \frac{|D_v|}{|D|} H(D_v)$
where:
- $D$ is the dataset.
- $D_v$ is the subset of $D$ where attribute $A$ has value $v$ .
- $|D|$ is the number of samples in the dataset.
- $|D_v|$ is the number of samples in the subset $D_v$ .

The attribute with the highest information gain is chosen for the split at each node.

Example: Building a Decision Tree with Categorical Attributes

Let's consider a simple example. Suppose we have a dataset about customers and their likelihood of purchasing a product based on their gender and marital status.

Gender	Marital Status	Purchase
Male	Single	No
Female	Married	Yes
Female	Single	Yes
Male	Married	No
Female	Married	Yes
Male	Single	No

Step 1: Calculate the Entropy of the Target Variable

The target variable is "Purchase," with two possible values: Yes and No. First, we calculate the entropy of the entire dataset.

$H(D) = -\left(\frac{3}{6} \log_2 \frac{3}{6} + \frac{3}{6} \log_2 \frac{3}{6}\right) = 1$

Step 2: Calculate the Information Gain for Each Attribute

Gender:
- For "Male":
  - $H(D_{\text{Male}}) = -\left(\frac{0}{3} \log_2 \frac{0}{3} + \frac{3}{3} \log_2 \frac{3}{3}\right) = 0$
- For "Female":
  - $H(D_{\text{Female}}) = -\left(\frac{3}{3} \log_2 \frac{3}{3} + \frac{0}{3} \log_2 \frac{0}{3}\right) = 0$
- Information Gain: $IG(D, \text{Gender}) = 1 - \left(\frac{3}{6} \times 0 + \frac{3}{6} \times 0\right) = 1$
Marital Status:
- For "Single":
  - $H(D_{\text{Single}}) = -\left(\frac{2}{3} \log_2 \frac{2}{3} + \frac{1}{3} \log_2 \frac{1}{3}\right) \approx 0.918$
- For "Married":
  - $H(D_{\text{Married}}) = -\left(\frac{1}{3} \log_2 \frac{1}{3} + \frac{2}{3} \log_2 \frac{2}{3}\right) \approx 0.918$
- Information Gain: $IG(D, \text{Marital Status}) = 1 - \left(\frac{3}{6} \times 0.918 + \frac{3}{6} \times 0.918\right) \approx 0.082$

Since the Information Gain is higher for "Gender," we split the dataset based on this attribute.

Python Implementation

Below is a Python implementation of a decision tree classifier handling categorical attributes using the scikit-learn library.

python:
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn import tree
import pandas as pd

# Sample dataset
data = {
    'Gender': ['Male', 'Female', 'Female', 'Male', 'Female', 'Male'],
    'Marital Status': ['Single', 'Married', 'Single', 'Married', 'Married', 'Single'],
    'Purchase': ['No', 'Yes', 'Yes', 'No', 'Yes', 'No']
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Encode categorical variables
le_gender = LabelEncoder()
le_marital_status = LabelEncoder()
le_purchase = LabelEncoder()

df['Gender'] = le_gender.fit_transform(df['Gender'])
df['Marital Status'] = le_marital_status.fit_transform(df['Marital Status'])
df['Purchase'] = le_purchase.fit_transform(df['Purchase'])

# Features and target
X = df[['Gender', 'Marital Status']]
y = df['Purchase']

# Build decision tree model
clf = DecisionTreeClassifier(criterion='entropy')
clf.fit(X, y)

# Visualize the decision tree
tree.plot_tree(clf, feature_names=['Gender', 'Marital Status'], class_names=['No', 'Yes'], filled=True)

Java Implementation

Here is a Java implementation using the Weka library.

java:
import weka.core.Attribute;
import weka.core.DenseInstance;
import weka.core.Instances;
import weka.classifiers.trees.J48;
import weka.core.FastVector;

public class DecisionTreeExample {

    public static void main(String[] args) throws Exception {

        // Declare numeric attributes for the dataset
        Attribute gender = new Attribute("gender");
        Attribute maritalStatus = new Attribute("maritalStatus");

        // Declare the class attribute along with its values
        FastVector classValues = new FastVector(2);
        classValues.addElement("No");
        classValues.addElement("Yes");
        Attribute purchase = new Attribute("purchase", classValues);

        // Declare the feature vector
        FastVector attributes = new FastVector(3);
        attributes.addElement(gender);
        attributes.addElement(maritalStatus);
        attributes.addElement(purchase);

        // Create an empty dataset
        Instances dataset = new Instances("DecisionTreeExample", attributes, 0);

        // Set class index
        dataset.setClassIndex(dataset.numAttributes() - 1);

        // Create instances and add them to the dataset
        double[] values = new double[dataset.numAttributes()];
        values[0] = 0; // Male
        values[1] = 0; // Single
        values[2] = dataset.attribute("purchase").indexOfValue("No");
        dataset.add(new DenseInstance(1.0, values));

        // Add more instances similarly...

        // Build the decision tree model
        J48 tree = new J48();
        tree.buildClassifier(dataset);

        // Print the decision tree
        System.out.println(tree);
    }
}

Conclusion

Decision trees are intuitive and interpretable models that can handle both numerical and categorical attributes. By calculating entropy and information gain, they effectively split the dataset to make accurate predictions. The Python and Java implementations provided show how these concepts can be applied in practice.

Search This Blog

Machine learning and artificial intelligence