Machine Learning (Chapter 34): Decision Trees - Categorical Attributes

 


Chapter 34: Decision Trees - Categorical Attributes

Introduction

Decision trees are powerful tools in machine learning for both classification and regression tasks. They work by recursively partitioning the data into subsets based on attribute values, ultimately leading to a decision or prediction. While decision trees can handle numerical attributes by splitting the data at certain thresholds, they can also handle categorical attributes by splitting based on category values. This chapter focuses on decision trees using categorical attributes.

Mathematics Behind Decision Trees with Categorical Attributes

Entropy and Information Gain

When dealing with categorical attributes, the key metrics used to make decisions about splits in the tree are Entropy and Information Gain.

  1. Entropy measures the uncertainty or impurity in a dataset. For a categorical attribute with kk different categories, the entropy is defined as:

    H(D)=i=1kpilog2piH(D) = -\sum_{i=1}^{k} p_i \log_2 p_i

    where:

    • pip_i is the proportion of samples in the dataset belonging to category ii.
  2. Information Gain (IG) is used to decide which attribute to split on at each step of building the tree. It measures the reduction in entropy after the dataset is split on an attribute. The information gain for a categorical attribute AA is defined as:

    IG(D,A)=H(D)vValues(A)DvDH(Dv)IG(D, A) = H(D) - \sum_{v \in \text{Values}(A)} \frac{|D_v|}{|D|} H(D_v)

    where:

    • DD is the dataset.
    • DvD_v is the subset of DD where attribute AA has value vv.
    • D|D| is the number of samples in the dataset.
    • Dv|D_v| is the number of samples in the subset DvD_v.

The attribute with the highest information gain is chosen for the split at each node.

Example: Building a Decision Tree with Categorical Attributes

Let's consider a simple example. Suppose we have a dataset about customers and their likelihood of purchasing a product based on their gender and marital status.

GenderMarital StatusPurchase
MaleSingleNo
FemaleMarriedYes
FemaleSingleYes
MaleMarriedNo
FemaleMarriedYes
MaleSingleNo

Step 1: Calculate the Entropy of the Target Variable

The target variable is "Purchase," with two possible values: Yes and No. First, we calculate the entropy of the entire dataset.

H(D)=(36log236+36log236)=1H(D) = -\left(\frac{3}{6} \log_2 \frac{3}{6} + \frac{3}{6} \log_2 \frac{3}{6}\right) = 1

Step 2: Calculate the Information Gain for Each Attribute

  • Gender:

    • For "Male":
      • H(DMale)=(03log203+33log233)=0H(D_{\text{Male}}) = -\left(\frac{0}{3} \log_2 \frac{0}{3} + \frac{3}{3} \log_2 \frac{3}{3}\right) = 0
    • For "Female":
      • H(DFemale)=(33log233+03log203)=0H(D_{\text{Female}}) = -\left(\frac{3}{3} \log_2 \frac{3}{3} + \frac{0}{3} \log_2 \frac{0}{3}\right) = 0
    • Information Gain: IG(D,Gender)=1(36×0+36×0)=1IG(D, \text{Gender}) = 1 - \left(\frac{3}{6} \times 0 + \frac{3}{6} \times 0\right) = 1
  • Marital Status:

    • For "Single":
      • H(DSingle)=(23log223+13log213)0.918H(D_{\text{Single}}) = -\left(\frac{2}{3} \log_2 \frac{2}{3} + \frac{1}{3} \log_2 \frac{1}{3}\right) \approx 0.918
    • For "Married":
      • H(DMarried)=(13log213+23log223)0.918H(D_{\text{Married}}) = -\left(\frac{1}{3} \log_2 \frac{1}{3} + \frac{2}{3} \log_2 \frac{2}{3}\right) \approx 0.918
    • Information Gain: IG(D,Marital Status)=1(36×0.918+36×0.918)0.082IG(D, \text{Marital Status}) = 1 - \left(\frac{3}{6} \times 0.918 + \frac{3}{6} \times 0.918\right) \approx 0.082

Since the Information Gain is higher for "Gender," we split the dataset based on this attribute.

Python Implementation

Below is a Python implementation of a decision tree classifier handling categorical attributes using the scikit-learn library.

python:

from sklearn.tree import DecisionTreeClassifier from sklearn.preprocessing import LabelEncoder from sklearn import tree import pandas as pd # Sample dataset data = { 'Gender': ['Male', 'Female', 'Female', 'Male', 'Female', 'Male'], 'Marital Status': ['Single', 'Married', 'Single', 'Married', 'Married', 'Single'], 'Purchase': ['No', 'Yes', 'Yes', 'No', 'Yes', 'No'] } # Convert to DataFrame df = pd.DataFrame(data) # Encode categorical variables le_gender = LabelEncoder() le_marital_status = LabelEncoder() le_purchase = LabelEncoder() df['Gender'] = le_gender.fit_transform(df['Gender']) df['Marital Status'] = le_marital_status.fit_transform(df['Marital Status']) df['Purchase'] = le_purchase.fit_transform(df['Purchase']) # Features and target X = df[['Gender', 'Marital Status']] y = df['Purchase'] # Build decision tree model clf = DecisionTreeClassifier(criterion='entropy') clf.fit(X, y) # Visualize the decision tree tree.plot_tree(clf, feature_names=['Gender', 'Marital Status'], class_names=['No', 'Yes'], filled=True)

Java Implementation

Here is a Java implementation using the Weka library.

java:

import weka.core.Attribute; import weka.core.DenseInstance; import weka.core.Instances; import weka.classifiers.trees.J48; import weka.core.FastVector; public class DecisionTreeExample { public static void main(String[] args) throws Exception { // Declare numeric attributes for the dataset Attribute gender = new Attribute("gender"); Attribute maritalStatus = new Attribute("maritalStatus"); // Declare the class attribute along with its values FastVector classValues = new FastVector(2); classValues.addElement("No"); classValues.addElement("Yes"); Attribute purchase = new Attribute("purchase", classValues); // Declare the feature vector FastVector attributes = new FastVector(3); attributes.addElement(gender); attributes.addElement(maritalStatus); attributes.addElement(purchase); // Create an empty dataset Instances dataset = new Instances("DecisionTreeExample", attributes, 0); // Set class index dataset.setClassIndex(dataset.numAttributes() - 1); // Create instances and add them to the dataset double[] values = new double[dataset.numAttributes()]; values[0] = 0; // Male values[1] = 0; // Single values[2] = dataset.attribute("purchase").indexOfValue("No"); dataset.add(new DenseInstance(1.0, values)); // Add more instances similarly... // Build the decision tree model J48 tree = new J48(); tree.buildClassifier(dataset); // Print the decision tree System.out.println(tree); } }

Conclusion

Decision trees are intuitive and interpretable models that can handle both numerical and categorical attributes. By calculating entropy and information gain, they effectively split the dataset to make accurate predictions. The Python and Java implementations provided show how these concepts can be applied in practice.


Comments

Popular posts from this blog

Machine Learning (Chapter 41): The ROC Curve

Machine Learning (Chapter 39): Bootstrapping & Cross-Validation

Machine Learning (Chapter 40): Class Evaluation Measures