Machine Learning (Chapter 34): Decision Trees - Categorical Attributes
Chapter 34: Decision Trees - Categorical Attributes
Introduction
Decision trees are powerful tools in machine learning for both classification and regression tasks. They work by recursively partitioning the data into subsets based on attribute values, ultimately leading to a decision or prediction. While decision trees can handle numerical attributes by splitting the data at certain thresholds, they can also handle categorical attributes by splitting based on category values. This chapter focuses on decision trees using categorical attributes.
Mathematics Behind Decision Trees with Categorical Attributes
Entropy and Information Gain
When dealing with categorical attributes, the key metrics used to make decisions about splits in the tree are Entropy and Information Gain.
Entropy measures the uncertainty or impurity in a dataset. For a categorical attribute with different categories, the entropy is defined as:
where:
- is the proportion of samples in the dataset belonging to category .
Information Gain (IG) is used to decide which attribute to split on at each step of building the tree. It measures the reduction in entropy after the dataset is split on an attribute. The information gain for a categorical attribute is defined as:
where:
- is the dataset.
- is the subset of where attribute has value .
- is the number of samples in the dataset.
- is the number of samples in the subset .
The attribute with the highest information gain is chosen for the split at each node.
Example: Building a Decision Tree with Categorical Attributes
Let's consider a simple example. Suppose we have a dataset about customers and their likelihood of purchasing a product based on their gender and marital status.
| Gender | Marital Status | Purchase |
|---|---|---|
| Male | Single | No |
| Female | Married | Yes |
| Female | Single | Yes |
| Male | Married | No |
| Female | Married | Yes |
| Male | Single | No |
Step 1: Calculate the Entropy of the Target Variable
The target variable is "Purchase," with two possible values: Yes and No. First, we calculate the entropy of the entire dataset.
Step 2: Calculate the Information Gain for Each Attribute
Gender:
- For "Male":
- For "Female":
- Information Gain:
- For "Male":
Marital Status:
- For "Single":
- For "Married":
- Information Gain:
- For "Single":
Since the Information Gain is higher for "Gender," we split the dataset based on this attribute.
Python Implementation
Below is a Python implementation of a decision tree classifier handling categorical attributes using the scikit-learn library.
python:
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn import tree
import pandas as pd
# Sample dataset
data = {
'Gender': ['Male', 'Female', 'Female', 'Male', 'Female', 'Male'],
'Marital Status': ['Single', 'Married', 'Single', 'Married', 'Married', 'Single'],
'Purchase': ['No', 'Yes', 'Yes', 'No', 'Yes', 'No']
}
# Convert to DataFrame
df = pd.DataFrame(data)
# Encode categorical variables
le_gender = LabelEncoder()
le_marital_status = LabelEncoder()
le_purchase = LabelEncoder()
df['Gender'] = le_gender.fit_transform(df['Gender'])
df['Marital Status'] = le_marital_status.fit_transform(df['Marital Status'])
df['Purchase'] = le_purchase.fit_transform(df['Purchase'])
# Features and target
X = df[['Gender', 'Marital Status']]
y = df['Purchase']
# Build decision tree model
clf = DecisionTreeClassifier(criterion='entropy')
clf.fit(X, y)
# Visualize the decision tree
tree.plot_tree(clf, feature_names=['Gender', 'Marital Status'], class_names=['No', 'Yes'], filled=True)
Java Implementation
Here is a Java implementation using the Weka library.
java:
import weka.core.Attribute;
import weka.core.DenseInstance;
import weka.core.Instances;
import weka.classifiers.trees.J48;
import weka.core.FastVector;
public class DecisionTreeExample {
public static void main(String[] args) throws Exception {
// Declare numeric attributes for the dataset
Attribute gender = new Attribute("gender");
Attribute maritalStatus = new Attribute("maritalStatus");
// Declare the class attribute along with its values
FastVector classValues = new FastVector(2);
classValues.addElement("No");
classValues.addElement("Yes");
Attribute purchase = new Attribute("purchase", classValues);
// Declare the feature vector
FastVector attributes = new FastVector(3);
attributes.addElement(gender);
attributes.addElement(maritalStatus);
attributes.addElement(purchase);
// Create an empty dataset
Instances dataset = new Instances("DecisionTreeExample", attributes, 0);
// Set class index
dataset.setClassIndex(dataset.numAttributes() - 1);
// Create instances and add them to the dataset
double[] values = new double[dataset.numAttributes()];
values[0] = 0; // Male
values[1] = 0; // Single
values[2] = dataset.attribute("purchase").indexOfValue("No");
dataset.add(new DenseInstance(1.0, values));
// Add more instances similarly...
// Build the decision tree model
J48 tree = new J48();
tree.buildClassifier(dataset);
// Print the decision tree
System.out.println(tree);
}
}
Conclusion
Decision trees are intuitive and interpretable models that can handle both numerical and categorical attributes. By calculating entropy and information gain, they effectively split the dataset to make accurate predictions. The Python and Java implementations provided show how these concepts can be applied in practice.

Comments
Post a Comment