Machine Learning (Chapter 20): SVMs for Linearly Non-Separable Data

By Ritesh Sahu August 30, 2024

Machine Learning (Chapter 20): SVMs for Linearly Non-Separable Data

Support Vector Machines (SVMs) are a powerful class of supervised learning models used for classification and regression tasks. While SVMs are often associated with linearly separable data, real-world data is frequently non-linearly separable. In such cases, we need to extend the basic SVM framework to handle these complexities. This chapter will explore how SVMs can be adapted for linearly non-separable data using techniques like the kernel trick and the soft margin approach.

1. Introduction to SVMs

In the simplest case of linear separability, SVMs aim to find a hyperplane that best separates data into different classes. For linearly separable data, the SVM solves the following optimization problem:

$\text{Minimize} \quad \frac{1}{2} \| \mathbf{w} \|^2$

Subject to:

$y_i (\mathbf{w} \cdot \mathbf{x_i} + b) \geq 1 \quad \text{for all } i$

where $\mathbf{w}$ is the weight vector, $b$ is the bias, $\mathbf{x_i}$ are the input features, and $y_i$ are the class labels.

2. Handling Non-Separable Data

When data is not linearly separable, the optimization problem needs to be modified to accommodate the presence of some misclassifications. This is done using a soft margin approach, which allows some points to be within the margin or even misclassified.

2.1 Soft Margin SVM

The soft margin SVM introduces a penalty for misclassification using slack variables $\xi_i$ . The modified optimization problem becomes:

$\text{Minimize} \quad \frac{1}{2} \| \mathbf{w} \|^2 + C \sum_{i=1}^N \xi_i$

Subject to:

$y_i (\mathbf{w} \cdot \mathbf{x_i} + b) \geq 1 - \xi_i$ $\xi_i \geq 0 \quad \text{for all } i$

Here, $C$ is a regularization parameter that controls the trade-off between maximizing the margin and minimizing the classification error.

2.2 Kernel Trick

To handle more complex, non-linearly separable data, we use the kernel trick. This technique involves transforming the data into a higher-dimensional space where it is more likely to be linearly separable. The kernel function computes the dot product in this higher-dimensional space without explicitly performing the transformation.

Common kernels include:

Linear Kernel: $K(\mathbf{x_i}, \mathbf{x_j}) = \mathbf{x_i} \cdot \mathbf{x_j}$
Polynomial Kernel: $K(\mathbf{x_i}, \mathbf{x_j}) = (\mathbf{x_i} \cdot \mathbf{x_j} + c)^d$
Radial Basis Function (RBF) Kernel: $K(\mathbf{x_i}, \mathbf{x_j}) = \exp\left(-\frac{\| \mathbf{x_i} - \mathbf{x_j} \|^2}{2\sigma^2}\right)$

The dual form of the SVM optimization problem, which uses the kernel trick, is:

$\text{Maximize} \quad \sum_{i=1}^N \alpha_i - \frac{1}{2} \sum_{i,j=1}^N \alpha_i \alpha_j y_i y_j K(\mathbf{x_i}, \mathbf{x_j})$

Subject to:

$0 \leq \alpha_i \leq C$ $\sum_{i=1}^N \alpha_i y_i = 0$

where $\alpha_i$ are the Lagrange multipliers.

3. Python Implementation

Let's implement a soft margin SVM with an RBF kernel using scikit-learn.

python:
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# Load dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Convert to binary classification for simplicity
X = X[y != 2]
y = y[y != 2]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Create and train the SVM model with RBF kernel
model = SVC(kernel='rbf', C=1.0, gamma='auto')
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Plot decision boundary
def plot_decision_boundary(X, y, model):
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    XX, YY = np.meshgrid(np.arange(x_min, x_max, 0.02),
                         np.arange(y_min, y_max, 0.02))
    Z = model.predict(np.c_[XX.ravel(), YY.ravel()])
    Z = Z.reshape(XX.shape)
    plt.contourf(XX, YY, Z, alpha=0.8)
    plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', marker='o')
    plt.title('SVM Decision Boundary with RBF Kernel')
    plt.show()

plot_decision_boundary(X_test, y_test, model)

Explanation:

Data Preparation: We use the Iris dataset but convert it to a binary classification problem for simplicity. We then standardize the features.
Model Training: We create an SVC model with an RBF kernel and train it on the data.
Evaluation: We evaluate the model's accuracy and print a classification report.
Visualization: We plot the decision boundary to visualize how the model separates the data.

4. Conclusion

SVMs are versatile and can be adapted for linearly non-separable data using the soft margin approach and kernel methods. By allowing some misclassifications and transforming data into higher-dimensional spaces, SVMs can handle complex datasets effectively. The combination of mathematical rigor and practical implementation makes SVMs a powerful tool in machine learning.

This chapter provided a comprehensive overview of handling non-separable data with SVMs, including mathematical formulations and a practical example in Python.

Search This Blog

Machine learning and artificial intelligence