Machine Learning (Chapter 20): SVMs for Linearly Non-Separable Data


 


Machine Learning (Chapter 20): SVMs for Linearly Non-Separable Data

Support Vector Machines (SVMs) are a powerful class of supervised learning models used for classification and regression tasks. While SVMs are often associated with linearly separable data, real-world data is frequently non-linearly separable. In such cases, we need to extend the basic SVM framework to handle these complexities. This chapter will explore how SVMs can be adapted for linearly non-separable data using techniques like the kernel trick and the soft margin approach.

1. Introduction to SVMs

In the simplest case of linear separability, SVMs aim to find a hyperplane that best separates data into different classes. For linearly separable data, the SVM solves the following optimization problem:

Minimize12w2\text{Minimize} \quad \frac{1}{2} \| \mathbf{w} \|^2

Subject to:

yi(wxi+b)1for all iy_i (\mathbf{w} \cdot \mathbf{x_i} + b) \geq 1 \quad \text{for all } i

where w\mathbf{w} is the weight vector, bb is the bias, xi\mathbf{x_i} are the input features, and yiy_i are the class labels.

2. Handling Non-Separable Data

When data is not linearly separable, the optimization problem needs to be modified to accommodate the presence of some misclassifications. This is done using a soft margin approach, which allows some points to be within the margin or even misclassified.

2.1 Soft Margin SVM

The soft margin SVM introduces a penalty for misclassification using slack variables ξi\xi_i. The modified optimization problem becomes:

Minimize12w2+Ci=1Nξi\text{Minimize} \quad \frac{1}{2} \| \mathbf{w} \|^2 + C \sum_{i=1}^N \xi_i

Subject to:

yi(wxi+b)1ξiy_i (\mathbf{w} \cdot \mathbf{x_i} + b) \geq 1 - \xi_i ξi0for all i\xi_i \geq 0 \quad \text{for all } i

Here, CC is a regularization parameter that controls the trade-off between maximizing the margin and minimizing the classification error.

2.2 Kernel Trick

To handle more complex, non-linearly separable data, we use the kernel trick. This technique involves transforming the data into a higher-dimensional space where it is more likely to be linearly separable. The kernel function computes the dot product in this higher-dimensional space without explicitly performing the transformation.

Common kernels include:

  • Linear Kernel: K(xi,xj)=xixjK(\mathbf{x_i}, \mathbf{x_j}) = \mathbf{x_i} \cdot \mathbf{x_j}
  • Polynomial Kernel: K(xi,xj)=(xixj+c)dK(\mathbf{x_i}, \mathbf{x_j}) = (\mathbf{x_i} \cdot \mathbf{x_j} + c)^d
  • Radial Basis Function (RBF) Kernel: K(xi,xj)=exp(xixj22σ2)K(\mathbf{x_i}, \mathbf{x_j}) = \exp\left(-\frac{\| \mathbf{x_i} - \mathbf{x_j} \|^2}{2\sigma^2}\right)

The dual form of the SVM optimization problem, which uses the kernel trick, is:

Maximizei=1Nαi12i,j=1NαiαjyiyjK(xi,xj)\text{Maximize} \quad \sum_{i=1}^N \alpha_i - \frac{1}{2} \sum_{i,j=1}^N \alpha_i \alpha_j y_i y_j K(\mathbf{x_i}, \mathbf{x_j})

Subject to:

0αiC0 \leq \alpha_i \leq C i=1Nαiyi=0\sum_{i=1}^N \alpha_i y_i = 0

where αi\alpha_i are the Lagrange multipliers.

3. Python Implementation

Let's implement a soft margin SVM with an RBF kernel using scikit-learn.

python:

import numpy as np from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.svm import SVC from sklearn.metrics import classification_report, accuracy_score import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler # Load dataset iris = datasets.load_iris() X = iris.data y = iris.target # Convert to binary classification for simplicity X = X[y != 2] y = y[y != 2] # Split the data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Standardize features scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) # Create and train the SVM model with RBF kernel model = SVC(kernel='rbf', C=1.0, gamma='auto') model.fit(X_train, y_train) # Predict and evaluate y_pred = model.predict(X_test) print(f"Accuracy: {accuracy_score(y_test, y_pred)}") print("Classification Report:") print(classification_report(y_test, y_pred)) # Plot decision boundary def plot_decision_boundary(X, y, model): x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 XX, YY = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02)) Z = model.predict(np.c_[XX.ravel(), YY.ravel()]) Z = Z.reshape(XX.shape) plt.contourf(XX, YY, Z, alpha=0.8) plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', marker='o') plt.title('SVM Decision Boundary with RBF Kernel') plt.show() plot_decision_boundary(X_test, y_test, model)

Explanation:

  1. Data Preparation: We use the Iris dataset but convert it to a binary classification problem for simplicity. We then standardize the features.

  2. Model Training: We create an SVC model with an RBF kernel and train it on the data.

  3. Evaluation: We evaluate the model's accuracy and print a classification report.

  4. Visualization: We plot the decision boundary to visualize how the model separates the data.

4. Conclusion

SVMs are versatile and can be adapted for linearly non-separable data using the soft margin approach and kernel methods. By allowing some misclassifications and transforming data into higher-dimensional spaces, SVMs can handle complex datasets effectively. The combination of mathematical rigor and practical implementation makes SVMs a powerful tool in machine learning.

This chapter provided a comprehensive overview of handling non-separable data with SVMs, including mathematical formulations and a practical example in Python.

Comments

Popular posts from this blog

Machine Learning (Chapter 41): The ROC Curve

Machine Learning (Chapter 39): Bootstrapping & Cross-Validation

Machine Learning (Chapter 40): Class Evaluation Measures