Machine Learning (Chapter 15): Logistic Regression

By Ritesh Sahu August 30, 2024

Machine Learning (Chapter 15): Logistic Regression

Introduction

Logistic Regression is a fundamental classification algorithm used in machine learning and statistics. It is widely employed to predict the probability of a binary outcome based on one or more predictor variables. Despite its name, Logistic Regression is used for classification tasks rather than regression tasks.

Mathematical Formulation

Logistic Regression models the probability that a given input belongs to a particular class. The probability is modeled using the logistic function, which is defined as:

$P(Y = 1 \mid X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}}$

Here:

$P(Y = 1 \mid X)$ is the probability that the response variable $Y$ is 1 given the input $X$ .
$\beta_0$ is the intercept term.
$\beta_1$ is the coefficient for the predictor variable $X$ .
$e$ is the base of the natural logarithm.

The logistic function, also known as the sigmoid function, outputs a value between 0 and 1, making it suitable for binary classification.

Logistic Function

The logistic function is given by:

$\sigma(z) = \frac{1}{1 + e^{-z}}$

where $z$ is a linear combination of the input features.

Cost Function

To fit the model to the data, we use the cost function, which measures the difference between the predicted probabilities and the actual class labels. The cost function for logistic regression is the log-loss function:

$J(\beta_0, \beta_1) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right]$

where:

$m$ is the number of training examples.
$h_\theta(x^{(i)}) = \sigma(\beta_0 + \beta_1 x^{(i)})$ is the hypothesis function.

Python Implementation

Let’s implement a simple logistic regression model using Python with the scikit-learn library.

python:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt

# Generate synthetic data
X, y = make_classification(n_samples=100, n_features=2, n_classes=2, random_state=42)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and fit the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)

# Plot decision boundary
def plot_decision_boundary(X, y, model):
    h = .02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, alpha=0.8)
    plt.scatter(X[:, 0], X[:, 1], c=y, edgecolor='k', marker='o')
    plt.title("Logistic Regression Decision Boundary")
    plt.xlabel("Feature 1")
    plt.ylabel("Feature 2")
    plt.show()

plot_decision_boundary(X_test, y_test, model)

Explanation

Data Generation: We generate synthetic data with two features and binary labels using make_classification.
Model Training: We split the data into training and test sets and train a logistic regression model using scikit-learn.
Model Evaluation: We evaluate the model’s performance by calculating accuracy, confusion matrix, and classification report.
Visualization: We plot the decision boundary of the logistic regression model to visualize how well it separates the classes.

Conclusion

Logistic Regression is a powerful and simple classification algorithm that uses the logistic function to model binary outcomes. Its mathematical foundation is robust, and its implementation in Python with scikit-learn makes it accessible for practical use. This method provides valuable insights into the relationship between predictor variables and the probability of a specific outcome.

Feel free to adjust the parameters and explore more advanced topics like regularization, multiclass classification, and feature engineering to further enhance your understanding of logistic regression.

Search This Blog

Machine learning and artificial intelligence