Machine Learning (Chapter 15): Logistic Regression

 


Machine Learning (Chapter 15): Logistic Regression

Introduction

Logistic Regression is a fundamental classification algorithm used in machine learning and statistics. It is widely employed to predict the probability of a binary outcome based on one or more predictor variables. Despite its name, Logistic Regression is used for classification tasks rather than regression tasks.

Mathematical Formulation

Logistic Regression models the probability that a given input belongs to a particular class. The probability is modeled using the logistic function, which is defined as:

P(Y=1X)=11+e(β0+β1X)P(Y = 1 \mid X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X)}}

Here:

  • P(Y=1X)P(Y = 1 \mid X) is the probability that the response variable YY is 1 given the input XX.
  • β0\beta_0 is the intercept term.
  • β1\beta_1 is the coefficient for the predictor variable XX.
  • ee is the base of the natural logarithm.

The logistic function, also known as the sigmoid function, outputs a value between 0 and 1, making it suitable for binary classification.

Logistic Function

The logistic function is given by:

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

where zz is a linear combination of the input features.

Cost Function

To fit the model to the data, we use the cost function, which measures the difference between the predicted probabilities and the actual class labels. The cost function for logistic regression is the log-loss function:

J(β0,β1)=1mi=1m[y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))]J(\beta_0, \beta_1) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right]

where:

  • mm is the number of training examples.
  • hθ(x(i))=σ(β0+β1x(i))h_\theta(x^{(i)}) = \sigma(\beta_0 + \beta_1 x^{(i)}) is the hypothesis function.

Python Implementation

Let’s implement a simple logistic regression model using Python with the scikit-learn library.

python:
import numpy as np from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, confusion_matrix, classification_report import matplotlib.pyplot as plt # Generate synthetic data X, y = make_classification(n_samples=100, n_features=2, n_classes=2, random_state=42) # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Create and fit the logistic regression model model = LogisticRegression() model.fit(X_train, y_train) # Make predictions y_pred = model.predict(X_test) # Evaluate the model accuracy = accuracy_score(y_test, y_pred) conf_matrix = confusion_matrix(y_test, y_pred) class_report = classification_report(y_test, y_pred) print(f"Accuracy: {accuracy:.2f}") print("Confusion Matrix:") print(conf_matrix) print("Classification Report:") print(class_report) # Plot decision boundary def plot_decision_boundary(X, y, model): h = .02 x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) Z = model.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.contourf(xx, yy, Z, alpha=0.8) plt.scatter(X[:, 0], X[:, 1], c=y, edgecolor='k', marker='o') plt.title("Logistic Regression Decision Boundary") plt.xlabel("Feature 1") plt.ylabel("Feature 2") plt.show() plot_decision_boundary(X_test, y_test, model)

Explanation

  1. Data Generation: We generate synthetic data with two features and binary labels using make_classification.
  2. Model Training: We split the data into training and test sets and train a logistic regression model using scikit-learn.
  3. Model Evaluation: We evaluate the model’s performance by calculating accuracy, confusion matrix, and classification report.
  4. Visualization: We plot the decision boundary of the logistic regression model to visualize how well it separates the classes.

Conclusion

Logistic Regression is a powerful and simple classification algorithm that uses the logistic function to model binary outcomes. Its mathematical foundation is robust, and its implementation in Python with scikit-learn makes it accessible for practical use. This method provides valuable insights into the relationship between predictor variables and the probability of a specific outcome.

Feel free to adjust the parameters and explore more advanced topics like regularization, multiclass classification, and feature engineering to further enhance your understanding of logistic regression.

Comments

Popular posts from this blog

Machine Learning (Chapter 41): The ROC Curve

Machine Learning (Chapter 39): Bootstrapping & Cross-Validation

Machine Learning (Chapter 40): Class Evaluation Measures