Machine Learning (Chapter 25): ANN III - Backpropagation II

 



Machine Learning (Chapter 25): ANN III - Backpropagation II

Introduction

In this chapter, we continue our exploration of Artificial Neural Networks (ANNs) with a deeper dive into the Backpropagation algorithm. Backpropagation is a critical component of training neural networks, allowing them to learn from data by minimizing the error between predicted and actual outcomes. This chapter builds on the foundations laid in the previous sections, introducing more advanced concepts and mathematical techniques involved in Backpropagation.

Recap: The Role of Backpropagation

Backpropagation is a supervised learning algorithm used to train neural networks. It works by propagating the error backward from the output layer to the input layer, updating the weights of the network to minimize the loss function. The key steps involved in Backpropagation are:

  1. Forward Pass: Compute the output of the network using the current weights.
  2. Compute Loss: Calculate the loss using a loss function (e.g., Mean Squared Error, Cross-Entropy).
  3. Backward Pass: Compute the gradient of the loss with respect to each weight using the chain rule.
  4. Update Weights: Adjust the weights using gradient descent or other optimization techniques.

Advanced Backpropagation Techniques

1. Momentum

One challenge in gradient descent is getting stuck in local minima or slow convergence. Momentum is an enhancement that helps accelerate gradient vectors in the right directions, thus leading to faster converging.

Mathematically, momentum modifies the weight update rule as follows:

vt+1=βvt+ηwL(wt)v_{t+1} = \beta v_t + \eta \nabla_w L(w_t) wt+1=wtvt+1w_{t+1} = w_t - v_{t+1}

Here:

  • β\beta is the momentum term, typically between 0.9 and 0.99.
  • η\eta is the learning rate.
  • wL(wt)\nabla_w L(w_t) is the gradient of the loss function with respect to the weights.

Momentum helps the network gain speed in directions with consistent gradients and reduces oscillations in directions with fluctuating gradients.

2. Nesterov Accelerated Gradient (NAG)

Nesterov Accelerated Gradient is a variant of momentum that anticipates the future position of the parameters, resulting in a more accurate gradient calculation.

The update rule is:

vt+1=βvt+ηwL(wtβvt)v_{t+1} = \beta v_t + \eta \nabla_w L(w_t - \beta v_t) wt+1=wtvt+1w_{t+1} = w_t - v_{t+1}

NAG looks ahead at the gradient of the loss, effectively providing a correction to the velocity, leading to better convergence properties.

3. Adaptive Learning Rates (AdaGrad, RMSProp, Adam)

AdaGrad, RMSProp, and Adam are optimization algorithms that adapt the learning rate during training.

  • AdaGrad: Adjusts the learning rate for each parameter based on the historical gradients.

    wt+1=wtηGt,t+ϵwL(wt)w_{t+1} = w_t - \frac{\eta}{\sqrt{G_{t,t} + \epsilon}} \nabla_w L(w_t)

    where GtG_t is the sum of the squares of the gradients up to time tt, and ϵ\epsilon is a small constant to prevent division by zero.

  • RMSProp: Improves AdaGrad by focusing on recent gradients.

    E[g2]t=βE[g2]t1+(1β)wL(wt)2E[g^2]_t = \beta E[g^2]_{t-1} + (1 - \beta) \nabla_w L(w_t)^2 wt+1=wtηE[g2]t+ϵwL(wt)w_{t+1} = w_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \nabla_w L(w_t)
  • Adam: Combines the advantages of both momentum and RMSProp.

    mt=β1mt1+(1β1)wL(wt)m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_w L(w_t) vt=β2vt1+(1β2)(wL(wt))2v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla_w L(w_t))^2 m^t=mt1β1t,v^t=vt1β2t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} wt+1=wtηm^tv^t+ϵw_{t+1} = w_t - \frac{\eta \hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

Python Code Example: Backpropagation with Adam Optimizer

python:

import numpy as np # Define the sigmoid activation function and its derivative def sigmoid(x): return 1 / (1 + np.exp(-x)) def sigmoid_derivative(x): return x * (1 - x) # Initialize input data (X) and target output (y) X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) y = np.array([[0], [1], [1], [0]]) # Set seed for reproducibility np.random.seed(42) # Initialize weights randomly weights_input_hidden = np.random.rand(2, 2) weights_hidden_output = np.random.rand(2, 1) # Set hyperparameters learning_rate = 0.01 epochs = 10000 # Adam parameters beta1 = 0.9 beta2 = 0.999 epsilon = 1e-8 m_wih = np.zeros_like(weights_input_hidden) v_wih = np.zeros_like(weights_input_hidden) m_who = np.zeros_like(weights_hidden_output) v_who = np.zeros_like(weights_hidden_output) for epoch in range(epochs): # Forward pass hidden_input = np.dot(X, weights_input_hidden) hidden_output = sigmoid(hidden_input) final_input = np.dot(hidden_output, weights_hidden_output) final_output = sigmoid(final_input) # Compute error error = y - final_output # Backward pass (Gradient calculation) d_output = error * sigmoid_derivative(final_output) error_hidden_layer = d_output.dot(weights_hidden_output.T) d_hidden_layer = error_hidden_layer * sigmoid_derivative(hidden_output) # Update weights with Adam optimizer m_wih = beta1 * m_wih + (1 - beta1) * X.T.dot(d_hidden_layer) v_wih = beta2 * v_wih + (1 - beta2) * (X.T.dot(d_hidden_layer))**2 m_wih_corr = m_wih / (1 - beta1**(epoch + 1)) v_wih_corr = v_wih / (1 - beta2**(epoch + 1)) weights_input_hidden += learning_rate * m_wih_corr / (np.sqrt(v_wih_corr) + epsilon) m_who = beta1 * m_who + (1 - beta1) * hidden_output.T.dot(d_output) v_who = beta2 * v_who + (1 - beta2) * (hidden_output.T.dot(d_output))**2 m_who_corr = m_who / (1 - beta1**(epoch + 1)) v_who_corr = v_who / (1 - beta2**(epoch + 1)) weights_hidden_output += learning_rate * m_who_corr / (np.sqrt(v_who_corr) + epsilon) # Print the final output print("Final output after training:") print(final_output)

Conclusion

Backpropagation remains a cornerstone of neural network training, and understanding its advanced techniques is crucial for improving model performance. Momentum, Nesterov Accelerated Gradient, and adaptive learning rates like Adam optimize the training process by enhancing convergence speed and stability. The mathematical foundations and Python code provided in this chapter should equip you with the tools to implement and experiment with these techniques in your own neural networks.

In the next chapter, we'll explore techniques for regularization in neural networks, including L1 and L2 regularization, dropout, and batch normalization.

Comments

Popular posts from this blog

Machine Learning (Chapter 41): The ROC Curve

Machine Learning (Chapter 39): Bootstrapping & Cross-Validation

Machine Learning (Chapter 40): Class Evaluation Measures