Machine Learning (Chapter 26): ANN IV - Initialization, Training & Validation

By Ritesh Sahu August 30, 2024

Machine Learning (Chapter 26): ANN IV - Initialization, Training & Validation

Introduction

In the previous chapters, we explored the fundamental concepts and mechanisms behind Artificial Neural Networks (ANNs), including early models, backpropagation, and its extensions. In this chapter, we'll dive into the crucial steps of initializing, training, and validating ANNs. Proper initialization, effective training techniques, and rigorous validation methods are key to building neural networks that generalize well and perform effectively on unseen data.

1. Initialization of Weights

The initialization of weights in a neural network significantly influences the training process. Poor initialization can lead to slow convergence or even prevent the network from converging at all. Two popular initialization methods are Xavier Initialization and He Initialization.

1.1 Xavier Initialization

Xavier initialization (also known as Glorot initialization) is designed to keep the scale of the gradients roughly the same across all layers. This method works well with activation functions like sigmoid and tanh.

Mathematical Formula:

For a layer with $n_{in}$ input units and $n_{out}$ output units, the weights are initialized as:

$W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}}\right)$

Where $\mathcal{U}$ denotes a uniform distribution.

1.2 He Initialization

He initialization is similar to Xavier but is optimized for ReLU and its variants, which do not have a vanishing gradient problem like sigmoid or tanh.

Mathematical Formula:

For a layer with $n_{in}$ input units, the weights are initialized as:

$W \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{in}}}\right)$

Where $\mathcal{N}$ denotes a normal distribution.

1.3 Python Implementation

python:
import numpy as np

def xavier_init(size, n_in, n_out):
    limit = np.sqrt(6 / (n_in + n_out))
    return np.random.uniform(-limit, limit, size=size)

def he_init(size, n_in):
    stddev = np.sqrt(2 / n_in)
    return np.random.normal(0, stddev, size=size)

# Example usage:
layer_dims = [784, 128, 64, 10]  # Example dimensions for a 3-layer neural network

weights_xavier = [xavier_init((layer_dims[i], layer_dims[i+1]), layer_dims[i], layer_dims[i+1]) for i in range(len(layer_dims)-1)]
weights_he = [he_init((layer_dims[i], layer_dims[i+1]), layer_dims[i]) for i in range(len(layer_dims)-1)]

2. Training Process

The training process of a neural network involves updating the weights to minimize the loss function. This process is typically done using Stochastic Gradient Descent (SGD) and its variants like Adam.

2.1 Loss Function

The loss function measures how well the network's predictions match the actual data. For classification problems, Cross-Entropy Loss is commonly used, while Mean Squared Error (MSE) is used for regression tasks.

Mathematical Formula:

For cross-entropy loss:

$L = -\frac{1}{m} \sum_{i=1}^{m} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]$

For mean squared error:

$L = \frac{1}{m} \sum_{i=1}^{m} (y_i - \hat{y}_i)^2$

Where:

$m$ is the number of samples
$y_i$ is the actual label
$\hat{y}_i$ is the predicted label

2.2 Gradient Descent

The gradients of the loss function with respect to the weights are computed, and the weights are updated accordingly.

Mathematical Formula:

For a weight $W$ , the update rule is:

$W = W - \eta \cdot \nabla_W L$

Where:

$\eta$ is the learning rate
$\nabla_W L$ is the gradient of the loss function with respect to $W$

2.3 Python Implementation

python:
def sgd_update(weights, gradients, learning_rate):
    return [w - learning_rate * g for w, g in zip(weights, gradients)]

def adam_update(weights, gradients, learning_rate, beta1, beta2, eps, t, m, v):
    m = [beta1 * mi + (1 - beta1) * gi for mi, gi in zip(m, gradients)]
    v = [beta2 * vi + (1 - beta2) * (gi ** 2) for vi, gi in zip(v, gradients)]

    m_hat = [mi / (1 - beta1 ** t) for mi in m]
    v_hat = [vi / (1 - beta2 ** t) for vi in v]

    weights = [w - learning_rate * mhi / (np.sqrt(vhi) + eps) for w, mhi, vhi in zip(weights, m_hat, v_hat)]
    return weights, m, v

# Example usage:
weights = [np.random.randn(*w.shape) for w in weights_xavier]
gradients = [np.random.randn(*w.shape) for w in weights]
learning_rate = 0.01

# SGD Update
weights = sgd_update(weights, gradients, learning_rate)

# Adam Update
beta1, beta2, eps = 0.9, 0.999, 1e-8
t = 1
m, v = [np.zeros_like(w) for w in weights], [np.zeros_like(w) for w in weights]
weights, m, v = adam_update(weights, gradients, learning_rate, beta1, beta2, eps, t, m, v)

3. Validation

Validation is essential for assessing the model's performance on unseen data, preventing overfitting, and fine-tuning hyperparameters. The dataset is typically split into training, validation, and test sets.

3.1 Train-Validation Split

A common approach is to use 80% of the data for training and 20% for validation.

python:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

3.2 Early Stopping

Early stopping is a technique to halt training when the validation loss stops decreasing, indicating that the model may be overfitting.

python:
min_val_loss = float('inf')
patience, patience_counter = 10, 0

for epoch in range(epochs):
    # Training step
    # ...

    # Validation step
    val_loss = compute_loss(X_val, y_val)
    if val_loss < min_val_loss:
        min_val_loss = val_loss
        best_weights = weights
        patience_counter = 0
    else:
        patience_counter += 1
    
    if patience_counter >= patience:
        print("Early stopping...")
        break

weights = best_weights

Conclusion

Initialization, training, and validation are crucial steps in developing effective neural networks. Proper initialization sets the foundation for efficient learning, while careful training and validation ensure that the model generalizes well to unseen data. By understanding and applying these techniques, one can significantly improve the performance of ANNs and build robust machine learning models.

Search This Blog

Machine learning and artificial intelligence