Machine Learning (Chapter 26): ANN IV - Initialization, Training & Validation

 


Machine Learning (Chapter 26): ANN IV - Initialization, Training & Validation

Introduction

In the previous chapters, we explored the fundamental concepts and mechanisms behind Artificial Neural Networks (ANNs), including early models, backpropagation, and its extensions. In this chapter, we'll dive into the crucial steps of initializing, training, and validating ANNs. Proper initialization, effective training techniques, and rigorous validation methods are key to building neural networks that generalize well and perform effectively on unseen data.

1. Initialization of Weights

The initialization of weights in a neural network significantly influences the training process. Poor initialization can lead to slow convergence or even prevent the network from converging at all. Two popular initialization methods are Xavier Initialization and He Initialization.

1.1 Xavier Initialization

Xavier initialization (also known as Glorot initialization) is designed to keep the scale of the gradients roughly the same across all layers. This method works well with activation functions like sigmoid and tanh.

Mathematical Formula:

For a layer with ninn_{in} input units and noutn_{out} output units, the weights are initialized as:

WU(6nin+nout,6nin+nout)W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}}\right)

Where U\mathcal{U} denotes a uniform distribution.

1.2 He Initialization

He initialization is similar to Xavier but is optimized for ReLU and its variants, which do not have a vanishing gradient problem like sigmoid or tanh.

Mathematical Formula:

For a layer with ninn_{in} input units, the weights are initialized as:

WN(0,2nin)W \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{in}}}\right)

Where N\mathcal{N} denotes a normal distribution.

1.3 Python Implementation

python:

import numpy as np def xavier_init(size, n_in, n_out): limit = np.sqrt(6 / (n_in + n_out)) return np.random.uniform(-limit, limit, size=size) def he_init(size, n_in): stddev = np.sqrt(2 / n_in) return np.random.normal(0, stddev, size=size) # Example usage: layer_dims = [784, 128, 64, 10] # Example dimensions for a 3-layer neural network weights_xavier = [xavier_init((layer_dims[i], layer_dims[i+1]), layer_dims[i], layer_dims[i+1]) for i in range(len(layer_dims)-1)] weights_he = [he_init((layer_dims[i], layer_dims[i+1]), layer_dims[i]) for i in range(len(layer_dims)-1)]

2. Training Process

The training process of a neural network involves updating the weights to minimize the loss function. This process is typically done using Stochastic Gradient Descent (SGD) and its variants like Adam.

2.1 Loss Function

The loss function measures how well the network's predictions match the actual data. For classification problems, Cross-Entropy Loss is commonly used, while Mean Squared Error (MSE) is used for regression tasks.

Mathematical Formula:

For cross-entropy loss:

L=1mi=1m[yilog(y^i)+(1yi)log(1y^i)]L = -\frac{1}{m} \sum_{i=1}^{m} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]

For mean squared error:

L=1mi=1m(yiy^i)2L = \frac{1}{m} \sum_{i=1}^{m} (y_i - \hat{y}_i)^2

Where:

  • mm is the number of samples
  • yiy_i is the actual label
  • y^i\hat{y}_i is the predicted label

2.2 Gradient Descent

The gradients of the loss function with respect to the weights are computed, and the weights are updated accordingly.

Mathematical Formula:

For a weight WW, the update rule is:

W=WηWLW = W - \eta \cdot \nabla_W L

Where:

  • η\eta is the learning rate
  • WL\nabla_W L is the gradient of the loss function with respect to WW

2.3 Python Implementation

python:

def sgd_update(weights, gradients, learning_rate): return [w - learning_rate * g for w, g in zip(weights, gradients)] def adam_update(weights, gradients, learning_rate, beta1, beta2, eps, t, m, v): m = [beta1 * mi + (1 - beta1) * gi for mi, gi in zip(m, gradients)] v = [beta2 * vi + (1 - beta2) * (gi ** 2) for vi, gi in zip(v, gradients)] m_hat = [mi / (1 - beta1 ** t) for mi in m] v_hat = [vi / (1 - beta2 ** t) for vi in v] weights = [w - learning_rate * mhi / (np.sqrt(vhi) + eps) for w, mhi, vhi in zip(weights, m_hat, v_hat)] return weights, m, v # Example usage: weights = [np.random.randn(*w.shape) for w in weights_xavier] gradients = [np.random.randn(*w.shape) for w in weights] learning_rate = 0.01 # SGD Update weights = sgd_update(weights, gradients, learning_rate) # Adam Update beta1, beta2, eps = 0.9, 0.999, 1e-8 t = 1 m, v = [np.zeros_like(w) for w in weights], [np.zeros_like(w) for w in weights] weights, m, v = adam_update(weights, gradients, learning_rate, beta1, beta2, eps, t, m, v)

3. Validation

Validation is essential for assessing the model's performance on unseen data, preventing overfitting, and fine-tuning hyperparameters. The dataset is typically split into training, validation, and test sets.

3.1 Train-Validation Split

A common approach is to use 80% of the data for training and 20% for validation.

python:

from sklearn.model_selection import train_test_split X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

3.2 Early Stopping

Early stopping is a technique to halt training when the validation loss stops decreasing, indicating that the model may be overfitting.

python:

min_val_loss = float('inf') patience, patience_counter = 10, 0 for epoch in range(epochs): # Training step # ... # Validation step val_loss = compute_loss(X_val, y_val) if val_loss < min_val_loss: min_val_loss = val_loss best_weights = weights patience_counter = 0 else: patience_counter += 1 if patience_counter >= patience: print("Early stopping...") break weights = best_weights

Conclusion

Initialization, training, and validation are crucial steps in developing effective neural networks. Proper initialization sets the foundation for efficient learning, while careful training and validation ensure that the model generalizes well to unseen data. By understanding and applying these techniques, one can significantly improve the performance of ANNs and build robust machine learning models.

Comments

Popular posts from this blog

Machine Learning (Chapter 41): The ROC Curve

Machine Learning (Chapter 39): Bootstrapping & Cross-Validation

Machine Learning (Chapter 40): Class Evaluation Measures