🌑

☀️

Stephen's Blog

Home Archives About

Gradient Descent based Optimization Algorithms for Deep Learning Models Training

Stephen Cheng

Intro

Gradient descent is an optimisation method for finding the minimum of a function. It is commonly used in deep learning models to update the weights of the neural network through backpropagation. Understanding different optimization algorithms and their strengths and weaknesses is crucial for any data scientist training deep learning models. Selecting the right optimizer for the task at hand is paramount to achieving the best possible training results in the shortest amount of time. Next, we’ll explore the most commonly used deep learning optimization algorithms, including Gradient Descent, Stochastic Gradient Descent, and the Adam optimizer. By the end of this article, you’ll have a clear idea of how to choose the best algorithm for training your models.

Gradient Descent

Gradient Descent is an algorithm designed to minimize a function by iteratively moving towards the minimum value of the function. It’s akin to a hiker trying to find the lowest point in a valley shrouded in fog. The hiker starts at a random location and can only feel the slope of the ground beneath their feet. To reach the valley’s lowest point, the hiker takes steps in the direction of the steepest descent. All deep learning model optimization algorithms widely used today are based on Gradient Descent.

How it Works

Initialization: Start with random values for the model’s weights.
Gradient computation: Calculate the gradient of the cost function with respect to each parameter. The gradient is a vector that points in the direction of the steepest increase of the function. In the context of optimization, we’re interested in the negative gradient, which points towards the direction of the steepest decrease.
Update parameters: Adjust the model’s parameters in the direction opposite to the gradient. This step is done by subtracting a fraction of the gradient from the current values of the parameters. The size of this step is determined by the learning rate, a hyperparameter that controls how fast or slow we move toward the optimal weights.

Mathematical Representation

The update rule for each parameter 𝒘 can be mathematically represented as:

where w represents the model’s parameters (weights) and 𝛼 is the learning rate. Δ𝑤𝐽(w) is the gradient of the cost function 𝐽(w) with respect to w.

The learning rate is a crucial hyperparameter that needs to be chosen carefully. If it’s too small, the algorithm will converge very slowly. If it’s too large, the algorithm might overshoot the minimum and fail to converge.

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) is a variant of the traditional Gradient Descent optimization algorithm that introduces randomness into the optimization process to improve convergence speed and potentially escape local minima. To understand the intuition behind SGD, we can again invoke the analogy of a hiker descending a foggy valley. If Gradient Descent represents a cautious hiker who carefully evaluates the slope around them before taking a step, Stochastic Gradient Descent is akin to a more impulsive hiker who decides their next step based only on the slope of the ground immediately beneath their feet. This approach can lead to a quicker descent but might involve more meandering.

How it Works

Initialization: Start with a random set of parameters for the model.
Gradient computation: Instead of calculating the gradient of the cost function over the entire training data, SGD computes the gradient based on a single randomly selected training example.
Update parameters: Update the model’s parameters using this computed gradient. The parameters are adjusted in the direction opposite to the gradient, similar to basic Gradient Descent.

Mathematical Representation

The parameter update rule in SGD is similar to that of Gradient Descent but applies to a single example i:

Here, w represents the model’s parameters (weights), 𝛼 is the learning rate, and ∆𝘸𝘑𝘪(𝘸) is the gradient of the cost function 𝐽i(w) for the ith training example with respect to w.

Mini-batch Gradient Descent

Mini-batch Gradient Descent strikes a balance between the thorough, calculated approach of Gradient Descent and the unpredictable, swift nature of Stochastic Gradient Descent (SGD). Imagine a group of hikers navigating through a foggy valley. Each hiker independently assesses a small, distinct section of the surrounding area before the group decides on the best direction to take. Based on a broader but still limited view of the terrain, this collective decision-making process allows for a more informed and steady progression toward the valley’s lowest point compared to an individual hiker’s erratic journey.

How it Works

Initialization: Start with initial random values for the model’s parameters.
Gradient computation: Instead of calculating the gradient using the entire dataset (as in Gradient Descent) or a single example (as in SGD), Mini-batch Gradient Descent computes the gradient using a small subset of the training data, known as a mini-batch.
Update parameters: Adjust the parameters in the direction opposite to the computed gradient. This adjustment is made based on the gradient derived from the mini-batch, aiming to reduce the cost function.

Mathematical Representation

The parameter update rule for Mini-batch Gradient Descent can be represented as:

where 𝑤 represents the model’s parameters (weights), 𝛼 is the learning rate, and ∆𝘸𝘑𝘪(𝘸) is the gradient of the cost function 𝘑𝘮𝘪𝘯𝘪-𝘣𝘢𝘵𝘤𝘩(𝘸) for the current mini-batch of training samples with respect to w.

AdaGrad (Adaptive Gradient Algorithm)

AdaGrad (Adaptive Gradient Algorithm) introduces an innovative twist to the conventional Gradient Descent optimization technique by dynamically adapting the learning rate, allowing for a more nuanced and effective optimization process. Imagine a scenario where our group of hikers, navigating the foggy valley, now has access to a map highlighting areas of varying difficulty. With this map, they can adjust their pace — taking smaller steps in steep, difficult terrain and larger strides in flatter regions — to optimize their path toward the valley’s bottom.

How it Works

Initialization: Begin with random values for the model’s parameters and initialize a gradient accumulation variable, typically a vector of zeros, of the same size as the parameters.
Gradient computation: Square and accumulate the gradients in the gradient accumulation variable, which consequently tracks the sum of squares of the gradients for each parameter.
Adjust learning rate: Modify the learning rate for each parameter inversely proportional to the square root of its accumulated gradient, ensuring parameters with smaller gradients to have larger updates.
Update parameters: Update each parameter using its adjusted learning rate and the computed gradient.

Mathematical Representation

The parameter update rule for AdaGrad can be represented as:

Where w represents the model’s parameters (weights), 𝛼 is the initial learning rate, 𝘎 is the accumulation of the squared gradients, ∈ is a small smoothing term to prevent division by zero, and ∆𝘸𝘑(𝘸) is the gradient of the cost function 𝐽(w) for the training samples with respect to w.

RMSprop (Root Mean Square Propagation)

RMSprop (Root Mean Square Propagation) is an adaptive learning rate optimization algorithm designed to address AdaGrad’s diminishing learning rates issue. Continuing with the analogy of hikers navigating a foggy valley, RMSprop equips our hikers with an adaptive tool that allows them to maintain a consistent pace despite the terrain’s complexity. This tool evaluates the recent terrain and adjusts their steps accordingly, ensuring they neither get stuck in difficult areas due to excessively small steps nor overshoot their target with overly large steps. RMSprop achieves this by modulating the learning rate based on a moving average of the squared gradients.

How it Works

Initialization: Start with random initial values for the model’s parameters and initialize a running average of squared gradients, typically as a vector of zeros of the same size as the parameters.
Compute gradient: Calculate the gradient of the cost function with respect to each parameter using a selected subset of the training data (mini-batch).
Update squared gradient average: Update the running average of squared gradients using a decay factor, γ, often set to 0.9. This moving average emphasizes more recent gradients, preventing the learning rate from diminishing too rapidly.
Adjust learning rate: Scale down the gradient by the square root of the updated running average, normalizing the updates and allowing for a consistent pace of learning across parameters.
Update parameters: Apply the adjusted gradients to update the model’s parameters.

Mathematical Representation

The parameter update rule for RMSprop can be represented as follows:

Here, 𝘸 represents the parameters, α is the initial learning rate, 𝘌[𝘨²]ₜ is the running average of squared gradients at a certain time, ∈ is a small smoothing term to prevent division by zero, and ∆𝘸𝘑(𝘸) is the gradient of the cost function 𝘑(𝘸) with respect to 𝘸.

AdaDelta

AdaDelta is an extension of AdaGrad that seeks to reduce its aggressively decreasing learning rate. Imagine our group of hikers now has an advanced tool that not only adapts to recent terrain changes but also ensures their gear weight doesn’t hinder their progress. This tool dynamically adjusts their stride length, ensuring they can maintain a steady pace without the burden of past terrain slowing them down. Similarly, AdaDelta modifies the learning rate based on a window of recent gradients rather than accumulating all past squared gradients. This approach allows for a more robust and adaptive learning rate adjustment over time.

How it Works

Initialization: Start with random initial parameters and two state variables, one for accumulating gradients and another for accumulating parameter updates, both initially set to zero.
Compute gradient: Determine the gradient of the cost function with respect to each parameter using a subset of the training data (mini-batch).
Accumulate gradients: Update the first state variable with the squared gradients, applying a decay factor to maintain a focus on recent gradients.
Compute update: Determine the parameter updates based on the square root of the accumulated updates state variable divided by the square root of the accumulated gradients state variable, adding a small constant to avoid division by zero.
Accumulate updates: Update the second state variable with the squared parameter updates.
Update parameters: Modify the model’s parameters using the computed updates.

Mathematical Representation

The parameter update rule for AdaDelta is more complex due to its iterative nature but can be generalized as:

Where 𝘸 represents the parameters, 𝘙𝘔𝘚[𝘶]ₜ₋₁ is the root mean square of parameter updates up to the previous time step, 𝘙𝘔𝘚[𝘨]ₜ is the root mean square of gradients at the current time step, and ∆𝘸𝘑(𝘸) is the gradient of the cost function 𝘑(𝘸) with respect to 𝘸.

Adam (Adaptive Moment Estimation)

Adam (Adaptive Moment Estimation) combines the best properties of AdaGrad and RMSprop to provide an optimization algorithm that can handle sparse gradients on noisy problems. Using our hiking analogy, imagine that the hikers now have access to a state-of-the-art navigation tool that not only adapts to the terrain’s difficulty but also keeps track of their direction to ensure smooth progress. This tool adjusts their pace based on both the recent and accumulated gradients, ensuring they efficiently navigate towards the valley’s bottom without veering off course. Adam achieves this by maintaining estimates of the first and second moments of the gradients, thus providing an adaptive learning rate mechanism.

How it Works

Initialization: Start with random initial parameter values and initialize a first moment vector (m) and a second moment vector (v). Each “moment vector” stores aggregated information about the gradients of the cost function with respect to the model’s parameters. The first moment vector accumulates the means (or the first moments) of the gradients, acting like a momentum by averaging past gradients to determine the direction to update the parameters. The second moment vector accumulates the variances (or second moments) of the gradients, helping adjust the size of the updates by considering the variability of past gradients.
Compute gradient: For each mini-batch, compute the gradients of the cost function with respect to the parameters.
Update moments: Update the first moment vector (m) with the bias-corrected moving average of the gradients. Similarly, update the second moment vector (v) with the bias-corrected moving average of the squared gradients.
Adjust learning rate: Calculate the adaptive learning rate for each parameter using the updated first and second moment vectors, ensuring effective parameter updates.
Update parameters: Use the adaptive learning rates to update the model’s parameters.

Mathematical Representation

The parameter update rule for Adam can be expressed as:

Where 𝘸 represents the parameters, α is the learning rate, and 𝘮ₜ and 𝘷ₜ are bias-corrected estimates of first and second moments of the gradients, respectively.

Cheat Sheet of Mathematical Representation of Optimizers

Comparisons between Different Optimization Algorithms

Algorithm	Pros	Cons	When to use
Gradient Descent	Simple and easy to implement	Computationally expensive on large datasets; May get stuck in local minima	Small datasets; When simplicity and clarity are preferred
SGD	Fast; Handles large datasets well	High variance in updates can lead to instability	Large datasets; Online or incremental learning scenarios
Mini-batch Gradient Descent	Balances between efficiency and stability; More generalizable updates	Requires tuning of batch size	Most deep learning tasks, especially when working with moderate to large datasets
AdaGrad	Adapts learning rate for each parameter; Good for sparse data	Learning rate can diminish to zero, stopping learning	Sparse datasets like text and images where learning rate needs to adapt to feature frequency
RMSprop	Addresses diminishing learning rate of AdaGrad; Adapts learning rate based on recent gradients	Can still get stuck in local minima on non-convex problems	Non-convex optimization problems; Situations where AdaGrad’s learning rate diminishes too fast
AdaDelta	Eliminates the need to set a default learning rate; Addresses diminishing learning rate issue	More complex than RMSprop and AdaGrad	Similar to RMSprop, but when you want to avoid manual learning rate setting
Adam	Combines advantages of AdaGrad and RMSprop; Adaptive learning rates; Includes bias correction	Requires tuning of hyperparameters (though it often works well with defaults)	Most deep learning applications, due to its efficiency and effectiveness

Deep Learning, Gradient Descent, Optimization Algorithms — Oct 18, 2024

Search

Made with ❤️ and ☀️ on Earth.