Gradient descent is an optimisation method for finding the minimum of a function. It is commonly used in deep learning models to update the weights of the neural network through backpropagation. Understanding different optimization algorithms and their strengths and weaknesses is crucial for any data scientist training deep learning models. Selecting the right optimizer for the task at hand is paramount to achieving the best possible training results in the shortest amount of time. Next, we’ll explore the most commonly used deep learning optimization algorithms, including Gradient Descent, Stochastic Gradient Descent, and the Adam optimizer. By the end of this article, you’ll have a clear idea of how to choose the best algorithm for training your models.
Gradient Descent is an algorithm designed to minimize a function by iteratively moving towards the minimum value of the function. It’s akin to a hiker trying to find the lowest point in a valley shrouded in fog. The hiker starts at a random location and can only feel the slope of the ground beneath their feet. To reach the valley’s lowest point, the hiker takes steps in the direction of the steepest descent. All deep learning model optimization algorithms widely used today are based on Gradient Descent.
The update rule for each parameter 𝒘 can be mathematically represented as:
where w represents the model’s parameters (weights) and 𝛼 is the learning rate. Δ𝑤𝐽(w) is the gradient of the cost function 𝐽(w) with respect to w.
The learning rate is a crucial hyperparameter that needs to be chosen carefully. If it’s too small, the algorithm will converge very slowly. If it’s too large, the algorithm might overshoot the minimum and fail to converge.
Stochastic Gradient Descent (SGD) is a variant of the traditional Gradient Descent optimization algorithm that introduces randomness into the optimization process to improve convergence speed and potentially escape local minima. To understand the intuition behind SGD, we can again invoke the analogy of a hiker descending a foggy valley. If Gradient Descent represents a cautious hiker who carefully evaluates the slope around them before taking a step, Stochastic Gradient Descent is akin to a more impulsive hiker who decides their next step based only on the slope of the ground immediately beneath their feet. This approach can lead to a quicker descent but might involve more meandering.
The parameter update rule in SGD is similar to that of Gradient Descent but applies to a single example i:
Here, w represents the model’s parameters (weights), 𝛼 is the learning rate, and ∆𝘸𝘑𝘪(𝘸) is the gradient of the cost function 𝐽i(w) for the ith training example with respect to w.
Mini-batch Gradient Descent strikes a balance between the thorough, calculated approach of Gradient Descent and the unpredictable, swift nature of Stochastic Gradient Descent (SGD). Imagine a group of hikers navigating through a foggy valley. Each hiker independently assesses a small, distinct section of the surrounding area before the group decides on the best direction to take. Based on a broader but still limited view of the terrain, this collective decision-making process allows for a more informed and steady progression toward the valley’s lowest point compared to an individual hiker’s erratic journey.
The parameter update rule for Mini-batch Gradient Descent can be represented as:
where 𝑤 represents the model’s parameters (weights), 𝛼 is the learning rate, and ∆𝘸𝘑𝘪(𝘸) is the gradient of the cost function 𝘑𝘮𝘪𝘯𝘪-𝘣𝘢𝘵𝘤𝘩(𝘸) for the current mini-batch of training samples with respect to w.
AdaGrad (Adaptive Gradient Algorithm) introduces an innovative twist to the conventional Gradient Descent optimization technique by dynamically adapting the learning rate, allowing for a more nuanced and effective optimization process. Imagine a scenario where our group of hikers, navigating the foggy valley, now has access to a map highlighting areas of varying difficulty. With this map, they can adjust their pace — taking smaller steps in steep, difficult terrain and larger strides in flatter regions — to optimize their path toward the valley’s bottom.
The parameter update rule for AdaGrad can be represented as:
Where w represents the model’s parameters (weights), 𝛼 is the initial learning rate, 𝘎 is the accumulation of the squared gradients, ∈ is a small smoothing term to prevent division by zero, and ∆𝘸𝘑(𝘸) is the gradient of the cost function 𝐽(w) for the training samples with respect to w.
RMSprop (Root Mean Square Propagation) is an adaptive learning rate optimization algorithm designed to address AdaGrad’s diminishing learning rates issue. Continuing with the analogy of hikers navigating a foggy valley, RMSprop equips our hikers with an adaptive tool that allows them to maintain a consistent pace despite the terrain’s complexity. This tool evaluates the recent terrain and adjusts their steps accordingly, ensuring they neither get stuck in difficult areas due to excessively small steps nor overshoot their target with overly large steps. RMSprop achieves this by modulating the learning rate based on a moving average of the squared gradients.
The parameter update rule for RMSprop can be represented as follows:
Here, 𝘸 represents the parameters, α is the initial learning rate, 𝘌[𝘨²]ₜ is the running average of squared gradients at a certain time, ∈ is a small smoothing term to prevent division by zero, and ∆𝘸𝘑(𝘸) is the gradient of the cost function 𝘑(𝘸) with respect to 𝘸.
AdaDelta is an extension of AdaGrad that seeks to reduce its aggressively decreasing learning rate. Imagine our group of hikers now has an advanced tool that not only adapts to recent terrain changes but also ensures their gear weight doesn’t hinder their progress. This tool dynamically adjusts their stride length, ensuring they can maintain a steady pace without the burden of past terrain slowing them down. Similarly, AdaDelta modifies the learning rate based on a window of recent gradients rather than accumulating all past squared gradients. This approach allows for a more robust and adaptive learning rate adjustment over time.
The parameter update rule for AdaDelta is more complex due to its iterative nature but can be generalized as:
Where 𝘸 represents the parameters, 𝘙𝘔𝘚[𝘶]ₜ₋₁ is the root mean square of parameter updates up to the previous time step, 𝘙𝘔𝘚[𝘨]ₜ is the root mean square of gradients at the current time step, and ∆𝘸𝘑(𝘸) is the gradient of the cost function 𝘑(𝘸) with respect to 𝘸.
Adam (Adaptive Moment Estimation) combines the best properties of AdaGrad and RMSprop to provide an optimization algorithm that can handle sparse gradients on noisy problems. Using our hiking analogy, imagine that the hikers now have access to a state-of-the-art navigation tool that not only adapts to the terrain’s difficulty but also keeps track of their direction to ensure smooth progress. This tool adjusts their pace based on both the recent and accumulated gradients, ensuring they efficiently navigate towards the valley’s bottom without veering off course. Adam achieves this by maintaining estimates of the first and second moments of the gradients, thus providing an adaptive learning rate mechanism.
The parameter update rule for Adam can be expressed as:
Where 𝘸 represents the parameters, α is the learning rate, and 𝘮ₜ and 𝘷ₜ are bias-corrected estimates of first and second moments of the gradients, respectively.
Algorithm | Pros | Cons | When to use |
---|---|---|---|
Gradient Descent | Simple and easy to implement | Computationally expensive on large datasets; May get stuck in local minima | Small datasets; When simplicity and clarity are preferred |
SGD | Fast; Handles large datasets well | High variance in updates can lead to instability | Large datasets; Online or incremental learning scenarios |
Mini-batch Gradient Descent | Balances between efficiency and stability; More generalizable updates | Requires tuning of batch size | Most deep learning tasks, especially when working with moderate to large datasets |
AdaGrad | Adapts learning rate for each parameter; Good for sparse data | Learning rate can diminish to zero, stopping learning | Sparse datasets like text and images where learning rate needs to adapt to feature frequency |
RMSprop | Addresses diminishing learning rate of AdaGrad; Adapts learning rate based on recent gradients | Can still get stuck in local minima on non-convex problems | Non-convex optimization problems; Situations where AdaGrad’s learning rate diminishes too fast |
AdaDelta | Eliminates the need to set a default learning rate; Addresses diminishing learning rate issue | More complex than RMSprop and AdaGrad | Similar to RMSprop, but when you want to avoid manual learning rate setting |
Adam | Combines advantages of AdaGrad and RMSprop; Adaptive learning rates; Includes bias correction | Requires tuning of hyperparameters (though it often works well with defaults) | Most deep learning applications, due to its efficiency and effectiveness |
Deep Learning, Gradient Descent, Optimization Algorithms — Oct 18, 2024
Made with ❤️ and ☀️ on Earth.