Published on
19/11/2019

Gradient Descent Variants

Understanding this fundamental concept is crucial for anyone interested in working in the ML space.

General Algorithm Overview

In modeling most machine learning problems, we generally start by defining the function/model (e.g., linear regression aka a function fitting a straight line to the data). We then choose the loss function to minimize, such as MSE loss. Finally, we optimize for the parameter values θ\theta (the parameters of the model) that minimize the loss function ― gradient descent is typically the go-to approach for this optimization step.

In its simplest form for a one variable function (1 Dimensional problem), the gradient descent equation is as follows:

θ(t+1)=θ(t)ηddθL(θ(t))\theta^{(t+1)} = \theta^{(t)} - \eta \, \frac{d}{d\theta} L\bigl(\theta^{(t)}\bigr)

Mean squared error(MSE) could be chosen as loss for the problems L(θ(t))L\bigl(\theta^{(t)}\bigr):

MSE(θ)=1n(i=1)n(yiθxi)2MSE(\theta) = \frac{1} {n} \sum_{(i = 1)}^n \bigl(y_i - \theta x_i \bigr)^2

Hence the derivative of the loss function is:

ddθL(θ(t))=2ni=1n(yiθ1(t)xi)xi\, \frac{d}{d\theta} L\bigl(\theta^{(t)}\bigr) = \frac{-2}{n} \sum_{i=1}^n \bigl(y_i - \theta_1^{(t)} x_i \bigr) x_i

With that equation, we first start by guessing initial values for θ\theta to minimize, then compute the derivative of L(θ(t))L\bigl(\theta^{(t)}\bigr) — after calculating the derivative at each step, when the slope is negative we move in the positive direction, and when the slope is positive we move in the negative direction.

The above is the simplest form with only one θ\theta parameter. In reality, most ML problems have many features, meaning many θ\theta parameters to be optimized by gradient descent. For this, we now have to take partial derivatives (\partial) to observe how the function changes if we only vary one variable while holding all others constant.

Therefore, we shall now have:

[θ0(t+1)θ1(t+1)θn(t+1)]=[θ0(t)θ1(t)θn(t)]η[Lθ0Lθ1Lθn]\begin{bmatrix} \theta_0^{(t+1)} \\ \theta_1^{(t+1)} \\ \vdots \\ \theta_n^{(t+1)} \end{bmatrix} = \begin{bmatrix} \theta_0^{(t)} \\ \theta_1^{(t)} \\ \vdots \\ \theta_n^{(t)} \end{bmatrix} - \eta \begin{bmatrix} \frac{\partial L}{\partial \theta_0} \\ \frac{\partial L}{\partial \theta_1} \\ \vdots \\ \frac{\partial L}{\partial \theta_n} \end{bmatrix}

Formally summarized as:

θ(t+1)=θ(t)ηθL(θ(t))\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \eta \, \nabla_{\boldsymbol{\theta}} L\bigl(\boldsymbol{\theta}^{(t)}\bigr)

Algorithm Variations

There are three primary variations of gradient descent, each differing in how much data they use to compute the gradient at each step:

  1. Batch Gradient Descent
    Calculates the gradient of the loss function using the entire dataset. However, this can be computationally expensive because all the data is used in each iteration. This makes it incredibly challenging to scale for large datasets, as both memory and processing time can become substantial bottlenecks.

  2. Stochastic Gradient Descent (SGD)
    Updates model parameters after processing each training example, unlike batch gradient descent. This eliminates the redundancy of recomputing gradients for similar data points, making SGD faster and more suitable for large datasets. However, frequent updates in SGD introduce high variance, causing the objective function to fluctuate significantly. While batch gradient descent converges smoothly to the minimum, SGD's fluctuations allow it to explore new and potentially better local minima. These fluctuations can also make it harder to converge. Gradually decreasing the learning rate can help SGD mimic batch gradient descent's stable convergence.

  3. Mini-batch Gradient Descent
    Combines the strengths of batch and stochastic gradient descent by updating model parameters based on small, random subsets of the data, called mini-batches. This approach reduces the variance of updates, leading to more stable convergence compared to SGD while being more computationally efficient than batch gradient descent. One caveat is that "SGD" is often used interchangeably with mini-batch gradient descent in practice.

Optimisation Algorithms

  • Momentum
    Accelerates gradient descent by smoothing updates, reducing oscillations in steep slopes near local optima by combining a fraction of the previous step with the current gradient. A typical momentum coefficient is γ=0.9\gamma = 0.9.

  • Nesterov Accelerated Gradient (NAG)
    Anticipates future positions by calculating gradients at estimated future locations for faster convergence. It is particularly effective in training recurrent neural networks (RNNs).

  • Adagrad
    Adapts learning rates for each parameter based on past gradients, making it suitable for sparse data. However, its diminishing learning rate limits long-term training.

  • RMSProp
    Maintains an exponentially decaying average of squared gradients to stabilize learning rates, with a default η=0.001\eta = 0.001, addressing Adagrad's limitations.

  • Adam
    Integrates momentum and RMSProp, adapting learning rates using first and second gradient moment estimates with bias correction. Adam is widely applicable with default parameters (β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999).

Which Optimizer Should You Use?

  • For Sparse Data: Adaptive learning-rate methods like Adagrad, RMSProp, or Adam are often the best choice. RMSProp addresses Adagrad's issue of diminishing learning rates, and Adam builds on RMSProp by incorporating bias correction and momentum, making it a strong overall choice for most tasks.

  • For Simpler Scenarios: Momentum or Nesterov Accelerated Gradient (NAG) can accelerate convergence and smooth updates, though they require careful learning rate tuning.

  • For Research or Simplicity: Vanilla SGD is sometimes used but often converges slowly, relies heavily on initialization, and risks getting stuck in saddle points.

Conclusion

Gradient descent is a fundamental concept in machine learning, and understanding its variations and optimizers is crucial for training models effectively. While the choice of optimizer depends on the dataset and task, adaptive learning-rate methods like Adam are generally reliable and widely applicable.

References:

  1. Ruder, S. (2017). An overview of gradient descent optimization algorithms. Retrieved from https://www.ruder.io/deep-learning-optimization-2017/

For comments, please send me an email.