Gradient Descent
Gradient descent is an optimization algorithm that’s used when training a machine learning model.
It’s based on a convex function and tweaks its parameters iteratively to minimize a given function to
its local minimum.
You start by defining the initial parameter’s values and from there the gradient descent algorithm
uses calculus to iteratively adjust the values so they minimize the given cost-function. To understand
this concept fully, it’s important to know about gradients.
A gradient simply measures the change in all weights with regard to the change in error. You can also
think of a gradient as the slope of a function. The higher the gradient, the steeper the slope and the
faster a model can learn. But if the slope is zero, the model stops learning. In mathematical terms, a
gradient is a partial derivative with respect to its inputs.
Imagine a blindfolded man who wants to climb to the top of a hill with the fewest steps possible. He
might start climbing the hill by taking really big steps in the steepest direction. But as he comes
closer to the top, his steps will get smaller and smaller to avoid overshooting it. Imagine the image
below illustrates our hill from a top-down view and the red arrows are the steps of our climber. A
gradient in this context is a vector that contains the direction of the steepest step the blindfolded
man can take and how long that step should be.
How Does Gradient Descent Work?
Instead of climbing up a hill, think of gradient descent as hiking down to the bottom of a valley. The
equation below describes what the gradient descent algorithm does: b is the next position of our
climber, while a represents his current position. The minus sign refers to the minimization part of the
gradient descent algorithm. The gamma in the middle is a waiting factor and the gradient term ( Δf(a)
) is simply the direction of the steepest descent.
Types of Gradient Descent
Batch Gradient Descent
Batch gradient descent, also called vanilla gradient descent, calculates the error for each example
within the training dataset, but it only gets updated after all training examples have been evaluated.
This process is like a cycle and called a training epoch.
An advantage of batch gradient descent is its computational efficiency: it produces a stable error
gradient and a stable convergence. But the stable error gradient can sometimes result in a state of
convergence that isn’t the best the model can achieve. It also requires the entire training dataset to
be in memory and available to the algorithm.
Stochastic Gradient Descent
By contrast, stochastic gradient descent (SGD) does this for each training example within the dataset,
meaning it updates the parameters for each training example one by one. Depending on the
problem, this can make SGD faster than batch gradient descent. One advantage is that frequent
updates allow us to have a pretty detailed rate of improvement.
The frequent updates, however, are more computationally expensive than the batch gradient
descent approach. Additionally, the frequency of those updates can result in noisy gradients, which
may cause the error rate to jump around instead of slowly decreasing.
Mini-Batch Gradient Descent
Mini-batch gradient descent is the go-to method since it’s a combination of the concepts of SGD and
batch gradient descent. It simply splits the training dataset into small batches and performs an
update for each of those batches. This creates a balance between the robustness of stochastic
gradient descent and the efficiency of batch gradient descent.
Common mini-batch sizes range between 50 and 256, but like any other machine learning technique,
there is no clear rule because it varies for different applications. This is the go-to algorithm when
training a neural network, and it’s the most common type of gradient descent within deep learning.