Goal of extensions of gradient descents
such as Momentum, RMSProp and Adam
is to speedup the gradient descent algorithm and address the challenges
faced by it. This post presents the challenges
and optimizations to handle the same.
Challenges of gradient descent
Learning rate: large value results in oscillations…smaller value is slow to converge
Choosing a learning rate that is applicable to all the dimensions. –in presence of steeper dimensions.
Saddle points
Oscillations/noise of SGD.
Momentum
What does it solve?
It solves the problem when in one dimension the learning is quite slow
and resulting in oscillations in the other dimensions.
$\beta \in (0, 1)$. Common value of $\beta$ is 0.9.
There is an alternative equation used in literature without the factor $1-\beta$.
It computes the exponential weighted moving average of the gradients.
It also becomes important in the context of SGD, where the the gradients
are noisy and averaging helps significantly. It dampens the oscillations.
It helps navigate the ravines efficiently.
Ravine is a volume where the gradient is steeper in a particular directions.
Gradient descent in that case results in taking large number of steps as
we cannot afford to have a large value of the learning rate, which could
result in divergence in the steeper dimensions.
Momentum solves this issue by having a large value of learning rate
in the dimension with less steeper and small value for the steeper dimension.
The momentum equation is simply an average of the last few values of the gradients. The average in the steeper dimension is small because of the oscillation (average of negative and positive number) whereas the gradient in
the other direction keeps adding to result in a large value in the slower dimension.
Newtonian interpretation: One can thing of SGD/GD as a person walking down
the hill steadily taking constant size steps. GD+Momentum can be considered
as a heavy ball rolling down gaining momentum in the downward direction and
dampening the oscillations and noise in the other directions.
Solves the same problem as Momentum. It dampens the oscillations in the direction of steep gradients.
The name comes from the fact that it computes root-mean squared of the gradients. In fact, it computes
EWMA of the square of the gradients.
The first equation represents the EWMA of the square of the gradients.
The equation is written for each dimension. Index $i$ represents the dimension.
$\nabla L_{t}^i$ is the ith component of the gradient. That is, $\nabla L_{t}^i = \frac{\partial L(W)}{\partial W^i}$
In the second equation, the gradient is divided by square root of $S_t^i + \epsilon$.
A small value $\epsilon$ is added just for the sake of numerical stability.
It helps avoid division by a zero. A common value of $\epsilon$ is $10^{-8}$
In the direction of steeper gradients (oscillating direction) $S_t^i$ would be high and would result
in a small change in the second equation because of the division by the same.
On the other hand, in the slow moving direction, $S_t^i$ would be small and would result in a larger
change in this direction as we can choose a bigger value of learning rate in the second equation.
Adam
Momentum and RMSProb both have their limitations.
Adam is a more robust optimization; it combines Momentum and RMSProb both.