Manjeet Dahiya

Gradient Descent Extensions

Goal of extensions of gradient descents such as Momentum, RMSProp and Adam is to speedup the gradient descent algorithm and address the challenges faced by it. This post presents the challenges and optimizations to handle the same.

Challenges of gradient descent

Momentum

What does it solve?

\[G_t = \beta G_{t-1} + (1 - \beta) \nabla L_{t} \\ W_t = W_{t-1} - \eta G_t\]

RMSProp (Root-mean squared prop)

\[S_t^i = \beta S_{t-1}^i + (1 - \beta) (\nabla L_{t}^i)^2 \\ W_t^i = W_{t-1}^i - \frac{\eta \nabla L_{t}^i}{\sqrt{S_t^i} + \epsilon}\]

Adam

Momentum and RMSProb both have their limitations. Adam is a more robust optimization; it combines Momentum and RMSProb both.

\[G_t = (\beta_1 G_{t-1} + (1 - \beta_1) \nabla L_{t})* \frac{1}{1-\beta_1^t} \\ S_t^i = (\beta_2 S_{t-1}^i + (1 - \beta_2) (\nabla L_{t}^i)^2)* \frac{1}{1-\beta_2^t} \\ W_t^i = W_{t-1}^i - \frac{\eta G_t^i}{\sqrt{S_t^i} + \epsilon}\]

Learning rate decay

\[\eta = \frac{\eta_0}{1+decayrate * epoch}\] \[\eta = \frac{k}{\sqrt{epoch}} \eta_0\] \[\eta = 0.95^{epoch} \eta_0\] \[\eta = 0.95^{t} \eta_0\]

© 2018-19 Manjeet Dahiya