DL optimizer
http://ruder.io/optimizing-gradient-descent/
DL optimizer
Stochastic gradient decent
use gradient

W = weight
L = loss
η = learning rate
Momentum (speed up or down base on previous gradient)
simulate particle motion
speed up on the same direction, slow down when changing direction.

is related to the last direction.
AdaGrad (adaptive learning rate)
adjust learning rate η base on previous gradient
early stage = small n, high learning speed
later stage = big n, low learning speed

n = sum(square(all previous gradients))
RMSprop
n = RSM(all previous gradients)
Adam
momentum + AdaGrad
Last updated
Was this helpful?