class nbla::AMSGRAD

template<typename T> class AMSGRAD : public nbla::Solver 

AMSGRAD solver defined as.

\[\begin{split} m_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t\\ v_t \leftarrow \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\\ \hat{v_t} = \max(\hat{v_{t-1}}, v_t)\\ \theta_{t+1} \leftarrow \theta_t - \alpha \frac{m_t}{\sqrt{\hat{v_t}} + \epsilon} \end{split}\]

where \(\theta_t\) is a gradient of a parameter, \(m_t\) and \(v_t\) are moving average and 0-mean variance of a sequence of gradients \(t=0,...,t\).

See also

See the paper linked below for more details. Reddi et al. On the convergence of ADAM and beyond. https://openreview.net/pdf?id=ryQu7f-RZ

Param alpha:: \(\alpha\) Learning rate.
Param beta1:: \(\beta_1\) Decay rate of moving mean.
Param beta2:: \(\beta_2\) Decay rate of moving 0-mean variance.
Param eps:: \(\epsilon\) Small value for avoiding zero division(:math:`\epsilon`). Note this does not appear in the paper.
Param bias_correction:: Apply bias correction to moving averages defined in ADAM. Note this does not appear in the paper.

Public Functions

inline virtual float learning_rate(): Set learning rate.