class nbla::AMSGRAD

template<typename T>
class AMSGRAD : public nbla::Solver

AMSGRAD solver defined as.

\[\begin{split} m_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t\\ v_t \leftarrow \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\\ \hat{v_t} = \max(\hat{v_{t-1}}, v_t)\\ \theta_{t+1} \leftarrow \theta_t - \alpha \frac{m_t}{\sqrt{\hat{v_t}} + \epsilon} \end{split}\]
where \(\theta_t\) is a gradient of a parameter, \(m_t\) and \(v_t\) are moving average and 0-mean variance of a sequence of gradients \(t=0,...,t\).

See also

See the paper linked below for more details. Reddi et al. On the convergence of ADAM and beyond. https://openreview.net/pdf?id=ryQu7f-RZ

Param alpha:

\(\alpha\) Learning rate.

Param beta1:

\(\beta_1\) Decay rate of moving mean.

Param beta2:

\(\beta_2\) Decay rate of moving 0-mean variance.

Param eps:

\(\epsilon\) Small value for avoiding zero division(:math:`\epsilon`). Note this does not appear in the paper.

Param bias_correction:

Apply bias correction to moving averages defined in ADAM. Note this does not appear in the paper.

Public Functions

inline virtual float learning_rate()

Set learning rate.