class nbla::SgdW

template<typename T> class SgdW : public nbla::Solver 

SGDW.

\[\begin{split} m_{t} &\leftarrow \gamma m_{t-1} + \eta_t \alpha g_t\\ w_{t} &\leftarrow w_{t-1} - m_{t} - \eta_t \lambda w_{t-1} \end{split}\]

where \(g_t\) denotes a gradient, \(m_t\) is momentum of the gradient initialized with 0 at \(t=0\), \(\eta _t\) is the scheduled learning rate, \(\lambda\) is the decoupled weight decay rate set by weight_decay method (lazy evaluation), and the rest is described in the argument documentation.

See also

See the paper linked below for more details. Loshchilov and Hutter, Decoupled Weight Decay Regularization. https://arxiv.org/abs/1711.05101

Param lr:: Initial learning rate ( \(\alpha\)). Note that you have to manage the scheduled learning rate \(\eta_t\) yourelf. By denoting learning rate updated at the set_learning_rate by \(\alpha_t\), we define \(\eta_t = \frac{\alpha_t}{\alpha}\).
Param momentum:: Decay rate of momentum ( \(\gamma\)).
Param wd:: The default weight decay rate ( \(\lambda\)). The weight decay operation is fused into the update operation in SgdW. It uses this default decay rate unless you overwrite a decay rate via weight_decay for the next call of update.

Public Functions

inline virtual float learning_rate(): Set learning rate.