class nbla::SgdW

template<typename T> class SgdW : public nbla::Solver 

SGDW.

\[\begin{split} m_{t} &\leftarrow \gamma m_{t-1} + \eta_t \alpha g_t\\ w_{t} &\leftarrow w_{t-1} - m_{t} - \eta_t \lambda w_{t-1} \end{split}\]

where \(g_t\) denotes a gradient, \(m_t\) is momentum of the gradient initialized with 0 at \(t=0\), \(\eta _t\) is the scheduled learning rate, \(\lambda\) is the decoupled weight decay rate set by weight_decay method (lazy evaluation), and the rest is described in the argument documentation.

参考

See the paper linked below for more details. Loshchilov and Hutter, Decoupled Weight Decay Regularization. https://arxiv.org/abs/1711.05101

Param lr:: Initial learning rate ( \(\alpha\)). Note that you have to manage the scheduled learning rate \(\eta_t\) yourelf. By denoting learning rate updated at the set_learning_rate by \(\alpha_t\), we define \(\eta_t = \frac{\alpha_t}{\alpha}\).
Param momentum:: Decay rate of momentum ( \(\gamma\)).
Param wd:: The default weight decay rate ( \(\lambda\)). The weight decay operation is fused into the update operation in SgdW. It uses this default decay rate unless you overwrite a decay rate via weight_decay for the next call of update.

Public Functions

inline virtual float learning_rate(): Set learning rate.