class nbla::SgdW
-
template<typename T>
class SgdW : public nbla::Solver SGDW.
\[\begin{split} m_{t} &\leftarrow \gamma m_{t-1} + \eta_t \alpha g_t\\ w_{t} &\leftarrow w_{t-1} - m_{t} - \eta_t \lambda w_{t-1} \end{split}\]where \(g_t\) denotes a gradient, \(m_t\) is momentum of the gradient initialized with 0 at \(t=0\), \(\eta _t\) is the scheduled learning rate, \(\lambda\) is the decoupled weight decay rate set by weight_decay method (lazy evaluation), and the rest is described in the argument documentation.See also
See the paper linked below for more details. Loshchilov and Hutter, Decoupled Weight Decay Regularization. https://arxiv.org/abs/1711.05101
- Param lr:
Initial learning rate ( \(\alpha\)). Note that you have to manage the scheduled learning rate \(\eta_t\) yourelf. By denoting learning rate updated at the set_learning_rate by \(\alpha_t\), we define \(\eta_t = \frac{\alpha_t}{\alpha}\).
- Param momentum:
Decay rate of momentum ( \(\gamma\)).
- Param wd:
The default weight decay rate ( \(\lambda\)). The weight decay operation is fused into the update operation in SgdW. It uses this default decay rate unless you overwrite a decay rate via weight_decay for the next call of update.
Public Functions
-
inline virtual float learning_rate()
Set learning rate.