Solvers¶
The nnabla.solvers.Solver
class represents a stochastic gradient descent based optimizer for optimizing the parameters in the computation graph. NNabla provides various solvers listed below.
Solver¶
-
class
nnabla.solvers.
Solver
¶ Solver interface class.
The same API provided in this class can be used to implement various types of solvers.
Example:
# Network building comes above import nnabla.solvers as S solver = S.Sgd(lr=1e-3) solver.set_parameters(nn.get_parameters()) for itr in range(num_itr): x.d = ... # set data t.d = ... # set label loss.forward() solver.zero_grad() # All gradient buffer being 0 loss.backward() solver.weight_decay(decay_rate) # Apply weight decay solver.clip_grad_by_norm(clip_norm) # Apply clip grad by norm solver.update() # updating parameters
Note
All solvers provided by NNabla belong to an inherited class of
Solver
. A solver is never instantiated by this class itself.-
check_inf_grad
(self, pre_hook=None, post_hook=None)¶ Check if there is any inf on the gradients which were setup.
-
check_inf_or_nan_grad
(self, pre_hook=None, post_hook=None)¶ Check if there is any inf or nan on the gradients which were setup.
-
check_nan_grad
(self, pre_hook=None, post_hook=None)¶ Check if there is any nan on the gradients which were setup.
-
clear_parameters
(self)¶ Clear all registered parameters and states.
-
clip_grad_by_norm
(self, float clip_norm, pre_hook=None, post_hook=None)¶ Clip gradients by norm. When called, the gradient will be clipped by the given norm.
- Parameters
clip_norm (float) – The value of clipping norm.
-
get_parameters
(self)¶ Get all registered parameters
-
get_states
(self)¶ Get all states
-
info
¶ object
- Type
info
-
learning_rate
(self)¶ Get the learning rate.
-
load_states
(self, path)¶ Load solver states.
- Parameters
path – path to the state file to be loaded.
-
name
¶ Get the name of the solver.
-
remove_parameters
(self, vector[string] keys)¶ Remove previously registered parameters, specified by a
vector
of its keys.
-
save_states
(self, path)¶ Save solver states.
- Parameters
path – path or file object
-
scale_grad
(self, scale, pre_hook=None, post_hook=None)¶ Rescale gradient
-
set_learning_rate
(self, learning_rate)¶ Set the learning rate.
-
set_parameters
(self, param_dict, bool reset=True, bool retain_state=False)¶ Set parameters by dictionary of keys and parameter Variables.
- Parameters
param_dict (dict) – key:string, value: Variable.
reset (bool) – If true, clear all parameters before setting parameters. If false, parameters are overwritten or added (if it’s new).
retain_state (bool) – The value is only considered if reset is false. If true and a key already exists (overwriting), a state (such as momentum) associated with the key will be kept if the shape of the parameter and that of the new param match.
-
set_states
(self, states)¶ Set states. Call
set_parameters
to initialize states of a solver first, otherwise this method raise an value error.
-
set_states_from_protobuf
(self, optimizer_proto)¶ Set states to the solver from the protobuf file.
Internally used helper method.
-
set_states_to_protobuf
(self, optimizer)¶ Set states to the protobuf file from the solver.
Internally used helper method.
-
setup
(self, params)¶ Deprecated. Call
set_parameters
withparam_dict
.
-
update
(self, update_pre_hook=None, update_post_hook=None)¶ When this function is called, parameter values are updated using the gradients accumulated in backpropagation, stored in the
grad
field of the parameterVariable
s. Update rules are implemented in the C++ core, in derived classes of Solver. The updated parameter values will be stored into the data field of the parameterVariable
s.- Parameters
update_pre_hook (callable) – This callable object is called immediately before each update of parameters. The default is None.
update_post_hook (callable) – This callable object is called immediately after each update of parameters. The default is None.
-
weight_decay
(self, float decay_rate, pre_hook=None, post_hook=None)¶ Apply weight decay to gradients. When called, the gradient weight will be decayed by a rate of the current parameter value.
- Parameters
decay_rate (float) – The coefficient of weight decay.
-
zero_grad
(self)¶ Initialize gradients of all registered parameter by zero.
-
List of solvers¶
-
nnabla.solvers.
Sgd
(lr=0.001)¶ Stochastic gradient descent (SGD) optimizer.
\[w_{t+1} \leftarrow w_t - \eta \Delta w_t\]- Parameters
lr (float) – Learning rate (\(\eta\)).
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Solver
-
nnabla.solvers.
Momentum
(lr=0.001, momentum=0.9)¶ SGD with Momentum.
\[\begin{split}v_t &\leftarrow \gamma v_{t-1} + \eta \Delta w_t\\ w_{t+1} &\leftarrow w_t - v_t\end{split}\]- Parameters
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Solver
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
-
nnabla.solvers.
Lars
(lr=0.001, momentum=0.9, coefficient=0.001, eps=1e-06)¶ LARS with Momentum.
\[\begin{split}\lambda &\leftarrow \eta \frac{\| w_t \|}{\| \Delta w_t + \beta w_t \|} \\ v_{t+1} &\leftarrow m v_t + \gamma \lambda (\Delta w_t + \beta w_t) \\ w_{t+1} &\leftarrow w_t - v_{t+1}\end{split}\]- Parameters
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Solver
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
-
nnabla.solvers.
Nesterov
(lr=0.001, momentum=0.9)¶ Nesterov Accelerated Gradient optimizer.
\[\begin{split}v_t &\leftarrow \gamma v_{t-1} - \eta \Delta w_t\\ w_{t+1} &\leftarrow w_t - \gamma v_{t-1} + \left(1 + \gamma \right) v_t\end{split}\]- Parameters
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Solver
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence \(o(1/k2)\).
-
nnabla.solvers.
Adadelta
(lr=1.0, decay=0.95, eps=1e-06)¶ AdaDelta optimizer.
\[\begin{split}g_t &\leftarrow \Delta w_t\\ v_t &\leftarrow - \frac{RMS \left[ v_t \right]_{t-1}} {RMS \left[ g \right]_t}g_t\\ w_{t+1} &\leftarrow w_t + \eta v_t\end{split}\]- Parameters
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Solver
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
-
nnabla.solvers.
Adagrad
(lr=0.01, eps=1e-08)¶ ADAGrad optimizer.
\[\begin{split}g_t &\leftarrow \Delta w_t\\ G_t &\leftarrow G_{t-1} + g_t^2\\ w_{t+1} &\leftarrow w_t - \frac{\eta}{\sqrt{G_t} + \epsilon} g_t\end{split}\]- Parameters
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Solver
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
-
nnabla.solvers.
AdaBelief
(alpha=0.001, beta1=0.9, beta2=0.999, eps=1e-08, wd=0.0, amsgrad=False, weight_decouple=False, fixed_decay=False, rectify=False)¶ AdaBelief optimizer.
\[\begin{split}m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t\\ s_t &\leftarrow \beta_2 s_{t-1} + (1 - \beta_2) (g_t - m_t)^2\\ w_{t+1} &\leftarrow w_t - \alpha \frac{\sqrt{1 - \beta_2^t}}{1 - \beta_1^t} \frac{m_t}{\sqrt{s_t + \epsilon} + \epsilon}\end{split}\]- Parameters
alpha (float) – Step size (\(\alpha\)).
beta1 (float) – Decay rate of first-order momentum (\(\beta_1\)).
beta2 (float) – Decay rate of second-order momentum (\(\beta_2\)).
eps (float) – Small value for avoiding zero division(\(\epsilon\)).
wd (float) – Weight decay rate. This option only takes effect when weight_decouple option is enabled.
amsgrad (bool) – Perform AMSGrad variant of AdaBelief.
weight_decouple (bool) – Perform decoupled weight decay as in AdamW.
fixed_decay (bool) – If True, the weight decay ratio will be kept fixed. Note that this option only takes effect when weight_decouple option is enabled.
rectify (bool) – Perform RAdam variant of AdaBelief.
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Solver
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
-
nnabla.solvers.
RMSprop
(lr=0.001, decay=0.9, eps=1e-08)¶ RMSprop optimizer (Geoffery Hinton).
\[\begin{split}g_t &\leftarrow \Delta w_t\\ v_t &\leftarrow \gamma v_{t-1} + \left(1 - \gamma \right) g_t^2\\ w_{t+1} &\leftarrow w_t - \eta \frac{g_t}{\sqrt{v_t} + \epsilon}\end{split}\]- Parameters
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Solver
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
-
nnabla.solvers.
RMSpropGraves
(lr=0.0001, decay=0.95, momentum=0.9, eps=0.0001)¶ RMSpropGraves optimizer (Alex Graves).
\[\begin{split}n_t &\leftarrow \rho n_{t-1} + \left(1 - \rho \right) {e_t}^2\\ g_t &\leftarrow \rho g_{t-1} + \left(1 - \rho \right) e_t\\ d_t &\leftarrow \beta d_{t-1} - \eta \frac{e_t}{\sqrt{n_t - {g_t}^2 + \epsilon}}\\ w_{t+1} &\leftarrow w_t + d_t\end{split}\]where \(e_t\) denotes the gradient.
- Parameters
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Solver
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
-
nnabla.solvers.
Adam
(alpha=0.001, beta1=0.9, beta2=0.999, eps=1e-08)¶ ADAM optimizer.
\[\begin{split}m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t\\ v_t &\leftarrow \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\\ w_{t+1} &\leftarrow w_t - \alpha \frac{\sqrt{1 - \beta_2^t}}{1 - \beta_1^t} \frac{m_t}{\sqrt{v_t} + \epsilon}\end{split}\]where \(g_t\) denotes a gradient, and let \(m_0 \leftarrow 0\) and \(v_0 \leftarrow 0\).
- Parameters
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Solver
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
-
nnabla.solvers.
AdaBound
(alpha=0.001, beta1=0.9, beta2=0.999, eps=1e-08, final_lr=0.1, gamma=0.001)¶ AdaBound optimizer applies dynamic bounds on learning rates to Adam.
\[\begin{split}w_{t+1} &\leftarrow w_t - \eta_t*m_t\\ \eta_t = clip( \alpha\frac{\sqrt{1 - \beta_2^t}}{(1 - \beta_1^t)(\sqrt{v_t} + \epsilon)}, \eta_l(t), \eta_u(t))\\ \eta_l(t) = (1 - (1/((1-\gamma)t+1)))\alpha^*\\ \eta_u(t) = (1 + (1/((1-\gamma)t)))\alpha^*\end{split}\]where \(\alpha^*\) (
final_lr
) is scaled by a factor defined as the current value of \(\alpha\) (set byset_learning_rate(lr)
) over initial value of \(\alpha\), so that learnign rate scheduling is properly applied to both \(\alpha\) and \(\alpha^*\).- Parameters
alpha (float) – Step size (\(\alpha\)).
beta1 (float) – Decay rate of first-order momentum (\(\beta_1\)).
beta2 (float) – Decay rate of second-order momentum (\(\beta_2\)).
eps (float) – Small value for avoiding zero division(\(\epsilon\)).
final_lr (float) – Final (SGD) learning rate.
gamma (float) – Convergence speed of the bound functions.
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Solver
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
-
nnabla.solvers.
Adamax
(alpha=0.002, beta1=0.9, beta2=0.999, eps=1e-08)¶ ADAMAX Optimizer.
\[\begin{split}m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t\\ v_t &\leftarrow \max\left(\beta_2 v_{t-1}, |g_t|\right)\\ w_{t+1} &\leftarrow w_t - \alpha \frac{\sqrt{1 - \beta_2^t}}{1 - \beta_1^t} \frac{m_t}{v_t + \epsilon}\end{split}\]where \(g_t\) denotes a gradient, and let \(m_0 \leftarrow 0\) and \(v_0 \leftarrow 0\), \(v_t\) is an exponentially weighted infinity norm of a sequence of gradients \(t=0,...,t\).
- Parameters
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Solver
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
-
nnabla.solvers.
AMSGRAD
(alpha=0.001, beta1=0.9, beta2=0.999, eps=1e-08, bias_correction=False)¶ AMSGRAD optimizer.
\[\begin{split}m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t\\ v_t &\leftarrow \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\\ \hat{v_t} = \max(\hat{v_{t-1}}, v_t)\\ w_{t+1} &\leftarrow w_t - \alpha \frac{m_t}{\sqrt{\hat{v_t}} + \epsilon}\end{split}\]where \(g_t\) denotes a gradient, and let \(m_0 \leftarrow 0\) and \(v_0 \leftarrow 0\).
- Parameters
alpha (float) – Step size (\(\alpha\)).
beta1 (float) – Decay rate of first-order momentum (\(\beta_1\)).
beta2 (float) – Decay rate of second-order momentum (\(\beta_2\)).
eps (float) – Small value for avoiding zero division(\(\epsilon\)). Note this does not appear in the paper.
bias_correction (bool) – Apply bias correction to moving averages defined in ADAM. Note this does not appear in the paper.
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Solver
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
-
nnabla.solvers.
AMSBound
(alpha=0.001, beta1=0.9, beta2=0.999, eps=1e-08, final_lr=0.1, gamma=0.001, bias_correction=False)¶ AMSBound optimizer applies dynamic bounds on learning rates to AMSGrad.
\[\begin{split}w_{t+1} &\leftarrow w_t - \eta_t*m_t\\ \eta_t = clip( \alpha\frac{\sqrt{1 - \beta_2^t}}{(1 - \beta_1^t)(\sqrt{\hat{v_t}} + \epsilon)}, \eta_l(t), \eta_u(t))\\ \hat{v_t} = \max(\hat{v_{t-1}}, v_t)\\ \eta_l(t) = (1 - (1/((1-\gamma)t+1)))\alpha^*\\ \eta_u(t) = (1 + (1/((1-\gamma)t)))\alpha^*\end{split}\]where \(\alpha^*\) (
final_lr
) is scaled by a factor defined as the current value of \(\alpha\) (set byset_learning_rate(lr)
) over initial value of \(\alpha\), so that learnign rate scheduling is properly applied to both \(\alpha\) and \(\alpha^*\).- Parameters
alpha (float) – Step size (\(\alpha\)).
beta1 (float) – Decay rate of first-order momentum (\(\beta_1\)).
beta2 (float) – Decay rate of second-order momentum (\(\beta_2\)).
eps (float) – Small value for avoiding zero division(\(\epsilon\)). Note this does not appear in the paper.
final_lr (float) – Final (SGD) learning rtae
gamma (float) – Convergence speed of the bound functions
bias_correction (bool) – Apply bias correction to moving averages defined in ADAM. Note this does not appear in the paper.
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Solver
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
-
nnabla.solvers.
AdamW
(alpha=0.001, beta1=0.9, beta2=0.999, eps=1e-08, wd=0.0001)¶ ADAM optimizer with decoupled weight decay.
\[\begin{split}m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t\\ v_t &\leftarrow \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\\ w_{t+1} &\leftarrow w_t - \alpha \frac{\sqrt{1 - \beta_2^t}}{1 - \beta_1^t} \frac{m_t}{\sqrt{v_t} + \epsilon} - \eta_t\lambda w_t\end{split}\]where \(g_t\) denotes a gradient, \(\lambda\) is the decoupled weight decay rate, and \(m_0 \leftarrow 0\) and \(v_0 \leftarrow 0\).
- Parameters
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Solver
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
-
nnabla.solvers.
SgdW
(lr=0.001, momentum=0.9, wd=0.0001)¶ Stochastic gradient descent (SGD) optimizer with decoupled weight decay.
\[\begin{split}v_t \leftarrow \gamma v_{t-1} + \eta g_t - (\eta / \eta_0)\lambda v_{t-1}\\ w_{t+1} \leftarrow w_t - v_t\end{split}\]where \(\lambda\) is the decoupled weight decay rate.
- Parameters
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Solver
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References