Solvers¶
The nnabla.solvers.Solver
class represents a stochastic gradient descent based optimizer for optimizing the parameters in the computation graph. NNabla provides various solvers listed below.
Solver¶
- class nnabla.solvers.Solver¶
Solver interface class.
The same API provided in this class can be used to implement various types of solvers.
Example:
# Network building comes above import nnabla.solvers as S solver = S.Sgd(lr=1e-3) solver.set_parameters(nn.get_parameters()) for itr in range(num_itr): x.d = ... # set data t.d = ... # set label loss.forward() solver.zero_grad() # All gradient buffer being 0 loss.backward() solver.weight_decay(decay_rate) # Apply weight decay solver.clip_grad_by_norm(clip_norm) # Apply clip grad by norm solver.update() # updating parameters
Note
All solvers provided by NNabla belong to an inherited class of
Solver
. A solver is never instantiated by this class itself.- check_inf_grad(self, pre_hook=None, post_hook=None)¶
Check if there is any inf on the gradients which were setup.
- check_inf_or_nan_grad(self, pre_hook=None, post_hook=None)¶
Check if there is any inf or nan on the gradients which were setup.
- check_nan_grad(self, pre_hook=None, post_hook=None)¶
Check if there is any nan on the gradients which were setup.
- clear_parameters(self)¶
Clear all registered parameters and states.
- clip_grad_by_norm(self, float clip_norm, pre_hook=None, post_hook=None)¶
Clip gradients by norm. When called, the gradient will be clipped by the given norm.
- Parameters
clip_norm (float) – The value of clipping norm.
- get_parameters(self)¶
Get all registered parameters
- get_states(self)¶
Get all states
- info¶
object
- Type
info
- learning_rate(self)¶
Get the learning rate.
- load_states(self, path)¶
Load solver states.
- Parameters
path – path to the state file to be loaded.
- name¶
Get the name of the solver.
- remove_parameters(self, vector[string] keys)¶
Remove previously registered parameters, specified by a
vector
of its keys.
- save_states(self, path)¶
Save solver states.
- Parameters
path – path or file object
- scale_grad(self, scale, pre_hook=None, post_hook=None)¶
Rescale gradient
- set_learning_rate(self, learning_rate)¶
Set the learning rate.
- set_parameters(self, param_dict, bool reset=True, bool retain_state=False)¶
Set parameters by dictionary of keys and parameter Variables.
- Parameters
param_dict (dict) – key:string, value: Variable.
reset (bool) – If true, clear all parameters before setting parameters. If false, parameters are overwritten or added (if it’s new).
retain_state (bool) – The value is only considered if reset is false. If true and a key already exists (overwriting), a state (such as momentum) associated with the key will be kept if the shape of the parameter and that of the new param match.
- set_states(self, states)¶
Set states. Call
set_parameters
to initialize states of a solver first, otherwise this method raise an value error.
- set_states_from_protobuf(self, optimizer_proto)¶
Set states to the solver from the protobuf file.
Internally used helper method.
- set_states_to_protobuf(self, optimizer)¶
Set states to the protobuf file from the solver.
Internally used helper method.
- setup(self, params)¶
Deprecated. Call
set_parameters
withparam_dict
.
- update(self, update_pre_hook=None, update_post_hook=None)¶
When this function is called, parameter values are updated using the gradients accumulated in backpropagation, stored in the
grad
field of the parameterVariable
s. Update rules are implemented in the C++ core, in derived classes of Solver. The updated parameter values will be stored into the data field of the parameterVariable
s.- Parameters
update_pre_hook (callable) – This callable object is called immediately before each update of parameters. The default is None.
update_post_hook (callable) – This callable object is called immediately after each update of parameters. The default is None.
- weight_decay(self, float decay_rate, pre_hook=None, post_hook=None)¶
Apply weight decay to gradients. When called, the gradient weight will be decayed by a rate of the current parameter value.
- Parameters
decay_rate (float) – The coefficient of weight decay.
- zero_grad(self)¶
Initialize gradients of all registered parameter by zero.
List of solvers¶
- nnabla.solvers.Sgd(lr=0.001)¶
Stochastic gradient descent (SGD) optimizer.
\[w_{t+1} \leftarrow w_t - \eta \Delta w_t\]
- nnabla.solvers.Momentum(lr=0.001, momentum=0.9)¶
SGD with Momentum.
\[\begin{split}v_t &\leftarrow \gamma v_{t-1} + \eta \Delta w_t\\ w_{t+1} &\leftarrow w_t - v_t\end{split}\]- Parameters
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
- nnabla.solvers.Lars(lr=0.001, momentum=0.9, coefficient=0.001, eps=1e-06)¶
LARS with Momentum.
\[\begin{split}\lambda &\leftarrow \eta \frac{\| w_t \|}{\| \Delta w_t + \beta w_t \|} \\ v_{t+1} &\leftarrow m v_t + \gamma \lambda (\Delta w_t + \beta w_t) \\ w_{t+1} &\leftarrow w_t - v_{t+1}\end{split}\]- Parameters
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
- nnabla.solvers.Nesterov(lr=0.001, momentum=0.9)¶
Nesterov Accelerated Gradient optimizer.
\[\begin{split}v_t &\leftarrow \gamma v_{t-1} - \eta \Delta w_t\\ w_{t+1} &\leftarrow w_t - \gamma v_{t-1} + \left(1 + \gamma \right) v_t\end{split}\]- Parameters
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence \(o(1/k2)\).
- nnabla.solvers.Adadelta(lr=1.0, decay=0.95, eps=1e-06)¶
AdaDelta optimizer.
\[\begin{split}g_t &\leftarrow \Delta w_t\\ v_t &\leftarrow - \frac{RMS \left[ v_t \right]_{t-1}} {RMS \left[ g \right]_t}g_t\\ w_{t+1} &\leftarrow w_t + \eta v_t\end{split}\]- Parameters
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
- nnabla.solvers.Adagrad(lr=0.01, eps=1e-08)¶
ADAGrad optimizer.
\[\begin{split}g_t &\leftarrow \Delta w_t\\ G_t &\leftarrow G_{t-1} + g_t^2\\ w_{t+1} &\leftarrow w_t - \frac{\eta}{\sqrt{G_t} + \epsilon} g_t\end{split}\]- Parameters
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
- nnabla.solvers.AdaBelief(alpha=0.001, beta1=0.9, beta2=0.999, eps=1e-08, wd=0.0, amsgrad=False, weight_decouple=False, fixed_decay=False, rectify=False)¶
AdaBelief optimizer.
\[\begin{split}m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t\\ s_t &\leftarrow \beta_2 s_{t-1} + (1 - \beta_2) (g_t - m_t)^2\\ w_{t+1} &\leftarrow w_t - \alpha \frac{\sqrt{1 - \beta_2^t}}{1 - \beta_1^t} \frac{m_t}{\sqrt{s_t + \epsilon} + \epsilon}\end{split}\]- Parameters
alpha (float) – Step size (\(\alpha\)).
beta1 (float) – Decay rate of first-order momentum (\(\beta_1\)).
beta2 (float) – Decay rate of second-order momentum (\(\beta_2\)).
eps (float) – Small value for avoiding zero division(\(\epsilon\)).
wd (float) – Weight decay rate. This option only takes effect when weight_decouple option is enabled.
amsgrad (bool) – Perform AMSGrad variant of AdaBelief.
weight_decouple (bool) – Perform decoupled weight decay as in AdamW.
fixed_decay (bool) – If True, the weight decay ratio will be kept fixed. Note that this option only takes effect when weight_decouple option is enabled.
rectify (bool) – Perform RAdam variant of AdaBelief.
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
- nnabla.solvers.RMSprop(lr=0.001, decay=0.9, eps=1e-08)¶
RMSprop optimizer (Geoffery Hinton).
\[\begin{split}g_t &\leftarrow \Delta w_t\\ v_t &\leftarrow \gamma v_{t-1} + \left(1 - \gamma \right) g_t^2\\ w_{t+1} &\leftarrow w_t - \eta \frac{g_t}{\sqrt{v_t} + \epsilon}\end{split}\]- Parameters
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
- nnabla.solvers.RMSpropGraves(lr=0.0001, decay=0.95, momentum=0.9, eps=0.0001)¶
RMSpropGraves optimizer (Alex Graves).
\[\begin{split}n_t &\leftarrow \rho n_{t-1} + \left(1 - \rho \right) {e_t}^2\\ g_t &\leftarrow \rho g_{t-1} + \left(1 - \rho \right) e_t\\ d_t &\leftarrow \beta d_{t-1} - \eta \frac{e_t}{\sqrt{n_t - {g_t}^2 + \epsilon}}\\ w_{t+1} &\leftarrow w_t + d_t\end{split}\]where \(e_t\) denotes the gradient.
- Parameters
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
- nnabla.solvers.Adam(alpha=0.001, beta1=0.9, beta2=0.999, eps=1e-08)¶
ADAM optimizer.
\[\begin{split}m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t\\ v_t &\leftarrow \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\\ w_{t+1} &\leftarrow w_t - \alpha \frac{\sqrt{1 - \beta_2^t}}{1 - \beta_1^t} \frac{m_t}{\sqrt{v_t} + \epsilon}\end{split}\]where \(g_t\) denotes a gradient, and let \(m_0 \leftarrow 0\) and \(v_0 \leftarrow 0\).
- Parameters
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
- nnabla.solvers.AdaBound(alpha=0.001, beta1=0.9, beta2=0.999, eps=1e-08, final_lr=0.1, gamma=0.001)¶
AdaBound optimizer applies dynamic bounds on learning rates to Adam.
\[\begin{split}w_{t+1} &\leftarrow w_t - \eta_t*m_t\\ \eta_t = clip( \alpha\frac{\sqrt{1 - \beta_2^t}}{(1 - \beta_1^t)(\sqrt{v_t} + \epsilon)}, \eta_l(t), \eta_u(t))\\ \eta_l(t) = (1 - (1/((1-\gamma)t+1)))\alpha^*\\ \eta_u(t) = (1 + (1/((1-\gamma)t)))\alpha^*\end{split}\]where \(\alpha^*\) (
final_lr
) is scaled by a factor defined as the current value of \(\alpha\) (set byset_learning_rate(lr)
) over initial value of \(\alpha\), so that learnign rate scheduling is properly applied to both \(\alpha\) and \(\alpha^*\).- Parameters
alpha (float) – Step size (\(\alpha\)).
beta1 (float) – Decay rate of first-order momentum (\(\beta_1\)).
beta2 (float) – Decay rate of second-order momentum (\(\beta_2\)).
eps (float) – Small value for avoiding zero division(\(\epsilon\)).
final_lr (float) – Final (SGD) learning rate.
gamma (float) – Convergence speed of the bound functions.
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
- nnabla.solvers.Adamax(alpha=0.002, beta1=0.9, beta2=0.999, eps=1e-08)¶
ADAMAX Optimizer.
\[\begin{split}m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t\\ v_t &\leftarrow \max\left(\beta_2 v_{t-1}, |g_t|\right)\\ w_{t+1} &\leftarrow w_t - \alpha \frac{\sqrt{1 - \beta_2^t}}{1 - \beta_1^t} \frac{m_t}{v_t + \epsilon}\end{split}\]where \(g_t\) denotes a gradient, and let \(m_0 \leftarrow 0\) and \(v_0 \leftarrow 0\), \(v_t\) is an exponentially weighted infinity norm of a sequence of gradients \(t=0,...,t\).
- Parameters
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
- nnabla.solvers.AMSGRAD(alpha=0.001, beta1=0.9, beta2=0.999, eps=1e-08, bias_correction=False)¶
AMSGRAD optimizer.
\[\begin{split}m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t\\ v_t &\leftarrow \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\\ \hat{v_t} = \max(\hat{v_{t-1}}, v_t)\\ w_{t+1} &\leftarrow w_t - \alpha \frac{m_t}{\sqrt{\hat{v_t}} + \epsilon}\end{split}\]where \(g_t\) denotes a gradient, and let \(m_0 \leftarrow 0\) and \(v_0 \leftarrow 0\).
- Parameters
alpha (float) – Step size (\(\alpha\)).
beta1 (float) – Decay rate of first-order momentum (\(\beta_1\)).
beta2 (float) – Decay rate of second-order momentum (\(\beta_2\)).
eps (float) – Small value for avoiding zero division(\(\epsilon\)). Note this does not appear in the paper.
bias_correction (bool) – Apply bias correction to moving averages defined in ADAM. Note this does not appear in the paper.
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
- nnabla.solvers.AMSBound(alpha=0.001, beta1=0.9, beta2=0.999, eps=1e-08, final_lr=0.1, gamma=0.001, bias_correction=False)¶
AMSBound optimizer applies dynamic bounds on learning rates to AMSGrad.
\[\begin{split}w_{t+1} &\leftarrow w_t - \eta_t*m_t\\ \eta_t = clip( \alpha\frac{\sqrt{1 - \beta_2^t}}{(1 - \beta_1^t)(\sqrt{\hat{v_t}} + \epsilon)}, \eta_l(t), \eta_u(t))\\ \hat{v_t} = \max(\hat{v_{t-1}}, v_t)\\ \eta_l(t) = (1 - (1/((1-\gamma)t+1)))\alpha^*\\ \eta_u(t) = (1 + (1/((1-\gamma)t)))\alpha^*\end{split}\]where \(\alpha^*\) (
final_lr
) is scaled by a factor defined as the current value of \(\alpha\) (set byset_learning_rate(lr)
) over initial value of \(\alpha\), so that learnign rate scheduling is properly applied to both \(\alpha\) and \(\alpha^*\).- Parameters
alpha (float) – Step size (\(\alpha\)).
beta1 (float) – Decay rate of first-order momentum (\(\beta_1\)).
beta2 (float) – Decay rate of second-order momentum (\(\beta_2\)).
eps (float) – Small value for avoiding zero division(\(\epsilon\)). Note this does not appear in the paper.
final_lr (float) – Final (SGD) learning rtae
gamma (float) – Convergence speed of the bound functions
bias_correction (bool) – Apply bias correction to moving averages defined in ADAM. Note this does not appear in the paper.
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
- nnabla.solvers.AdamW(alpha=0.001, beta1=0.9, beta2=0.999, eps=1e-08, wd=0.0001)¶
ADAM optimizer with decoupled weight decay.
\[\begin{split}m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t\\ v_t &\leftarrow \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\\ w_{t+1} &\leftarrow w_t - \alpha \frac{\sqrt{1 - \beta_2^t}}{1 - \beta_1^t} \frac{m_t}{\sqrt{v_t} + \epsilon} - \eta_t\lambda w_t\end{split}\]where \(g_t\) denotes a gradient, \(\lambda\) is the decoupled weight decay rate, and \(m_0 \leftarrow 0\) and \(v_0 \leftarrow 0\).
- Parameters
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References
- nnabla.solvers.SgdW(lr=0.001, momentum=0.9, wd=0.0001)¶
Stochastic gradient descent (SGD) optimizer with decoupled weight decay.
\[\begin{split}v_t \leftarrow \gamma v_{t-1} + \eta g_t - (\eta / \eta_0)\lambda v_{t-1}\\ w_{t+1} \leftarrow w_t - v_t\end{split}\]where \(\lambda\) is the decoupled weight decay rate.
- Parameters
- Returns
- An instance of Solver class.
See Solver API guide for details.
- Return type
Note
You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by
nnabla.set_default_context(ctx)
ornnabla.context_scope(ctx)
. See API docs.References