Solvers¶

The nnabla.solvers.Solver class represents a stochastic gradient descent based optimizer for optimizing the parameters in the computation graph. NNabla provides various solvers listed below.

Solver¶

class nnabla.solvers.Solver¶

Solver interface class.

The same API provided in this class can be used to implement various types of solvers.

Example:

# Network building comes above
import nnabla.solvers as S
solver = S.Sgd(lr=1e-3)
solver.set_parameters(nn.get_parameters())

for itr in range(num_itr):
    x.d = ... # set data
    t.d = ... # set label
    loss.forward()
    solver.zero_grad()  # All gradient buffer being 0
    loss.backward()
    solver.weight_decay(decay_rate)  # Apply weight decay
    solver.clip_grad_by_norm(clip_norm)  # Apply clip grad by norm
    solver.update()  # updating parameters

Note

All solvers provided by NNabla belong to an inherited class of Solver . A solver is never instantiated by this class itself.

check_inf_grad(self, pre_hook=None, post_hook=None)¶: Check if there is any inf on the gradients which were setup.

check_inf_or_nan_grad(self, pre_hook=None, post_hook=None)¶: Check if there is any inf or nan on the gradients which were setup.

check_nan_grad(self, pre_hook=None, post_hook=None)¶: Check if there is any nan on the gradients which were setup.

clear_parameters(self)¶: Clear all registered parameters and states.

clip_grad_by_norm(self, float clip_norm, pre_hook=None, post_hook=None)¶

Clip gradients by norm. When called, the gradient will be clipped by the given norm.

Parameters: clip_norm (float) – The value of clipping norm.

get_parameters(self)¶: Get all registered parameters

get_states(self)¶: Get all states

info¶

object

Type: info

learning_rate(self)¶: Get the learning rate.

load_states(self, path)¶

Load solver states.

Parameters: path – path to the state file to be loaded.

name¶: Get the name of the solver.

remove_parameters(self, vector[string] keys)¶: Remove previously registered parameters, specified by a vector of its keys.

save_states(self, path)¶

Save solver states.

Parameters: path – path or file object

scale_grad(self, scale, pre_hook=None, post_hook=None)¶: Rescale gradient

set_learning_rate(self, learning_rate)¶: Set the learning rate.

set_parameters(self, param_dict, bool reset=True, bool retain_state=False)¶

Set parameters by dictionary of keys and parameter Variables.

Parameters

param_dict (dict) – key:string, value: Variable.
reset (bool) – If true, clear all parameters before setting parameters. If false, parameters are overwritten or added (if it’s new).
retain_state (bool) – The value is only considered if reset is false. If true and a key already exists (overwriting), a state (such as momentum) associated with the key will be kept if the shape of the parameter and that of the new param match.

set_states(self, states)¶: Set states. Call set_parameters to initialize states of a solver first, otherwise this method raise an value error.

set_states_from_protobuf(self, optimizer_proto)¶

Set states to the solver from the protobuf file.

Internally used helper method.

set_states_to_protobuf(self, optimizer)¶

Set states to the protobuf file from the solver.

Internally used helper method.

setup(self, params)¶: Deprecated. Call set_parameters with param_dict .

update(self, update_pre_hook=None, update_post_hook=None)¶

When this function is called, parameter values are updated using the gradients accumulated in backpropagation, stored in the grad field of the parameter Variable s. Update rules are implemented in the C++ core, in derived classes of Solver. The updated parameter values will be stored into the data field of the parameter Variable s.

Parameters

update_pre_hook (callable) – This callable object is called immediately before each update of parameters. The default is None.
update_post_hook (callable) – This callable object is called immediately after each update of parameters. The default is None.

weight_decay(self, float decay_rate, pre_hook=None, post_hook=None)¶

Apply weight decay to gradients. When called, the gradient weight will be decayed by a rate of the current parameter value.

Parameters: decay_rate (float) – The coefficient of weight decay.

zero_grad(self)¶: Initialize gradients of all registered parameter by zero.

List of solvers¶

nnabla.solvers.Sgd(lr=0.001)¶

Stochastic gradient descent (SGD) optimizer.

\[w_{t+1} \leftarrow w_t - \eta \Delta w_t\]

Parameters

lr (float) – Learning rate (\(\eta\)).

Returns

An instance of Solver class.: See Solver API guide for details.

Return type

Solver

Note

You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by nnabla.set_default_context(ctx) or nnabla.context_scope(ctx). See API docs.

nnabla.solvers.Momentum(lr=0.001, momentum=0.9)¶

SGD with Momentum.

\[\begin{split}v_t &\leftarrow \gamma v_{t-1} + \eta \Delta w_t\\ w_{t+1} &\leftarrow w_t - v_t\end{split}\]

Parameters

lr (float) – Learning rate (\(\eta\)).
momentum (float) – Decay rate of momentum.

Returns

An instance of Solver class.: See Solver API guide for details.

Return type

Solver

Note

You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by nnabla.set_default_context(ctx) or nnabla.context_scope(ctx). See API docs.

References

Ning Qian : On the Momentum Term in Gradient Descent Learning Algorithms.

nnabla.solvers.Lars(lr=0.001, momentum=0.9, coefficient=0.001, eps=1e-06)¶

LARS with Momentum.

\[\begin{split}\lambda &\leftarrow \eta \frac{\| w_t \|}{\| \Delta w_t + \beta w_t \|} \\ v_{t+1} &\leftarrow m v_t + \gamma \lambda (\Delta w_t + \beta w_t) \\ w_{t+1} &\leftarrow w_t - v_{t+1}\end{split}\]

Parameters

lr (float) – Learning rate (\(\eta\)).
momentum (float) – Decay rate of momentum.
coefficient (float) – Trust coefficient
eps (float) – Small value for avoiding zero devision(\(\epsilon\)).

Returns

An instance of Solver class.: See Solver API guide for details.

Return type

Solver

Note

You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by nnabla.set_default_context(ctx) or nnabla.context_scope(ctx). See API docs.

References

Yang You, Igor Gitman, Boris Ginsburg Large Batch Training of Convolutional Networks

nnabla.solvers.Nesterov(lr=0.001, momentum=0.9)¶

Nesterov Accelerated Gradient optimizer.

\[\begin{split}v_t &\leftarrow \gamma v_{t-1} - \eta \Delta w_t\\ w_{t+1} &\leftarrow w_t - \gamma v_{t-1} + \left(1 + \gamma \right) v_t\end{split}\]

Parameters

lr (float) – Learning rate (\(\eta\)).
momentum (float) – Decay rate of momentum.

Returns

An instance of Solver class.: See Solver API guide for details.

Return type

Solver

Note

You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by nnabla.set_default_context(ctx) or nnabla.context_scope(ctx). See API docs.

References

Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence \(o(1/k2)\).

nnabla.solvers.Adadelta(lr=1.0, decay=0.95, eps=1e-06)¶

AdaDelta optimizer.

\[\begin{split}g_t &\leftarrow \Delta w_t\\ v_t &\leftarrow - \frac{RMS \left[ v_t \right]_{t-1}} {RMS \left[ g \right]_t}g_t\\ w_{t+1} &\leftarrow w_t + \eta v_t\end{split}\]

Parameters

lr (float) – Learning rate (\(\eta\)).
decay (float) – Decay rate (\(\gamma\)).
eps (float) – Small value for avoiding zero division(\(\epsilon\)).

Returns

An instance of Solver class.: See Solver API guide for details.

Return type

Solver

Note

You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by nnabla.set_default_context(ctx) or nnabla.context_scope(ctx). See API docs.

References

Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method.

nnabla.solvers.Adagrad(lr=0.01, eps=1e-08)¶

ADAGrad optimizer.

\[\begin{split}g_t &\leftarrow \Delta w_t\\ G_t &\leftarrow G_{t-1} + g_t^2\\ w_{t+1} &\leftarrow w_t - \frac{\eta}{\sqrt{G_t} + \epsilon} g_t\end{split}\]

Parameters

lr (float) – Learning rate (\(\eta\)).
eps (float) – Small value for avoiding zero division(\(\epsilon\)).

Returns

An instance of Solver class.: See Solver API guide for details.

Return type

Solver

Note

You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by nnabla.set_default_context(ctx) or nnabla.context_scope(ctx). See API docs.

References

John Duchi, Elad Hazan and Yoram Singer (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.

nnabla.solvers.AdaBelief(alpha=0.001, beta1=0.9, beta2=0.999, eps=1e-08, wd=0.0, amsgrad=False, weight_decouple=False, fixed_decay=False, rectify=False)¶

AdaBelief optimizer.

\[\begin{split}m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t\\ s_t &\leftarrow \beta_2 s_{t-1} + (1 - \beta_2) (g_t - m_t)^2\\ w_{t+1} &\leftarrow w_t - \alpha \frac{\sqrt{1 - \beta_2^t}}{1 - \beta_1^t} \frac{m_t}{\sqrt{s_t + \epsilon} + \epsilon}\end{split}\]

Parameters

alpha (float) – Step size (\(\alpha\)).
beta1 (float) – Decay rate of first-order momentum (\(\beta_1\)).
beta2 (float) – Decay rate of second-order momentum (\(\beta_2\)).
eps (float) – Small value for avoiding zero division(\(\epsilon\)).
wd (float) – Weight decay rate. This option only takes effect when weight_decouple option is enabled.
amsgrad (bool) – Perform AMSGrad variant of AdaBelief.
weight_decouple (bool) – Perform decoupled weight decay as in AdamW.
fixed_decay (bool) – If True, the weight decay ratio will be kept fixed. Note that this option only takes effect when weight_decouple option is enabled.
rectify (bool) – Perform RAdam variant of AdaBelief.

Returns

An instance of Solver class.: See Solver API guide for details.

Return type

Solver

Note

You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by nnabla.set_default_context(ctx) or nnabla.context_scope(ctx). See API docs.

References

Juntang Zhuang, et al. (2020). AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients.

nnabla.solvers.RMSprop(lr=0.001, decay=0.9, eps=1e-08)¶

RMSprop optimizer (Geoffery Hinton).

\[\begin{split}g_t &\leftarrow \Delta w_t\\ v_t &\leftarrow \gamma v_{t-1} + \left(1 - \gamma \right) g_t^2\\ w_{t+1} &\leftarrow w_t - \eta \frac{g_t}{\sqrt{v_t} + \epsilon}\end{split}\]

Parameters

lr (float) – Learning rate (\(\eta\)).
decay (float) – Decay rate (\(\gamma\)).
eps (float) – Small value for avoiding zero division(\(\epsilon\)).

Returns

An instance of Solver class.: See Solver API guide for details.

Return type

Solver

Note

You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by nnabla.set_default_context(ctx) or nnabla.context_scope(ctx). See API docs.

References

Geoff Hinton. Lecture 6a : Overview of mini-batch gradient descent.

nnabla.solvers.RMSpropGraves(lr=0.0001, decay=0.95, momentum=0.9, eps=0.0001)¶

RMSpropGraves optimizer (Alex Graves).

\[\begin{split}n_t &\leftarrow \rho n_{t-1} + \left(1 - \rho \right) {e_t}^2\\ g_t &\leftarrow \rho g_{t-1} + \left(1 - \rho \right) e_t\\ d_t &\leftarrow \beta d_{t-1} - \eta \frac{e_t}{\sqrt{n_t - {g_t}^2 + \epsilon}}\\ w_{t+1} &\leftarrow w_t + d_t\end{split}\]

where \(e_t\) denotes the gradient.

Parameters

lr (float) – Learning rate (\(\eta\)).
decay (float) – Decay rate (\(\rho\)).
momentum (float) – Momentum (\(\beta\))
eps (float) – Small value for avoiding zero division(\(\epsilon\)).

Returns

An instance of Solver class.: See Solver API guide for details.

Return type

Solver

Note

You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by nnabla.set_default_context(ctx) or nnabla.context_scope(ctx). See API docs.

References

Alex Graves. Generating Sequences With Recurrent Neural Networks.

nnabla.solvers.Adam(alpha=0.001, beta1=0.9, beta2=0.999, eps=1e-08)¶

ADAM optimizer.

\[\begin{split}m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t\\ v_t &\leftarrow \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\\ w_{t+1} &\leftarrow w_t - \alpha \frac{\sqrt{1 - \beta_2^t}}{1 - \beta_1^t} \frac{m_t}{\sqrt{v_t} + \epsilon}\end{split}\]

where \(g_t\) denotes a gradient, and let \(m_0 \leftarrow 0\) and \(v_0 \leftarrow 0\).

Parameters

alpha (float) – Step size (\(\alpha\)).
beta1 (float) – Decay rate of first-order momentum (\(\beta_1\)).
beta2 (float) – Decay rate of second-order momentum (\(\beta_2\)).
eps (float) – Small value for avoiding zero division(\(\epsilon\)).

Returns

An instance of Solver class.: See Solver API guide for details.

Return type

Solver

Note

You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by nnabla.set_default_context(ctx) or nnabla.context_scope(ctx). See API docs.

References

Kingma and Ba, Adam: A Method for Stochastic Optimization.

nnabla.solvers.AdaBound(alpha=0.001, beta1=0.9, beta2=0.999, eps=1e-08, final_lr=0.1, gamma=0.001)¶

AdaBound optimizer applies dynamic bounds on learning rates to Adam.

\[\begin{split}w_{t+1} &\leftarrow w_t - \eta_t*m_t\\ \eta_t = clip( \alpha\frac{\sqrt{1 - \beta_2^t}}{(1 - \beta_1^t)(\sqrt{v_t} + \epsilon)}, \eta_l(t), \eta_u(t))\\ \eta_l(t) = (1 - (1/((1-\gamma)t+1)))\alpha^*\\ \eta_u(t) = (1 + (1/((1-\gamma)t)))\alpha^*\end{split}\]

where \(\alpha^*\) (final_lr) is scaled by a factor defined as the current value of \(\alpha\) (set by set_learning_rate(lr)) over initial value of \(\alpha\), so that learnign rate scheduling is properly applied to both \(\alpha\) and \(\alpha^*\).

Parameters

alpha (float) – Step size (\(\alpha\)).
beta1 (float) – Decay rate of first-order momentum (\(\beta_1\)).
beta2 (float) – Decay rate of second-order momentum (\(\beta_2\)).
eps (float) – Small value for avoiding zero division(\(\epsilon\)).
final_lr (float) – Final (SGD) learning rate.
gamma (float) – Convergence speed of the bound functions.

Returns

An instance of Solver class.: See Solver API guide for details.

Return type

Solver

Note

You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by nnabla.set_default_context(ctx) or nnabla.context_scope(ctx). See API docs.

References

L. Luo, Y. Xiong, Y. Liu and X. Sun. Adaptive Gradient Methods with Dynamic Bound of Learning Rate.

nnabla.solvers.Adamax(alpha=0.002, beta1=0.9, beta2=0.999, eps=1e-08)¶

ADAMAX Optimizer.

\[\begin{split}m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t\\ v_t &\leftarrow \max\left(\beta_2 v_{t-1}, |g_t|\right)\\ w_{t+1} &\leftarrow w_t - \alpha \frac{\sqrt{1 - \beta_2^t}}{1 - \beta_1^t} \frac{m_t}{v_t + \epsilon}\end{split}\]

where \(g_t\) denotes a gradient, and let \(m_0 \leftarrow 0\) and \(v_0 \leftarrow 0\), \(v_t\) is an exponentially weighted infinity norm of a sequence of gradients \(t=0,...,t\).

Parameters

alpha (float) – Step size (\(\alpha\)).
beta1 (float) – Decay rate of first-order momentum (\(\beta_1\)).
beta2 (float) – Decay rate of inf-order momentum (\(\beta_2\)).
eps (float) – Small value for avoiding zero division(\(\epsilon\)).

Returns

An instance of Solver class.: See Solver API guide for details.

Return type

Solver

Note

You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by nnabla.set_default_context(ctx) or nnabla.context_scope(ctx). See API docs.

References

Kingma and Ba, Adam: A Method for Stochastic Optimization.

nnabla.solvers.AMSGRAD(alpha=0.001, beta1=0.9, beta2=0.999, eps=1e-08, bias_correction=False)¶

AMSGRAD optimizer.

\[\begin{split}m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t\\ v_t &\leftarrow \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\\ \hat{v_t} = \max(\hat{v_{t-1}}, v_t)\\ w_{t+1} &\leftarrow w_t - \alpha \frac{m_t}{\sqrt{\hat{v_t}} + \epsilon}\end{split}\]

where \(g_t\) denotes a gradient, and let \(m_0 \leftarrow 0\) and \(v_0 \leftarrow 0\).

Parameters

alpha (float) – Step size (\(\alpha\)).
beta1 (float) – Decay rate of first-order momentum (\(\beta_1\)).
beta2 (float) – Decay rate of second-order momentum (\(\beta_2\)).
eps (float) – Small value for avoiding zero division(\(\epsilon\)). Note this does not appear in the paper.
bias_correction (bool) – Apply bias correction to moving averages defined in ADAM. Note this does not appear in the paper.

Returns

An instance of Solver class.: See Solver API guide for details.

Return type

Solver

Note

You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by nnabla.set_default_context(ctx) or nnabla.context_scope(ctx). See API docs.

References

Reddi et al. On the convergence of ADAM and beyond.

nnabla.solvers.AMSBound(alpha=0.001, beta1=0.9, beta2=0.999, eps=1e-08, final_lr=0.1, gamma=0.001, bias_correction=False)¶

AMSBound optimizer applies dynamic bounds on learning rates to AMSGrad.

\[\begin{split}w_{t+1} &\leftarrow w_t - \eta_t*m_t\\ \eta_t = clip( \alpha\frac{\sqrt{1 - \beta_2^t}}{(1 - \beta_1^t)(\sqrt{\hat{v_t}} + \epsilon)}, \eta_l(t), \eta_u(t))\\ \hat{v_t} = \max(\hat{v_{t-1}}, v_t)\\ \eta_l(t) = (1 - (1/((1-\gamma)t+1)))\alpha^*\\ \eta_u(t) = (1 + (1/((1-\gamma)t)))\alpha^*\end{split}\]

where \(\alpha^*\) (final_lr) is scaled by a factor defined as the current value of \(\alpha\) (set by set_learning_rate(lr)) over initial value of \(\alpha\), so that learnign rate scheduling is properly applied to both \(\alpha\) and \(\alpha^*\).

Parameters

alpha (float) – Step size (\(\alpha\)).
beta1 (float) – Decay rate of first-order momentum (\(\beta_1\)).
beta2 (float) – Decay rate of second-order momentum (\(\beta_2\)).
eps (float) – Small value for avoiding zero division(\(\epsilon\)). Note this does not appear in the paper.
final_lr (float) – Final (SGD) learning rtae
gamma (float) – Convergence speed of the bound functions
bias_correction (bool) – Apply bias correction to moving averages defined in ADAM. Note this does not appear in the paper.

Returns

An instance of Solver class.: See Solver API guide for details.

Return type

Solver

Note

You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by nnabla.set_default_context(ctx) or nnabla.context_scope(ctx). See API docs.

References

L. Luo, Y. Xiong, Y. Liu and X. Sun. Adaptive Gradient Methods with Dynamic Bound of Learning Rate.

nnabla.solvers.AdamW(alpha=0.001, beta1=0.9, beta2=0.999, eps=1e-08, wd=0.0001)¶

ADAM optimizer with decoupled weight decay.

\[\begin{split}m_t &\leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t\\ v_t &\leftarrow \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\\ w_{t+1} &\leftarrow w_t - \alpha \frac{\sqrt{1 - \beta_2^t}}{1 - \beta_1^t} \frac{m_t}{\sqrt{v_t} + \epsilon} - \eta_t\lambda w_t\end{split}\]

where \(g_t\) denotes a gradient, \(\lambda\) is the decoupled weight decay rate, and \(m_0 \leftarrow 0\) and \(v_0 \leftarrow 0\).

Parameters

alpha (float) – Step size (\(\alpha\)).
beta1 (float) – Decay rate of first-order momentum (\(\beta_1\)).
beta2 (float) – Decay rate of second-order momentum (\(\beta_2\)).
eps (float) – Small value for avoiding zero division(\(\epsilon\)).
wd (float) – Weight decay rate.

Returns

An instance of Solver class.: See Solver API guide for details.

Return type

Solver

Note

You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by nnabla.set_default_context(ctx) or nnabla.context_scope(ctx). See API docs.

References

Loshchilov and Hutter, Decoupled Weight Decay Regularization.

nnabla.solvers.SgdW(lr=0.001, momentum=0.9, wd=0.0001)¶

Stochastic gradient descent (SGD) optimizer with decoupled weight decay.

\[\begin{split}v_t \leftarrow \gamma v_{t-1} + \eta g_t - (\eta / \eta_0)\lambda v_{t-1}\\ w_{t+1} \leftarrow w_t - v_t\end{split}\]

where \(\lambda\) is the decoupled weight decay rate.

Parameters

lr (float) – Learning rate (\(\eta\)).
momentum (float) – Decay rate of momentum.
wd (float) – Weight decay rate.

Returns

An instance of Solver class.: See Solver API guide for details.

Return type

Solver

Note

You can instantiate a preferred target implementation (ex. CUDA) of a Solver given a Context. A Context can be set by nnabla.set_default_context(ctx) or nnabla.context_scope(ctx). See API docs.

References

Loshchilov and Hutter, Decoupled Weight Decay Regularization.