# Parametric Functions¶

In NNabla, trainable models are created by composing functions that have optimizable parameters. These functions are called parametric functions. Parametric functions are provided by nnabla.parametric_functions.

## Parameter Management API¶

The parameters registered by List of Parametric Functions can be managed using APIs listed in this section.

nnabla.parameter.parameter_scope(name, scope=None)[source]

Grouping parameters registered by parametric functions listed in nnabla.parametric_functions.

Parameters
• name (str) – Parameter scope name.

• scope (OrderedDict, optional) – Specify current parameter scope as a local dictionary. The default value is None. In this case, the current parameter scope maintained in global is used.

Example:

import nnabla as nn
import nnabla.parametric_functions as PF
import nnabla.functions as F

with nn.parameter_scope('conv1'):
conv_out1 = PF.convolution(x, 32, (5, 5))
bn_out1 = PF.batch_normalization(conv_out1)
act_out1 = F.relu(bn_out1)
with nn.parameter_scope('conv2'):
conv_out2 = PF.convolution(act_out1, 64, (3, 3))
bn_out2 = PF.batch_normalization(conv_out2)
act_out2 = F.relu(bn_out2)


Nesting The with statement blocks allows you to nest parameter scopes. This can also be done by using “/” inside the parameter names.

Example:

with nn.parameter_scope('network1'):
with nn.parameter_scope('conv1'):
conv_out1 = PF.convolution(x, 32, (5, 5))
bn_out1 = PF.batch_normalization(conv_out1)
act_out1 = F.relu(bn_out1)
with nn.parameter_scope('conv2'):
conv_out2 = PF.convolution(act_out1, 64, (3, 3))
bn_out2 = PF.batch_normalization(conv_out2)
act_out2 = F.relu(bn_out2)


is equivalent to

with nn.parameter_scope('network1/conv1'):
conv_out1 = PF.convolution(x, 32, (5, 5))
bn_out1 = PF.batch_normalization(conv_out1)
act_out1 = F.relu(bn_out1)
with nn.parameter_scope('network1/conv2'):
conv_out2 = PF.convolution(act_out1, 64, (3, 3))
bn_out2 = PF.batch_normalization(conv_out2)
act_out2 = F.relu(bn_out2)

nnabla.parameter.get_current_parameter_scope()[source]

Returns current parameter scope.

Get parameter Variables under the current parameter scope.

Parameters
• params (dict) – Internal use. User doesn’t set it manually.

• path (str) – Internal use. User doesn’t set it manually.

• grad_only (bool) – Retrieve all parameters under the current scope if False, while only parameters with need_grad=True are retrieved if True.

Returns
Return type

dict

nnabla.parameter.clear_parameters()[source]

Clear all parameters in the current scope.

nnabla.parameter.save_parameters(path, params=None, extension=None)[source]

Save all parameters into a file with the specified format.

Currently hdf5 and protobuf formats are supported.

Parameters

Load parameters from a file with the specified format.

Parameters

path – path or file object

Returns an existing parameter variable in current parameter scope with the provided name.

If a variable with the provided name does not exist, a new variable is created and registered to the current parameter scope with the name, then returned.

Parameters
• name (str) – The name under the current scope. If it already exists, the name is queried from the parameter manager.

• shape (tuple of int) – Shape of created parameter. The shape of the specified parameter must match with this shape. The default is None which is only valid if initializer is given as an numpy.ndarray.

• initializer (nnabla.initializer.BaseInitializer or numpy.ndarray) – An initialization function to be applied to the parameter. numpy.ndarray can also be given to initialize parameters from numpy array data.

• need_grad (bool) – Register the parameter with the specified need_grad flag. The default is True. If the flag is different from the previously specified one, the flag will be overwritten, but the values will be kept.

• as_need_grad (bool) – Get a parameter variable with the specified need_grad flag. Note that this doesn’t overwrite the flag of the registered parameter variable with the provided name. Instead, if the given flag mismatches with the previously registered need_grad flag, it returns a new variable referring to the same array contents but with need_grad=as_need_grad.

Note

It returns a Variable which is unlinked from the registered one in the current parmeter scope (using nnabla.Variable.get_unlinked_variable()). That means changing a need_grad attribute doesn’t affect the variable existing in the current parameter scope.

## List of Parametric Functions¶

Parametric functions are provided by nnabla.parametric_functions , as listed below. Like functions listed in Functions, they take Variable (s) as first argument(s) followed by options specific to a parametric function. In addition, they register parameter Variable (s) into the parameter scope.

The parameter variables are registered with need_grad properties specific to a parametric function. The variables with need_grad=False flag will not be updated by gradient descent. Hence, backward computation is not executed for those variables. False is usually specified when the parameters are updated during foward pass and/or backward pass, e.g., batch normalization.

All parametric functions take an optional argument fix_parameters=False. By giving True, the associated parameter variables are connected to a computation graph with a property need_grad=False regardless properties of the registered variables, then backward gradient computation is not executed for those variables. This is useful when you create a computation graph for evaluation purpose, fixing parameters partially in a graph, and so on.

All parametric functions listed below are decorated with the following decorator.

nnabla.parametric_functions.parametric_function_api(scope_name=None, param_desc=None)[source]

Decorator for parametric functions.

The decorated function is always called under a parameter scope scope_name. Also, the decorator adds an additional argument name (str, default is None) at the end. If name is specified, the scope scope_name comes under a scope name. This feature could reduce vertical space usage of the source code. Any parametric function should be decorated by this.

Parameters
• scope_name (str, optional) – The original function will be called under a parameter scope named by scope_name.

• param_desc (list, optional) – Descriptions of parameters will be automatically included into docstring. This must be a list of tuples with 4 elements composed of (name (str), description (str), shape info (str), need_grad (bool)).

Returns

A decorated parametric function.

Return type

function

See Parameter Management API to know how to query and manipulate registered variables.

Here is the list of parametric functions.

nnabla.parametric_functions.affine(inp, n_outmaps, base_axis=1, w_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, apply_w=None, apply_b=None, name=None)[source]

The affine layer, also known as the fully connected layer. Computes

${\mathbf y} = {\mathbf A} {\mathbf x} + {\mathbf b}.$

where $${\mathbf x}, {\mathbf y}$$ are the inputs and outputs respectively, and $${\mathbf A}, {\mathbf b}$$ are constants.

Parameters
Returns

$$(B + 1)$$-D array. ($$M_0 \times \ldots \times M_{B-1} \times L$$)

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "affine";

• W (need_grad=True) : Weight matrix. (shape: (inmaps, outmaps))

• b (need_grad=True) : bias vector. (shape: (outputs,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = affine(<args>)

nnabla.parametric_functions.convolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, channel_last=False, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, apply_w=None, apply_b=None, name=None)[source]

N-D Convolution with a bias term.

For Dilated Convolution (a.k.a. Atrous Convolution), refer to:

Note

Convolution is a computationally intensive operation that should preferably be run with the cudnn backend. NNabla then uses CuDNN library functions to determine and cache the fastest algorithm for the given set of convolution parameters, which results in additional memory consumption which may pose a problem for GPUs with insufficient memory size. In that case, the NNABLA_CUDNN_WORKSPACE_LIMIT environment variable can be used to restrict the choice of algorithms to those that fit the given workspace memory limit, expressed in bytes. In some cases it may also be desired to restrict the automatic search to algorithms that produce deterministic (reproducible) results. This can be requested by setting the the environment variable NNABLA_CUDNN_DETERMINISTIC to a non-zero value.

Parameters
Returns

N-D array. See convolution for the output shape.

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "conv";

• W (need_grad=True) : Filter weights. (shape: (outmaps, inmaps // group, *kernel))

• b (need_grad=True) : Bias vector. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = convolution(<args>)

nnabla.parametric_functions.depthwise_convolution(inp, kernel, pad=None, stride=None, dilation=None, multiplier=1, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, name=None)[source]

N-D Depthwise Convolution with a bias term.

Reference:

Parameters
Returns

N-D array. See depthwise_convolution for the output shape.

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "depthwise_conv";

• W (need_grad=True) : Filter weights. (shape: (inmaps * multiplier, *kernel))

• b (need_grad=True) : Bias vector. (shape: (inmaps * multiplier,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = depthwise_convolution(<args>)

nnabla.parametric_functions.deconvolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, channel_last=False, output_padding=None, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, apply_w=None, apply_b=None, name=None)[source]

Deconvolution layer.

Parameters
Returns

N-D array. See deconvolution for the output shape.

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "deconv";

• W (need_grad=True) : Filter weights. (shape: (inmaps, outmaps // group, *kernel))

• b (need_grad=True) : Bias vector. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = deconvolution(<args>)

nnabla.parametric_functions.depthwise_deconvolution(inp, kernel, pad=None, stride=None, dilation=None, divisor=1, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, name=None)[source]

Depthwise deconvolution computes the transposed depthwise convolution for one-dimensional and two-dimensional input data.

Parameters
Returns

N-D array. See depthwise_deconvolution for the output shape.

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "depthwise_deconv";

• W (need_grad=True) : Filter weights. (shape: (inmaps,) + kernel)

• b (need_grad=True) : Bias vector. (shape: (inmaps / divisor,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = depthwise_deconvolution(<args>)

nnabla.parametric_functions.deformable_convolution(inp, outmaps, kernel, offset, mask=None, pad=None, stride=None, dilation=None, group=1, deformable_group=1, channel_last=False, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, apply_w=None, apply_b=None, name=None)[source]

2D Deformable Convolution with a bias term. If use mask, this function is Deformable Convolution v2.

Parameters
Returns

N-D array. See convolution for the output shape.

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "deformable_conv";

• W (need_grad=True) : Filter weights. (shape: (outmaps, inmaps // group, *kernel))

• b (need_grad=True) : Bias vector. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = deformable_convolution(<args>)

nnabla.parametric_functions.batch_normalization(inp, axes=[1], decay_rate=0.9, eps=1e-05, batch_stat=True, output_stat=False, fix_parameters=False, param_init=None, no_scale=False, no_bias=False, name=None)[source]

Batch normalization layer.

$\begin{split}\begin{array}{lcl} \mu &=& \frac{1}{M} \sum x_i\\ \sigma^2 &=& \frac{1}{M} \sum \left(x_i - \mu\right)^2\\ \hat{x}_i &=& \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon }}\\ y_i &= & \hat{x}_i \gamma + \beta. \end{array}\end{split}$

where $$x_i, y_i$$ are the inputs. In testing, the mean and variance computed by moving average calculated during training are used.

Parameters
• inp (Variable) – N-D array of input.

• axes (tuple of int) – Mean and variance for each element in axes are calculated using elements on the rest axes. For example, if an input is 4 dimensions, and axes is [1], batch mean is calculated as np.mean(inp.d, axis=(0, 2, 3), keepdims=True) (using numpy expression as an example).

• decay_rate (float) – Decay rate of running mean and variance.

• eps (float) – Tiny value to avoid zero division by std.

• batch_stat (bool) – Use mini-batch statistics rather than running ones.

• output_stat (bool) – Output batch mean and variance.

• fix_parameters (bool) – When set to True, the beta and gamma will not be updated.

• param_init (dict) – Parameter initializers can be set with a dict. A key of the dict must be 'beta', 'gamma', 'mean' or 'var'. A value of the dict must be an Initializer or a numpy.ndarray. E.g. {'beta': ConstantInitializer(0), 'gamma': np.ones(gamma_shape) * 2}.

• no_scale (bool) – If True, the scale term is omitted.

• no_bias (bool) – If True, the bias term is omitted.

Returns

N-D array.

Return type

Variable

References

The shape of parameters has the same number of dimensions with the input data, and the shapes in axes has the same dimensions with the input, while the rest has 1. If an input is 4-dim and axes=[1], the parameter shape will be param_shape  = np.mean(inp.d, axis=(0, 2, 3), keepdims=True).shape (using numpy expression as an example).

Parameters to be registered

The following variables are registered in a parameter scope "bn";

• beta (need_grad=True) : Trainable bias $$\beta$$. (shape: <see above>)

• gamma (need_grad=True) : Trainable scaling factor $$\gamma$$. (shape: <see above>)

• mean (need_grad=False) : Moving average of batch mean. (shape: <see above>)

• var (need_grad=False) : Moving average of batch variance. (shape: <see above>)

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = batch_normalization(<args>)

nnabla.parametric_functions.fused_batch_normalization(inp, z=None, axes=[1], decay_rate=0.9, eps=1e-05, batch_stat=True, nonlinearity='relu', output_stat=False, fix_parameters=False, param_init=None, no_scale=False, no_bias=False, name=None)[source]

Batch normalization layer fused with the following add2 operation of a residual input and an nonlinear activation.

Parameters
• inp (Variable) – N-D array of input.

• z (Variable, optional) – A residual input. By specifying None, the activation function will follow immediately after BN operation.

• axes (tuple of int) – Mean and variance for each element in axes are calculated using elements on the rest axes. For example, if an input is 4 dimensions, and axes is [1], batch mean is calculated as np.mean(inp.d, axis=(0, 2, 3), keepdims=True) (using numpy expression as an example).

• decay_rate (float) – Decay rate of running mean and variance.

• eps (float) – Tiny value to avoid zero division by std.

• batch_stat (bool) – Use mini-batch statistics rather than running ones.

• nonlinearity (string) – Activation function. The default is ‘relu’.

• output_stat (bool) – Output batch mean and variance.

• fix_parameters (bool) – When set to True, the beta and gamma will not be updated.

• no_scale (bool) – If True, the scale term is omitted.

• no_bias (bool) – If True, the bias term is omitted.

Returns

N-D array.

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "bn";

• beta (need_grad=True) : Trainable bias $$\beta$$. (shape: <see above>)

• gamma (need_grad=True) : Trainable scaling factor $$\gamma$$. (shape: <see above>)

• mean (need_grad=False) : Moving average of batch mean. (shape: <see above>)

• var (need_grad=False) : Moving average of batch variance. (shape: <see above>)

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = fused_batch_normalization(<args>)

nnabla.parametric_functions.sync_batch_normalization(inp, comm, group='world', axes=[1], decay_rate=0.9, eps=1e-05, batch_stat=True, output_stat=False, fix_parameters=False, param_init=None, no_scale=False, no_bias=False, name=None)[source]

Synchronized batch normalization layer.

For some tasks (e.g., semantic segmentation), batch size will be too small and BatchNormalization layer might not work well. SyncBatchNorlization layer solves these problems by synchronizing batch stats (mean and var) between multiple processes.

$\begin{split}\begin{array}{lcl} \mu &=& \frac{1}{M} \sum x_i\\ \sigma^2 &=& \frac{1}{M} \left(\sum x_i - \mu\right)^2\\ \hat{x}_i &=& \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon }}\\ y_i &= & \hat{x}_i \gamma + \beta. \end{array}\end{split}$

where $$x_i, y_i$$ are the inputs.

Parameters
• inp (Variable) – N-D array of input.

• comm (Communicator) – The communicator

• group (string) – The name of the communicator group

• axes (tuple of int) – Mean and variance for each element in axes are calculated using elements on the rest axes. For example, if an input is 4 dimensions, and axes is [1], batch mean is calculated as np.mean(inp.d, axis=(0, 2, 3), keepdims=True) (using numpy expression as an example).

• decay_rate (float) – Decay rate of running mean and variance.

• eps (float) – Tiny value to avoid zero division by std.

• batch_stat (bool) – Use mini-batch statistics rather than running ones.

• output_stat (bool) – Output batch mean and variance.

• fix_parameters (bool) – When set to True, the beta and gamma will not be updated.

• param_init (dict) – Parameter initializers can be set with a dict. A key of the dict must be 'beta', 'gamma', 'mean' or 'var'. A value of the dict must be an Initializer or a numpy.ndarray. E.g. {'beta': ConstantInitializer(0), 'gamma': np.ones(gamma_shape) * 2}.

• no_scale (bool) – If True, the scale term is omitted.

• no_bias (bool) – If True, the bias term is omitted.

Returns

N-D array.

Return type

Variable

References

The shape of parameters has the same number of dimensions with the input data, and the shapes in axes has the same dimensions with the input, while the rest has 1. If an input is 4-dim and axes=[1], the parameter shape will be param_shape  = np.mean(inp.d, axis=(0, 2, 3), keepdims=True).shape (using numpy expression as an example).

Parameters to be registered

The following variables are registered in a parameter scope "bn";

• beta (need_grad=True) : Trainable bias $$\beta$$. (shape: <see above>)

• gamma (need_grad=True) : Trainable scaling factor $$\gamma$$. (shape: <see above>)

• mean (need_grad=False) : Moving average of batch mean. (shape: <see above>)

• var (need_grad=False) : Moving average of batch variance. (shape: <see above>)

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = sync_batch_normalization(<args>)

nnabla.parametric_functions.mean_subtraction(inp, base_axis=1, update_running_mean=True, fix_parameters=False, name=None)[source]

Mean subtraction layer.

It subtracts the mean of the elements of the input array, and normalizes it to $$0$$. Preprocessing arrays with this function has the effect of improving accuracy in various tasks such as image classification.

At training time, this function is defined as

$\begin{split}\begin{array}{lcl} \mu &=& \frac{1}{M} \sum x_i \\ y_i &=& x_i - \mu \end{array}\end{split}$

At testing time, the mean values used are those that were computed during training by moving average.

Note

The backward performs an approximated differentiation that takes into account only the latest mini-batch.

Parameters
• inp (Variable) – N-D array of input.

• base_axis (int) – Base axis of Mean Subtraction operation. Dimensions up to base_axis is treated as sample dimension.

• update_running_mean (bool) – When set to True, the running mean will not be updated.

• fix_parameters (bool) – dummy parameter. This argument dose not affect anything.

Returns

N-D array.

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "mean_subtraction";

• mean (need_grad=False) : Moving average. (shape: inp.shape[base_axis:])

• t (need_grad=False) : Minibatch counter used in forward pass. (shape: (1,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = mean_subtraction(<args>)

nnabla.parametric_functions.layer_normalization(inp, batch_axis=0, eps=1e-05, output_stat=False, fix_parameters=False, param_init=None, no_scale=False, no_bias=False, name=None)[source]

Applies Layer Normalization over an input variable, which is defined as:

$\begin{split}\begin{eqnarray} \mu^l &=& \frac{1}{H} \sum_{i=1}^{H} x_i^l \\ \sigma^l &=& \sqrt{\frac{1}{H} \sum_{i=1}^{H} \left(x_i^l - \mu^l\right)^2} \\ y &=& \frac{x - \mu^l}{\sigma^l + \epsilon} \gamma + \beta \end{eqnarray}\end{split}$

where $$x$$ and $$y$$ are input and output variable, $$\mu^l$$ and $$\sigma^l$$ are the mean and std of each layer along batch axis, and $$\alpha$$ and $$\beta$$ are trainable parameter.

Note

Unlike other normalizations, which applies scalar scale and bias for each entire channel/plane, Layer Normalization applies per-element scale and bias.

References

Parameters
• inp (Variable) – An input variable.

• batch_axis (int or repeated int) – Axes mean and variance are taken.

• eps (float) – Tiny value to avoid zero division by std.

• output_stat (bool) – It True, calculated mean and variance are also returned.

• fix_parameters (bool) – When set to True, the beta and gamma will not be updated.

• param_init (dict) – Parameter initializers can be set with a dict. A key of the dict must be 'gamma', 'beta'. A value of the dict must be an Initializer or a numpy.ndarray. E.g. {'gamma': np.ones(...) * 2, 'beta': ConstantInitializer(0)}.

• no_scale (bool) – If True, the scale term is omitted.

• no_bias (bool) – If True, the bias term is omitted.

Returns

Normalized output variable. * Variable: Mean (if output_stat=True). * Variable: Std (if output_stat=True)

Return type

Parameters to be registered

The following variables are registered in a parameter scope "layer_normalization";

• beta (need_grad=True) : Trainable bias $$\beta$$. (shape: <see above>)

• gamma (need_grad=True) : Trainable scaling factor $$\gamma$$. (shape: <see above>)

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = layer_normalization(<args>)

nnabla.parametric_functions.instance_normalization(inp, channel_axis=1, batch_axis=0, eps=1e-05, output_stat=False, fix_parameters=False, param_init=None, no_scale=False, no_bias=False, name=None)[source]

Applies Instance Normalization over an input variable, which is defined as:

$\begin{split}\begin{eqnarray} \mu^i &=& \frac{1}{H} \sum_{i=1}^{H} x_i^i \\ \sigma^i &=& \sqrt{\frac{1}{H} \sum_{i=1}^{H} \left(x_i^i - \mu^i\right)^2} \\ y &=& \frac{x - \mu^i}{\sigma^ + \epsilon} \gamma + \beta \end{eqnarray}\end{split}$

where $$x$$ and $$y$$ are input and output variable, $$\mu^i$$ and $$\sigma^i$$ are the mean and std of each instance which is separately calculated for each batch and channel, and $$\gamma$$ and $$\beta$$ are adaptive gains and biases.

If the input shape is [B, C, H, W] (= channel_axis=1, batch_axis=0), the shape of calculated mean and std are [B, C, 1, 1]

References

Parameters
Parameters to be registered

The following variables are registered in a parameter scope "instance_normalization";

• beta (need_grad=True) : Trainable bias $$\beta$$. (shape: <see above>)

• gamma (need_grad=True) : Trainable scaling factor $$\gamma$$. (shape: <see above>)

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = instance_normalization(<args>)

nnabla.parametric_functions.group_normalization(inp, num_groups, channel_axis=1, batch_axis=0, eps=1e-05, output_stat=False, fix_parameters=False, param_init=None, no_scale=False, no_bias=False, name=None)[source]

Applies Group Normalization over an input tensor, which is defined as:

$\begin{split}\begin{eqnarray} \mu^g &=& \frac{1}{H} \sum_{i=1}^{H} x_i^g \\ \sigma^g &=& \sqrt{\frac{1}{H} \sum_{i=1}^{H} \left(x_i^g - \mu^g\right)^2} \\ y &=& \frac{x - \mu^g}{\sigma^g + \epsilon} \gamma + \beta \end{eqnarray}\end{split}$

where $$x$$ and $$y$$ are input and output variable, $$\mu^g$$ and $$\sigma^g$$ are the mean and std of each group which contains num_channels / num_groups channels, and $$\gamma$$ and $$\beta$$ are adaptive gains and biases.

The input channels, specified by channel_axis, are separeted into num_groups groups, and the mean and std are calculated over the each group. For example, if the input shape is [B, C, H, W] (= channel_axis=1, batch_axis=0), an input variable is once reshaped to [B, num_groups, C / num_groups, H, W] and standardize by its mean and std whose shapes are [B, num_groups, C / num_groups, 1, 1]. Before returning, an output variable is reshaped again to the original input shape (= [B, C, H, W] in the case above).

References

Parameters
• inp (Variable) – An input variable.

• num_groups (int) – A number of groups. The channel dim of ‘x’ must be integer multiple of num_groups.

• channel_axis (int) – Channel axis.

• batch_axis (int or repeated int) – Axes mean and variance are taken.

• eps (float) – Tiny value to avoid zero division by std.

• output_stat (bool) – It true, the batch statistics of mean and variance.

• fix_parameters (bool) – When set to True, the beta and gamma will not be updated.

• param_init (dict) – Parameter initializers can be set with a dict. A key of the dict must be 'gamma', 'beta'. A value of the dict must be an Initializer or a numpy.ndarray. E.g. {'gamma': np.ones(...) * 2, 'beta': ConstantInitializer(0)}.

• no_scale (bool) – If True, the scale term is omitted.

• no_bias (bool) – If True, the bias term is omitted.

Returns

Normalized output variable. * Variable: Mean (if output_stat=True) * Variable: Std (if output_stat=True)

Return type

Parameters to be registered

The following variables are registered in a parameter scope "group_normalization";

• beta (need_grad=True) : Trainable bias $$\beta$$. (shape: <see above>)

• gamma (need_grad=True) : Trainable scaling factor $$\gamma$$. (shape: <see above>)

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = group_normalization(<args>)

nnabla.parametric_functions.rnn(x, h, w0_init=None, w_init=None, b_init=None, num_layers=1, nonlinearity='tanh', dropout=0.0, bidirectional=False, training=True, rng=None, with_bias=True, fix_parameters=False, name=None)[source]

N-Step RNN (recurrent neural networks).

N-Step RNN function implements Elman RNN with nonlineraity to input sequence. N-Step RNN function is defined as following:

$h_t = \tanh(w_{ih}x_t+b_{ih}+w_{hh}h_{(t-1)}).$

We use the following notations to describe the inputs and outputs below. $$T$$: sequcne length, $$B$$: batch size, $$I$$: input size, $$L$$: number of layers, $$D$$: number of directions, can be either 1 or 2, $$H$$: hidden size.

References

Jeffrey L. Elman. “Finding Structure in Time.” Cognitive Science. 1990.

Parameters
Returns

Output $$y$$ with shape $$(T, B, D * H)$$ ~nnabla.Variable: Output $$h_n$$ with shape $$(L, D, B, H)$$

Return type

Variable

Example

x = nn.Variable((seq_len, batch_size, input_size))
h = nn.Variable((num_layers, num_directions, batch_size, hidden_size))
y, hn = PF.rnn(x, h)

Parameters to be registered

The following variables are registered in a parameter scope "rnn";

• weight_l0 (need_grad=True) : Filter weights at 0-th layer. (shape: (D, H, I + H))

• weight (need_grad=True) : Filter weights at 1-st layer and above. (shape: (L-1, D, H, DH + H))

• bias (need_grad=True) : Biases. (shape: (L, D, H))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = rnn(<args>)

nnabla.parametric_functions.lstm(x, h, c, w0_init=None, w_init=None, b_init=None, num_layers=1, dropout=0.0, bidirectional=False, training=True, rng=None, with_bias=True, fix_parameters=False, name=None)[source]

LSTM (long short-term memory).

Long Short-Term Memory, or LSTM, is a building block for recurrent neural networks (RNN) layers. LSTM unit consists of a cell and input, output, forget gates whose functions are defined as following:

$\begin{split}f_t&&=\sigma(W_fx_t+U_fh_{t-1}+b_f) \\ i_t&&=\sigma(W_ix_t+U_ih_{t-1}+b_i) \\ o_t&&=\sigma(W_ox_t+U_oh_{t-1}+b_o) \\ c_t&&=f_t\odot c_{t-1}+i_t\odot\tanh(W_cx_t+U_ch_{t-1}+b_c) \\ h_t&&=o_t\odot\tanh(c_t).\end{split}$

We use the following notations to describe the inputs and outputs below. $$T$$: sequcne length, $$B$$: batch size, $$I$$: input size, $$L$$: number of layers, $$D$$: number of directions, can be either 1 or 2, $$H$$: hidden size.

References

S. Hochreiter, and J. Schmidhuber. “Long Short-Term Memory.” Neural Computation. 1997.

Parameters
Returns

Output $$y$$ with shape $$(T, B, D * H)$$ ~nnabla.Variable: Output $$h_n$$ with shape $$(L, D, B, H)$$ ~nnabla.Variable: Output $$c_n$$ with shape $$(L, D, B, H)$$

Return type

Variable

Example

x = nn.Variable((seq_len, batch_size, input_size))
h = nn.Variable((num_layers, num_directions, batch_size, hidden_size))
c = nn.Variable((num_layers, num_directions, batch_size, hidden_size))
y, hn, cn = PF.lstm(x, h, c)

Parameters to be registered

The following variables are registered in a parameter scope "lstm";

• weight_l0 (need_grad=True) : Filter weights at 0-th layer. (shape: (D, 4, H, I + H))

• weight (need_grad=True) : Filter weights at 1-st layer and above. (shape: (L-1, D, 4, H, DH + H))

• bias (need_grad=True) : Biases. (shape: (L, D, 4, H))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = lstm(<args>)

nnabla.parametric_functions.gru(x, h, w0_init=None, w_init=None, b_init=None, num_layers=1, dropout=0.0, bidirectional=False, training=True, rng=None, with_bias=True, fix_parameters=False, name=None)[source]

GRU (gated recurrent units).

GRU is defined as following:

$\begin{split}r_t&&=\sigma(W_rx_t+U_rh_{t-1}+b_r) \\ z_t&&=\sigma(W_zx_t+U_zh_{t-1}+b_z) \\ n_t&&=\tanh(W_nx_t+b_{in}+r_n \odot (U_nh_{t-1}+b_{hn})) \\ h_t&&=(1-z_t) \odot n_t+z_t \odot h_{t-1}.\end{split}$

We use the following notations to describe the inputs and outputs below. $$T$$: sequcne length, $$B$$: batch size, $$I$$: input size, $$L$$: number of layers, $$D$$: number of directions, can be either 1 or 2, $$H$$: hidden size.

References

K. Cho et al. “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation.” Empirical Methods in Natural Language Processing. 2014.

Parameters
Returns

Output $$y$$ with shape $$(T, B, D * H)$$ ~nnabla.Variable: Output $$h_n$$ with shape $$(L, D, B, H)$$

Return type

Variable

Example

x = nn.Variable((seq_len, batch_size, input_size))
h = nn.Variable((num_layers, num_directions, batch_size, hidden_size))
y, hn = PF.gru(x, h)

Parameters to be registered

The following variables are registered in a parameter scope "gru";

• weight_l0 (need_grad=True) : Filter weights at 0-th layer. (shape: (D, 3, H, I + H))

• weight (need_grad=True) : Filter weights at 1-st layer and above. (shape: (L-1, D, 3, H, DH + H))

• bias (need_grad=True) : Biases. (shape: (L, D, 4, H))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = gru(<args>)

nnabla.parametric_functions.embed(inp, n_inputs, n_features, initializer=None, fix_parameters=False, apply_w=None, name=None)[source]

Embed.

Embed slices a matrix/tensor with indexing array/tensor. Weights are initialized with nnabla.initializer.UniformInitializer within the range of $$-\sqrt{3}$$ and $$\sqrt{3}$$.

Parameters
• x (Variable) – [Integer] Indices with shape $$(I_0, ..., I_N)$$

• n_inputs – number of possible inputs, words or vocabraries

• n_features – number of embedding features

• fix_parameters (bool) – When set to True, the embedding weight matrix will not be updated.

• apply_w (function) – Lambda, function, or callable object applied to the weights.

Returns

Output with shape $$(I_0, ..., I_N, W_1, ..., W_M)$$

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "embed";

• W (need_grad=True) : Embedding matrix. (shape: (n_inputs, n_features))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = embed(<args>)

nnabla.parametric_functions.prelu(inp, base_axis=1, shared=True, fix_parameters=False, slope_init=None, name=None)[source]

Parametrized Rectified Linear Unit function defined as

$y_i = \max(0, x_i) + w_i \min(0, x_i)$

where negative slope $$w$$ is learned and can vary across channels (an axis specified with base_axis). Weights are initialized with $$-1$$.

Parameters
Returns

N-D array.

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "prelu";

• slope (need_grad=True) : Negative slope. (shape: tuple() if shared else (inp.shape[base_axis],))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = prelu(<args>)

nnabla.parametric_functions.svd_affine(inp, n_outmaps, r, base_axis=1, uv_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, name=None)[source]

SVD affine is a low rank approximation of the affine layer. It can be seen as two consecutive affine layers with a bottleneck. It computes:

${\mathbf y} = {\mathbf U} {\mathbf V} {\mathbf x} + {\mathbf b}.$

where $${\mathbf x}, {\mathbf y}$$ are the inputs and outputs respectively, and $${\mathbf U}, {\mathbf V}, {\mathbf b}$$ are constants.

The weights $${\mathbf U}$$ and $${\mathbf V}$$ are approximated with singular value decomposition (SVD) of the original weight matrix $${\mathbf W}$$ and by selecting the $${R}$$ dominant singular values and the corresponding singular vectors. Therefore the low rank $${R}$$ is the size of the bottleneck.

If uv_init is a numpy array, $${\mathbf U}$$ and $${\mathbf V}$$ are computed such that uv_init is approximated by $${\mathbf{UV}}$$. If uv_init is None or an initializer, the product of $${\mathbf U}$$ and $${\mathbf V}$$ approximates the random initialization.

If $${\mathbf U}$$ and $${\mathbf V}$$ exist in the context, they take precedence over uv_init.

Suppose the weight of the affine is of $${I \times O}$$ and the compression rate you want to specify is $${CR}$$, then you set $${R}$$ as

$R = \left\lfloor \frac{(1 - CR)OI}{O + I} \right\rfloor.$
Parameters
Returns

$$(B + 1)$$-D array. ($$M_0 \times \ldots \times M_{B-1} \times L$$)

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "svd_affine";

• U (need_grad=True) : $${\mathbf U}$$. (shape: (inmaps, r))

• V (need_grad=True) : $${\mathbf V}$$. (shape: (r, outmaps))

• b (need_grad=True) : Bias vector. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = svd_affine(<args>)

nnabla.parametric_functions.svd_convolution(inp, outmaps, kernel, r, pad=None, stride=None, dilation=None, uv_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, name=None)[source]

SVD convolution is a low rank approximation of the convolution layer. It can be seen as a depth wise convolution followed by a 1x1 convolution.

The flattened kernels for the i-th input map are expressed by their low rank approximation. The kernels for the i-th input $${\mathbf W_i}$$ are approximated with the singular value decomposition (SVD) and by selecting the $${R}$$ dominant singular values and the corresponding singular vectors.

${\mathbf W_{:,i,:}} ~ {\mathbf U_i} {\mathbf V_i}.$

$${\mathbf U}$$ contains the weights of the depthwise convolution with multiplier $${R}$$ and $${\mathbf V}$$ contains the weights of the 1x1 convolution.

If uv_init is a numpy array, $${\mathbf U}$$ and $${\mathbf V}$$ are computed such that uv_init is approximated by $${\mathbf{UV}}$$. If uv_init is None or an initializer, the product of $${\mathbf U}$$ and $${\mathbf V}$$ approximates the random initialization.

If $${\mathbf U}$$ and $${\mathbf V}$$ exist in the context, they take precedence over uv_init.

Suppose the kernel tensor of the convolution is of $${O \times I \times K \times K}$$ and the compression rate you want to specify is $${CR}$$, then you set $${R}$$ as

$R = \left\lfloor \frac{(1 - CR)OIK^2}{I(O + K^2)} \right\rfloor.$
Parameters
Returns

$$(B + 1)$$-D array. ($$M_0 \times \ldots \times M_{B-1} \times L$$)

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "svd_conv";

• U (need_grad=True) : Decomposed filter weights $${\mathbf U}$$. (shape: (inmaps * r, *kernel))

• V (need_grad=True) : Decomposed filter weights $${\mathbf V}$$. (shape: (outmaps, inmaps * r, 1, ...))

• b (need_grad=True) : Bias vector. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = svd_convolution(<args>)

nnabla.parametric_functions.cpd3_convolution(inp, outmaps, kernel, r, pad=None, stride=None, dilation=None, oik_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, max_iter=500, stopping_criterion=1e-05, lambda_reg=0.0, name=None)[source]

CP convolution is a low rank approximation of a convolution layer. A 3D tensor containing the parameter is built by collapsing the N-D kernels into 1D, then the tensor is decomposed into three matrices. The decomposed layer can be seen as linear combinations of the input feature maps to $${R}$$ feature maps followed by a depthwise convolution and followed by linear combinations of the feature maps to compute the output feature maps.

The CP decomposition allows to approximate the kernel tensor by $${R}$$ rank-1 tensors of the form:

$\sum_{r=1}^{R} \lambda_r {\mathbf{o}^{(r)} \otimes \mathbf{i}^{(r)} \otimes \mathbf{k}^{(r)}},$

where $${\lambda}_r$$ is the normalization coefficient and $${\otimes}$$ is the outer product.

If oik_init is a numpy array, U and V are computed so that uv_init can be approximates from UV If oik_init is None or an initializer, the product of U and V approximate the randomly initialized array

If O, I and K exist in context, they are used to initialize the layer and oik_init is not used.

Suppose the kernel tensor of the affine is of $${I \times O}$$ and the compression rate you want to specify is $${CR}$$, then you set $${R}$$ as

$R = \left\lfloor \frac{(1 - CR)OIK^2}{O + I + K^2} \right\rfloor.$

References

• Lebedev, Vadim, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky, “Speeding-up convolutional neural networks using fine-tuned cp-decomposition.”, arXiv preprint arXiv:1412.6553 (2014).

• Marcella Astrid, Seung-Ik Lee, “CP-decomposition with Tensor Power Method for Convolutional Neural Networks Compression”, BigComp 2017.

Parameters
Returns

$$(B + 1)$$-D array. ($$M_0 \times \ldots \times M_{B-1} \times L$$)

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "cpd3_conv";

• I (need_grad=True) : Decomposed filter weights $${\mathbf I}$$. (shape: (r, inmaps, 1, ...))

• K (need_grad=True) : Decomposed filter weights $${\mathbf K}$$. (shape: (r, *kernel))

• O (need_grad=True) : Decomposed filter weights $${\mathbf O}$$. (shape: (outmaps, r, 1, ...))

• b (need_grad=True) : Bias vector. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = cpd3_convolution(<args>)

nnabla.parametric_functions.binary_connect_affine(inp, n_outmaps, base_axis=1, quantize_zero_to=1.0, w_init=None, wb_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, name=None)[source]

Binary Connect Affine, multiplier-less inner-product.

Binary Connect Affine is an affine function, except the definition of the inner product is modified. The input-output relation of this function is as follows:

$y_i = \sum_{i} sign(w_i) x_i.$

Therefore $$sign(w_i)$$ is either $$1$$ or $$-1$$ and the inner product simplifies to addition.

This function should be used together with Batch Normalization.

References

M. Courbariaux, Y. Bengio, and J.-P. David. “BinaryConnect: Training Deep Neural Networks with binary weights during propagations.” Advances in Neural Information Processing Systems. 2015.

Note

1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (weight) and not the binarized weights (binary_weight)

2) The weights and the binary weights become synced only after forward() is called, and not after a call to backward(). To access the parameters of the network, remember to call forward() once before doing so, otherwise the float weights and the binary weights will not be in sync.

3) Quantized values are stored as floating point number for binary_weight, since this function is only for simulation purposes.

Parameters
Returns

Variable

Parameters to be registered

The following variables are registered in a parameter scope "bicon_affine";

• W (need_grad=True) : Weight matrix in floating type. (shape: (inmaps, outmaps))

• Wb (need_grad=False) : Binarized weights. (shape: (inmaps, outmaps))

• b (need_grad=True) : Bias vector. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = binary_connect_affine(<args>)

nnabla.parametric_functions.binary_connect_convolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, quantize_zero_to=1.0, w_init=None, wb_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, name=None)[source]

Binary Connect Convolution, multiplier-less inner-product.

Binary Connect Convolution is the convolution function, except the definition of the inner product is modified. The input-output relation of this function is as follows:

$y_{n, a, b} = \sum_{m} \sum_{i} \sum_{j} sign(w_{n, m, i, j}) x_{m, a + i, b + j}.$

Therefore $$sign(w_i)$$ is either $$1$$ or $$-1$$ and the inner product simplifies to addition.

This function should be used together with BatchNormalization.

References

M. Courbariaux, Y. Bengio, and J.-P. David. “BinaryConnect: Training Deep Neural Networks with binary weights during propagations.” Advances in Neural Information Processing Systems. 2015.

Note

1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (weight) and not the binarized weights (binary_weight)

2) The weights and the binary weights become synced only after forward() is called, and not after a call to backward(). To access the parameters of the network, remember to call forward() once before doing so, otherwise the float weights and the binary weights will not be in sync.

3) Quantized values are stored as floating point number for binary_weight, since this function is only for simulation purposes.

Parameters
Returns

Variable

Parameters to be registered

The following variables are registered in a parameter scope "bicon_conv";

• W (need_grad=True) : Filter weights in float. (shape: (outmaps, inmaps, *kernel))

• Wb (need_grad=False) : Binarized filter weights. (shape: (outmaps, inmaps, *kernel))

• b (need_grad=True) : Bias vector. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = binary_connect_convolution(<args>)

nnabla.parametric_functions.binary_weight_affine(inp, n_outmaps, base_axis=1, quantize_zero_to=1.0, w_init=None, wb_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, name=None)[source]

Binary Weight Affine, multiplier-less inner-product with a scale factor.

Binary Weight Affine is the affine function, but the inner product in this function is the following,

$y_j = \frac{1}{\|\mathbf{w}_j\|_{\ell_1}} \sum_{i} sign(w_{ji}) x_i$

Therefore $$sign(w_{ji})$$ is either $$1$$ or $$-1$$ and the inner product simplifies to addition followed by scaling factor $$\alpha = \frac{1}{\|\mathbf{w}_j\|_{\ell_1}}$$. The number of :$$\alpha$$ is the outmaps of the affine function.

References

Rastegari, Mohammad, et al. “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks.” arXiv preprint arXiv:1603.05279 (2016).

Note

1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (weight) and not the binarized weights (binary_weight)

2) The weights and the binary weights become synced only after forward() is called, and not after a call to backward(). To access the parameters of the network, remember to call forward() once before doing so, otherwise the float weights and the binary weights will not be in sync.

3) Quantized values are stored as floating point number for binary_weight, since this function is only for simulation purposes.

Parameters
Returns

Variable

Parameters to be registered

The following variables are registered in a parameter scope "bwn_affine";

• W (need_grad=True) : Weight matrix in floating type. (shape: (inmaps, outmaps))

• Wb (need_grad=False) : Binarized weights. (shape: (inmaps, outmaps))

• alpha (need_grad=False) : Scaling factor $$\alpha$$. (shape: (outmaps,))

• b (need_grad=True) : Bias vector. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = binary_weight_affine(<args>)

nnabla.parametric_functions.binary_weight_convolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, quantize_zero_to=1.0, w_init=None, wb_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, name=None)[source]

Binary Weight Convolution, multiplier-less inner-product with a scale factor.

Binary Weight Convolution is the convolution function, but the inner product in this function is the following,

$y_{n, a, b} = \frac{1}{\|\mathbf{w}_n\|_{\ell_1}} \sum_{m} \sum_{i} \sum_{j} sign(w_{n, m, i, j}) x_{m, a + i, b + j}.$

Therefore $$sign(w_{n, m, i, j})$$ is either $$1$$ or $$-1$$ and the inner product simplifies to addition followed by scaling factor $$\alpha = \frac{1}{\|\mathbf{w}_n\|_{\ell_1}}$$. The number of $$n$$ is the number of outmaps of the convolution function.

References

Rastegari, Mohammad, et al. “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks.” arXiv preprint arXiv:1603.05279 (2016).

Note

1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (weight) and not the binarized weights (binary_weight)

2) The weights and the binary weights become synced only after forward() is called, and not after a call to backward(). To access the parameters of the network, remember to call forward() once before doing so, otherwise the float weights and the binary weights will not be in sync.

3) Quantized values are stored as floating point number for binary_weight, since this function is only for simulation purposes.

Parameters
Returns

Variable

Parameters to be registered

The following variables are registered in a parameter scope "bwn_conv";

• W (need_grad=True) : Filter weights in float. (shape: (outmaps, inmaps, *kernel))

• Wb (need_grad=False) : Binarized filter weights. (shape: (outmaps, inmaps, *kernel))

• alpha (need_grad=False) : Scaling factor $$\alpha$$. (shape: (outmaps,))

• b (need_grad=True) : Bias vector. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = binary_weight_convolution(<args>)

nnabla.parametric_functions.inq_affine(inp, n_outmaps, base_axis=1, num_bits=4, inq_iterations=(), selection_algorithm='random', seed=-1, w_init=None, i_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, name=None)[source]

Incremental Network Quantization Affine Layer

During training, the weights are sequentially quantized to power-of-two values, which allows the training of a multiplierless network.

Using inq_iterations, one can specify after how many forward passes half of the learnable weights are fixed and quantized to powers-of-two. After reaching the last value in inq_iterations, all weights are fixed.

For more details, please refer to the reference.

Reference: Zhou A, Yao A, Guo Y, Xu L, Chen Y. Incremental network quantization: Towards lossless CNNs with low-precision weights. <https://arxiv.org/abs/1702.03044>

Parameters
Returns

Variable

Parameters to be registered

The following variables are registered in a parameter scope "inq_affine";

• W (need_grad=True) : Weight matrix in floating type. (shape: (inmaps, outmaps))

• I (need_grad=False) : Binary indicator matrix of fixed weights. (shape: (inmaps, outmaps))

• b (need_grad=True) : Bias vector. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = inq_affine(<args>)

nnabla.parametric_functions.inq_convolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, num_bits=4, inq_iterations=(), selection_algorithm='random', seed=-1, w_init=None, i_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, name=None)[source]

Incremental Network Quantization Convolution Layer

During training, the weights are sequentially quantized to power-of-two values, which allows the training of a multiplierless network.

Using inq_iterations, one can specify after how many forward passes half of the learnable weights are fixed and quantized to powers-of-two. After reaching the last value in inq_iterations, all weights are fixed.

For more details, please refer to the reference.

Reference: Zhou A, Yao A, Guo Y, Xu L, Chen Y. Incremental network quantization: Towards lossless CNNs with low-precision weights. <https://arxiv.org/abs/1702.03044>

Parameters
Returns

Variable

Parameters to be registered

The following variables are registered in a parameter scope "inq_conv";

• W (need_grad=True) : Filter weights in float. (shape: (outmaps, inmaps, *kernel))

• I (need_grad=False) : Binary indicator matrix of fixed weights. (shape: (outmaps, inmaps, *kernel))

• b (need_grad=True) : Bias vector. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = inq_convolution(<args>)

nnabla.parametric_functions.fixed_point_quantized_affine(inp, n_outmaps, base_axis=1, w_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, quantize_w=True, sign_w=True, n_w=8, delta_w=0.0625, ste_fine_grained_w=True, quantize_b=True, sign_b=True, n_b=8, delta_b=0.0625, ste_fine_grained_b=True, name=None)[source]

Fixed-Point Quantized Affine.

Fixed-Point Quantized Affine is the affine function, except the definition of the inner product is modified. The input-output relation of this function is as follows:

$y_j = \sum_{i} Q(w_{ji}) x_i,$

where $$Q(w_{ji})$$ is the fixed-point quantization function.

Note

1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (weight) and not the quantized weights (quantized weight)

2) The weights and the quantized weights become synced only after forward() is called, and not after a call to backward(). To access the parameters of the network, remember to call forward() once before doing so, otherwise the float weights and the quantized weights will not be in sync.

3) CPU and GPU implementations now use float value for quantized weight, since this function is only for simulation purposes.

Parameters
Returns

$$(B + 1)$$-D array. ($$M_0 \times \ldots \times M_{B-1} \times L$$)

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "fp_quantized_affine";

• W (need_grad=True) : Weight matrix in float. (shape: (inmaps, outmaps))

• b (need_grad=True) : Bias vector in float. (shape: (outmaps,))

• W_q (need_grad=False) : Quantized weights. (shape: (inmaps, outmaps))

• b_q (need_grad=False) : Quantized biases. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = fixed_point_quantized_affine(<args>)

nnabla.parametric_functions.fixed_point_quantized_convolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, channel_last=False, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, quantize_w=True, sign_w=True, n_w=8, delta_w=0.0625, ste_fine_grained_w=True, quantize_b=True, sign_b=True, n_b=8, delta_b=0.0625, ste_fine_grained_b=True, name=None)[source]

Fixed-Point Quantized Convolution.

Fixed-Point Quantized Convolution is the convolution function, except the definition of the inner product is modified. The input-output relation of this function is as follows:

$y_{n, a, b} = \sum_{m} \sum_{i} \sum_{j} Q(w_{n, m, i, j}) x_{m, a + i, b + j},$

where $$Q(w_{n, m, i, j})$$ is the fixed-point quantization function.

Note

1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (weight) and not the quantized weights (quantized weight)

2) The weights and the quantized weights become synced only after forward() is called, and not after a call to backward(). To access the parameters of the network, remember to call forward() once before doing so, otherwise the float weights and the quantized weights will not be in sync.

3) CPU and GPU implementations now use float value for quantized weight, since this function is only for simulation purposes.

Parameters
Returns

N-D array.

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "fp_quantized_conv";

• W (need_grad=True) : Filter weights in float. (shape: (outmaps, inmaps // group, *kernel))

• b (need_grad=True) : Bias vector in float. (shape: (outmaps,))

• W_q (need_grad=False) : Quantized weights. (shape: (outmaps, inmaps // group, *kernel))

• b_q (need_grad=False) : Quantized biases. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = fixed_point_quantized_convolution(<args>)

nnabla.parametric_functions.min_max_quantized_affine(inp, n_outmaps, base_axis=1, w_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, quantize_w=True, ql_min_w=0, ql_max_w=255, w_min_max=False, qr_min_w_init=None, qr_max_w_init=None, ste_fine_grained_w=True, quantize_b=True, ql_min_b=0, ql_max_b=255, b_min_max=False, qr_min_b_init=None, qr_max_b_init=None, ste_fine_grained_b=True, eps=0.01, name=None)[source]

Min-max Quantized Affine.

Min-max Quantized Affine is the affine function, except the definition of the inner product is modified. The input-output relation of this function is as follows:

$y_j = \sum_{i} Q(w_{ji}) x_i,$

where $$Q(w_{ji})$$ is the min-max quantization function.

In the min_max_quantized affine, the exponential moving average is not used. the min and max quantization ranges are either the min-max of weights and bias or trained.

Notice that the min and max values of inputs are always used instead of the exponential moving average.

Note

1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (weight) and not the quantized weights (quantized weight)

2) The weights and the quantized weights become synced only after forward() is called, and not after a call to backward(). To access the parameters of the network, remember to call forward() once before doing so, otherwise the float weights and the quantized weights will not be in sync.

3) CPU and GPU implementations now use float value for quantized weight, since this function is only for simulation purposes.

Parameters
Returns

$$(B + 1)$$-D array. ($$M_0 \times \ldots \times M_{B-1} \times L$$)

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "min_max_quantized_affine";

• W (need_grad=True) : Weight matrix in float. (shape: (inmaps, outmaps))

• b (need_grad=True) : Bias vector in float. (shape: (outmaps,))

• W_q (need_grad=False) : Quantized weights. (shape: (inmaps, outmaps))

• b_q (need_grad=False) : Quantized biases. (shape: (outmaps,))

• qr_min (need_grad=False) : Minimum quantization range. Minimum values of inputs or trainable range.. (shape: ql_min.shape)

• qr_max (need_grad=False) : Maximum quantization range. Maximum values of inputs or trainable range.. (shape: ql_max.shape)

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = min_max_quantized_affine(<args>)

nnabla.parametric_functions.min_max_quantized_convolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, channel_last=False, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, quantize_w=True, ql_min_w=0, ql_max_w=255, w_min_max=False, qr_min_w_init=None, qr_max_w_init=None, ste_fine_grained_w=True, quantize_b=True, ql_min_b=0, ql_max_b=255, b_min_max=False, qr_min_b_init=None, qr_max_b_init=None, ste_fine_grained_b=True, eps=0.01, name=None)[source]

Min-max Quantized Convolution.

Min-max Quantized Convolution is the convolution function, except the definition of the inner product is modified. The input-output relation of this function is as follows:

$y_{n, a, b} = \sum_{m} \sum_{i} \sum_{j} Q(w_{n, m, i, j}) x_{m, a + i, b + j},$

where $$Q(w_{n, m, i, j})$$ is the min-max quantization function.

In the min_max_quantized convolution, the exponential moving average is not used. the min and max quantization ranges are either the min-max of weights and bias or trained.

Notice that the min and max values of inputs are always used instead of the exponential moving average.

Note

1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (weight) and not the quantized weights (quantized weight)

2) The weights and the quantized weights become synced only after forward() is called, and not after a call to backward(). To access the parameters of the network, remember to call forward() once before doing so, otherwise the float weights and the quantized weights will not be in sync.

3) CPU and GPU implementations now use float value for quantized weight, since this function is only for simulation purposes.

Parameters
Returns

N-D array.

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "min_max_quantized_conv";

• W (need_grad=True) : Filter weights in float. (shape: (outmaps, inmaps // group, *kernel))

• b (need_grad=True) : Bias vector in float. (shape: (outmaps,))

• W_q (need_grad=False) : Quantized weights. (shape: (outmaps, inmaps // group, *kernel))

• b_q (need_grad=False) : Quantized biases. (shape: (outmaps,))

• qr_min (need_grad=False) : Minimum quantization range. Minimum values of inputs or trainable range.. (shape: ql_min.shape)

• qr_max (need_grad=False) : Maximum quantization range. Maximum values of inputs or trainable range.. (shape: ql_max.shape)

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = min_max_quantized_convolution(<args>)

nnabla.parametric_functions.pow2_quantized_affine(inp, n_outmaps, base_axis=1, w_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, quantize_w=True, sign_w=True, with_zero_w=False, n_w=8, m_w=2, ste_fine_grained_w=True, quantize_b=True, sign_b=True, with_zero_b=False, n_b=8, m_b=2, ste_fine_grained_b=True, name=None)[source]

Pow2 Quantized Affine.

Pow2 Quantized Affine is the affine function, except the definition of the inner product is modified. The input-output relation of this function is as follows:

$y_j = \sum_{i} Q(w_{ji}) x_i,$

where $$Q(w_{ji})$$ is the power-of-2 quantization function.

Note

1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (weight) and not the quantized weights (quantized weight)

2) The weights and the quantized weights become synced only after forward() is called, and not after a call to backward(). To access the parameters of the network, remember to call forward() once before doing so, otherwise the float weights and the quantized weights will not be in sync.

3) Quantized values are stored as floating point number for quantized weight, since this function is only for simulation purposes.

Parameters
Returns

$$(B + 1)$$-D array. ($$M_0 \times \ldots \times M_{B-1} \times L$$)

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "pow2_quantized_affine";

• W (need_grad=True) : Weight matrix in float. (shape: (inmaps, outmaps))

• b (need_grad=True) : Bias vector in float. (shape: (outmaps,))

• W_q (need_grad=False) : Quantized weights. (shape: (inmaps, outmaps))

• b_q (need_grad=False) : Quantized biases. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = pow2_quantized_affine(<args>)

nnabla.parametric_functions.pow2_quantized_convolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, quantize_w=True, with_zero_w=False, sign_w=True, n_w=8, m_w=2, ste_fine_grained_w=True, quantize_b=True, with_zero_b=False, sign_b=True, n_b=8, m_b=2, ste_fine_grained_b=True, name=None)[source]

Pow2 Quantized Convolution.

Pow2 Quantized Convolution is the convolution function, except the definition of the inner product is modified. The input-output relation of this function is as follows:

$y_{n, a, b} = \sum_{m} \sum_{i} \sum_{j} Q(w_{n, m, i, j}) x_{m, a + i, b + j},$

where $$Q(w_{n, m, i, j})$$ is the power-of-2 quantization function.

Note

1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (weight) and not the quantized weights (quantized weight)

2) The weights and the quantized weights become synced only after forward() is called, and not after a call to backward(). To access the parameters of the network, remember to call forward() once before doing so, otherwise the float weights and the quantized weights will not be in sync.

3) Quantized values are stored as floating point number for quantized weight, since this function is only for simulation purposes.

Parameters
Returns

N-D array.

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "pow2_quantized_conv";

• W (need_grad=True) : Filter weights in float. (shape: (outmaps, inmaps // group, *kernel))

• b (need_grad=True) : Bias vector in float. (shape: (outmaps,))

• W_q (need_grad=False) : Quantized weights. (shape: (outmaps, inmaps // group, *kernel))

• b_q (need_grad=False) : Quantized biases. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = pow2_quantized_convolution(<args>)

nnabla.parametric_functions.pruned_affine(inp, n_outmaps, base_axis=1, w_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, prune_w=True, rate_w=0.9, prune_b=True, rate_b=0.9, name=None)[source]

Pruned Affine.

Pruned Affine is the affine function, except the definition of the inner product is modified. The input-output relation of this function is as follows:

$y_j = \sum_{i} Q(w_{ji}) x_i,$

where $$Q(w_{ji})$$ is the pruning function, i.e., F.prune.

Note

1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (weight) and not the quantized weights (quantized weight)

2) The weights and the quantized weights become synced only after forward() is called, and not after a call to backward(). To access the parameters of the network, remember to call forward() once before doing so, otherwise the float weights and the quantized weights will not be in sync.

3) CPU and GPU implementations now use float value for quantized weight, since this function is only for simulation purposes.

Parameters
Returns

$$(B + 1)$$-D array. ($$M_0 \times \ldots \times M_{B-1} \times L$$)

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "pruned_affine";

• W (need_grad=True) : Weight matrix in float. (shape: (inmaps, outmaps))

• b (need_grad=True) : Bias vector in float. (shape: (outmaps,))

• W_q (need_grad=False) : Qunatized weights. (shape: (inmaps, outmaps))

• b_q (need_grad=False) : Quantized biases. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = pruned_affine(<args>)

nnabla.parametric_functions.pruned_convolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, channel_last=False, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, prune_w=True, rate_w=0.9, prune_b=True, rate_b=0.9, name=None)[source]

Pruned Convolution.

Pruned Convolution is the convolution function, except the definition of the inner product is modified. The input-output relation of this function is as follows:

$y_{n, a, b} = \sum_{m} \sum_{i} \sum_{j} Q(w_{n, m, i, j}) x_{m, a + i, b + j},$

where $$Q(w_{ji})$$ is the pruning function, i.e., F.prune.

Note

1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (weight) and not the quantized weights (quantized weight)

2) The weights and the quantized weights become synced only after forward() is called, and not after a call to backward(). To access the parameters of the network, remember to call forward() once before doing so, otherwise the float weights and the quantized weights will not be in sync.

3) CPU and GPU implementations now use float value for quantized weight, since this function is only for simulation purposes.

Parameters
Returns

N-D array.

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "pruned_conv";

• W (need_grad=True) : Filter weights in float. (shape: (outmaps, inmaps // group, *kernel))

• b (need_grad=True) : Bias vector in float. (shape: (outmaps,))

• W_q (need_grad=False) : Qunatized weights. (shape: (outmaps, inmaps // group, *kernel))

• b_q (need_grad=False) : Quantized biases. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = pruned_convolution(<args>)

nnabla.parametric_functions.min_max_quantize(x, ql_min=0, ql_max=255, decay=0.999, x_min_max=False, ema=False, ste_fine_grained=True, eps=0.01, qr_min_init=None, qr_max_init=None, fix_parameters=False, outputs=None, name=None)[source]

Min-max quantization.

This function uniformly quantizes values in the range of min and max quantization levels.

Min-max quantization is defined as the following equation

$y = round \left(\frac{\min(\max(x, m), M) - m}{scale} \right) \times scale + m,$

where the $$scale$$ is defined as

$scale = \frac{M - m}{M_q - m_q},$

and

$\begin{split}m_q = ql_{min}, \\ M_q = ql_{max}, \\ m = qr_{min}, \\ M = qr_{max}.\end{split}$

In the backward pass when using ste_fine_grained as false,

$\frac{\partial q_i}{\partial x_i} = 1.$

In the backward pass when using ste_fine_grained as true,

$\begin{split} \frac{\partial q_i}{\partial x_i}= \left\{ \begin{array}{ll} 0 & if \ \ \ x_i > M \\ 1 & if \ \ m \le x_i \le M \\ 0 & if \ \ x_i < m \\ \end{array} \right..\end{split}$

$$qr_{min}$$ and $$qr_{max}$$ are treaded as follows.

More precisely, in inference of the min-max quantization, one has to consider zero-point (zp) which corresponds to the real value 0, and its data type is an integer. zero-point is defined as

$\begin{split} && zp_f = ql_{min} -\frac{qr_{min}}{scale}, \\ && zp = \left\{ \begin{array}{ll} ql_{max} & if \ \ \ zp_f >= ql_{max} \\ round(zp_f) & if \ \ otherwise \\ ql_{min} & if \ \ zp_f <= ql_{min} \\ \end{array} \right..\end{split}$

Accordingly, in order to simulate quantization effect of zero-point, during both forward and backward pass, $$qr_{min}$$ and $$qr_{max}$$ are adjusted as follows,

$\begin{split}qr_{min}^{adj} = ql_{min} - zp * scale, \\ qr_{max}^{adj} = ql_{max} - zp * scale.\end{split}$

These operations are often called nudge.

Finally, in the formulas of the min-max quantization, $$m$$ and $$M$$ are replaced by $$qr_{min}^{adj}$$ and $$qr_{max}^{adj}$$ respectively.

Parameters

References

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko, “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference”, https://arxiv.org/abs/1712.05877

Parameters to be registered

The following variables are registered in a parameter scope "min_max_quantize";

• qr_min (need_grad=False) : Minimum quantization range, the exponential movining average of min values of inputs initialized with -6.0 if ema is True. (shape: ql_min.shape)

• qr_max (need_grad=False) : Maximum quantization range, the exponential movining average of max values of inputs initialized with 6.0 if ema is True. (shape: ql_max.shape)

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = min_max_quantize(<args>)

nnabla.parametric_functions.lstm_cell(x, h, c, state_size, w_init=None, b_init=None, fix_parameters=False, name=None)[source]

Long Short-Term Memory.

Long Short-Term Memory, or LSTM, is a building block for recurrent neural networks (RNN) layers. LSTM unit consists of a cell and input, output, forget gates whose functions are defined as following:

$\begin{split}f_t&&=\sigma(W_fx_t+U_fh_{t-1}+b_f) \\ i_t&&=\sigma(W_ix_t+U_ih_{t-1}+b_i) \\ o_t&&=\sigma(W_ox_t+U_oh_{t-1}+b_o) \\ c_t&&=f_t\odot c_{t-1}+i_t\odot\tanh(W_cx_t+U_ch_{t-1}+b_c) \\ h_t&&=o_t\odot\tanh(c_t).\end{split}$

References

S. Hochreiter, and J. Schmidhuber. “Long Short-Term Memory.” Neural Computation. 1997.

Parameters
Returns

Variable

Parameters to be registered

The following variables are registered in a parameter scope "lstm";

• affine/W (need_grad=True) : Stacked weight matrixes of LSTM block. (shape: (inmaps, 4, state_size))

• affine/b (need_grad=True) : Stacked bias vectors of LSTM block. (shape: (4, state_size,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = lstm_cell(<args>)

class nnabla.parametric_functions.LSTMCell(batch_size, state_size, h=None, c=None, name=None)[source]
__call__(x, w_init, b_init, fix_parameters)[source]

Updates h and c by calling lstm function.

Parameters
nnabla.parametric_functions.spectral_norm(w, dim=0, itr=1, eps=1e-12, test=False, u_init=None, fix_parameters=True, name=None)[source]

Spectral Normalization.

$W_{sn} = \frac{W}{\sigma(W)}.$

where $$W$$ is the input matrix, and the $$\sigma(W)$$ is the spectral norm of $$W$$. The spectral norm is approximately computed by the power iteration.

References

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida, “Spectral Normalization for Generative Adversarial Networks”, International Conference on Learning Representations. 2018.

Parameters
• W (Variable) – Input N-D array with shape. This is normally network parameter.

• dim (int) – Output dimension. Default is 0. If the dimension is not 0, then the specified dimension becomes the most-left dimension by transposing.

• itr (int) – Number of iterations. Default is 1.

• eps (float) – Epsilon for the normalization. Default is 1e-12.

• test (bool) – Use test mode. Default is False.

Returns

Spectrally normalized $$W_{sn}$$ with the same shape as $$W$$.

Return type

Variable

Example

import nnabla as nn
import nnabla.parametric_functions as PF

b, c, h, w = 4, 64, 32, 32

# Spectrally normalized convolution
apply_w = lambda w: PF.spectral_norm(w, dim=0)
h = nn.Variable.from_numpy_array(np.random.randn(b, c, h, w))
h = PF.convolution(h, with_bias=False, apply_w=apply_w)

# Spectrally normalized affine
apply_w = lambda w: PF.spectral_norm(w, dim=1)
h = nn.Variable.from_numpy_array(np.random.randn(b, c))
h = PF.affine(h, with_bias=False, apply_w=apply_w)

# Spectrally normalized embed
apply_w = lambda w: PF.spectral_norm(w, dim=1)
h = nn.Variable.from_numpy_array(np.random.randn(b, c))
h = PF.embed(h, c, apply_w=apply_w)

Parameters to be registered

The following variables are registered in a parameter scope "spectral-norm";

• u (need_grad=False) : singular vector. (shape: (w.shape[dim], ))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = spectral_norm(<args>)

nnabla.parametric_functions.weight_normalization(w, dim=0, eps=1e-12, g_init=None, fix_parameters=False, name=None)[source]

Weight Normalization.

$\mathbf{w}_{WN} = g \dfrac{\mathbf{w}}{\|\mathbf{w}\|}$

where $$\mathbf{w}$$ is the input weights to be normalized, and $$g$$ is learnable multiplication factors each of which is applied to each input weights at dim. This function is in general used as callback passed to apply_w for PF.convolution, PF.affine and so on. According to the authors original implementation, $$v$$ should be initialized by $$N(0, 0.05)$$. To meet this condition, initializer should be passed to convolution which Weight Normalization is applied, like an example below.

References

Parameters
Returns

$$W$$ with the same shape as $$v$$.

Return type

Variable

Example

import nnabla as nn
import nnabla.parametric_functions as PF
import nnabla.initializer as I

# h is nn.Variable.

# convolution
# according to the original implementation, w should be initialized by N(0, 0.05).
h = PF.convolution(h, ..., apply_w=PF.weight_normalization, w_init=I.NormalInitializer(0.05))

# affine
h = PF.affine(h, ..., apply_w=lambda w: PF.weight_normalization(w, dim=1), w_init=I.NormalInitializer(0.05))


Warning

Up to the version 1.10.0, this had been implemented as the composite functions.

Parameters to be registered

The following variables are registered in a parameter scope "wn";

• g (need_grad=True) : Weight Normalization adaptive scale scalar.. (shape: w.shape[dim])

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = weight_normalization(<args>)


Computes multi-headed attention with query, key, and value. We use the following notations to describe the inputs and outputs below. $$L_T$$: target sequence length, $$L_S$$: source sequence length, $$B$$: batch size, $$D$$: input dimension, $$E$$: embedding dimension.

References

A. Vaswani et al. “Attention is All You Need.” NIPS. 2017. <https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf>

Example:

q = nn.Variable((tgt_len, batch_size, q_input_dim))
k = nn.Variable((src_len, batch_size, k_input_dim))
v = nn.Variable((src_len, batch_size, v_input_dim))

out, w = PF.multi_head_attention(q, k, v)
out.forward()

Parameters
• query (Variable) – Input N-D array with shape $$(L_T, B, D_q)$$.

• key (Variable) – Input N-D array with shape $$(L_S, B, D_k)$$.

• value (Variable) – Input N-D array with shape $$(L_S, B, D_v)$$.

• num_heads (int, optional) – Number of attention heads. Note that embedding dimensoin E must be divisible by the number of heads. Default is 12 which is conventional.

• dropout (float, optional) – Dropout ratio applied to parameters. Default is 0.

• k_embed_dim (int, optional) – Embedding dimension for key. If specified, embedding dimensions for both query and key are set as that value. Otherwise, k_embed_dim is set as the same alue as embedding dimension for query.

• v_embed_dim (int, optional) – Embedding dimension for value. If not specified, it is defaulted as the same value as embedding dimension for query.

• out_dim (int, optional) – Embedding dimension for output weight. If not spefied, it is defaulted as the same value as embedding dimension for value.

• rng (numpy.random.RandomState, optional) – Random generator for Initializer. Default is None.

• with_bias (bool, optional) – Specify whether to include the bias parameters. Default is True.

• add_attn_bias (bool, optional) – Specify whether to add attention bias parameters for key and value. Default is False.

• additive_mask (Variable, optional) – Input N-D array with shape $$(L_T, L_S)$$. Values will be added to the attention layer to prevent attention to certain positions.

• key_padding_mask (Variable, optional) – Input N-D array with shape $$(B, L_S)$$. Specified padding elements will be ignored by the attention layer. Values must be either 1 or 0.

• fix_parameters (bool, optional) – When set to True, the weights and biases will not be updated. Default is False.

• param_init (dict, optional) – Parameter initializers can be set with a dict. Possible keys of the dict include q_weight, k_weight, v_weight, q_bias, k_bias, v_bias, out_weight, out_bias, attn_bias_k, attn_bias_v. A value of the dict must be an Initializer or a numpy.ndarray. E.g. {'q_bias': ConstantInitializer(0)}.

Returns

Output $$y$$ with shape $$(L_T, B, E)$$ ~nnabla.Variable: Output $$h_n$$ with shape $$(B, L_T, L_S)$$

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "multi_head_attention";

• q_weight (need_grad=True) : weights for query. (shape: (E, E))

• k_weight (need_grad=True) : weights for key. (shape: (E_k, E))

• v_weight (need_grad=True) : weights for value. (shape: (E_v, E))

• out_weight (need_grad=True) : weigths for out projection. (shape: (E, E))

• q_bias (need_grad=True) : bias for query. (shape: (E, ))

• k_bias (need_grad=True) : bias for key. (shape: (E, ))

• v_bias (need_grad=True) : bais for value. (shape: (E, ))

• out_bias (need_grad=True) : bias for out projection. (shape: (E, ))

• attn_bias_k (need_grad=True) : attnetion bias for k. (shape: (E, 1))

• attn_bias_v (need_grad=True) : attnetion bias for v. (shape: (E, 1))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):


Transformer.

We use the following notations to describe the inputs and outputs below. $$L_T$$: target sequence length, $$L_S$$: source sequence length, $$B$$: batch size, $$E$$: embedding dimension.

References

A. Vaswani et al. “Attention is All You Need.” NIPS. 2017. <https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf>

Examples:

src = nn.Variable((src_len, batch_size, embed_dim),need_grad=True)
out = PF.transformer(src, tgt, num_heads=16, num_encoder_layers=12)
out.forward()

Parameters
• src (Variable) – Input source sequence to the encoder with shape:math:(L_S, B, E).

• tgt (Variable) – Input target sequence to the decoder with shape $$(L_T, B, E)$$.

• embed_dim (int, optional) – Embedding dimension to be used. Default is 512.

• num_heads (int, optional) – Number of attention heads. Default is 12.

• num_encoder_layers (int, optional) – Number of encoder layers to stack. Default is 6.

• num_decoder_layers (int, optional) – Number of decoder layers to stack. Default is 6.

• dim_feedforward (int, optional) – Dimension of the feedforward network model. Default is 2048.

• dropout (float, optional) – Dropout ratio applied. Default is 0.1.

• activation (function, optional) – Non-linear activation function to be used. Default is None, which is set as F.relu in the code.

• src_additive_mask (Variable, optional) – Additive mask for the src sequence (optional). $$(L_S, L_S)$$.

• tgt_additive_mask (Variable, optional) – Additive mask for the tgt sequence (optional). $$(L_T, L_T)$$.

• memory_additive_mask (Variable, optional) – Additive mask for the encoder output (optional). $$(L_T, L_S)$$.

• src_key_padding_mask (Variable, optional) – Key padding mask for src keys per batch (optional). $$(B, L_S)$$. Specified padding elements will be ignored by the attention layer. Values must be either 1 or 0.

• tgt_key_padding_mask (Variable, optional) – Key padding mask for tgt keys per batch (optional). $$(B, L_T)$$. Specified padding elements will be ignored by the attention layer. Values must be either 1 or 0.

• memory_key_padding_mask (Variable, optional) – Key padding mask for memory keys per batch (optional). $$(B, L_S)$$. Specified padding elements will be ignored by the attention layer. Values must be either 1 or 0.

• rng (numpy.random.RandomState, optional) – Random generator for Initializer. Default is None.

• add_attn_bias (bool, optional) – Specify whether to add attention bias parameters for key and value. Default is False.

• fix_parameters (bool, optional) – When set to True, the weights and biases will not be updated. Default is False.

Returns

Output $$y$$ with shape $$(L_T, B, E)$$

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "transformer";

• encoder{layer#} (need_grad=True) : parameters for the n’th encoder layer. (shape: Refer to transformer_encode for details)

• decoder{layer#} (need_grad=True) : parameters for the n’th decoder layer. (shape: Refer to transformer_decode for details)

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = transformer(<args>)


Transformer Encoder.

Parameters
• src (Variable) – Input sequnce to the encoder layer with shape $$(L_S, B, E)$$.

• embed_dim (int) – Number of embedding dimension.

• dim_feedforward (int, optional) – Dimension of the feedforward network model. Default is 2048.

• dropout (float, optional) – Dropout ratio. Default is 0.1.

• activation (function, optional) – Non-linear activation function to be used. Default is None, which is set as F.relu in the code.

• src_additive_mask (Variable, optional) – Additive mask for the source sequence with shape $$(L_S, L_S)$$

• src_key_padding_mask (Variable, optional) – Padding mask for the source sequence with shape $$(B, L_S)$$. Specified padding elements will be ignored by the attention layer. Values must be either 1 or 0.

• rng (numpy.random.RandomState, optional) – Random generator for Initializer. Defalut is None.

• add_attn_bias (bool, optional) – Specify whether to add attention bias parameters for key and value. Default is False.

• fix_parameters (bool, optional) – When set to True, the weights and biases will not be updated. Default is False.

Returns

Output $$y$$ with shape $$(L_S, B, E)$$

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "transformer_encode";

• src_self_attn (need_grad=True) : self-attention parameters for source sequence. (shape: Refer to multi_head_attention for details)

• enc_affine1 (need_grad=True) : first affine used in encoder. (shape: Refer to affine for details)

• enc_affine2 (need_grad=True) : second affine used in encoder. (shape: Refer to affine for details)

• enc_layer_norm1 (need_grad=True) : fist layer normalization used in encoder. (shape: Refer to layer_normalization for details)

• enc_layer_norm2 (need_grad=True) : second layer normalization used in encoder. (shape: Refer to layer_normalization for details)

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = transformer_encode(<args>)


Transformer Decoder.

Parameters
• tgt (Variable) – Input sequnce to the decoder layer with shape $$(L_T, B, E)$$.

• memory (Variable) – Output sequnce from the last layer of the encoder with shape $$(L_T, B, E)$$.

• embed_dim (int) – Number of embedding dimension.

• dim_feedforward (int, optional) – Dimension of the feedforward network model. Default is 2048.

• dropout (float, optional) – Dropout ratio. Default is 0.1.

• activation (function, optional) – Non-linear activation function to be used. Default is None, which is set as F.relu in the code.

• tgt_additive_mask (Variable, optional) – Additive mask for the target sequence with shape $$(L_T, L_T)$$.

• memory_additive_mask (Variable, optional) – Additive mask for the memory sequcne with shape $$(L_T, L_S)$$.

• tgt_key_padding_mask (Variable, optional) – Padding mask for the target sequence with shape $$(B, L_T)$$. Specified padding elements will be ignored by the attention layer. Values must be either 1 or 0.

• memory_key_padding_mask (Variable, optional) – Padding mask for the mask sequence with shape $$(B, L_S)$$. Specified padding elements will be ignored by the attention layer. Values must be either 1 or 0.

• rng (numpy.random.RandomState) – Random generator for Initializer. Default is None.

• add_attn_bias (bool, optional) – Specify whether to add attention bias parameters for key and value. Default is False.

• fix_parameters (bool) – When set to True, the weights and biases will not be updated. Default is False.

Returns

Output $$y$$ with shape $$(L_T, B, E)$$

Return type

Variable

Parameters to be registered

The following variables are registered in a parameter scope "transformer_decode";

• tgt_self_attn (need_grad=True) : self-attention parameters for target sequence. (shape: Refer to multi_head_attention for details)

• tgt_memory_attn (need_grad=True) : attention parameters for target sequence with output from encoder as key. (shape: Refer to multi_head_attention for details)

• dec_affine1 (need_grad=True) : first affine used in decoder. (shape: Refer to affine for details)

• dec_affine2 (need_grad=True) : second affine used in decoder. (shape: Refer to affine for details)

• dec_layer_norm1 (need_grad=True) : fist layer normalization used in decoder. (shape: Refer to layer_normalization for details)

• dec_layer_norm2 (need_grad=True) : second layer normalization used in decoder. (shape: Refer to layer_normalization for details)

• dec_layer_norm3 (need_grad=True) : third layer normalization used in decoder. (shape: Refer to layer_normalization for details)

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parameter_scope(name):
output = transformer_decode(<args>)


## Parameter Initializer¶

Some of the parametric functions optionally takes parameter initializer listed below.

class nnabla.initializer.BaseInitializer[source]

Base class of the parameter initializer.

__call__(shape)[source]

Generates an array with an initializer.

Parameters

shape (tuple of int) – numpy.ndarray with the shape created.

Returns

Array.

Return type

numpy.ndarray

Note

Subclasses of BaseInitializer must override this method.

class nnabla.initializer.ConstantInitializer(value=0)[source]

Generates a constant valued array.

Parameters

value (float) – A constant value.

Example:

import nnabla as nn
import nnabla.parametric_functions as PF
import nnabla.initializer as I

x = nn.Variable([60,1,28,28])
w = I.ConstantInitializer(0.1)
b = I.ConstantInitializer() # this generates constant valued array of default value 0
h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv'

class nnabla.initializer.NormalInitializer(sigma=1.0, rng=None)[source]

Generates a random array from a specified normal distribution.

$\mathbf x \sim {\cal N} (\mathbf 0 | \sigma^2 \mathbf I)$
Parameters

Example:

import nnabla as nn
import nnabla.parametric_functions as PF
import nnabla.initializer as I

x = nn.Variable([60,1,28,28])
w = I.NormalInitializer(5e-5)
b = I.NormalInitializer(0.0)
h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv')

class nnabla.initializer.UniformInitializer(lim=(-1, 1), rng=None)[source]

Generates a random array from a specified uniform distribution.

$\mathbf x \sim {\cal U} (a, b)$
Parameters

Example:

import nnabla as nn
import nnabla.parametric_functions as PF
import nnabla.initializer as I

x = nn.Variable([60,1,28,28])
w = I.UniformInitializer() # this generates uniform distribution within the default range of (-1,1)
b = I.UniformInitializer((-0.5,0.5))
h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv')

class nnabla.initializer.UniformIntInitializer(lim=(0, 10), rng=None)[source]

Generates a random array from a specified integer uniform distribution.

$\mathbf x \sim {\cal U} ([a, b))$
Parameters

Example:

import nnabla as nn
import nnabla.parametric_functions as PF
import nnabla.initializer as I

x = nn.Variable([60,1,28,28])
w = I.UniformIntInitializer() # this generates uniform integer distribution within the default range of (0,10)
b = I.UniformIntInitializer((-1,1))
h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv')

class nnabla.initializer.RangeInitializer(start=0, step=1)[source]

Generates an array with sequence of numbers.

$\mathbf x[i] = start + step * i$
Parameters
• start (int) – A start value.

• step (int) – A step value.

Example:

import nnabla as nn
import nnabla.initializer as I

x = nn.Variable([100])
x.d = I.RangeInitializer(0, 1)(x.shape)

class nnabla.initializer.OrthogonalInitializer(gain=1.0, rng=None)[source]

Generates an orthogonal matrix weights proposed by Saxe et al.

Parameters
• gain (float) – scaling factor which should be decided depending on a type of units.

• rng (numpy.random.RandomState) – Random number generator.

Example:

import numpy as np
import nnabla as nn
import nnabla.parametric_functions as PF
import nnabla.initializer as I

x = nn.Variable([60,1,28,28])
w = I.OrthogonalInitializer(np.sqrt(2.0))
b = I.ConstantInitializer(0.0)
h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv')


References

class nnabla.initializer.WeightNormalizationScaleInitializer(w, dim=0, eps=1e-12)[source]

Compute the L2-norm for each weight kernel.

This initializer is specific to the weight normalization scale to keep the same magnitude of the originally initialized weights even after the applicaiton of the weight normalization at only initialization.

Parameters
• w (Variable) – Weight the weight normalization is applied.

• dim (int) – Output dimension of the weight normalization.

• eps (float) – Eplision of the weight normalization.

nnabla.initializer.calc_normal_std_he_forward(inmaps, outmaps, kernel=(1, 1))[source]

Calculates the standard deviation proposed by He et al.

$\sigma = \sqrt{\frac{2}{NK}}$
Parameters
• inmaps (int) – Map size of an input Variable, $$N$$.

• outmaps (int) – Map size of an output Variable, $$M$$.

• kernel (tuple of int) – Convolution kernel spatial shape. In above definition, $$K$$ is the product of shape dimensions. In Affine, the default value should be used.

Example:

import nnabla as nn
import nnabla.parametric_functions as PF
import nnabla.initializer as I

x = nn.Variable([60,1,28,28])
s = I.calc_normal_std_he_forward(x.shape[1],64)
w = I.NormalInitializer(s)
b = I.ConstantInitializer(0)
h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv')


References

nnabla.initializer.calc_normal_std_he_backward(inmaps, outmaps, kernel=(1, 1))[source]

Calculates the standard deviation of He et al. (backward case).

$\sigma = \sqrt{\frac{2}{MK}}$
Parameters
• inmaps (int) – Map size of an input Variable, $$N$$.

• outmaps (int) – Map size of an output Variable, $$M$$.

• kernel (tuple of int) – Convolution kernel spatial shape. In above definition, $$K$$ is the product of shape dimensions. In Affine, the default value should be used.

Example:

import nnabla as nn
import nnabla.parametric_functions as PF
import nnabla.initializer as I

x = nn.Variable([60,1,28,28])
s = I.calc_normal_std_he_backward(x.shape[1],64)
w = I.NormalInitializer(s)
b = I.ConstantInitializer(0)
h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv')


References

nnabla.initializer.calc_normal_std_glorot(inmaps, outmaps, kernel=(1, 1))[source]

Calculates the standard deviation proposed by Glorot et al.

Note

We have updated the definition as following from v.1.2. It may affect the behavior of existing scripts that rely on the default initialization.

$\sigma = \sqrt{\frac{2}{K(N + M)}}$
Parameters
• inmaps (int) – Map size of an input Variable, $$N$$.

• outmaps (int) – Map size of an output Variable, $$M$$.

• kernel (tuple of int) – Convolution kernel spatial shape. In above definition, $$K$$ is the product of shape dimensions. In Affine, the default value should be used.

Example:

import nnabla as nn
import nnabla.parametric_functions as PF
import nnabla.initializer as I

x = nn.Variable([60,1,28,28])
s = I.calc_normal_std_glorot(x.shape[1],64)
w = I.NormalInitializer(s)
b = I.ConstantInitializer(0)
h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv')


References

nnabla.initializer.calc_uniform_lim_glorot(inmaps, outmaps, kernel=(1, 1))[source]

Calculates the lower bound and the upper bound of the uniform distribution proposed by Glorot et al.

Note

We have updated the definition as following from v.1.3. It may affect the behavior of existing scripts that rely on the default initialization.

$\begin{split}b &= \sqrt{\frac{6}{K(N + M)}}\\ a &= -b\end{split}$
Parameters
• inmaps (int) – Map size of an input Variable, $$N$$.

• outmaps (int) – Map size of an output Variable, $$M$$.

• kernel (tuple of int) – Convolution kernel spatial shape. In above definition, $$K$$ is the product of shape dimensions. In Affine, the default value should be used.

Example:

import nnabla as nn
import nnabla.parametric_functions as PF
import nnabla.initializer as I

x = nn.Variable([60,1,28,28])
lb,ub= I.calc_uniform_lim_glorot(x.shape[1],64)
w = I.UniformInitializer((lb,ub))
b = I.ConstantInitializer(0)
h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv')
`

References