Parametric Functions

In NNabla, trainable models are created by composing functions that have optimizable parameters. These functions are called parametric functions. Parametric functions are provided by nnabla.parametric_functions.

See also:
Python API Tutorial.

Parameter Management API

The parameters registered by List of Parametric Functions can be managed using APIs listed in this section.

nnabla.parameter.parameter_scope(*args, **kwds)[source]

Grouping parameters registered by parametric functions listed in nnabla.parametric_functions.

Parameters:
  • name (str) – Parameter scope name.
  • scope (OrderedDict, optional) – Specify current parameter scope as a local dictionary. The default value is None. In this case, the current parameter scope maintained in global is used.

Example:

import nnabla as nn
import nnabla.parametric_functions as PF
import nnabla.functions as F

with nn.parameter_scope('conv1'):
    conv_out1 = PF.convolution(x, 32, (5, 5))
    bn_out1 = PF.batch_normalization(conv_out1)
    act_out1 = F.relu(bn_out1)
with nn.parameter_scope('conv2'):
    conv_out2 = PF.convolution(act_out1, 64, (3, 3))
    bn_out2 = PF.batch_normalization(conv_out2)
    act_out2 = F.relu(bn_out2)

Nesting The with statement blocks allows you to nest parameter scopes. This can also be done by using “/” inside the parameter names.

Example:

with nn.parameter_scope('network1'):
    with nn.parameter_scope('conv1'):
        conv_out1 = PF.convolution(x, 32, (5, 5))
        bn_out1 = PF.batch_normalization(conv_out1)
        act_out1 = F.relu(bn_out1)
    with nn.parameter_scope('conv2'):
        conv_out2 = PF.convolution(act_out1, 64, (3, 3))
        bn_out2 = PF.batch_normalization(conv_out2)
        act_out2 = F.relu(bn_out2)

is equivalent to

with nn.parameter_scope('network1/conv1'):
    conv_out1 = PF.convolution(x, 32, (5, 5))
    bn_out1 = PF.batch_normalization(conv_out1)
    act_out1 = F.relu(bn_out1)
with nn.parameter_scope('network1/conv2'):
    conv_out2 = PF.convolution(act_out1, 64, (3, 3))
    bn_out2 = PF.batch_normalization(conv_out2)
    act_out2 = F.relu(bn_out2)
nnabla.parameter.get_current_parameter_scope()[source]

Returns current parameter scope.

nnabla.parameter.get_parameters(params=None, path='', grad_only=True)[source]

Get parameter Variables under the current parameter scope.

Parameters:
  • params (dict) – Internal use. User doesn’t set it manually.
  • path (str) – Internal use. User doesn’t set it manually.
  • grad_only (bool) – Retrieve all parameters under the current scope if False, while only parameters with need_grad=True are retrieved if True.
Returns:

{str : Variable}

Return type:

dict

nnabla.parameter.clear_parameters()[source]

Clear all parameters in the current scope.

nnabla.parameter.save_parameters(path, params=None)[source]

Save all parameters into a file with the specified format.

Currently hdf5 and protobuf formats are supported.

Parameters:
  • path – path or file object
  • params (dict, optional) – Parameters to be saved. Dictionary is of a parameter name (str) to Variable.
nnabla.parameter.load_parameters(path, proto=None, needs_proto=False)[source]

Load parameters from a file with the specified format.

Parameters:path – path or file object
nnabla.parameter.get_parameter_or_create(name, shape=None, initializer=None, need_grad=True, as_need_grad=None)[source]

Returns an existing parameter variable in current parameter scope with the provided name.

If a variable with the provided name does not exist, a new variable is created and registered to the current parameter scope with the name, then returned.

Parameters:
  • name (str) – The name under the current scope. If it already exists, the name is queried from the parameter manager.
  • shape (tuple of int) – Shape of created parameter. The shape of the specified parameter must match with this shape. The default is None which is only valid if initializer is given as an numpy.ndarray.
  • initializer (nnabla.initializer.BaseInitializer or numpy.ndarray) – An initialization function to be applied to the parameter. numpy.ndarray can also be given to initialize parameters from numpy array data.
  • need_grad (bool) – Register the parameter with the specified need_grad flag. The default is True. If the flag is different from the previously specified one, the flag will be overwritten, but the values will be kept.
  • as_need_grad (bool) – Get a parameter variable with the specified need_grad flag. Note that this doesn’t overwrite the flag of the registered parameter variable with the provided name. Instead, if the given flag mismatches with the previously registered need_grad flag, it returns a new variable referring to the same array contents but with need_grad=as_need_grad.

Note

It returns a Variable which is unlinked from the registered one in the current parmeter scope (using nnabla.Variable.get_unlinked_variable()). That means changing a need_grad attribute doesn’t affect the variable existing in the current parameter scope.

List of Parametric Functions

Parametric functions are provided by nnabla.parametric_functions , as listed below. Like functions listed in Functions, they take Variable (s) as first argument(s) followed by options specific to a parametric function. In addition, they register parameter Variable (s) into the parameter scope.

The parameter variables are registered with need_grad properties specific to a parametric function. The variables with need_grad=False flag will not be updated by gradient descent. Hence, backward computation is not executed for those variables. False is usually specified when the parameters are updated during foward pass and/or backward pass, e.g., batch normalization.

All parametric functions take an optional argument fix_parameters=False. By giving True, the associated parameter variables are connected to a computation graph with a property need_grad=False regardless properties of the registered variables, then backward gradient computation is not executed for those variables. This is useful when you create a computation graph for evaluation purpose, fixing parameters partially in a graph, and so on.

All parametric functions listed below are decorated with the following decorator.

nnabla.parametric_functions.parametric_function_api(scope_name=None, param_desc=None)[source]

Decorator for parametric functions.

The decorated function is always called under a parameter scope scope_name. Also, the decorator adds an additional argument name (str, default is None) at the end. If name is specified, the scope scope_name comes under a scope name. This feature could reduce vertical space usage of the source code. Any parametric function should be decorated by this.

Parameters:
  • scope_name (str, optional) – The original function will be called under a parameter scope named by scope_name.
  • param_desc (list, optional) – Descriptions of parameters will be automatically included into docstring. This must be a list of tuples with 4 elements composed of (name (str), description (str), shape info (str), need_grad (bool)).
Returns:

A decorated parametric function.

Return type:

function

See Parameter Management API to know how to query and manipulate registered variables.

Here is the list of parametric functions.

nnabla.parametric_functions.affine(inp, n_outmaps, base_axis=1, w_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, apply_w=None, apply_b=None, name=None)[source]

The affine layer, also known as the fully connected layer. Computes

\[{\mathbf y} = {\mathbf A} {\mathbf x} + {\mathbf b}.\]

where \({\mathbf x}, {\mathbf y}\) are the inputs and outputs respectively, and \({\mathbf A}, {\mathbf b}\) are constants.

Parameters:
  • inp (Variable) – Input N-D array with shape (\(M_0 \times \ldots \times M_{B-1} \times D_B \times \ldots \times D_N\)). Dimensions before and after base_axis are flattened as if it is a matrix.
  • n_outmaps (int or tuple of int) – Number of output neurons per data.
  • base_axis (int) – Dimensions up to base_axis are treated as the sample dimensions.
  • w_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for weight. By default, it is initialized with nnabla.initializer.UniformInitializer within the range determined by nnabla.initializer.calc_uniform_lim_glorot.
  • b_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for bias. By default, it is initialized with zeros if with_bias is True.
  • fix_parameters (bool) – When set to True, the weights and biases will not be updated.
  • rng (numpy.random.RandomState) – Random generator for Initializer.
  • with_bias (bool) – Specify whether to include the bias term.
  • apply_w (function) – Lambda, function, or callable object applied to the weights.
  • apply_b (function) – Lambda, function, or callable object applied to the bias.
Returns:

\((B + 1)\)-D array. (\(M_0 \times \ldots \times M_{B-1} \times L\))

Return type:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "affine";

  • W (need_grad=True) : Weight matrix. (shape: (inmaps, outmaps))
  • b (need_grad=True) : bias vector. (shape: (outputs,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = affine(<args>)
nnabla.parametric_functions.convolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, channel_last=False, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, apply_w=None, apply_b=None, name=None)[source]

N-D Convolution with a bias term.

For Dilated Convolution (a.k.a. Atrous Convolution), refer to:

Note

Convolution is a computationally intensive operation that should preferably be run with the cudnn backend. NNabla then uses CuDNN library functions to determine and cache the fastest algorithm for the given set of convolution parameters, which results in additional memory consumption which may pose a problem for GPUs with insufficient memory size. In that case, the NNABLA_CUDNN_WORKSPACE_LIMIT environment variable can be used to restrict the choice of algorithms to those that fit the given workspace memory limit, expressed in bytes. In some cases it may also be desired to restrict the automatic search to algorithms that produce deterministic (reproducible) results. This can be requested by setting the the environment variable NNABLA_CUDNN_DETERMINISTIC to a non-zero value.

Parameters:
  • inp (Variable) – N-D array.
  • outmaps (int) – Number of convolution kernels (which is equal to the number of output channels). For example, to apply convolution on an input with 16 types of filters, specify 16.
  • kernel (tuple of int) – Convolution kernel size. For example, to apply convolution on an image with a 3 (height) by 5 (width) two-dimensional kernel, specify (3,5).
  • pad (tuple of int) – Padding sizes for dimensions.
  • stride (tuple of int) – Stride sizes for dimensions.
  • dilation (tuple of int) – Dilation sizes for dimensions.
  • group (int) – Number of groups of channels. This makes connections across channels more sparse by grouping connections along map direction.
  • channel_last (bool) – If True, the last dimension is considered as channel dimension, a.k.a. NHWC order.
  • w_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for weight. By default, it is initialized with nnabla.initializer.UniformInitializer within the range determined by nnabla.initializer.calc_uniform_lim_glorot.
  • b_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for bias. By default, it is initialized with zeros if with_bias is True.
  • base_axis (int) – Dimensions up to base_axis are treated as the sample dimensions.
  • fix_parameters (bool) – When set to True, the weights and biases will not be updated.
  • rng (numpy.random.RandomState) – Random generator for Initializer.
  • with_bias (bool) – Specify whether to include the bias term.
  • apply_w (function) – Lambda, function, or callable object applied to the weights.
  • apply_b (function) – Lambda, function, or callable object applied to the bias.
Returns:

N-D array. See convolution for the output shape.

Return type:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "conv";

  • W (need_grad=True) : Filter weights. (shape: (outmaps, inmaps // group, *kernel))
  • b (need_grad=True) : Bias vector. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = convolution(<args>)
nnabla.parametric_functions.depthwise_convolution(inp, kernel, pad=None, stride=None, dilation=None, multiplier=1, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, name=None)[source]

N-D Depthwise Convolution with a bias term.

Reference:

Parameters:
Returns:

N-D array. See depthwise_convolution for the output shape.

Return type:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "depthwise_conv";

  • W (need_grad=True) : Filter weights. (shape: (inmaps * multiplier, *kernel))
  • b (need_grad=True) : Bias vector. (shape: (inmaps * multiplier,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = depthwise_convolution(<args>)
nnabla.parametric_functions.deconvolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, apply_w=None, apply_b=None, name=None)[source]

Deconvolution layer.

Parameters:
  • inp (Variable) – N-D array.
  • outmaps (int) – Number of deconvolution kernels (which is equal to the number of output channels). For example, to apply deconvolution on an input with 16 types of filters, specify 16.
  • kernel (tuple of int) – Convolution kernel size. For example, to apply deconvolution on an image with a 3 (height) by 5 (width) two-dimensional kernel, specify (3,5).
  • pad (tuple of int) – Padding sizes for dimensions.
  • stride (tuple of int) – Stride sizes for dimensions.
  • dilation (tuple of int) – Dilation sizes for dimensions.
  • group (int) – Number of groups of channels. This makes connections across channels sparser by grouping connections along map direction.
  • w_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for weight. By default, it is initialized with nnabla.initializer.UniformInitializer within the range determined by nnabla.initializer.calc_uniform_lim_glorot.
  • b_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for bias. By default, it is initialized with zeros if with_bias is True.
  • base_axis (int) – Dimensions up to base_axis are treated as the sample dimensions.
  • fix_parameters (bool) – When set to True, the weights and biases will not be updated.
  • rng (numpy.random.RandomState) – Random generator for Initializer.
  • with_bias (bool) – Specify whether to include the bias term.
  • apply_w (function) – Lambda, function, or callable object applied to the weights.
  • apply_b (function) – Lambda, function, or callable object applied to the bias.
Returns:

N-D array. See deconvolution for the output shape.

Return type:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "deconv";

  • W (need_grad=True) : Filter weights. (shape: (inmaps, outmaps // group, *kernel))
  • b (need_grad=True) : Bias vector. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = deconvolution(<args>)
nnabla.parametric_functions.depthwise_deconvolution(inp, kernel, pad=None, stride=None, dilation=None, divisor=1, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, name=None)[source]

Depthwise deconvolution computes the transposed depthwise convolution for one-dimensional and two-dimensional input data.

Parameters:
Returns:

N-D array. See depthwise_deconvolution for the output shape.

Return type:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "depthwise_deconv";

  • W (need_grad=True) : Filter weights. (shape: (inmaps,) + kernel)
  • b (need_grad=True) : Bias vector. (shape: (inmaps / divisor,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = depthwise_deconvolution(<args>)
nnabla.parametric_functions.batch_normalization(inp, axes=[1], decay_rate=0.9, eps=1e-05, batch_stat=True, output_stat=False, fix_parameters=False, param_init=None, no_scale=False, no_bias=False, name=None)[source]

Batch normalization layer.

\[\begin{split}\begin{array}{lcl} \mu &=& \frac{1}{M} \sum x_i\\ \sigma^2 &=& \frac{1}{M} \sum \left(x_i - \mu\right)^2\\ \hat{x}_i &=& \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon }}\\ y_i &= & \hat{x}_i \gamma + \beta. \end{array}\end{split}\]

where \(x_i, y_i\) are the inputs. In testing, the mean and variance computed by moving average calculated during training are used.

Parameters:
  • inp (Variable) – N-D array of input.
  • axes (tuple of int) – Mean and variance for each element in axes are calculated using elements on the rest axes. For example, if an input is 4 dimensions, and axes is [1], batch mean is calculated as np.mean(inp.d, axis=(0, 2, 3), keepdims=True) (using numpy expression as an example).
  • decay_rate (float) – Decay rate of running mean and variance.
  • eps (float) – Tiny value to avoid zero division by std.
  • batch_stat (bool) – Use mini-batch statistics rather than running ones.
  • output_stat (bool) – Output batch mean and variance.
  • fix_parameters (bool) – When set to True, the beta and gamma will not be updated.
  • param_init (dict) – Parameter initializers can be set with a dict. A key of the dict must be 'beta', 'gamma', 'mean' or 'var'. A value of the dict must be an Initializer or a numpy.ndarray. E.g. {'beta': ConstantIntializer(0), 'gamma': np.ones(gamma_shape) * 2}.
  • no_scale (bool) – If True, the scale term is omitted.
  • no_bias (bool) – If True, the bias term is omitted.
Returns:

N-D array.

Return type:

Variable

References

The shape of parameters has the same number of dimensions with the input data, and the shapes in axes has the same dimensions with the input, while the rest has 1. If an input is 4-dim and axes=[1], the parameter shape will be param_shape  = np.mean(inp.d, axis=(0, 2, 3), keepdims=True).shape (using numpy expression as an example).

Parameters to be registered

The following variables are registered in a parameter scope "bn";

  • beta (need_grad=True) : Trainable bias \(\beta\). (shape: <see above>)
  • gamma (need_grad=True) : Trainable scaling factor \(\gamma\). (shape: <see above>)
  • mean (need_grad=False) : Moving average of batch mean. (shape: <see above>)
  • var (need_grad=False) : Moving average of batch variance. (shape: <see above>)

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = batch_normalization(<args>)
nnabla.parametric_functions.sync_batch_normalization(inp, comm, group='world', axes=[1], decay_rate=0.9, eps=1e-05, batch_stat=True, output_stat=False, fix_parameters=False, param_init=None, no_scale=False, no_bias=False, name=None)[source]

Synchronized batch normalization layer.

For some tasks (e.g., semantic segmentation), batch size will be too small and BatchNormalization layer might not work well. SyncBatchNorlization layer solves these problems by synchronizing batch stats (mean and var) between multiple processes.

\[\begin{split}\begin{array}{lcl} \mu &=& \frac{1}{M} \sum x_i\\ \sigma^2 &=& \frac{1}{M} \left(\sum x_i - \mu\right)^2\\ \hat{x}_i &=& \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon }}\\ y_i &= & \hat{x}_i \gamma + \beta. \end{array}\end{split}\]

where \(x_i, y_i\) are the inputs.

Parameters:
  • inp (Variable) – N-D array of input.
  • comm (Communicator) – The communicator
  • group (string) – The name of the communicator group
  • axes (tuple of int) – Mean and variance for each element in axes are calculated using elements on the rest axes. For example, if an input is 4 dimensions, and axes is [1], batch mean is calculated as np.mean(inp.d, axis=(0, 2, 3), keepdims=True) (using numpy expression as an example).
  • decay_rate (float) – Decay rate of running mean and variance.
  • eps (float) – Tiny value to avoid zero division by std.
  • batch_stat (bool) – Use mini-batch statistics rather than running ones.
  • output_stat (bool) – Output batch mean and variance.
  • fix_parameters (bool) – When set to True, the beta and gamma will not be updated.
  • param_init (dict) – Parameter initializers can be set with a dict. A key of the dict must be 'beta', 'gamma', 'mean' or 'var'. A value of the dict must be an Initializer or a numpy.ndarray. E.g. {'beta': ConstantIntializer(0), 'gamma': np.ones(gamma_shape) * 2}.
  • no_scale (bool) – If True, the scale term is omitted.
  • no_bias (bool) – If True, the bias term is omitted.
Returns:

N-D array.

Return type:

Variable

References

The shape of parameters has the same number of dimensions with the input data, and the shapes in axes has the same dimensions with the input, while the rest has 1. If an input is 4-dim and axes=[1], the parameter shape will be param_shape  = np.mean(inp.d, axis=(0, 2, 3), keepdims=True).shape (using numpy expression as an example).

Parameters to be registered

The following variables are registered in a parameter scope "bn";

  • beta (need_grad=True) : Trainable bias \(\beta\). (shape: <see above>)
  • gamma (need_grad=True) : Trainable scaling factor \(\gamma\). (shape: <see above>)
  • mean (need_grad=False) : Moving average of batch mean. (shape: <see above>)
  • var (need_grad=False) : Moving average of batch variance. (shape: <see above>)

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = sync_batch_normalization(<args>)
nnabla.parametric_functions.mean_subtraction(inp, base_axis=1, update_running_mean=True, fix_parameters=False, name=None)[source]

Mean subtraction layer.

It subtracts the mean of the elements of the input array, and normalizes it to \(0\). Preprocessing arrays with this function has the effect of improving accuracy in various tasks such as image classification.

At training time, this function is defined as

\[\begin{split}\begin{array}{lcl} \mu &=& \frac{1}{M} \sum x_i \\ y_i &=& x_i - \mu \end{array}\end{split}\]

At testing time, the mean values used are those that were computed during training by moving average.

Note

The backward performs an approximated differentiation that takes into account only the latest mini-batch.

Parameters:
  • inp (Variable) – N-D array of input.
  • base_axis (int) – Base axis of Mean Subtraction operation. Dimensions up to base_axis is treated as sample dimension.
  • update_running_mean (bool) – When set to True, the running mean will not be updated.
  • fix_parameters (bool) – dummy parameter. This argument dose not affect anything.
Returns:

N-D array.

Return type:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "mean_subtraction";

  • mean (need_grad=False) : Moving average. (shape: inp.shape[base_axis:])
  • t (need_grad=False) : Minibatch counter used in forward pass. (shape: (1,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = mean_subtraction(<args>)
nnabla.parametric_functions.layer_normalization(inp, batch_axis=0, eps=1e-05, output_stat=False, fix_parameters=False, param_init=None, no_scale=False, no_bias=False, name=None)[source]

Applies Layer Normalization over an input variable, which is defined as:

\[\begin{split}\begin{eqnarray} \mu^l &=& \frac{1}{H} \sum_{i=1}^{H} x_i^l \\ \sigma^l &=& \sqrt{\frac{1}{H} \sum_{i=1}^{H} \left(x_i^l - \mu^l\right)^2} \\ y &=& \frac{x - \mu^l}{\sigma^l + \epsilon} \gamma + \beta \end{eqnarray}\end{split}\]

where \(x\) and \(y\) are input and output variable, \(\mu^l\) and \(\sigma^l\) are the mean and std of each layer along batch axis, and \(\alpha\) and \(\beta\) are trainable parameter.

Note

Unlike other normalizations, which applies scalar scale and bias for each entire channel/plane, Layer Normalization applies per-element scale and bias.

References

Parameters:
  • inp (Variable) – An input variable.
  • batch_axis (int or repeated int) – Axes mean and variance are taken.
  • eps (float) – Tiny value to avoid zero division by std.
  • output_stat (bool) – It True, calculated mean and variance are also returned.
  • fix_parameters (bool) – When set to True, the beta and gamma will not be updated.
  • param_init (dict) – Parameter initializers can be set with a dict. A key of the dict must be 'gamma', 'beta'. A value of the dict must be an Initializer or a numpy.ndarray. E.g. {'gamma': np.ones(...) * 2, 'beta': ConstantIntializer(0)}.
  • no_scale (bool) – If True, the scale term is omitted.
  • no_bias (bool) – If True, the bias term is omitted.
Returns:

Normalized output variable. * Variable: Mean (if ``output_stat=True`). * Variable: Std (if ``output_stat=True`)

Return type:

Parameters to be registered

The following variables are registered in a parameter scope "layer_normalization";

  • beta (need_grad=True) : Trainable bias \(\beta\). (shape: <see above>)
  • gamma (need_grad=True) : Trainable scaling factor \(\gamma\). (shape: <see above>)

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = layer_normalization(<args>)
nnabla.parametric_functions.instance_normalization(inp, channel_axis=1, batch_axis=0, eps=1e-05, output_stat=False, fix_parameters=False, param_init=None, no_scale=False, no_bias=False, name=None)[source]

Applies Instance Normalization over an input variable, which is defined as:

\[\begin{split}\begin{eqnarray} \mu^i &=& \frac{1}{H} \sum_{i=1}^{H} x_i^i \\ \sigma^i &=& \sqrt{\frac{1}{H} \sum_{i=1}^{H} \left(x_i^i - \mu^i\right)^2} \\ y &=& \frac{x - \mu^i}{\sigma^ + \epsilon} \gamma + \beta \end{eqnarray}\end{split}\]

where \(x\) and \(y\) are input and output variable, \(\mu^i\) and \(\sigma^i\) are the mean and std of each instance which is separately calculated for each batch and channel, and \(\gamma\) and \(\beta\) are adaptive gains and biases.

If the input shape is [B, C, H, W] (= channel_axis=1, batch_axis=0), the shape of calculated mean and std are [B, C, 1, 1]

References

Parameters:
  • inp (Variable) – An input variable.
  • channel_axis (int or repeated int) – Channel axes.
  • batch_axis (int or repeated int) – Batch axes.
  • eps (float) – Tiny value to avoid zero division by std.
  • output_stat (bool) – It True, the batch statistics of mean and variance.
  • fix_parameters (bool) – If True, the beta and gamma will not be updated.
  • param_init (dict) – Parameter initializers can be set with a dict. A key of the dict must be 'gamma', 'beta'. A value of the dict must be an Initializer or a numpy.ndarray. E.g. {'gamma': np.ones(...) * 2, 'beta': ConstantIntializer(0)}.
  • no_scale (bool) – If True, the scale term is omitted.
  • no_bias (bool) – If True, the bias term is omitted.
  • Returns
Parameters to be registered

The following variables are registered in a parameter scope "instance_normalization";

  • beta (need_grad=True) : Trainable bias \(\beta\). (shape: <see above>)
  • gamma (need_grad=True) : Trainable scaling factor \(\gamma\). (shape: <see above>)

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = instance_normalization(<args>)
nnabla.parametric_functions.group_normalization(inp, num_groups, channel_axis=1, batch_axis=0, eps=1e-05, output_stat=False, fix_parameters=False, param_init=None, no_scale=False, no_bias=False, name=None)[source]

Applies Group Normalization over an input tensor, which is defined as:

\[\begin{split}\begin{eqnarray} \mu^g &=& \frac{1}{H} \sum_{i=1}^{H} x_i^g \\ \sigma^g &=& \sqrt{\frac{1}{H} \sum_{i=1}^{H} \left(x_i^g - \mu^g\right)^2} \\ y &=& \frac{x - \mu^g}{\sigma^g + \epsilon} \gamma + \beta \end{eqnarray}\end{split}\]

where \(x\) and \(y\) are input and output variable, \(\mu^g\) and \(\sigma^g\) are the mean and std of each group which contains num_channels / num_groups channels, and \(\gamma\) and \(\beta\) are adaptive gains and biases.

The input channels, specified by channel_axis, are separeted into num_groups groups, and the mean and std are calculated over the each group. For example, if the input shape is [B, C, H, W] (= channel_axis=1, batch_axis=0), an input variable is once reshaped to [B, num_groups, C / num_groups, H, W] and standardize by its mean and std whose shapes are [B, num_groups, C / num_groups, 1, 1]. Before returning, an output variable is reshaped again to the original input shape (= [B, C, H, W] in the case above).

References

Parameters:
  • inp (Variable) – An input variable.
  • num_groups (int) – A number of groups. The channel dim of ‘x’ must be integer multiple of num_groups.
  • channel_axis (int) – Channel axis.
  • batch_axis (int or repeated int) – Axes mean and variance are taken.
  • eps (float) – Tiny value to avoid zero division by std.
  • output_stat (bool) – It true, the batch statistics of mean and variance.
  • fix_parameters (bool) – When set to True, the beta and gamma will not be updated.
  • param_init (dict) – Parameter initializers can be set with a dict. A key of the dict must be 'gamma', 'beta'. A value of the dict must be an Initializer or a numpy.ndarray. E.g. {'gamma': np.ones(...) * 2, 'beta': ConstantIntializer(0)}.
  • no_scale (bool) – If True, the scale term is omitted.
  • no_bias (bool) – If True, the bias term is omitted.
Returns:

Normalized output variable. * Variable: Mean (if ``output_stat=True`) * Variable: Std (if ``output_stat=True`)

Return type:

Parameters to be registered

The following variables are registered in a parameter scope "group_normalization";

  • beta (need_grad=True) : Trainable bias \(\beta\). (shape: <see above>)
  • gamma (need_grad=True) : Trainable scaling factor \(\gamma\). (shape: <see above>)

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = group_normalization(<args>)
nnabla.parametric_functions.rnn(x, h, w0_init=None, w_init=None, b_init=None, num_layers=1, nonlinearity='tanh', dropout=0.0, bidirectional=False, training=True, rng=None, with_bias=True, fix_parameters=False, name=None)[source]

N-Step RNN (recurrent neural networks).

N-Step RNN function implements Elman RNN with nonlineraity to input sequence. N-Step RNN function is defined as following:

\[h_t = \tanh(w_{ih}x_t+b_{ih}+w_{hh}h_{(t-1)}).\]

We use the following notations to describe the inputs and outputs below. \(T\): sequcne length, \(B\): batch size, \(I\): input size, \(L\): number of layers, \(D\): number of directions, can be either 1 or 2, \(H\): hidden size.

References

Jeffrey L. Elman. “Finding Structure in Time.” Cognitive Science. 1990.

Parameters:
  • x (Variable) – Input N-D array with shape \((T, B, I)\).
  • h (Variable) – Input N-D array with shape \((L, D, B, H)\).
  • w0_init (nnabla.initializer.BaseInitializer or numpy.ndarray, optional) – Initializer for weight at the first layer. Shape is \((D, H, I + H)\).
  • w_init (nnabla.initializer.BaseInitializer or numpy.ndarray, optional) – Initializer for weights at the second layer and up. Shape is \((L-1, D, H, D*H + H)\).
  • b_init (nnabla.initializer.BaseInitializer or numpy.ndarray, optional) – Initializer for bias. Shape is \((L, D, H)\).
  • num_layers (int, optional) – Number of layers in the network. If set to 1, only the weights for the first layer will be invoked. Default is 1.
  • nonlinearity (str, optional) – Type of nonlinearity applied to input sequcne. Must be either tanh or relu. Default is tanh.
  • dropout (float, optional) – Dropout ratio applied to parameters. Default is 0.0.
  • bidirectional (bool, optional) – If True, bidirectional computation will be performed in each layer. Default is False.
  • training (bool, optional) – Backpropagation will be performed only when it is true. Default is True.
  • with_bias (bool, optional) – Specify whether to include the bias term.
Returns:

Output \(y\) with shape \((T, B, D * H)\) ~nnabla.Variable: Output \(h_n\) with shape \((L, D, B, H)\)

Return type:

Variable

Example

x = nn.Variable((seq_len, batch_size, input_size))
h = nn.Variable((num_layers, num_directions, batch_size, hidden_size))
y, hn = PF.rnn(x, h)
Parameters to be registered

The following variables are registered in a parameter scope "rnn";

  • weight_l0 (need_grad=True) : Filter weights at 0-th layer. (shape: (D, H, I + H))
  • weight (need_grad=True) : Filter weights at 1-st layer and above. (shape: (L-1, D, H, DH + H))
  • bias (need_grad=True) : Biases. (shape: (L, D, H))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = rnn(<args>)
nnabla.parametric_functions.lstm(x, h, c, w0_init=None, w_init=None, b_init=None, num_layers=1, dropout=0.0, bidirectional=False, training=True, rng=None, with_bias=True, fix_parameters=False, name=None)[source]

LSTM (long short-term memory).

Long Short-Term Memory, or LSTM, is a building block for recurrent neural networks (RNN) layers. LSTM unit consists of a cell and input, output, forget gates whose functions are defined as following:

\[\begin{split}f_t&&=\sigma(W_fx_t+U_fh_{t-1}+b_f) \\ i_t&&=\sigma(W_ix_t+U_ih_{t-1}+b_i) \\ o_t&&=\sigma(W_ox_t+U_oh_{t-1}+b_o) \\ c_t&&=f_t\odot c_{t-1}+i_t\odot\tanh(W_cx_t+U_ch_{t-1}+b_c) \\ h_t&&=o_t\odot\tanh(c_t).\end{split}\]

We use the following notations to describe the inputs and outputs below. \(T\): sequcne length, \(B\): batch size, \(I\): input size, \(L\): number of layers, \(D\): number of directions, can be either 1 or 2, \(H\): hidden size.

References

S. Hochreiter, and J. Schmidhuber. “Long Short-Term Memory.” Neural Computation. 1997.

Parameters:
  • x (Variable) – Input N-D array with shape \((T, B, I)\).
  • h (Variable) – Input N-D array with shape \((L, D, B, H)\).
  • c (Variable) – Input N-D array with shape \((L, D, B, H)\) .
  • w0_init (nnabla.initializer.BaseInitializer or numpy.ndarray, optional) – Initializer for weight at the first layer. Shape is \((D, 4, H, I + H)\).
  • w_init (nnabla.initializer.BaseInitializer or numpy.ndarray, optional) – Initializer for weights at the second layer and up. Shape is \((L-1, D, 4, H, D * H + H)\).
  • b_init (nnabla.initializer.BaseInitializer or numpy.ndarray, optional) – Initializer for bias. Shape is \((L, D, 4, H)\).
  • num_layers (int, optional) – Number of layers in the network. If set to 1, only the weights for the first layer will be invoked. Default is 1.
  • dropout (float, optional) – Dropout ratio applied to parameters. Default is 0.0.
  • bidirectional (bool, optional) – If True, bidirectional computation will be performed in each layer. Default is False.
  • training (bool, optional) – Backpropagation will be performed only when it is true. Default is True.
  • with_bias (bool, optional) – Specify whether to include the bias term.
  • fix_parameters (bool) – When set to True, the weights and biases will not be updated.
Returns:

Output \(y\) with shape \((T, B, D * H)\) ~nnabla.Variable: Output \(h_n\) with shape \((L, D, B, H)\) ~nnabla.Variable: Output \(c_n\) with shape \((L, D, B, H)\)

Return type:

Variable

Example

x = nn.Variable((seq_len, batch_size, input_size))
h = nn.Variable((num_layers, num_directions, batch_size, hidden_size))
c = nn.Variable((num_layers, num_directions, batch_size, hidden_size))
y, hn, cn = PF.lstm(x, h, c)
Parameters to be registered

The following variables are registered in a parameter scope "lstm";

  • weight_l0 (need_grad=True) : Filter weights at 0-th layer. (shape: (D, 4, H, I + H))
  • weight (need_grad=True) : Filter weights at 1-st layer and above. (shape: (L-1, D, 4, H, DH + H))
  • bias (need_grad=True) : Biases. (shape: (L, D, 4, H))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = lstm(<args>)
nnabla.parametric_functions.gru(x, h, w0_init=None, w_init=None, b_init=None, num_layers=1, dropout=0.0, bidirectional=False, training=True, rng=None, with_bias=True, fix_parameters=False, name=None)[source]

GRU (gated recurrent units).

GRU is defined as following:

\[\begin{split}r_t&&=\sigma(W_rx_t+U_rh_{t-1}+b_r) \\ z_t&&=\sigma(W_zx_t+U_zh_{t-1}+b_z) \\ n_t&&=\tanh(W_nx_t+b_{in}+r_n \odot (U_nh_{t-1}+b_{hn})) \\ h_t&&=(1-z_t) \odot n_t+z_t \odot h_{t-1}.\end{split}\]

We use the following notations to describe the inputs and outputs below. \(T\): sequcne length, \(B\): batch size, \(I\): input size, \(L\): number of layers, \(D\): number of directions, can be either 1 or 2, \(H\): hidden size.

References

K. Cho et al. “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation.” Empirical Methods in Natural Language Processing. 2014.

Parameters:
  • x (Variable) – Input N-D array with shape \((T, B, I)\).
  • h (Variable) – Input N-D array with shape \((L, D, B, H)\).
  • w0_init (nnabla.initializer.BaseInitializer or numpy.ndarray, optional) – Initializer for weight at the first layer. Shape is \((D, 3, H, I + H)\).
  • w_init (nnabla.initializer.BaseInitializer or numpy.ndarray, optional) – Initializer for weights at the second layer and up. Shape is \((L-1, D, 3, H, D * H + H)\).
  • b_init (nnabla.initializer.BaseInitializer or numpy.ndarray, optional) – Initializer for bias. Shape is \((L, D, 4, H)\).
  • num_layers (int, optional) – Number of layers in the network. If set to 1, only the weights for the first layer will be invoked. Default is 1.
  • dropout (float, optional) – Dropout ratio applied to parameters. Default is 0.0.
  • bidirectional (bool, optional) – If True, bidirectional computation will be performed in each layer. Default is False.
  • training (bool, optional) – Backpropagation will be performed only when it is true. Default is True.
  • with_bias (bool, optional) – Specify whether to include the bias term.
Returns:

Output \(y\) with shape \((T, B, D * H)\) ~nnabla.Variable: Output \(h_n\) with shape \((L, D, B, H)\)

Return type:

Variable

Example

x = nn.Variable((seq_len, batch_size, input_size))
h = nn.Variable((num_layers, num_directions, batch_size, hidden_size))
y, hn = PF.gru(x, h)
Parameters to be registered

The following variables are registered in a parameter scope "gru";

  • weight_l0 (need_grad=True) : Filter weights at 0-th layer. (shape: (D, 3, H, I + H))
  • weight (need_grad=True) : Filter weights at 1-st layer and above. (shape: (L-1, D, 3, H, DH + H))
  • bias (need_grad=True) : Biases. (shape: (L, D, 4, H))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = gru(<args>)
nnabla.parametric_functions.embed(inp, n_inputs, n_features, initializer=None, fix_parameters=False, apply_w=None, name=None)[source]

Embed.

Embed slices a matrix/tensor with indexing array/tensor. Weights are initialized with nnabla.initializer.UniformInitializer within the range of \(-\sqrt{3}\) and \(\sqrt{3}\).

Parameters:
  • x (Variable) – [Integer] Indices with shape \((I_0, ..., I_N)\)
  • n_inputs – number of possible inputs, words or vocabraries
  • n_features – number of embedding features
  • fix_parameters (bool) – When set to True, the embedding weight matrix will not be updated.
  • apply_w (function) – Lambda, function, or callable object applied to the weights.
Returns:

Output with shape \((I_0, ..., I_N, W_1, ..., W_M)\)

Return type:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "embed";

  • W (need_grad=True) : Embedding matrix. (shape: (n_inputs, n_features))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = embed(<args>)
nnabla.parametric_functions.prelu(inp, base_axis=1, shared=True, fix_parameters=False, slope_init=None, name=None)[source]

Parametrized Rectified Linear Unit function defined as

\[y_i = \max(0, x_i) + w_i \min(0, x_i)\]

where negative slope \(w\) is learned and can vary across channels (an axis specified with base_axis). Weights are initialized with \(-1\).

Parameters:
  • x (Variable) – N-D array as input
  • base_axis (int) – Dimensions up to base_axis is treated as sample dimension.
  • shared (bool) – Use shared weight value or not
  • fix_parameters (bool) – When set to True, the negative slope values will not be updated.
  • slope_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer of negative slopes. By default, they are initialized with 0.25.
Returns:

N-D array.

Return type:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "prelu";

  • slope (need_grad=True) : Negative slope. (shape: tuple() if shared else (inp.shape[base_axis],))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = prelu(<args>)
nnabla.parametric_functions.svd_affine(inp, n_outmaps, r, base_axis=1, uv_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, name=None)[source]

SVD affine is a low rank approximation of the affine layer. It can be seen as two consecutive affine layers with a bottleneck. It computes:

\[{\mathbf y} = {\mathbf U} {\mathbf V} {\mathbf x} + {\mathbf b}.\]

where \({\mathbf x}, {\mathbf y}\) are the inputs and outputs respectively, and \({\mathbf U}, {\mathbf V}, {\mathbf b}\) are constants.

The weights \({\mathbf U}\) and \({\mathbf V}\) are approximated with singular value decomposition (SVD) of the original weight matrix \({\mathbf W}\) and by selecting the \({R}\) dominant singular values and the corresponding singular vectors. Therefore the low rank \({R}\) is the size of the bottleneck.

If uv_init is a numpy array, \({\mathbf U}\) and \({\mathbf V}\) are computed such that uv_init is approximated by \({\mathbf{UV}}\). If uv_init is None or an initializer, the product of \({\mathbf U}\) and \({\mathbf V}\) approximates the random initialization.

If \({\mathbf U}\) and \({\mathbf V}\) exist in the context, they take precedence over uv_init.

Suppose the weight of the affine is of \({I \times O}\) and the compression rate you want to specify is \({CR}\), then you set \({R}\) as

\[R = \left\lfloor \frac{(1 - CR)OI}{O + I} \right\rfloor.\]
Parameters:
Returns:

\((B + 1)\)-D array. (\(M_0 \times \ldots \times M_{B-1} \times L\))

Return type:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "svd_affine";

  • U (need_grad=True) : \({\mathbf U}\). (shape: (inmaps, r))
  • V (need_grad=True) : \({\mathbf V}\). (shape: (r, outmaps))
  • b (need_grad=True) : Bias vector. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = svd_affine(<args>)
nnabla.parametric_functions.svd_convolution(inp, outmaps, kernel, r, pad=None, stride=None, dilation=None, uv_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, name=None)[source]

SVD convolution is a low rank approximation of the convolution layer. It can be seen as a depth wise convolution followed by a 1x1 convolution.

The flattened kernels for the i-th input map are expressed by their low rank approximation. The kernels for the i-th input \({\mathbf W_i}\) are approximated with the singular value decomposition (SVD) and by selecting the \({R}\) dominant singular values and the corresponding singular vectors.

\[{\mathbf W_{:,i,:}} ~ {\mathbf U_i} {\mathbf V_i}.\]

\({\mathbf U}\) contains the weights of the depthwise convolution with multiplier \({R}\) and \({\mathbf V}\) contains the weights of the 1x1 convolution.

If uv_init is a numpy array, \({\mathbf U}\) and \({\mathbf V}\) are computed such that uv_init is approximated by \({\mathbf{UV}}\). If uv_init is None or an initializer, the product of \({\mathbf U}\) and \({\mathbf V}\) approximates the random initialization.

If \({\mathbf U}\) and \({\mathbf V}\) exist in the context, they take precedence over uv_init.

Suppose the kernel tensor of the convolution is of \({O \times I \times K \times K}\) and the compression rate you want to specify is \({CR}\), then you set \({R}\) as

\[R = \left\lfloor \frac{(1 - CR)OIK^2}{I(O + K^2)} \right\rfloor.\]
Parameters:
  • inp (Variable) – N-D array.
  • outmaps (int) – Number of convolution kernels (which is equal to the number of output channels). For example, to apply convolution on an input with 16 types of filters, specify 16.
  • kernel (tuple) – Convolution kernel size. For example, to apply convolution on an image with a 3 (height) by 5 (width) two-dimensional kernel, specify (3, 5).
  • r (int) – Rank of the factorized layer.
  • pad (tuple) – Padding sizes (int) for dimensions.
  • stride (tuple) – Stride sizes (int) for dimensions.
  • dilation (tuple) – Dilation sizes (int) for dimensions.
  • uv_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for weight. By default, it is initialized with nnabla.initializer.UniformInitializer within the range determined by nnabla.initializer.calc_uniform_lim_glorot.
  • b_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for bias. By default, it is initialized with zeros if with_bias is True.
  • base_axis (int) – Dimensions up to base_axis are treated as the sample dimensions.
  • fix_parameters (bool) – When set to True, the weights and biases will not be updated.
  • rng (numpy.random.RandomState) – Random generator for Initializer.
  • with_bias (bool) – Specify whether to include the bias term.
Returns:

\((B + 1)\)-D array. (\(M_0 \times \ldots \times M_{B-1} \times L\))

Return type:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "svd_conv";

  • U (need_grad=True) : Decomposed filter weights \({\mathbf U}\). (shape: (inmaps * r, *kernel))
  • V (need_grad=True) : Decomposed filter weights \({\mathbf V}\). (shape: (outmaps, inmaps * r, 1, ...))
  • b (need_grad=True) : Bias vector. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = svd_convolution(<args>)
nnabla.parametric_functions.cpd3_convolution(inp, outmaps, kernel, r, pad=None, stride=None, dilation=None, oik_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, max_iter=500, stopping_criterion=1e-05, lambda_reg=0.0, name=None)[source]

CP convolution is a low rank approximation of a convolution layer. A 3D tensor containing the parameter is built by collapsing the N-D kernels into 1D, then the tensor is decomposed into three matrices. The decomposed layer can be seen as linear combinations of the input feature maps to \({R}\) feature maps followed by a depthwise convolution and followed by linear combinations of the feature maps to compute the output feature maps.

The CP decomposition allows to approximate the kernel tensor by \({R}\) rank-1 tensors of the form:

\[\sum_{r=1}^{R} \lambda_r {\mathbf{o}^{(r)} \otimes \mathbf{i}^{(r)} \otimes \mathbf{k}^{(r)}},\]

where \({\lambda}_r\) is the normalization coefficient and \({\otimes}\) is the outer product.

If oik_init is a numpy array, U and V are computed so that uv_init can be approximates from UV If oik_init is None or an initializer, the product of U and V approximate the randomly initialized array

If O, I and K exist in context, they are used to initialize the layer and oik_init is not used.

Suppose the kernel tensor of the affine is of \({I \times O}\) and the compression rate you want to specify is \({CR}\), then you set \({R}\) as

\[R = \left\lfloor \frac{(1 - CR)OIK^2}{O + I + K^2} \right\rfloor.\]

References

  • Lebedev, Vadim, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky, “Speeding-up convolutional neural networks using fine-tuned cp-decomposition.”, arXiv preprint arXiv:1412.6553 (2014).
  • Marcella Astrid, Seung-Ik Lee, “CP-decomposition with Tensor Power Method for Convolutional Neural Networks Compression”, BigComp 2017.
Parameters:
  • inp (Variable) – N-D array.
  • outmaps (int) – Number of convolution kernels (which is equal to the number of output channels). For example, to apply convolution on an input with 16 types of filters, specify 16.
  • kernel (tuple of int) – Convolution kernel size. For example, to apply convolution on an image with a 3 (height) by 5 (width) two-dimensional kernel, specify (3,5).
  • r (int) – rank of the factorized layer
  • pad (tuple of int) – Padding sizes for dimensions.
  • stride (tuple of int) – Stride sizes for dimensions.
  • dilation (tuple of int) – Dilation sizes for dimensions.
  • oik_init (numpy array or nnabla.initializer.BaseInitializer) – Initializer for weight. Initializer for weight. By default, it is initialized with nnabla.initializer.UniformInitializer within the range determined by nnabla.initializer.calc_uniform_lim_glorot.
  • b_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for bias. It is initialized with zeros if with_bias is True.
  • base_axis (int) – Dimensions up to base_axis are treated as the sample dimensions.
  • fix_parameters (bool) – When set to True, the weights and biases will not be updated.
  • rng (numpy.random.RandomState) – Random generator for Initializer.
  • with_bias (bool) – Specify whether to include the bias term.
  • max_iter (int) – Max iteration of the ALS.
  • stopping_criterion (float) – Threshold for stopping the ALS. If the value is negative, the convergence check is ignored; in other words, it may reduce the computation time.
  • lambda_reg (float) – regularization parameter for the ALS. Larger lambda_reg means larger regularization.
Returns:

\((B + 1)\)-D array. (\(M_0 \times \ldots \times M_{B-1} \times L\))

Return type:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "cpd3_conv";

  • I (need_grad=True) : Decomposed filter weights \({\mathbf I}\). (shape: (r, inmaps, 1, ...))
  • K (need_grad=True) : Decomposed filter weights \({\mathbf K}\). (shape: (r, *kernel))
  • O (need_grad=True) : Decomposed filter weights \({\mathbf O}\). (shape: (outmaps, r, 1, ...))
  • b (need_grad=True) : Bias vector. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = cpd3_convolution(<args>)
nnabla.parametric_functions.binary_connect_affine(inp, n_outmaps, base_axis=1, quantize_zero_to=1.0, w_init=None, wb_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, name=None)[source]

Binary Connect Affine, multiplier-less inner-product.

Binary Connect Affine is an affine function, except the definition of the inner product is modified. The input-output relation of this function is as follows:

\[y_i = \sum_{i} sign(w_i) x_i.\]

Therefore \(sign(w_i)\) is either \(1\) or \(-1\) and the inner product simplifies to addition.

This function should be used together with Batch Normalization.

References

M. Courbariaux, Y. Bengio, and J.-P. David. “BinaryConnect: Training Deep Neural Networks with binary weights during propagations.” Advances in Neural Information Processing Systems. 2015.

Note

1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (weight) and not the binarized weights (binary_weight)

2) The weights and the binary weights become synced only after forward() is called, and not after a call to backward(). To access the parameters of the network, remember to call forward() once before doing so, otherwise the float weights and the binary weights will not be in sync.

3) Quantized values are stored as floating point number for binary_weight, since this function is only for simulation purposes.

Parameters:
Returns:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "bicon_affine";

  • W (need_grad=True) : Weight matrix in floating type. (shape: (inmaps, outmaps))
  • Wb (need_grad=False) : Binarized weights. (shape: (inmaps, outmaps))
  • b (need_grad=True) : Bias vector. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = binary_connect_affine(<args>)
nnabla.parametric_functions.binary_connect_convolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, quantize_zero_to=1.0, w_init=None, wb_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, name=None)[source]

Binary Connect Convolution, multiplier-less inner-product.

Binary Connect Convolution is the convolution function, except the definition of the inner product is modified. The input-output relation of this function is as follows:

\[y_{n, a, b} = \sum_{m} \sum_{i} \sum_{j} sign(w_{n, m, i, j}) x_{m, a + i, b + j}.\]

Therefore \(sign(w_i)\) is either \(1\) or \(-1\) and the inner product simplifies to addition.

This function should be used together with BatchNormalization.

References

M. Courbariaux, Y. Bengio, and J.-P. David. “BinaryConnect: Training Deep Neural Networks with binary weights during propagations.” Advances in Neural Information Processing Systems. 2015.

Note

1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (weight) and not the binarized weights (binary_weight)

2) The weights and the binary weights become synced only after forward() is called, and not after a call to backward(). To access the parameters of the network, remember to call forward() once before doing so, otherwise the float weights and the binary weights will not be in sync.

3) Quantized values are stored as floating point number for binary_weight, since this function is only for simulation purposes.

Parameters:
Returns:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "bicon_conv";

  • W (need_grad=True) : Filter weights in float. (shape: (outmaps, inmaps, *kernel))
  • Wb (need_grad=False) : Binarized filter weights. (shape: (outmaps, inmaps, *kernel))
  • b (need_grad=True) : Bias vector. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = binary_connect_convolution(<args>)
nnabla.parametric_functions.binary_weight_affine(inp, n_outmaps, base_axis=1, quantize_zero_to=1.0, w_init=None, wb_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, name=None)[source]

Binary Weight Affine, multiplier-less inner-product with a scale factor.

Binary Weight Affine is the affine function, but the inner product in this function is the following,

\[y_j = \frac{1}{\|\mathbf{w}_j\|_{\ell_1}} \sum_{i} sign(w_{ji}) x_i\]

Therefore \(sign(w_{ji})\) is either \(1\) or \(-1\) and the inner product simplifies to addition followed by scaling factor \(\alpha = \frac{1}{\|\mathbf{w}_j\|_{\ell_1}}\). The number of :\(\alpha\) is the outmaps of the affine function.

References

Rastegari, Mohammad, et al. “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks.” arXiv preprint arXiv:1603.05279 (2016).

Note

1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (weight) and not the binarized weights (binary_weight)

2) The weights and the binary weights become synced only after forward() is called, and not after a call to backward(). To access the parameters of the network, remember to call forward() once before doing so, otherwise the float weights and the binary weights will not be in sync.

3) Quantized values are stored as floating point number for binary_weight, since this function is only for simulation purposes.

Parameters:
Returns:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "bwn_affine";

  • W (need_grad=True) : Weight matrix in floating type. (shape: (inmaps, outmaps))
  • Wb (need_grad=False) : Binarized weights. (shape: (inmaps, outmaps))
  • alpha (need_grad=False) : Scaling factor \(\alpha\). (shape: (outmaps,))
  • b (need_grad=True) : Bias vector. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = binary_weight_affine(<args>)
nnabla.parametric_functions.binary_weight_convolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, quantize_zero_to=1.0, w_init=None, wb_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, name=None)[source]

Binary Weight Convolution, multiplier-less inner-product with a scale factor.

Binary Weight Convolution is the convolution function, but the inner product in this function is the following,

\[y_{n, a, b} = \frac{1}{\|\mathbf{w}_n\|_{\ell_1}} \sum_{m} \sum_{i} \sum_{j} sign(w_{n, m, i, j}) x_{m, a + i, b + j}.\]

Therefore \(sign(w_{n, m, i, j})\) is either \(1\) or \(-1\) and the inner product simplifies to addition followed by scaling factor \(\alpha = \frac{1}{\|\mathbf{w}_n\|_{\ell_1}}\). The number of \(n\) is the number of outmaps of the convolution function.

References

Rastegari, Mohammad, et al. “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks.” arXiv preprint arXiv:1603.05279 (2016).

Note

1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (weight) and not the binarized weights (binary_weight)

2) The weights and the binary weights become synced only after forward() is called, and not after a call to backward(). To access the parameters of the network, remember to call forward() once before doing so, otherwise the float weights and the binary weights will not be in sync.

3) Quantized values are stored as floating point number for binary_weight, since this function is only for simulation purposes.

Parameters:
Returns:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "bwn_conv";

  • W (need_grad=True) : Filter weights in float. (shape: (outmaps, inmaps, *kernel))
  • Wb (need_grad=False) : Binarized filter weights. (shape: (outmaps, inmaps, *kernel))
  • alpha (need_grad=False) : Scaling factor \(\alpha\). (shape: (outmaps,))
  • b (need_grad=True) : Bias vector. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = binary_weight_convolution(<args>)
nnabla.parametric_functions.inq_affine(inp, n_outmaps, base_axis=1, num_bits=4, inq_iterations=(), selection_algorithm='random', seed=-1, w_init=None, i_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, name=None)[source]

Incremental Network Quantization Affine Layer

During training, the weights are sequentially quantized to power-of-two values, which allows the training of a multiplierless network.

Using inq_iterations, one can specify after how many forward passes half of the learnable weights are fixed and quantized to powers-of-two. After reaching the last value in inq_iterations, all weights are fixed.

For more details, please refer to the reference.

Reference: Zhou A, Yao A, Guo Y, Xu L, Chen Y. Incremental network quantization: Towards lossless CNNs with low-precision weights. <https://arxiv.org/abs/1702.03044>

Parameters:
  • inp (Variable) – Input N-D array with shape (\(M_0 \times \ldots \times M_{B-1} \times D_B \times \ldots \times D_N\)). Dimensions before and after base_axis are flattened as if it was a matrix.
  • n_outmaps (int or tuple of int) – Number of output neurons per data.
  • base_axis (int) – Dimensions up to base_axis are treated as the sample dimensions.
  • quantize_zero_to (float) – Input value at zero is quantized to this value.
  • num_bits (int) – Number of bits per weight. Value has to be larger than 1 as one bit is already used to code the value “0”
  • inq_iterations (tuple of int) – Tuple of iteration numbers at which we fix half of the weights.
  • selection_algorithm (str) – Chooses algorithm that is used to decide which weights are fixed. (“largest_abs” … fix weights with largest absolute value, “random” … fix weights randomly)
  • seed (int) – Random seed for INQ algorithm
  • w_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for weight. By default, it is initialized with nnabla.initializer.UniformInitializer within the range determined by nnabla.initializer.calc_uniform_lim_glorot.
  • i_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for indicators (0 … learnable, 1 … fixed). By default, it is initialized with zeros.
  • b_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for bias. By default, it is initialized with zeros if with_bias is True.
  • fix_parameters (bool) – When set to True, the weight and bias will not be updated.
  • rng (numpy.random.RandomState) – Random generator for Initializer.
  • with_bias (bool) – Specify whether to include the bias term.
Returns:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "inq_affine";

  • W (need_grad=True) : Weight matrix in floating type. (shape: (inmaps, outmaps))
  • I (need_grad=False) : Binary indicator matrix of fixed weights. (shape: (inmaps, outmaps))
  • b (need_grad=True) : Bias vector. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = inq_affine(<args>)
nnabla.parametric_functions.inq_convolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, num_bits=4, inq_iterations=(), selection_algorithm='random', seed=-1, w_init=None, i_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, name=None)[source]

Incremental Network Quantization Convolution Layer

During training, the weights are sequentially quantized to power-of-two values, which allows the training of a multiplierless network.

Using inq_iterations, one can specify after how many forward passes half of the learnable weights are fixed and quantized to powers-of-two. After reaching the last value in inq_iterations, all weights are fixed.

For more details, please refer to the reference.

Reference: Zhou A, Yao A, Guo Y, Xu L, Chen Y. Incremental network quantization: Towards lossless CNNs with low-precision weights. <https://arxiv.org/abs/1702.03044>

Parameters:
  • inp (Variable) – Input N-D array with shape (\(M_0 \times \ldots \times M_{B-1} \times D_B \times \ldots \times D_N\)). Dimensions before and after base_axis are flattened as if it was a matrix.
  • n_outmaps (int or tuple of int) – Number of output neurons per data.
  • base_axis (int) – Dimensions up to base_axis are treated as the sample dimensions.
  • num_bits (int) – Number of bits per weight. Value has to be larger than 1 as one bit is already used to code the value “0”
  • inq_iterations (tuple of int) – Tuple of iteration numbers at which we fix half of the weights.
  • selection_algorithm (str) – Chooses algorithm that is used to decide which weights are fixed. (“largest_abs” … fix weights with largest absolute value, “random” … fix weights randomly)
  • seed (int) – Random seed for INQ algorithm
  • w_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for the weight. By default, it is initialized with nnabla.initializer.UniformInitializer within the range determined by nnabla.initializer.calc_uniform_lim_glorot.
  • i_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for the indicators (0 … learnable, 1 … fixed). By default, it is initialized with zeros.
  • b_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for the bias. By default, it is initialized with zeros if with_bias is True.
  • fix_parameters (bool) – When set to True, the weight and bias will not be updated.
  • rng (numpy.random.RandomState) – Random generator for Initializer.
  • with_bias (bool) – Specify whether to include the bias term.
Returns:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "inq_conv";

  • W (need_grad=True) : Filter weights in float. (shape: (outmaps, inmaps, *kernel))
  • I (need_grad=False) : Binary indicator matrix of fixed weights. (shape: (outmaps, inmaps, *kernel))
  • b (need_grad=True) : Bias vector. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = inq_convolution(<args>)
nnabla.parametric_functions.fixed_point_quantized_affine(inp, n_outmaps, base_axis=1, w_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, quantize_w=True, sign_w=True, n_w=8, delta_w=0.0625, ste_fine_grained_w=True, quantize_b=True, sign_b=True, n_b=8, delta_b=0.0625, ste_fine_grained_b=True, name=None)[source]

Fixed-Point Quantized Affine.

Fixed-Point Quantized Affine is the affine function, except the definition of the inner product is modified. The input-output relation of this function is as follows:

\[y_j = \sum_{i} Q(w_{ji}) x_i,\]

where \(Q(w_{ji})\) is the fixed-point quantization function.

Note

1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (weight) and not the quantized weights (quantized weight)

2) The weights and the quantized weights become synced only after forward() is called, and not after a call to backward(). To access the parameters of the network, remember to call forward() once before doing so, otherwise the float weights and the quantized weights will not be in sync.

3) CPU and GPU implementations now use float value for quantized weight, since this function is only for simulation purposes.

Parameters:
  • inp (Variable) – Input N-D array with shape (\(M_0 \times \ldots \times M_{B-1} \times D_B \times \ldots \times D_N\)). Dimensions before and after base_axis are flattened as if it is a matrix.
  • n_outmaps (int or tuple of int) – Number of output neurons per data.
  • base_axis (int) – Dimensions up to base_axis are treated as the sample dimensions.
  • w_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for weight. By default, it is initialized with nnabla.initializer.UniformInitializer within the range determined by nnabla.initializer.calc_uniform_lim_glorot.
  • b_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for bias. By default, it is initialized with zeros if with_bias is True.
  • fix_parameters (bool) – When set to True, the weights and biases will not be updated.
  • rng (numpy.random.RandomState) – Random generator for Initializer.
  • with_bias (bool) – Specify whether to include the bias term.
  • quantize_w (bool) – Quantize weights if True.
  • sign_w (bool) – Use signed quantization if True.
  • n_w (int) – Bit width used for weight.
  • delta_w (float) – Step size for weight.
  • ste_fine_grained_w (bool) – STE is fine-grained if True.
  • quantize_b (bool) – Quantize bias if True.
  • n_b (int) – Bit width used for bias.
  • delta_w – Step size for bias.
  • ste_fine_grained_b (bool) – STE is fine-grained if True.
Returns:

\((B + 1)\)-D array. (\(M_0 \times \ldots \times M_{B-1} \times L\))

Return type:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "fp_quantized_affine";

  • W (need_grad=True) : Weight matrix in float. (shape: (inmaps, outmaps))
  • b (need_grad=True) : Bias vector in float. (shape: (outmaps,))
  • W_q (need_grad=False) : Quantized weights. (shape: (inmaps, outmaps))
  • b_q (need_grad=False) : Quantized biases. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = fixed_point_quantized_affine(<args>)
nnabla.parametric_functions.fixed_point_quantized_convolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, channel_last=False, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, quantize_w=True, sign_w=True, n_w=8, delta_w=0.0625, ste_fine_grained_w=True, quantize_b=True, sign_b=True, n_b=8, delta_b=0.0625, ste_fine_grained_b=True, name=None)[source]

Fixed-Point Quantized Convolution.

Fixed-Point Quantized Convolution is the convolution function, except the definition of the inner product is modified. The input-output relation of this function is as follows:

\[y_{n, a, b} = \sum_{m} \sum_{i} \sum_{j} Q(w_{n, m, i, j}) x_{m, a + i, b + j},\]

where \(Q(w_{n, m, i, j})\) is the fixed-point quantization function.

Note

1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (weight) and not the quantized weights (quantized weight)

2) The weights and the quantized weights become synced only after forward() is called, and not after a call to backward(). To access the parameters of the network, remember to call forward() once before doing so, otherwise the float weights and the quantized weights will not be in sync.

3) CPU and GPU implementations now use float value for quantized weight, since this function is only for simulation purposes.

Parameters:
  • inp (Variable) – N-D array.
  • outmaps (int) – Number of convolution kernels (which is equal to the number of output channels). For example, to apply convolution on an input with 16 types of filters, specify 16.
  • kernel (tuple of int) – Convolution kernel size. For example, to apply convolution on an image with a 3 (height) by 5 (width) two-dimensional kernel, specify (3,5).
  • pad (tuple of int) – Padding sizes for dimensions.
  • stride (tuple of int) – Stride sizes for dimensions.
  • dilation (tuple of int) – Dilation sizes for dimensions.
  • group (int) – Number of groups of channels. This makes connections across channels more sparse by grouping connections along map direction.
  • w_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for weight. By default, it is initialized with nnabla.initializer.UniformInitializer within the range determined by nnabla.initializer.calc_uniform_lim_glorot.
  • b_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for bias. By default, it is initialized with zeros if with_bias is True.
  • base_axis (int) – Dimensions up to base_axis are treated as the sample dimensions.
  • fix_parameters (bool) – When set to True, the weights and biases will not be updated.
  • rng (numpy.random.RandomState) – Random generator for Initializer.
  • with_bias (bool) – Specify whether to include the bias term.
  • quantize_w (bool) – Quantize weights if True.
  • quantize_bias (bool) – Quantize bias if True.
  • sign_w (bool) – Use signed quantization if True.
  • n_w (int) – Bit width used for weight.
  • delta_w (float) – Step size for weight.
  • ste_fine_grained_w (bool) – STE is fine-grained if True.
  • quantize_b (bool) – Quantize bias if True.
  • n_b (int) – Bit width used for bias.
  • delta_w – Step size for bias.
  • ste_fine_grained_b (bool) – STE is fine-grained if True.
Returns:

N-D array.

Return type:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "fp_quantized_conv";

  • W (need_grad=True) : Filter weights in float. (shape: (outmaps, inmaps // group, *kernel))
  • b (need_grad=True) : Bias vector in float. (shape: (outmaps,))
  • W_q (need_grad=False) : Quantized weights. (shape: (outmaps, inmaps // group, *kernel))
  • b_q (need_grad=False) : Quantized biases. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = fixed_point_quantized_convolution(<args>)
nnabla.parametric_functions.min_max_quantized_affine(inp, n_outmaps, base_axis=1, w_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, quantize_w=True, ql_min_w=0, ql_max_w=255, w_min_max=False, qr_min_w_init=None, qr_max_w_init=None, ste_fine_grained_w=True, quantize_b=True, ql_min_b=0, ql_max_b=255, b_min_max=False, qr_min_b_init=None, qr_max_b_init=None, ste_fine_grained_b=True, eps=0.01, name=None)[source]

Min-max Quantized Affine.

Min-max Quantized Affine is the affine function, except the definition of the inner product is modified. The input-output relation of this function is as follows:

\[y_j = \sum_{i} Q(w_{ji}) x_i,\]

where \(Q(w_{ji})\) is the min-max quantization function.

In the min_max_quantized affine, the exponential moving average is not used. the min and max quantization ranges are either the min-max of weights and bias or trained.

Notice that the min and max values of inputs are always used instead of the exponential moving average.

Note

1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (weight) and not the quantized weights (quantized weight)

2) The weights and the quantized weights become synced only after forward() is called, and not after a call to backward(). To access the parameters of the network, remember to call forward() once before doing so, otherwise the float weights and the quantized weights will not be in sync.

3) CPU and GPU implementations now use float value for quantized weight, since this function is only for simulation purposes.

Parameters:
Returns:

\((B + 1)\)-D array. (\(M_0 \times \ldots \times M_{B-1} \times L\))

Return type:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "min_max_quantized_affine";

  • W (need_grad=True) : Weight matrix in float. (shape: (inmaps, outmaps))
  • b (need_grad=True) : Bias vector in float. (shape: (outmaps,))
  • W_q (need_grad=False) : Quantized weights. (shape: (inmaps, outmaps))
  • b_q (need_grad=False) : Quantized biases. (shape: (outmaps,))
  • qr_min (need_grad=False) : Minimum quantization range. Minimum values of inputs or trainable range.. (shape: ql_min.shape)
  • qr_max (need_grad=False) : Maximum quantization range. Maximum values of inputs or trainable range.. (shape: ql_max.shape)

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = min_max_quantized_affine(<args>)
nnabla.parametric_functions.min_max_quantized_convolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, channel_last=False, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, quantize_w=True, ql_min_w=0, ql_max_w=255, w_min_max=False, qr_min_w_init=None, qr_max_w_init=None, ste_fine_grained_w=True, quantize_b=True, ql_min_b=0, ql_max_b=255, b_min_max=False, qr_min_b_init=None, qr_max_b_init=None, ste_fine_grained_b=True, eps=0.01, name=None)[source]

Min-max Quantized Convolution.

Min-max Quantized Convolution is the convolution function, except the definition of the inner product is modified. The input-output relation of this function is as follows:

\[y_{n, a, b} = \sum_{m} \sum_{i} \sum_{j} Q(w_{n, m, i, j}) x_{m, a + i, b + j},\]

where \(Q(w_{n, m, i, j})\) is the min-max quantization function.

In the min_max_quantized convolution, the exponential moving average is not used. the min and max quantization ranges are either the min-max of weights and bias or trained.

Notice that the min and max values of inputs are always used instead of the exponential moving average.

Note

1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (weight) and not the quantized weights (quantized weight)

2) The weights and the quantized weights become synced only after forward() is called, and not after a call to backward(). To access the parameters of the network, remember to call forward() once before doing so, otherwise the float weights and the quantized weights will not be in sync.

3) CPU and GPU implementations now use float value for quantized weight, since this function is only for simulation purposes.

Parameters:
Returns:

N-D array.

Return type:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "min_max_quantized_conv";

  • W (need_grad=True) : Filter weights in float. (shape: (outmaps, inmaps // group, *kernel))
  • b (need_grad=True) : Bias vector in float. (shape: (outmaps,))
  • W_q (need_grad=False) : Quantized weights. (shape: (outmaps, inmaps // group, *kernel))
  • b_q (need_grad=False) : Quantized biases. (shape: (outmaps,))
  • qr_min (need_grad=False) : Minimum quantization range. Minimum values of inputs or trainable range.. (shape: ql_min.shape)
  • qr_max (need_grad=False) : Maximum quantization range. Maximum values of inputs or trainable range.. (shape: ql_max.shape)

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = min_max_quantized_convolution(<args>)
nnabla.parametric_functions.pow2_quantized_affine(inp, n_outmaps, base_axis=1, w_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, quantize_w=True, sign_w=True, with_zero_w=False, n_w=8, m_w=2, ste_fine_grained_w=True, quantize_b=True, sign_b=True, with_zero_b=False, n_b=8, m_b=2, ste_fine_grained_b=True, name=None)[source]

Pow2 Quantized Affine.

Pow2 Quantized Affine is the affine function, except the definition of the inner product is modified. The input-output relation of this function is as follows:

\[y_j = \sum_{i} Q(w_{ji}) x_i,\]

where \(Q(w_{ji})\) is the power-of-2 quantization function.

Note

1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (weight) and not the quantized weights (quantized weight)

2) The weights and the quantized weights become synced only after forward() is called, and not after a call to backward(). To access the parameters of the network, remember to call forward() once before doing so, otherwise the float weights and the quantized weights will not be in sync.

3) Quantized values are stored as floating point number for quantized weight, since this function is only for simulation purposes.

Parameters:
  • inp (Variable) – Input N-D array with shape (\(M_0 \times \ldots \times M_{B-1} \times D_B \times \ldots \times D_N\)). Dimensions before and after base_axis are flattened as if it is a matrix.
  • n_outmaps (int or tuple of int) – Number of output neurons per data.
  • base_axis (int) – Dimensions up to base_axis are treated as the sample dimensions.
  • w_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for weight. By default, it is initialized with nnabla.initializer.UniformInitializer within the range determined by nnabla.initializer.calc_uniform_lim_glorot.
  • b_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for bias. By default, it is initialized with zeros if with_bias is True.
  • fix_parameters (bool) – When set to True, the weights and biases will not be updated.
  • rng (numpy.random.RandomState) – Random generator for Initializer.
  • with_bias (bool) – Specify whether to include the bias term.
  • quantize_w (bool) – Quantize weights if True.
  • sign_w (bool) – Use signed quantization if True.
  • with_zero_w (bool) – Indicate using zero as a quantized value. Default is false.
  • n_w (int) – Bit width used for weight.
  • m_w (int) – \(2^m\) is upper bound and \(-2^m\) is lower bound for weights. Default is 2.
  • ste_fine_grained_w (bool) – STE is fine-grained if True.
  • quantize_b (bool) – Quantize bias if True.
  • with_zero_b (bool) – Indicate using zero as a quantized value. Default is false.
  • n_b (int) – Bit width used for bias.
  • m_b (int) – \(2^m\) is upper bound and \(-2^m\) is lower bound for bias. Default is 2.
  • ste_fine_grained_b (bool) – STE is fine-grained if True.
Returns:

\((B + 1)\)-D array. (\(M_0 \times \ldots \times M_{B-1} \times L\))

Return type:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "pow2_quantized_affine";

  • W (need_grad=True) : Weight matrix in float. (shape: (inmaps, outmaps))
  • b (need_grad=True) : Bias vector in float. (shape: (outmaps,))
  • W_q (need_grad=False) : Quantized weights. (shape: (inmaps, outmaps))
  • b_q (need_grad=False) : Quantized biases. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = pow2_quantized_affine(<args>)
nnabla.parametric_functions.pow2_quantized_convolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, quantize_w=True, with_zero_w=False, sign_w=True, n_w=8, m_w=2, ste_fine_grained_w=True, quantize_b=True, with_zero_b=False, sign_b=True, n_b=8, m_b=2, ste_fine_grained_b=True, name=None)[source]

Pow2 Quantized Convolution.

Pow2 Quantized Convolution is the convolution function, except the definition of the inner product is modified. The input-output relation of this function is as follows:

\[y_{n, a, b} = \sum_{m} \sum_{i} \sum_{j} Q(w_{n, m, i, j}) x_{m, a + i, b + j},\]

where \(Q(w_{n, m, i, j})\) is the power-of-2 quantization function.

Note

1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (weight) and not the quantized weights (quantized weight)

2) The weights and the quantized weights become synced only after forward() is called, and not after a call to backward(). To access the parameters of the network, remember to call forward() once before doing so, otherwise the float weights and the quantized weights will not be in sync.

3) Quantized values are stored as floating point number for quantized weight, since this function is only for simulation purposes.

Parameters:
  • inp (Variable) – N-D array.
  • outmaps (int) – Number of convolution kernels (which is equal to the number of output channels). For example, to apply convolution on an input with 16 types of filters, specify 16.
  • kernel (tuple of int) – Convolution kernel size. For example, to apply convolution on an image with a 3 (height) by 5 (width) two-dimensional kernel, specify (3,5).
  • pad (tuple of int) – Padding sizes for dimensions.
  • stride (tuple of int) – Stride sizes for dimensions.
  • dilation (tuple of int) – Dilation sizes for dimensions.
  • group (int) – Number of groups of channels. This makes connections across channels more sparse by grouping connections along map direction.
  • w_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for weight. By default, it is initialized with nnabla.initializer.UniformInitializer within the range determined by nnabla.initializer.calc_uniform_lim_glorot.
  • b_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for bias. By default, it is initialized with zeros if with_bias is True.
  • base_axis (int) – Dimensions up to base_axis are treated as the sample dimensions.
  • fix_parameters (bool) – When set to True, the weights and biases will not be updated.
  • rng (numpy.random.RandomState) – Random generator for Initializer.
  • with_bias (bool) – Specify whether to include the bias term.
  • quantize_w (bool) – Quantize weights if True.
  • sign_w (bool) – Use signed quantization if True.
  • n_w (int) – Bit width used for weight.
  • m_w (int) – \(2^m\) is upper bound and \(-2^m\) is lower bound for weights. Default is 2.
  • ste_fine_grained_w (bool) – STE is fine-grained if True.
  • quantize_b (bool) – Quantize bias if True.
  • sign_b (bool) – Use signed quantization if True.
  • n_b (int) – Bit width used for bias.
  • m_b (int) – \(2^m\) is upper bound and \(-2^m\) is lower bound for bias. Default is 2.
  • ste_fine_grained_b (bool) – STE is fine-grained if True.
Returns:

N-D array.

Return type:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "pow2_quantized_conv";

  • W (need_grad=True) : Filter weights in float. (shape: (outmaps, inmaps // group, *kernel))
  • b (need_grad=True) : Bias vector in float. (shape: (outmaps,))
  • W_q (need_grad=False) : Quantized weights. (shape: (outmaps, inmaps // group, *kernel))
  • b_q (need_grad=False) : Quantized biases. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = pow2_quantized_convolution(<args>)
nnabla.parametric_functions.pruned_affine(inp, n_outmaps, base_axis=1, w_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, prune_w=True, rate_w=0.9, prune_b=True, rate_b=0.9, name=None)[source]

Pruned Affine.

Pruned Affine is the affine function, except the definition of the inner product is modified. The input-output relation of this function is as follows:

\[y_j = \sum_{i} Q(w_{ji}) x_i,\]

where \(Q(w_{ji})\) is the pruning function, i.e., F.prune.

Note

1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (weight) and not the quantized weights (quantized weight)

2) The weights and the quantized weights become synced only after forward() is called, and not after a call to backward(). To access the parameters of the network, remember to call forward() once before doing so, otherwise the float weights and the quantized weights will not be in sync.

3) CPU and GPU implementations now use float value for quantized weight, since this function is only for simulation purposes.

Parameters:
  • inp (Variable) – Input N-D array with shape (\(M_0 \times \ldots \times M_{B-1} \times D_B \times \ldots \times D_N\)). Dimensions before and after base_axis are flattened as if it is a matrix.
  • n_outmaps (int or tuple of int) – Number of output neurons per data.
  • base_axis (int) – Dimensions up to base_axis are treated as the sample dimensions.
  • w_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for weight.
  • b_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for bias.
  • fix_parameters (bool) – When set to True, the weights and biases will not be updated.
  • rng (numpy.random.RandomState) – Random generator for Initializer.
  • with_bias (bool) – Specify whether to include the bias term.
  • prune_w (bool) – Quantize weights if True.
  • rate_w (float) – Pruning rate for weights.
  • prune_b (bool) – Quantize bias if True.
  • rate_b (float) – Pruning rate for bias.
Returns:

\((B + 1)\)-D array. (\(M_0 \times \ldots \times M_{B-1} \times L\))

Return type:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "pruned_affine";

  • W (need_grad=True) : Weight matrix in float. (shape: (inmaps, outmaps))
  • b (need_grad=True) : Bias vector in float. (shape: (outmaps,))
  • W_q (need_grad=False) : Qunatized weights. (shape: (inmaps, outmaps))
  • b_q (need_grad=False) : Quantized biases. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = pruned_affine(<args>)
nnabla.parametric_functions.pruned_convolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, channel_last=False, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, prune_w=True, rate_w=0.9, prune_b=True, rate_b=0.9, name=None)[source]

Pruned Convolution.

Pruned Convolution is the convolution function, except the definition of the inner product is modified. The input-output relation of this function is as follows:

\[y_{n, a, b} = \sum_{m} \sum_{i} \sum_{j} Q(w_{n, m, i, j}) x_{m, a + i, b + j},\]

where \(Q(w_{ji})\) is the pruning function, i.e., F.prune.

Note

1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (weight) and not the quantized weights (quantized weight)

2) The weights and the quantized weights become synced only after forward() is called, and not after a call to backward(). To access the parameters of the network, remember to call forward() once before doing so, otherwise the float weights and the quantized weights will not be in sync.

3) CPU and GPU implementations now use float value for quantized weight, since this function is only for simulation purposes.

Parameters:
  • inp (Variable) – N-D array.
  • outmaps (int) – Number of convolution kernels (which is equal to the number of output channels). For example, to apply convolution on an input with 16 types of filters, specify 16.
  • kernel (tuple of int) – Convolution kernel size. For example, to apply convolution on an image with a 3 (height) by 5 (width) two-dimensional kernel, specify (3,5).
  • pad (tuple of int) – Padding sizes for dimensions.
  • stride (tuple of int) – Stride sizes for dimensions.
  • dilation (tuple of int) – Dilation sizes for dimensions.
  • group (int) – Number of groups of channels. This makes connections across channels more sparse by grouping connections along map direction.
  • w_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for weight.
  • b_init (nnabla.initializer.BaseInitializer or numpy.ndarray) – Initializer for bias.
  • base_axis (int) – Dimensions up to base_axis are treated as the sample dimensions.
  • fix_parameters (bool) – When set to True, the weights and biases will not be updated.
  • rng (numpy.random.RandomState) – Random generator for Initializer.
  • with_bias (bool) – Specify whether to include the bias term.
  • prune_w (bool) – Quantize weights if True.
  • rate_w (float) – Pruning rate for weights.
  • prune_b (bool) – Quantize bias if True.
  • rate_b (float) – Pruning rate for bias.
Returns:

N-D array.

Return type:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "pruned_conv";

  • W (need_grad=True) : Filter weights in float. (shape: (outmaps, inmaps // group, *kernel))
  • b (need_grad=True) : Bias vector in float. (shape: (outmaps,))
  • W_q (need_grad=False) : Qunatized weights. (shape: (outmaps, inmaps // group, *kernel))
  • b_q (need_grad=False) : Quantized biases. (shape: (outmaps,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = pruned_convolution(<args>)
nnabla.parametric_functions.min_max_quantize(x, ql_min=0, ql_max=255, decay=0.999, x_min_max=False, ema=False, ste_fine_grained=True, eps=0.01, qr_min_init=None, qr_max_init=None, fix_parameters=False, outputs=None, name=None)[source]

Min-max quantization.

This function uniformly quantizes values in the range of min and max quantization levels.

Min-max quantization is defined as the following equation

\[y = round \left(\frac{\min(\max(x, m), M) - m}{scale} \right) \times scale + m,\]

where the \(scale\) is defined as

\[scale = \frac{M - m}{M_q - m_q},\]

and

\[\begin{split}m_q = ql_{min}, \\ M_q = ql_{max}, \\ m = qr_{min}, \\ M = qr_{max}.\end{split}\]

In the backward pass when using ste_fine_grained as false,

\[\frac{\partial q_i}{\partial x_i} = 1.\]

In the backward pass when using ste_fine_grained as true,

\[\begin{split} \frac{\partial q_i}{\partial x_i}= \left\{ \begin{array}{ll} 0 & if \ \ \ x_i > M \\ 1 & if \ \ m \le x_i \le M \\ 0 & if \ \ x_i < m \\ \end{array} \right..\end{split}\]

\(qr_{min}\) and \(qr_{max}\) are treaded as follows.

  • x_min_max is True and ema is True: Exponential moving average are computed for each \(min(x)\) and \(max(x)\) then stored in \(qr_{min}\) and \(qr_{max}\).
  • x_min_max is True and ema is False: \(min(x)\) and \(max(x)\) are computed then stored in \(qr_{min}\) and \(qr_{max}\).
  • x_min_max is False and ema is True: Exponential moving average stored in \(qr_{min}\) and \(qr_{max}\) are used.
  • x_min_max is False and ema is False Gradients of \(qr_{min}\) and \(qr_{max}\) are computed in the backward pass.

More precisely, in inference of the min-max quantization, one has to consider zero-point (zp) which corresponds to the real value 0, and its data type is an integer. zero-point is defined as

\[\begin{split} && zp_f = ql_{min} -\frac{qr_{min}}{scale}, \\ && zp = \left\{ \begin{array}{ll} ql_{max} & if \ \ \ zp_f >= ql_{max} \\ round(zp_f) & if \ \ otherwise \\ ql_{min} & if \ \ zp_f <= ql_{min} \\ \end{array} \right..\end{split}\]

Accordingly, in order to simulate quantization effect of zero-point, during both forward and backward pass, \(qr_{min}\) and \(qr_{max}\) are adjusted as follows,

\[\begin{split}qr_{min}^{adj} = ql_{min} - zp * scale, \\ qr_{max}^{adj} = ql_{max} - zp * scale.\end{split}\]

These operations are often called nudge.

Finally, in the formulas of the min-max quantization, \(m\) and \(M\) are replaced by \(qr_{min}^{adj}\) and \(qr_{max}^{adj}\) respectively.

Parameters:

References

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko, “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference”, https://arxiv.org/abs/1712.05877

Parameters to be registered

The following variables are registered in a parameter scope "min_max_quantize";

  • qr_min (need_grad=False) : Minimum quantization range, the exponential movining average of min values of inputs initialized with -6.0 if ema is True. (shape: ql_min.shape)
  • qr_max (need_grad=False) : Maximum quantization range, the exponential movining average of max values of inputs initialized with 6.0 if ema is True. (shape: ql_max.shape)

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = min_max_quantize(<args>)
nnabla.parametric_functions.lstm_cell(x, h, c, state_size, w_init=None, b_init=None, fix_parameters=False, name=None)[source]

Long Short-Term Memory.

Long Short-Term Memory, or LSTM, is a building block for recurrent neural networks (RNN) layers. LSTM unit consists of a cell and input, output, forget gates whose functions are defined as following:

\[\begin{split}f_t&&=\sigma(W_fx_t+U_fh_{t-1}+b_f) \\ i_t&&=\sigma(W_ix_t+U_ih_{t-1}+b_i) \\ o_t&&=\sigma(W_ox_t+U_oh_{t-1}+b_o) \\ c_t&&=f_t\odot c_{t-1}+i_t\odot\tanh(W_cx_t+U_ch_{t-1}+b_c) \\ h_t&&=o_t\odot\tanh(c_t).\end{split}\]

References

S. Hochreiter, and J. Schmidhuber. “Long Short-Term Memory.” Neural Computation. 1997.

Parameters:
Returns:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "lstm";

  • affine/W (need_grad=True) : Stacked weight matrixes of LSTM block. (shape: (inmaps, 4, state_size))
  • affine/b (need_grad=True) : Stacked bias vectors of LSTM block. (shape: (4, state_size,))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = lstm_cell(<args>)
class nnabla.parametric_functions.LSTMCell(batch_size, state_size, h=None, c=None, name=None)[source]
__call__(x, w_init, b_init, fix_parameters)[source]

Updates h and c by calling lstm function.

Parameters:
nnabla.parametric_functions.spectral_norm(w, dim=0, itr=1, eps=1e-12, test=False, u_init=None, fix_parameters=True, name=None)[source]

Spectral Normalization.

\[W_{sn} = \frac{W}{\sigma(W)}.\]

where \(W\) is the input matrix, and the \(\sigma(W)\) is the spectral norm of \(W\). The spectral norm is approximately computed by the power iteration.

References

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida, “Spectral Normalization for Generative Adversarial Networks”, International Conference on Learning Representations. 2018.

Parameters:
  • W (Variable) – Input N-D array with shape. This is normally network parameter.
  • dim (int) – Output dimension. Default is 0. If the dimension is not 0, then the specified dimension becomes the most-left dimension by transposing.
  • itr (int) – Number of iterations. Default is 1.
  • eps (float) – Epsilon for the normalization. Default is 1e-12.
  • test (bool) – Use test mode. Default is False.
Returns:

Spectrally normalized \(W_{sn}\) with the same shape as \(W\).

Return type:

Variable

Example

import nnabla as nn
import nnabla.parametric_functions as PF

b, c, h, w = 4, 64, 32, 32

# Spectrally normalized convolution
apply_w = lambda w: PF.spectral_norm(w, dim=0)
h = nn.Variable.from_numpy_array(np.random.randn(b, c, h, w))
h = PF.convolution(h, with_bias=False, apply_w=apply_w)

# Spectrally normalized affine
apply_w = lambda w: PF.spectral_norm(w, dim=1)
h = nn.Variable.from_numpy_array(np.random.randn(b, c))
h = PF.affine(h, with_bias=False, apply_w=apply_w)

# Spectrally normalized embed
apply_w = lambda w: PF.spectral_norm(w, dim=1)
h = nn.Variable.from_numpy_array(np.random.randn(b, c))
h = PF.embed(h, c, apply_w=apply_w)
Parameters to be registered

The following variables are registered in a parameter scope "spectral-norm";

  • W_sn (need_grad=False) : Spectral Normalized Weight matrix. (shape: w.shape)
  • u (need_grad=False) : singular vector. (shape: (w.shape[dim], ))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = spectral_norm(<args>)
nnabla.parametric_functions.multi_head_attention(query, key, value, num_heads=12, dropout=0.0, rng=None, with_bias=True, add_attn_bias=False, additive_mask=None, key_padding_mask=None, fix_parameters=False, param_init=None, name=None)[source]

MultiHeadAttention.

Computes multi-headed attention with query, key, and value. We use the following notations to describe the inputs and outputs below. \(L_T\): target sequence length, \(L_S\): source sequence length, \(B\): batch size, \(E\): embedding dimension.

References

A. Vaswani et al. “Attention is All You Need.” NIPS. 2017. <https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf>

Example:

q = nn.Variable((tgt_len, batch_size, embed_dim))
k = nn.Variable((src_len, batch_size, kdim))
v = nn.Variable((src_len, batch_size, vdim))

out, w = PF.multi_head_attention(q, k, v)
out.forward()
Parameters:
  • query (Variable) – Input N-D array with shape \((L_T, B, E)\).
  • key (Variable) – Input N-D array with shape \((L_S, B, E_k)\).
  • value (Variable) – Input N-D array with shape \((L_S, B, E_v)\).
  • num_heads (int, optional) – Number of attention heads. Note that embedding dimensoin E must be divisible by the number of heads. Default is 12 which is conventional.
  • dropout (float, optional) – Dropout ratio applied to parameters. Default is 0.
  • rng (numpy.random.RandomState, optional) – Random generator for Initializer. Default is None.
  • with_bias (bool, optional) – Specify whether to include the bias parameters. Default is True.
  • add_attn_bias (bool, optional) – Specify whether to add attention bias parameters for key and value. Default is False.
  • additive_mask (Variable, optional) – Input N-D array with shape \((L_T, L_S)\). Values will be added to the attention layer to prevent attention to certain positions.
  • key_padding_mask (Variable, optional) – Input N-D array with shape \((B, L_S)\). Specified padding elements will be ignored by the attention layer. Values must be either 1 or 0.
  • fix_parameters (bool, optional) – When set to True, the weights and biases will not be updated. Default is False.
  • param_init (dict, optional) – Parameter initializers can be set with a dict. Possible keys of the dict include q_weight, k_weight, v_weight, q_bias, k_bias, v_bias, out_weight, out_bias, attn_bias_k, attn_bias_v. A value of the dict must be an Initializer or a numpy.ndarray. E.g. {'q_bias': ConstantIntializer(0)}.
Returns:

Output \(y\) with shape \((L_T, B, E)\) ~nnabla.Variable: Output \(h_n\) with shape \((B, L_T, L_S)\)

Return type:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "multi_head_attention";

  • q_weight (need_grad=True) : weights for query. (shape: (E, E))
  • k_weight (need_grad=True) : weights for key. (shape: (E_k, E))
  • v_weight (need_grad=True) : weights for value. (shape: (E_v, E))
  • out_weight (need_grad=True) : weigths for out projection. (shape: (E, E))
  • q_bias (need_grad=True) : bias for query. (shape: (E, ))
  • k_bias (need_grad=True) : bias for key. (shape: (E, ))
  • v_bias (need_grad=True) : bais for value. (shape: (E, ))
  • out_bias (need_grad=True) : bias for out projection. (shape: (E, ))
  • attn_bias_k (need_grad=True) : attnetion bias for k. (shape: (E, 1))
  • attn_bias_v (need_grad=True) : attnetion bias for v. (shape: (E, 1))

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = multi_head_attention(<args>)
class nnabla.parametric_functions.transformer[source]

Transformer.

We use the following notations to describe the inputs and outputs below. \(L_T\): target sequence length, \(L_S\): source sequence length, \(B\): batch size, \(E\): embedding dimension.

References

A. Vaswani et al. “Attention is All You Need.” NIPS. 2017. <https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf>

Examples:

src = nn.Variable((src_len, batch_size, embed_dim),need_grad=True)
tgt = nn.Variable((tgt_len, batch_size, embed_dim),need_grad=True)
out = PF.transformer(src, tgt, num_heads=16, num_encoder_layers=12)
out.forward()
Parameters:
  • src (Variable) – Input source sequence to the encoder with shape:math:(L_S, B, E).
  • tgt (Variable) – Input target sequence to the decoder with shape \((L_T, B, E)\).
  • embed_dim (int, optional) – Embedding dimension to be used. Default is 512.
  • num_heads (int, optional) – Number of attention heads. Default is 12.
  • num_encoder_layers (int, optional) – Number of encoder layers to stack. Default is 6.
  • num_decoder_layers (int, optional) – Number of decoder layers to stack. Default is 6.
  • dim_feedforward (int, optional) – Dimension of the feedforward network model. Default is 2048.
  • dropout (float, optional) – Dropout ratio applied. Default is 0.1.
  • activation (function, optional) – Non-linear activation function to be used. Default is None, which is set as F.relu in the code.
  • src_additive_mask (Variable, optional) – Additive mask for the src sequence (optional). \((L_S, L_S)\).
  • tgt_additive_mask (Variable, optional) – Additive mask for the tgt sequence (optional).:math:(L_T, L_T).
  • memory_additive_mask (Variable, optional) – Additive mask for the encoder output (optional). \((L_T, L_S)\).
  • src_key_padding_mask (Variable, optional) – Key padding mask for src keys per batch (optional).:math:(B, L_S). Specified padding elements will be ignored by the attention layer. Values must be either 1 or 0.
  • tgt_key_padding_mask (Variable, optional) – Key padding mask for tgt keys per batch (optional).:math:(B, L_T). Specified padding elements will be ignored by the attention layer. Values must be either 1 or 0.
  • memory_key_padding_mask (Variable, optional) – Key padding mask for memory keys per batch (optional).:math:(B, L_S). Specified padding elements will be ignored by the attention layer. Values must be either 1 or 0.
  • rng (numpy.random.RandomState, optional) – Random generator for Initializer. Default is None.
  • add_attn_bias (bool, optional) – Specify whether to add attention bias parameters for key and value. Default is False.
  • fix_parameters (bool, optional) – When set to True, the weights and biases will not be updated. Default is False.
Returns:

Output \(y\) with shape \((L_T, B, E)\)

Return type:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "transformer";

  • encoder{layer#} (need_grad=True) : parameters for the n’th encoder layer. (shape: Refer to transformer_encode for details)
  • decoder{layer#} (need_grad=True) : parameters for the n’th decoder layer. (shape: Refer to transformer_decode for details)

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = transformer(<args>)
class nnabla.parametric_functions.transformer_encode[source]

Transformer Encoder.

Parameters:
  • src (Variable) – Input sequnce to the encoder layer with shape \((L_S, B, E)\).
  • embed_dim (int) – Number of embedding dimension.
  • num_heads (int) – Number of attention heads.
  • dim_feedforward (int, optional) – Dimension of the feedforward network model. Default is 2048.
  • dropout (float, optional) – Dropout ratio. Default is 0.1.
  • activation (function, optional) – Non-linear activation function to be used. Default is None, which is set as F.relu in the code.
  • src_additive_mask (Variable, optional) – Additive mask for the source sequence with shape \((L_S, L_S)\)
  • src_key_padding_mask (Variable, optional) – Padding mask for the source sequence with shape \((B, L_S)\). Specified padding elements will be ignored by the attention layer. Values must be either 1 or 0.
  • rng (numpy.random.RandomState, optional) – Random generator for Initializer. Defalut is None.
  • add_attn_bias (bool, optional) – Specify whether to add attention bias parameters for key and value. Default is False.
  • fix_parameters (bool, optional) – When set to True, the weights and biases will not be updated. Default is False.
Returns:

Output \(y\) with shape \((L_S, B, E)\)

Return type:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "transformer_encode";

  • src_self_attn (need_grad=True) : self-attention parameters for source sequence. (shape: Refer to multi_head_attention for details)
  • enc_affine1 (need_grad=True) : first affine used in encoder. (shape: Refer to affine for details)
  • enc_affine2 (need_grad=True) : second affine used in encoder. (shape: Refer to affine for details)
  • enc_layer_norm1 (need_grad=True) : fist layer normalization used in encoder. (shape: Refer to layer_normalization for details)
  • enc_layer_norm2 (need_grad=True) : second layer normalization used in encoder. (shape: Refer to layer_normalization for details)

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = transformer_encode(<args>)
class nnabla.parametric_functions.transformer_decode[source]

Transformer Decoder.

Parameters:
  • tgt (Variable) – Input sequnce to the decoder layer with shape \((L_T, B, E)\).
  • memory (Variable) – Output sequnce from the last layer of the encoder with shape \((L_T, B, E)\).
  • embed_dim (int) – Number of embedding dimension.
  • num_heads (int) – Number of attention heads.
  • dim_feedforward (int, optional) – Dimension of the feedforward network model. Default is 2048.
  • dropout (float, optional) – Dropout ratio. Default is 0.1.
  • activation (function, optional) – Non-linear activation function to be used. Default is None, which is set as F.relu in the code.
  • tgt_additive_mask (Variable, optional) – Additive mask for the target sequence with shape \((L_T, L_T)\).
  • memory_additive_mask (Variable, optional) – Additive mask for the memory sequcne with shape \((L_T, L_S)\).
  • tgt_key_padding_mask (Variable, optional) – Padding mask for the target sequence with shape \((B, L_T)\). Specified padding elements will be ignored by the attention layer. Values must be either 1 or 0.
  • memory_key_padding_mask (Variable, optional) – Padding mask for the mask sequence with shape \((B, L_S)\). Specified padding elements will be ignored by the attention layer. Values must be either 1 or 0.
  • rng (numpy.random.RandomState) – Random generator for Initializer. Default is None.
  • add_attn_bias (bool, optional) – Specify whether to add attention bias parameters for key and value. Default is False.
  • fix_parameters (bool) – When set to True, the weights and biases will not be updated. Default is False.
Returns:

Output \(y\) with shape \((L_T, B, E)\)

Return type:

Variable

Parameters to be registered

The following variables are registered in a parameter scope "transformer_decode";

  • tgt_self_attn (need_grad=True) : self-attention parameters for target sequence. (shape: Refer to multi_head_attention for details)
  • tgt_memory_attn (need_grad=True) : attention parameters for target sequence with output from encoder as key. (shape: Refer to multi_head_attention for details)
  • dec_affine1 (need_grad=True) : first affine used in decoder. (shape: Refer to affine for details)
  • dec_affine2 (need_grad=True) : second affine used in decoder. (shape: Refer to affine for details)
  • dec_layer_norm1 (need_grad=True) : fist layer normalization used in decoder. (shape: Refer to layer_normalization for details)
  • dec_layer_norm2 (need_grad=True) : second layer normalization used in decoder. (shape: Refer to layer_normalization for details)
  • dec_layer_norm3 (need_grad=True) : third layer normalization used in decoder. (shape: Refer to layer_normalization for details)

Note

If the name option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.

with parametric_scope(name):
    output = transformer_decode(<args>)

Parameter Initializer

Some of the parametric functions optionally takes parameter initializer listed below.

class nnabla.initializer.BaseInitializer[source]

Base class of the parameter initializer.

__call__(shape)[source]

Generates an array with an initializer.

Parameters:shape (tuple of int) – numpy.ndarray with the shape created.
Returns:Array.
Return type:numpy.ndarray

Note

Subclasses of BaseInitializer must override this method.

class nnabla.initializer.ConstantInitializer(value=0)[source]

Bases: nnabla.initializer.BaseInitializer

Generates a constant valued array.

Parameters:value (float) – A constant value.

Example:

import nnabla as nn
import nnabla.parametric_functions as PF
import nnabla.initializer as I

x = nn.Variable([60,1,28,28])
w = I.ConstantInitializer(0.1)
b = I.ConstantInitializer() # this generates constant valued array of default value 0
h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv'
class nnabla.initializer.NormalInitializer(sigma=1.0, rng=None)[source]

Bases: nnabla.initializer.BaseInitializer

Generates a random array from a specified normal distribution.

\[\mathbf x \sim {\cal N} (\mathbf 0 | \sigma^2 \mathbf I)\]
Parameters:
  • sigma (float) – \(\sigma\).
  • rng (numpy.random.RandomState) – Random number generator.

Example:

import nnabla as nn
import nnabla.parametric_functions as PF
import nnabla.initializer as I

x = nn.Variable([60,1,28,28])
w = I.NormalInitializer(5e-5)
b = I.NormalInitializer(0.0)
h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv')
class nnabla.initializer.UniformInitializer(lim=(-1, 1), rng=None)[source]

Bases: nnabla.initializer.BaseInitializer

Generates a random array from a specified uniform distribution.

\[\mathbf x \sim {\cal U} (a, b)\]
Parameters:
  • lim (tuple of float) – A tuple of two floats, \((a, b)\).
  • rng (numpy.random.RandomState) – Random number generator.

Example:

import nnabla as nn
import nnabla.parametric_functions as PF
import nnabla.initializer as I

x = nn.Variable([60,1,28,28])
w = I.UniformInitializer() # this generates uniform distribution within the default range of (-1,1)
b = I.UniformInitializer((-0.5,0.5))
h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv')
nnabla.initializer.calc_normal_std_he_forward(inmaps, outmaps, kernel=(1, 1))[source]

Calculates the standard deviation proposed by He et al.

\[\sigma = \sqrt{\frac{2}{NK}}\]
Parameters:
  • inmaps (int) – Map size of an input Variable, \(N\).
  • outmaps (int) – Map size of an output Variable, \(M\).
  • kernel (tuple of int) – Convolution kernel spatial shape. In above definition, \(K\) is the product of shape dimensions. In Affine, the default value should be used.

Example:

import nnabla as nn
import nnabla.parametric_functions as PF
import nnabla.initializer as I

x = nn.Variable([60,1,28,28])
s = I.calc_normal_std_he_forward(x.shape[1],64)
w = I.NormalInitializer(s)
b = I.ConstantInitializer(0)
h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv')

References

nnabla.initializer.calc_normal_std_he_backward(inmaps, outmaps, kernel=(1, 1))[source]

Calculates the standard deviation of He et al. (backward case).

\[\sigma = \sqrt{\frac{2}{MK}}\]
Parameters:
  • inmaps (int) – Map size of an input Variable, \(N\).
  • outmaps (int) – Map size of an output Variable, \(M\).
  • kernel (tuple of int) – Convolution kernel spatial shape. In above definition, \(K\) is the product of shape dimensions. In Affine, the default value should be used.

Example:

import nnabla as nn
import nnabla.parametric_functions as PF
import nnabla.initializer as I

x = nn.Variable([60,1,28,28])
s = I.calc_normal_std_he_backward(x.shape[1],64)
w = I.NormalInitializer(s)
b = I.ConstantInitializer(0)
h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv')

References

nnabla.initializer.calc_normal_std_glorot(inmaps, outmaps, kernel=(1, 1))[source]

Calculates the standard deviation proposed by Glorot et al.

Note

We have updated the definition as following from v.1.3. It may affect the behavior of existing scripts that rely on the default initialization.

\[\sigma = \sqrt{\frac{2}{K(N + M)}}\]
Parameters:
  • inmaps (int) – Map size of an input Variable, \(N\).
  • outmaps (int) – Map size of an output Variable, \(M\).
  • kernel (tuple of int) – Convolution kernel spatial shape. In above definition, \(K\) is the product of shape dimensions. In Affine, the default value should be used.

Example:

import nnabla as nn
import nnabla.parametric_functions as PF
import nnabla.initializer as I

x = nn.Variable([60,1,28,28])
s = I.calc_normal_std_glorot(x.shape[1],64)
w = I.NormalInitializer(s)
b = I.ConstantInitializer(0)
h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv')

References

nnabla.initializer.calc_uniform_lim_glorot(inmaps, outmaps, kernel=(1, 1))[source]

Calculates the lower bound and the upper bound of the uniform distribution proposed by Glorot et al.

Note

We have updated the definition as following from v.1.3. It may affect the behavior of existing scripts that rely on the default initialization.

\[\begin{split}b &= \sqrt{\frac{6}{K(N + M)}}\\ a &= -b\end{split}\]
Parameters:
  • inmaps (int) – Map size of an input Variable, \(N\).
  • outmaps (int) – Map size of an output Variable, \(M\).
  • kernel (tuple of int) – Convolution kernel spatial shape. In above definition, \(K\) is the product of shape dimensions. In Affine, the default value should be used.

Example:

import nnabla as nn
import nnabla.parametric_functions as PF
import nnabla.initializer as I

x = nn.Variable([60,1,28,28])
lb,ub= I.calc_uniform_lim_glorot(x.shape[1],64)
w = I.UniformInitializer((lb,ub))
b = I.ConstantInitializer(0)
h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv')

References