Parametric Functions¶
In NNabla, trainable models are created by composing functions that have optimizable parameters.
These functions are called parametric functions.
Parametric functions are provided by nnabla.parametric_functions
.
- See also:
Parameter Management API¶
The parameters registered by List of Parametric Functions can be managed using APIs listed in this section.
- nnabla.parameter.parameter_scope(name, scope=None)[source]¶
Grouping parameters registered by parametric functions listed in
nnabla.parametric_functions
.- Parameters
name (str) – Parameter scope name.
scope (OrderedDict, optional) – Specify current parameter scope as a local dictionary. The default value is
None
. In this case, the current parameter scope maintained in global is used.
Example:
import nnabla as nn import nnabla.parametric_functions as PF import nnabla.functions as F with nn.parameter_scope('conv1'): conv_out1 = PF.convolution(x, 32, (5, 5)) bn_out1 = PF.batch_normalization(conv_out1) act_out1 = F.relu(bn_out1) with nn.parameter_scope('conv2'): conv_out2 = PF.convolution(act_out1, 64, (3, 3)) bn_out2 = PF.batch_normalization(conv_out2) act_out2 = F.relu(bn_out2)
Nesting
The with statement
blocks allows you to nest parameter scopes. This can also be done by using “/” inside the parameter names.Example:
with nn.parameter_scope('network1'): with nn.parameter_scope('conv1'): conv_out1 = PF.convolution(x, 32, (5, 5)) bn_out1 = PF.batch_normalization(conv_out1) act_out1 = F.relu(bn_out1) with nn.parameter_scope('conv2'): conv_out2 = PF.convolution(act_out1, 64, (3, 3)) bn_out2 = PF.batch_normalization(conv_out2) act_out2 = F.relu(bn_out2)
is equivalent to
with nn.parameter_scope('network1/conv1'): conv_out1 = PF.convolution(x, 32, (5, 5)) bn_out1 = PF.batch_normalization(conv_out1) act_out1 = F.relu(bn_out1) with nn.parameter_scope('network1/conv2'): conv_out2 = PF.convolution(act_out1, 64, (3, 3)) bn_out2 = PF.batch_normalization(conv_out2) act_out2 = F.relu(bn_out2)
- nnabla.parameter.get_parameters(params=None, path='', grad_only=True)[source]¶
Get parameter Variables under the current parameter scope.
- Parameters
- Returns
- Return type
- nnabla.parameter.save_parameters(path, params=None, extension=None)[source]¶
Save all parameters into a file with the specified format.
Currently hdf5 and protobuf formats are supported.
- nnabla.parameter.load_parameters(path, proto=None, needs_proto=False, extension='.nntxt')[source]¶
Load parameters from a file with the specified format.
- Parameters
path – path or file object
- nnabla.parameter.get_parameter_or_create(name, shape=None, initializer=None, need_grad=True, as_need_grad=None)[source]¶
Returns an existing parameter variable in current parameter scope with the provided name.
If a variable with the provided name does not exist, a new variable is created and registered to the current parameter scope with the name, then returned.
- Parameters
name (str) – The name under the current scope. If it already exists, the name is queried from the parameter manager.
shape (
tuple
ofint
) – Shape of created parameter. The shape of the specified parameter must match with this shape. The default is None which is only valid if initializer is given as annumpy.ndarray
.initializer (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – An initialization function to be applied to the parameter.numpy.ndarray
can also be given to initialize parameters from numpy array data.need_grad (bool) – Register the parameter with the specified
need_grad
flag. The default is True. If the flag is different from the previously specified one, the flag will be overwritten, but the values will be kept.as_need_grad (bool) – Get a parameter variable with the specified
need_grad
flag. Note that this doesn’t overwrite the flag of the registered parameter variable with the provided name. Instead, if the given flag mismatches with the previously registeredneed_grad
flag, it returns a new variable referring to the same array contents but withneed_grad=as_need_grad
.
Note
It returns a Variable which is unlinked from the registered one in the current parmeter scope (using
nnabla.Variable.get_unlinked_variable()
). That means changing aneed_grad
attribute doesn’t affect the variable existing in the current parameter scope.
List of Parametric Functions¶
Parametric functions are provided by nnabla.parametric_functions
, as listed below.
Like functions listed in Functions, they take Variable
(s) as
first argument(s) followed by options specific to a parametric function. In addition,
they register parameter Variable
(s) into the parameter scope.
The parameter variables are registered with need_grad
properties specific
to a parametric function. The variables with need_grad=False
flag will not
be updated by gradient descent. Hence, backward computation is not executed for
those variables. False
is usually specified when the parameters are updated
during foward pass and/or backward pass, e.g., batch normalization.
All parametric functions take an optional argument fix_parameters=False
.
By giving True
, the associated parameter variables are connected to a
computation graph with a property need_grad=False
regardless properties
of the registered variables, then backward gradient
computation is not executed for those variables. This is useful when you create
a computation graph for evaluation purpose, fixing parameters partially in a
graph, and so on.
All parametric functions listed below are decorated with the following decorator.
- nnabla.parametric_functions.parametric_function_api(scope_name=None, param_desc=None)[source]¶
Decorator for parametric functions.
The decorated function is always called under a parameter scope
scope_name
. Also, the decorator adds an additional argumentname
(str
, default isNone
) at the end. Ifname
is specified, the scopescope_name
comes under a scopename
. This feature could reduce vertical space usage of the source code. Any parametric function should be decorated by this.- Parameters
scope_name (str, optional) – The original function will be called under a parameter scope named by
scope_name
.param_desc (list, optional) – Descriptions of parameters will be automatically included into docstring. This must be a list of tuples with 4 elements composed of (name (str), description (str), shape info (str), need_grad (bool)).
- Returns
A decorated parametric function.
- Return type
function
See Parameter Management API to know how to query and manipulate registered variables.
Here is the list of parametric functions.
- nnabla.parametric_functions.affine(inp, n_outmaps, base_axis=1, w_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, apply_w=None, apply_b=None, name=None)[source]¶
The affine layer, also known as the fully connected layer. Computes
\[{\mathbf y} = {\mathbf A} {\mathbf x} + {\mathbf b}.\]where \({\mathbf x}, {\mathbf y}\) are the inputs and outputs respectively, and \({\mathbf A}, {\mathbf b}\) are constants.
- Parameters
inp (Variable) – Input N-D array with shape (\(M_0 \times \ldots \times M_{B-1} \times D_B \times \ldots \times D_N\)). Dimensions before and after base_axis are flattened as if it is a matrix.
n_outmaps (
int
ortuple
ofint
) – Number of output neurons per data.base_axis (int) – Dimensions up to
base_axis
are treated as the sample dimensions.w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for bias. By default, it is initialized with zeros ifwith_bias
isTrue
.fix_parameters (bool) – When set to
True
, the weights and biases will not be updated.rng (numpy.random.RandomState) – Random generator for Initializer.
with_bias (bool) – Specify whether to include the bias term.
apply_w (function) – Lambda, function, or callable object applied to the weights.
apply_b (function) – Lambda, function, or callable object applied to the bias.
- Returns
\((B + 1)\)-D array. (\(M_0 \times \ldots \times M_{B-1} \times L\))
- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"affine"
;W (
need_grad=True
) : Weight matrix. (shape:(inmaps, outmaps)
)b (
need_grad=True
) : bias vector. (shape:(outputs,)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = affine(<args>)
- nnabla.parametric_functions.convolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, channel_last=False, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, apply_w=None, apply_b=None, name=None)[source]¶
N-D Convolution with a bias term.
For Dilated Convolution (a.k.a. Atrous Convolution), refer to:
Chen et al., DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. https://arxiv.org/abs/1606.00915
Yu et al., Multi-Scale Context Aggregation by Dilated Convolutions. https://arxiv.org/abs/1511.07122
Note
Convolution is a computationally intensive operation that should preferably be run with the
cudnn
backend. NNabla then uses CuDNN library functions to determine and cache the fastest algorithm for the given set of convolution parameters, which results in additional memory consumption which may pose a problem for GPUs with insufficient memory size. In that case, theNNABLA_CUDNN_WORKSPACE_LIMIT
environment variable can be used to restrict the choice of algorithms to those that fit the given workspace memory limit, expressed in bytes. In some cases it may also be desired to restrict the automatic search to algorithms that produce deterministic (reproducible) results. This can be requested by setting the the environment variableNNABLA_CUDNN_DETERMINISTIC
to a non-zero value.- Parameters
inp (Variable) – N-D array.
outmaps (int) – Number of convolution kernels (which is equal to the number of output channels). For example, to apply convolution on an input with 16 types of filters, specify 16.
kernel (
tuple
ofint
) – Convolution kernel size. For example, to apply convolution on an image with a 3 (height) by 5 (width) two-dimensional kernel, specify (3,5).group (int) – Number of groups of channels. This makes connections across channels more sparse by grouping connections along map direction.
channel_last (bool) – If True, the last dimension is considered as channel dimension, a.k.a. NHWC order.
w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for bias. By default, it is initialized with zeros ifwith_bias
isTrue
.base_axis (int) – Dimensions up to
base_axis
are treated as the sample dimensions.fix_parameters (bool) – When set to
True
, the weights and biases will not be updated.rng (numpy.random.RandomState) – Random generator for Initializer.
with_bias (bool) – Specify whether to include the bias term.
apply_w (function) – Lambda, function, or callable object applied to the weights.
apply_b (function) – Lambda, function, or callable object applied to the bias.
- Returns
N-D array. See
convolution
for the output shape.- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"conv"
;W (
need_grad=True
) : Filter weights. (shape:(outmaps, inmaps // group, *kernel)
)b (
need_grad=True
) : Bias vector. (shape:(outmaps,)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = convolution(<args>)
- nnabla.parametric_functions.depthwise_convolution(inp, kernel, pad=None, stride=None, dilation=None, multiplier=1, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, name=None)[source]¶
N-D Depthwise Convolution with a bias term.
Reference:
Chollet: Chollet, Francois. “Xception: Deep Learning with Depthwise Separable Convolutions. https://arxiv.org/abs/1610.02357
- Parameters
inp (Variable) – N-D array.
kernel (
tuple
ofint
) – Convolution kernel size. For example, to apply convolution on an image with a 3 (height) by 5 (width) two-dimensional kernel, specify (3,5).multiplier (
int
) – Number of output feature maps per input feature map.w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for bias. By default, it is initialized with zeros ifwith_bias
isTrue
.base_axis (int) – Dimensions up to
base_axis
are treated as the sample dimensions.fix_parameters (bool) – When set to
True
, the weights and biases will not be updated.rng (numpy.random.RandomState) – Random generator for Initializer.
with_bias (bool) – Specify whether to include the bias term.
- Returns
N-D array. See
depthwise_convolution
for the output shape.- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"depthwise_conv"
;W (
need_grad=True
) : Filter weights. (shape:(inmaps * multiplier, *kernel)
)b (
need_grad=True
) : Bias vector. (shape:(inmaps * multiplier,)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = depthwise_convolution(<args>)
- nnabla.parametric_functions.deconvolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, channel_last=False, output_padding=None, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, apply_w=None, apply_b=None, name=None)[source]¶
Deconvolution layer.
- Parameters
inp (Variable) – N-D array.
outmaps (int) – Number of deconvolution kernels (which is equal to the number of output channels). For example, to apply deconvolution on an input with 16 types of filters, specify 16.
kernel (
tuple
ofint
) – Convolution kernel size. For example, to apply deconvolution on an image with a 3 (height) by 5 (width) two-dimensional kernel, specify (3,5).group (int) – Number of groups of channels. This makes connections across channels sparser by grouping connections along map direction.
w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for bias. By default, it is initialized with zeros ifwith_bias
isTrue
.base_axis (int) – Dimensions up to
base_axis
are treated as the sample dimensions.fix_parameters (bool) – When set to
True
, the weights and biases will not be updated.rng (numpy.random.RandomState) – Random generator for Initializer.
with_bias (bool) – Specify whether to include the bias term.
apply_w (function) – Lambda, function, or callable object applied to the weights.
apply_b (function) – Lambda, function, or callable object applied to the bias.
- Returns
N-D array. See
deconvolution
for the output shape.- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"deconv"
;W (
need_grad=True
) : Filter weights. (shape:(inmaps, outmaps // group, *kernel)
)b (
need_grad=True
) : Bias vector. (shape:(outmaps,)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = deconvolution(<args>)
- nnabla.parametric_functions.depthwise_deconvolution(inp, kernel, pad=None, stride=None, dilation=None, divisor=1, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, name=None)[source]¶
Depthwise deconvolution computes the transposed depthwise convolution for one-dimensional and two-dimensional input data.
- Parameters
inp (Variable) – N-D array.
kernel (
tuple
ofint
) – Convolution kernel size. For example, to apply convolution on an image with a 3 (height) by 5 (width) two-dimensional kernel, specify (3,5).divisor (
int
) – Number of input feature maps per output feature map.w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for bias. By default, it is initialized with zeros ifwith_bias
isTrue
.base_axis (int) – Dimensions up to
base_axis
are treated as the sample dimensions.fix_parameters (bool) – When set to
True
, the weights and biases will not be updated.rng (numpy.random.RandomState) – Random generator for Initializer.
with_bias (bool) – Specify whether to include the bias term.
- Returns
N-D array. See
depthwise_deconvolution
for the output shape.- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"depthwise_deconv"
;W (
need_grad=True
) : Filter weights. (shape:(inmaps,) + kernel
)b (
need_grad=True
) : Bias vector. (shape:(inmaps / divisor,)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = depthwise_deconvolution(<args>)
- nnabla.parametric_functions.deformable_convolution(inp, outmaps, kernel, offset, mask=None, pad=None, stride=None, dilation=None, group=1, deformable_group=1, channel_last=False, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, apply_w=None, apply_b=None, name=None)[source]¶
2D Deformable Convolution with a bias term. If use mask, this function is Deformable Convolution v2.
Dai et al., Deformable Convolutional Networks. https://arxiv.org/abs/1703.06211
Zhu et al., Deformable ConvNets v2: More Deformable, Better Results. https://arxiv.org/abs/1811.11168
- Parameters
inp (Variable) – N-D array.
outmaps (int) – Number of convolution kernels (which is equal to the number of output channels). For example, to apply convolution on an input with 16 types of filters, specify 16.
kernel (
tuple
ofint
) – Convolution kernel size. For example, to apply convolution on an image with a 3 (height) by 5 (width) two-dimensional kernel, specify (3,5).offset (Variable) – Offsets for deformable convolutions. Shape is fixed to \((N, deformable_group imes 2 imes Kh imes Kw, H, W)\). Offsets must be calculated externally through a separate convolution layer.
mask (Variable) – Normalized mask for deformable convolutions v2. Shape is fixed to \((N, deformable_group imes 1 imes Kh imes Kw, H, W)\). Masks must be calculated externally together with the offsets through a separate convolution layer.
group (int) – Number of groups of channels. This makes connections across channels more sparse by grouping connections along map direction.
deformable_group (int) – Number of deformable groups of channels. This makes connections across channels more sparse by grouping connections along map direction.
channel_last (bool) – If True, the last dimension is considered as channel dimension, a.k.a. NHWC order.
w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for bias. By default, it is initialized with zeros ifwith_bias
isTrue
.base_axis (int) – Dimensions up to
base_axis
are treated as the sample dimensions.fix_parameters (bool) – When set to
True
, the weights and biases will not be updated.rng (numpy.random.RandomState) – Random generator for Initializer.
with_bias (bool) – Specify whether to include the bias term.
apply_w (function) – Lambda, function, or callable object applied to the weights.
apply_b (function) – Lambda, function, or callable object applied to the bias.
- Returns
N-D array. See
convolution
for the output shape.- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"deformable_conv"
;W (
need_grad=True
) : Filter weights. (shape:(outmaps, inmaps // group, *kernel)
)b (
need_grad=True
) : Bias vector. (shape:(outmaps,)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = deformable_convolution(<args>)
- nnabla.parametric_functions.batch_normalization(inp, axes=[1], decay_rate=0.9, eps=1e-05, batch_stat=True, output_stat=False, fix_parameters=False, param_init=None, no_scale=False, no_bias=False, name=None)[source]¶
Batch normalization layer.
\[\begin{split}\begin{array}{lcl} \mu &=& \frac{1}{M} \sum x_i\\ \sigma^2 &=& \frac{1}{M} \sum \left(x_i - \mu\right)^2\\ \hat{x}_i &=& \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon }}\\ y_i &= & \hat{x}_i \gamma + \beta. \end{array}\end{split}\]where \(x_i, y_i\) are the inputs. In testing, the mean and variance computed by moving average calculated during training are used.
- Parameters
inp (Variable) – N-D array of input.
axes (
tuple
ofint
) – Mean and variance for each element inaxes
are calculated using elements on the rest axes. For example, if an input is 4 dimensions, andaxes
is[1]
, batch mean is calculated asnp.mean(inp.d, axis=(0, 2, 3), keepdims=True)
(using numpy expression as an example).decay_rate (float) – Decay rate of running mean and variance.
eps (float) – Tiny value to avoid zero division by std.
batch_stat (bool) – Use mini-batch statistics rather than running ones.
output_stat (bool) – Output batch mean and variance.
fix_parameters (bool) – When set to
True
, the beta and gamma will not be updated.param_init (dict) – Parameter initializers can be set with a dict. A key of the dict must be
'beta'
,'gamma'
,'mean'
or'var'
. A value of the dict must be anInitializer
or anumpy.ndarray
. E.g.{'beta': ConstantInitializer(0), 'gamma': np.ones(gamma_shape) * 2}
.
- Returns
N-D array.
- Return type
References
Ioffe and Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. https://arxiv.org/abs/1502.03167
The shape of parameters has the same number of dimensions with the input data, and the shapes in
axes
has the same dimensions with the input, while the rest has1
. If an input is 4-dim andaxes=[1]
, the parameter shape will beparam_shape = np.mean(inp.d, axis=(0, 2, 3), keepdims=True).shape
(using numpy expression as an example).- Parameters to be registered
The following variables are registered in a parameter scope
"bn"
;beta (
need_grad=True
) : Trainable bias \(\beta\). (shape:<see above>
)gamma (
need_grad=True
) : Trainable scaling factor \(\gamma\). (shape:<see above>
)mean (
need_grad=False
) : Moving average of batch mean. (shape:<see above>
)var (
need_grad=False
) : Moving average of batch variance. (shape:<see above>
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = batch_normalization(<args>)
- nnabla.parametric_functions.fused_batch_normalization(inp, z=None, axes=[1], decay_rate=0.9, eps=1e-05, batch_stat=True, nonlinearity='relu', output_stat=False, fix_parameters=False, param_init=None, no_scale=False, no_bias=False, name=None)[source]¶
Batch normalization layer fused with the following add2 operation of a residual input and an nonlinear activation.
- Parameters
inp (Variable) – N-D array of input.
z (Variable, optional) – A residual input. By specifying None, the activation function will follow immediately after BN operation.
axes (
tuple
ofint
) – Mean and variance for each element inaxes
are calculated using elements on the rest axes. For example, if an input is 4 dimensions, andaxes
is[1]
, batch mean is calculated asnp.mean(inp.d, axis=(0, 2, 3), keepdims=True)
(using numpy expression as an example).decay_rate (float) – Decay rate of running mean and variance.
eps (float) – Tiny value to avoid zero division by std.
batch_stat (bool) – Use mini-batch statistics rather than running ones.
nonlinearity (string) – Activation function. The default is ‘relu’.
output_stat (bool) – Output batch mean and variance.
fix_parameters (bool) – When set to
True
, the beta and gamma will not be updated.
- Returns
N-D array.
- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"bn"
;beta (
need_grad=True
) : Trainable bias \(\beta\). (shape:<see above>
)gamma (
need_grad=True
) : Trainable scaling factor \(\gamma\). (shape:<see above>
)mean (
need_grad=False
) : Moving average of batch mean. (shape:<see above>
)var (
need_grad=False
) : Moving average of batch variance. (shape:<see above>
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = fused_batch_normalization(<args>)
- nnabla.parametric_functions.sync_batch_normalization(inp, comm, group='world', axes=[1], decay_rate=0.9, eps=1e-05, batch_stat=True, output_stat=False, fix_parameters=False, param_init=None, no_scale=False, no_bias=False, name=None)[source]¶
Synchronized batch normalization layer.
For some tasks (e.g., semantic segmentation), batch size will be too small and BatchNormalization layer might not work well. SyncBatchNorlization layer solves these problems by synchronizing batch stats (mean and var) between multiple processes.
\[\begin{split}\begin{array}{lcl} \mu &=& \frac{1}{M} \sum x_i\\ \sigma^2 &=& \frac{1}{M} \left(\sum x_i - \mu\right)^2\\ \hat{x}_i &=& \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon }}\\ y_i &= & \hat{x}_i \gamma + \beta. \end{array}\end{split}\]where \(x_i, y_i\) are the inputs.
- Parameters
inp (Variable) – N-D array of input.
comm (Communicator) – The communicator
group (string) – The name of the communicator group
axes (
tuple
ofint
) – Mean and variance for each element inaxes
are calculated using elements on the rest axes. For example, if an input is 4 dimensions, andaxes
is[1]
, batch mean is calculated asnp.mean(inp.d, axis=(0, 2, 3), keepdims=True)
(using numpy expression as an example).decay_rate (float) – Decay rate of running mean and variance.
eps (float) – Tiny value to avoid zero division by std.
batch_stat (bool) – Use mini-batch statistics rather than running ones.
output_stat (bool) – Output batch mean and variance.
fix_parameters (bool) – When set to
True
, the beta and gamma will not be updated.param_init (dict) – Parameter initializers can be set with a dict. A key of the dict must be
'beta'
,'gamma'
,'mean'
or'var'
. A value of the dict must be anInitializer
or anumpy.ndarray
. E.g.{'beta': ConstantInitializer(0), 'gamma': np.ones(gamma_shape) * 2}
.
- Returns
N-D array.
- Return type
References
Ioffe and Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, https://arxiv.org/abs/1502.03167
Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, Amit Agrawal, Context Encoding for Semantic Segmentation, https://arxiv.org/abs/1803.08904
Implementing Synchronized Multi-GPU Batch Normalization https://hangzhang.org/PyTorch-Encoding/notes/syncbn.html
The shape of parameters has the same number of dimensions with the input data, and the shapes in
axes
has the same dimensions with the input, while the rest has1
. If an input is 4-dim andaxes=[1]
, the parameter shape will beparam_shape = np.mean(inp.d, axis=(0, 2, 3), keepdims=True).shape
(using numpy expression as an example).- Parameters to be registered
The following variables are registered in a parameter scope
"bn"
;beta (
need_grad=True
) : Trainable bias \(\beta\). (shape:<see above>
)gamma (
need_grad=True
) : Trainable scaling factor \(\gamma\). (shape:<see above>
)mean (
need_grad=False
) : Moving average of batch mean. (shape:<see above>
)var (
need_grad=False
) : Moving average of batch variance. (shape:<see above>
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = sync_batch_normalization(<args>)
- nnabla.parametric_functions.mean_subtraction(inp, base_axis=1, update_running_mean=True, fix_parameters=False, name=None)[source]¶
Mean subtraction layer.
It subtracts the mean of the elements of the input array, and normalizes it to \(0\). Preprocessing arrays with this function has the effect of improving accuracy in various tasks such as image classification.
At training time, this function is defined as
\[\begin{split}\begin{array}{lcl} \mu &=& \frac{1}{M} \sum x_i \\ y_i &=& x_i - \mu \end{array}\end{split}\]At testing time, the mean values used are those that were computed during training by moving average.
Note
The backward performs an approximated differentiation that takes into account only the latest mini-batch.
- Parameters
inp (Variable) – N-D array of input.
base_axis (int) – Base axis of Mean Subtraction operation. Dimensions up to base_axis is treated as sample dimension.
update_running_mean (bool) – When set to
True
, the running mean will not be updated.fix_parameters (bool) – dummy parameter. This argument dose not affect anything.
- Returns
N-D array.
- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"mean_subtraction"
;mean (
need_grad=False
) : Moving average. (shape:inp.shape[base_axis:]
)t (
need_grad=False
) : Minibatch counter used in forward pass. (shape:(1,)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = mean_subtraction(<args>)
- nnabla.parametric_functions.layer_normalization(inp, batch_axis=0, eps=1e-05, output_stat=False, fix_parameters=False, param_init=None, no_scale=False, no_bias=False, name=None)[source]¶
Applies Layer Normalization over an input variable, which is defined as:
\[\begin{split}\begin{eqnarray} \mu^l &=& \frac{1}{H} \sum_{i=1}^{H} x_i^l \\ \sigma^l &=& \sqrt{\frac{1}{H} \sum_{i=1}^{H} \left(x_i^l - \mu^l\right)^2} \\ y &=& \frac{x - \mu^l}{\sigma^l + \epsilon} \gamma + \beta \end{eqnarray}\end{split}\]where \(x\) and \(y\) are input and output variable, \(\mu^l\) and \(\sigma^l\) are the mean and std of each layer along batch axis, and \(\alpha\) and \(\beta\) are trainable parameter.
Note
Unlike other normalizations, which applies scalar scale and bias for each entire channel/plane, Layer Normalization applies per-element scale and bias.
References
- Parameters
inp (Variable) – An input variable.
batch_axis (int or repeated int) – Axes mean and variance are taken.
eps (float) – Tiny value to avoid zero division by std.
output_stat (bool) – It
True
, calculated mean and variance are also returned.fix_parameters (bool) – When set to
True
, the beta and gamma will not be updated.param_init (dict) – Parameter initializers can be set with a dict. A key of the dict must be
'gamma'
,'beta'
. A value of the dict must be anInitializer
or anumpy.ndarray
. E.g.{'gamma': np.ones(...) * 2, 'beta': ConstantInitializer(0)}
.
- Returns
Normalized output variable. *
Variable
: Mean (if ``output_stat=True`). *Variable
: Std (if ``output_stat=True`)- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"layer_normalization"
;beta (
need_grad=True
) : Trainable bias \(\beta\). (shape:<see above>
)gamma (
need_grad=True
) : Trainable scaling factor \(\gamma\). (shape:<see above>
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = layer_normalization(<args>)
- nnabla.parametric_functions.instance_normalization(inp, channel_axis=1, batch_axis=0, eps=1e-05, output_stat=False, fix_parameters=False, param_init=None, no_scale=False, no_bias=False, name=None)[source]¶
Applies Instance Normalization over an input variable, which is defined as:
\[\begin{split}\begin{eqnarray} \mu^i &=& \frac{1}{H} \sum_{i=1}^{H} x_i^i \\ \sigma^i &=& \sqrt{\frac{1}{H} \sum_{i=1}^{H} \left(x_i^i - \mu^i\right)^2} \\ y &=& \frac{x - \mu^i}{\sigma^ + \epsilon} \gamma + \beta \end{eqnarray}\end{split}\]where \(x\) and \(y\) are input and output variable, \(\mu^i\) and \(\sigma^i\) are the mean and std of each instance which is separately calculated for each batch and channel, and \(\gamma\) and \(\beta\) are adaptive gains and biases.
If the input shape is [B, C, H, W] (= channel_axis=1, batch_axis=0), the shape of calculated mean and std are [B, C, 1, 1]
References
- Parameters
inp (Variable) – An input variable.
channel_axis (int or repeated int) – Channel axes.
batch_axis (int or repeated int) – Batch axes.
eps (float) – Tiny value to avoid zero division by std.
output_stat (bool) – It
True
, the batch statistics of mean and variance.fix_parameters (bool) – If
True
, the beta and gamma will not be updated.param_init (dict) – Parameter initializers can be set with a dict. A key of the dict must be
'gamma'
,'beta'
. A value of the dict must be anInitializer
or anumpy.ndarray
. E.g.{'gamma': np.ones(...) * 2, 'beta': ConstantInitializer(0)}
.Returns –
- Parameters to be registered
The following variables are registered in a parameter scope
"instance_normalization"
;beta (
need_grad=True
) : Trainable bias \(\beta\). (shape:<see above>
)gamma (
need_grad=True
) : Trainable scaling factor \(\gamma\). (shape:<see above>
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = instance_normalization(<args>)
- nnabla.parametric_functions.group_normalization(inp, num_groups, channel_axis=1, batch_axis=0, eps=1e-05, output_stat=False, fix_parameters=False, param_init=None, no_scale=False, no_bias=False, name=None)[source]¶
Applies Group Normalization over an input tensor, which is defined as:
\[\begin{split}\begin{eqnarray} \mu^g &=& \frac{1}{H} \sum_{i=1}^{H} x_i^g \\ \sigma^g &=& \sqrt{\frac{1}{H} \sum_{i=1}^{H} \left(x_i^g - \mu^g\right)^2} \\ y &=& \frac{x - \mu^g}{\sigma^g + \epsilon} \gamma + \beta \end{eqnarray}\end{split}\]where \(x\) and \(y\) are input and output variable, \(\mu^g\) and \(\sigma^g\) are the mean and std of each group which contains
num_channels / num_groups
channels, and \(\gamma\) and \(\beta\) are adaptive gains and biases.The input channels, specified by
channel_axis
, are separeted intonum_groups
groups, and the mean and std are calculated over the each group. For example, if the input shape is [B, C, H, W] (= channel_axis=1, batch_axis=0), an input variable is once reshaped to [B, num_groups, C / num_groups, H, W] and standardize by its mean and std whose shapes are [B, num_groups, C / num_groups, 1, 1]. Before returning, an output variable is reshaped again to the original input shape (= [B, C, H, W] in the case above).References
- Parameters
inp (Variable) – An input variable.
num_groups (int) – A number of groups. The channel dim of ‘x’ must be integer multiple of
num_groups
.channel_axis (int) – Channel axis.
batch_axis (int or repeated int) – Axes mean and variance are taken.
eps (float) – Tiny value to avoid zero division by std.
output_stat (bool) – It true, the batch statistics of mean and variance.
fix_parameters (bool) – When set to
True
, the beta and gamma will not be updated.param_init (dict) – Parameter initializers can be set with a dict. A key of the dict must be
'gamma'
,'beta'
. A value of the dict must be anInitializer
or anumpy.ndarray
. E.g.{'gamma': np.ones(...) * 2, 'beta': ConstantInitializer(0)}
.
- Returns
Normalized output variable. *
Variable
: Mean (if ``output_stat=True`) *Variable
: Std (if ``output_stat=True`)- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"group_normalization"
;beta (
need_grad=True
) : Trainable bias \(\beta\). (shape:<see above>
)gamma (
need_grad=True
) : Trainable scaling factor \(\gamma\). (shape:<see above>
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = group_normalization(<args>)
- nnabla.parametric_functions.rnn(x, h, w0_init=None, w_init=None, b_init=None, num_layers=1, nonlinearity='tanh', dropout=0.0, bidirectional=False, training=True, rng=None, with_bias=True, fix_parameters=False, name=None)[source]¶
N-Step RNN (recurrent neural networks).
N-Step RNN function implements Elman RNN with nonlineraity to input sequence. N-Step RNN function is defined as following:
\[h_t = \tanh(w_{ih}x_t+b_{ih}+w_{hh}h_{(t-1)}).\]We use the following notations to describe the inputs and outputs below. \(T\): sequcne length, \(B\): batch size, \(I\): input size, \(L\): number of layers, \(D\): number of directions, can be either 1 or 2, \(H\): hidden size.
References
Jeffrey L. Elman. “Finding Structure in Time.” Cognitive Science. 1990.
- Parameters
x (Variable) – Input N-D array with shape \((T, B, I)\).
h (Variable) – Input N-D array with shape \((L, D, B, H)\).
w0_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
, optional) – Initializer for weight at the first layer. Shape is \((D, H, I + H)\).w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
, optional) – Initializer for weights at the second layer and up. Shape is \((L-1, D, H, D*H + H)\).b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
, optional) – Initializer for bias. Shape is \((L, D, H)\).num_layers (int, optional) – Number of layers in the network. If set to 1, only the weights for the first layer will be invoked. Default is 1.
nonlinearity (str, optional) – Type of nonlinearity applied to input sequcne. Must be either tanh or relu. Default is tanh.
dropout (float, optional) – Dropout ratio applied to parameters. Default is 0.0.
bidirectional (bool, optional) – If True, bidirectional computation will be performed in each layer. Default is False.
training (bool, optional) – Backpropagation will be performed only when it is true. Default is True.
with_bias (bool, optional) – Specify whether to include the bias term.
- Returns
Output \(y\) with shape \((T, B, D * H)\) ~nnabla.Variable: Output \(h_n\) with shape \((L, D, B, H)\)
- Return type
Example
x = nn.Variable((seq_len, batch_size, input_size)) h = nn.Variable((num_layers, num_directions, batch_size, hidden_size)) y, hn = PF.rnn(x, h)
- Parameters to be registered
The following variables are registered in a parameter scope
"rnn"
;weight_l0 (
need_grad=True
) : Filter weights at 0-th layer. (shape:(D, H, I + H)
)weight (
need_grad=True
) : Filter weights at 1-st layer and above. (shape:(L-1, D, H, DH + H)
)bias (
need_grad=True
) : Biases. (shape:(L, D, H)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = rnn(<args>)
- nnabla.parametric_functions.lstm(x, h, c, w0_init=None, w_init=None, b_init=None, num_layers=1, dropout=0.0, bidirectional=False, training=True, rng=None, with_bias=True, fix_parameters=False, name=None)[source]¶
LSTM (long short-term memory).
Long Short-Term Memory, or LSTM, is a building block for recurrent neural networks (RNN) layers. LSTM unit consists of a cell and input, output, forget gates whose functions are defined as following:
\[\begin{split}f_t&&=\sigma(W_fx_t+U_fh_{t-1}+b_f) \\ i_t&&=\sigma(W_ix_t+U_ih_{t-1}+b_i) \\ o_t&&=\sigma(W_ox_t+U_oh_{t-1}+b_o) \\ c_t&&=f_t\odot c_{t-1}+i_t\odot\tanh(W_cx_t+U_ch_{t-1}+b_c) \\ h_t&&=o_t\odot\tanh(c_t).\end{split}\]We use the following notations to describe the inputs and outputs below. \(T\): sequcne length, \(B\): batch size, \(I\): input size, \(L\): number of layers, \(D\): number of directions, can be either 1 or 2, \(H\): hidden size.
References
S. Hochreiter, and J. Schmidhuber. “Long Short-Term Memory.” Neural Computation. 1997.
- Parameters
x (Variable) – Input N-D array with shape \((T, B, I)\).
h (Variable) – Input N-D array with shape \((L, D, B, H)\).
c (Variable) – Input N-D array with shape \((L, D, B, H)\) .
w0_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
, optional) – Initializer for weight at the first layer. Shape is \((D, 4, H, I + H)\).w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
, optional) – Initializer for weights at the second layer and up. Shape is \((L-1, D, 4, H, D * H + H)\).b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
, optional) – Initializer for bias. Shape is \((L, D, 4, H)\).num_layers (int, optional) – Number of layers in the network. If set to 1, only the weights for the first layer will be invoked. Default is 1.
dropout (float, optional) – Dropout ratio applied to parameters. Default is 0.0.
bidirectional (bool, optional) – If True, bidirectional computation will be performed in each layer. Default is False.
training (bool, optional) – Backpropagation will be performed only when it is true. Default is True.
with_bias (bool, optional) – Specify whether to include the bias term.
fix_parameters (bool) – When set to
True
, the weights and biases will not be updated.
- Returns
Output \(y\) with shape \((T, B, D * H)\) ~nnabla.Variable: Output \(h_n\) with shape \((L, D, B, H)\) ~nnabla.Variable: Output \(c_n\) with shape \((L, D, B, H)\)
- Return type
Example
x = nn.Variable((seq_len, batch_size, input_size)) h = nn.Variable((num_layers, num_directions, batch_size, hidden_size)) c = nn.Variable((num_layers, num_directions, batch_size, hidden_size)) y, hn, cn = PF.lstm(x, h, c)
- Parameters to be registered
The following variables are registered in a parameter scope
"lstm"
;weight_l0 (
need_grad=True
) : Filter weights at 0-th layer. (shape:(D, 4, H, I + H)
)weight (
need_grad=True
) : Filter weights at 1-st layer and above. (shape:(L-1, D, 4, H, DH + H)
)bias (
need_grad=True
) : Biases. (shape:(L, D, 4, H)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = lstm(<args>)
- nnabla.parametric_functions.gru(x, h, w0_init=None, w_init=None, b_init=None, num_layers=1, dropout=0.0, bidirectional=False, training=True, rng=None, with_bias=True, fix_parameters=False, name=None)[source]¶
GRU (gated recurrent units).
GRU is defined as following:
\[\begin{split}r_t&&=\sigma(W_rx_t+U_rh_{t-1}+b_r) \\ z_t&&=\sigma(W_zx_t+U_zh_{t-1}+b_z) \\ n_t&&=\tanh(W_nx_t+b_{in}+r_n \odot (U_nh_{t-1}+b_{hn})) \\ h_t&&=(1-z_t) \odot n_t+z_t \odot h_{t-1}.\end{split}\]We use the following notations to describe the inputs and outputs below. \(T\): sequcne length, \(B\): batch size, \(I\): input size, \(L\): number of layers, \(D\): number of directions, can be either 1 or 2, \(H\): hidden size.
References
K. Cho et al. “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation.” Empirical Methods in Natural Language Processing. 2014.
- Parameters
x (Variable) – Input N-D array with shape \((T, B, I)\).
h (Variable) – Input N-D array with shape \((L, D, B, H)\).
w0_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
, optional) – Initializer for weight at the first layer. Shape is \((D, 3, H, I + H)\).w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
, optional) – Initializer for weights at the second layer and up. Shape is \((L-1, D, 3, H, D * H + H)\).b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
, optional) – Initializer for bias. Shape is \((L, D, 4, H)\).num_layers (int, optional) – Number of layers in the network. If set to 1, only the weights for the first layer will be invoked. Default is 1.
dropout (float, optional) – Dropout ratio applied to parameters. Default is 0.0.
bidirectional (bool, optional) – If True, bidirectional computation will be performed in each layer. Default is False.
training (bool, optional) – Backpropagation will be performed only when it is true. Default is True.
with_bias (bool, optional) – Specify whether to include the bias term.
- Returns
Output \(y\) with shape \((T, B, D * H)\) ~nnabla.Variable: Output \(h_n\) with shape \((L, D, B, H)\)
- Return type
Example
x = nn.Variable((seq_len, batch_size, input_size)) h = nn.Variable((num_layers, num_directions, batch_size, hidden_size)) y, hn = PF.gru(x, h)
- Parameters to be registered
The following variables are registered in a parameter scope
"gru"
;weight_l0 (
need_grad=True
) : Filter weights at 0-th layer. (shape:(D, 3, H, I + H)
)weight (
need_grad=True
) : Filter weights at 1-st layer and above. (shape:(L-1, D, 3, H, DH + H)
)bias (
need_grad=True
) : Biases. (shape:(L, D, 4, H)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = gru(<args>)
- nnabla.parametric_functions.embed(inp, n_inputs, n_features, initializer=None, fix_parameters=False, apply_w=None, name=None)[source]¶
Embed.
Embed slices a matrix/tensor with indexing array/tensor. Weights are initialized with
nnabla.initializer.UniformInitializer
within the range of \(-\sqrt{3}\) and \(\sqrt{3}\).- Parameters
x (Variable) – [Integer] Indices with shape \((I_0, ..., I_N)\)
n_inputs – number of possible inputs, words or vocabraries
n_features – number of embedding features
fix_parameters (bool) – When set to
True
, the embedding weight matrix will not be updated.apply_w (function) – Lambda, function, or callable object applied to the weights.
- Returns
Output with shape \((I_0, ..., I_N, W_1, ..., W_M)\)
- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"embed"
;W (
need_grad=True
) : Embedding matrix. (shape:(n_inputs, n_features)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = embed(<args>)
- nnabla.parametric_functions.prelu(inp, base_axis=1, shared=True, fix_parameters=False, slope_init=None, name=None)[source]¶
Parametrized Rectified Linear Unit function defined as
\[y_i = \max(0, x_i) + w_i \min(0, x_i)\]where negative slope \(w\) is learned and can vary across channels (an axis specified with base_axis). Weights are initialized with \(-1\).
- Parameters
x (Variable) – N-D array as input
base_axis (int) – Dimensions up to base_axis is treated as sample dimension.
shared (bool) – Use shared weight value or not
fix_parameters (bool) – When set to
True
, the negative slope values will not be updated.slope_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer of negative slopes. By default, they are initialized with0.25
.
- Returns
N-D array.
- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"prelu"
;slope (
need_grad=True
) : Negative slope. (shape:tuple() if shared else (inp.shape[base_axis],)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = prelu(<args>)
- nnabla.parametric_functions.svd_affine(inp, n_outmaps, r, base_axis=1, uv_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, name=None)[source]¶
SVD affine is a low rank approximation of the affine layer. It can be seen as two consecutive affine layers with a bottleneck. It computes:
\[{\mathbf y} = {\mathbf U} {\mathbf V} {\mathbf x} + {\mathbf b}.\]where \({\mathbf x}, {\mathbf y}\) are the inputs and outputs respectively, and \({\mathbf U}, {\mathbf V}, {\mathbf b}\) are constants.
The weights \({\mathbf U}\) and \({\mathbf V}\) are approximated with singular value decomposition (SVD) of the original weight matrix \({\mathbf W}\) and by selecting the \({R}\) dominant singular values and the corresponding singular vectors. Therefore the low rank \({R}\) is the size of the bottleneck.
If
uv_init
is a numpy array, \({\mathbf U}\) and \({\mathbf V}\) are computed such thatuv_init
is approximated by \({\mathbf{UV}}\). Ifuv_init
isNone
or an initializer, the product of \({\mathbf U}\) and \({\mathbf V}\) approximates the random initialization.If \({\mathbf U}\) and \({\mathbf V}\) exist in the context, they take precedence over
uv_init
.Suppose the weight of the affine is of \({I \times O}\) and the compression rate you want to specify is \({CR}\), then you set \({R}\) as
\[R = \left\lfloor \frac{(1 - CR)OI}{O + I} \right\rfloor.\]- Parameters
inp (Variable) – Input N-D array with shape (\(M_0 \times \ldots \times M_{B-1} \times D_B \times \ldots \times D_N\)). Dimensions before and after base_axis are flattened as if it is a matrix.
n_outmaps (int or tuple) – Number of output neurons per data.
r (int) – rank of the factorized layer (size of the bottleneck)
base_axis (int) – Dimensions up to
base_axis
are treated as the sample dimensions.uv_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for bias. By default, it is initialized with zeros ifwith_bias
isTrue
.fix_parameters (bool) – When set to
True
, the weights and biases will not be updated.rng (numpy.random.RandomState) – Random generator for Initializer.
with_bias (bool) – Specify whether to include the bias term.
- Returns
\((B + 1)\)-D array. (\(M_0 \times \ldots \times M_{B-1} \times L\))
- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"svd_affine"
;U (
need_grad=True
) : \({\mathbf U}\). (shape:(inmaps, r)
)V (
need_grad=True
) : \({\mathbf V}\). (shape:(r, outmaps)
)b (
need_grad=True
) : Bias vector. (shape:(outmaps,)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = svd_affine(<args>)
- nnabla.parametric_functions.svd_convolution(inp, outmaps, kernel, r, pad=None, stride=None, dilation=None, uv_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, name=None)[source]¶
SVD convolution is a low rank approximation of the convolution layer. It can be seen as a depth wise convolution followed by a 1x1 convolution.
The flattened kernels for the i-th input map are expressed by their low rank approximation. The kernels for the i-th input \({\mathbf W_i}\) are approximated with the singular value decomposition (SVD) and by selecting the \({R}\) dominant singular values and the corresponding singular vectors.
\[{\mathbf W_{:,i,:}} ~ {\mathbf U_i} {\mathbf V_i}.\]\({\mathbf U}\) contains the weights of the depthwise convolution with multiplier \({R}\) and \({\mathbf V}\) contains the weights of the 1x1 convolution.
If
uv_init
is a numpy array, \({\mathbf U}\) and \({\mathbf V}\) are computed such thatuv_init
is approximated by \({\mathbf{UV}}\). Ifuv_init
isNone
or an initializer, the product of \({\mathbf U}\) and \({\mathbf V}\) approximates the random initialization.If \({\mathbf U}\) and \({\mathbf V}\) exist in the context, they take precedence over
uv_init
.Suppose the kernel tensor of the convolution is of \({O \times I \times K \times K}\) and the compression rate you want to specify is \({CR}\), then you set \({R}\) as
\[R = \left\lfloor \frac{(1 - CR)OIK^2}{I(O + K^2)} \right\rfloor.\]- Parameters
inp (Variable) – N-D array.
outmaps (int) – Number of convolution kernels (which is equal to the number of output channels). For example, to apply convolution on an input with 16 types of filters, specify 16.
kernel (tuple) – Convolution kernel size. For example, to apply convolution on an image with a 3 (height) by 5 (width) two-dimensional kernel, specify (3, 5).
r (int) – Rank of the factorized layer.
uv_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for bias. By default, it is initialized with zeros ifwith_bias
isTrue
.base_axis (int) – Dimensions up to
base_axis
are treated as the sample dimensions.fix_parameters (bool) – When set to
True
, the weights and biases will not be updated.rng (numpy.random.RandomState) – Random generator for Initializer.
with_bias (bool) – Specify whether to include the bias term.
- Returns
\((B + 1)\)-D array. (\(M_0 \times \ldots \times M_{B-1} \times L\))
- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"svd_conv"
;U (
need_grad=True
) : Decomposed filter weights \({\mathbf U}\). (shape:(inmaps * r, *kernel)
)V (
need_grad=True
) : Decomposed filter weights \({\mathbf V}\). (shape:(outmaps, inmaps * r, 1, ...)
)b (
need_grad=True
) : Bias vector. (shape:(outmaps,)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = svd_convolution(<args>)
- nnabla.parametric_functions.cpd3_convolution(inp, outmaps, kernel, r, pad=None, stride=None, dilation=None, oik_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, max_iter=500, stopping_criterion=1e-05, lambda_reg=0.0, name=None)[source]¶
CP convolution is a low rank approximation of a convolution layer. A 3D tensor containing the parameter is built by collapsing the N-D kernels into 1D, then the tensor is decomposed into three matrices. The decomposed layer can be seen as linear combinations of the input feature maps to \({R}\) feature maps followed by a depthwise convolution and followed by linear combinations of the feature maps to compute the output feature maps.
The CP decomposition allows to approximate the kernel tensor by \({R}\) rank-1 tensors of the form:
\[\sum_{r=1}^{R} \lambda_r {\mathbf{o}^{(r)} \otimes \mathbf{i}^{(r)} \otimes \mathbf{k}^{(r)}},\]where \({\lambda}_r\) is the normalization coefficient and \({\otimes}\) is the outer product.
If
oik_init
is a numpy array, U and V are computed so that uv_init can be approximates from UV Ifoik_init
is None or an initializer, the product of U and V approximate the randomly initialized arrayIf
O
,I
andK
exist in context, they are used to initialize the layer and oik_init is not used.Suppose the kernel tensor of the affine is of \({I \times O}\) and the compression rate you want to specify is \({CR}\), then you set \({R}\) as
\[R = \left\lfloor \frac{(1 - CR)OIK^2}{O + I + K^2} \right\rfloor.\]References
Lebedev, Vadim, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky, “Speeding-up convolutional neural networks using fine-tuned cp-decomposition.”, arXiv preprint arXiv:1412.6553 (2014).
Marcella Astrid, Seung-Ik Lee, “CP-decomposition with Tensor Power Method for Convolutional Neural Networks Compression”, BigComp 2017.
- Parameters
inp (Variable) – N-D array.
outmaps (int) – Number of convolution kernels (which is equal to the number of output channels). For example, to apply convolution on an input with 16 types of filters, specify 16.
kernel (
tuple
ofint
) – Convolution kernel size. For example, to apply convolution on an image with a 3 (height) by 5 (width) two-dimensional kernel, specify (3,5).r (int) – rank of the factorized layer
oik_init (numpy array or
nnabla.initializer.BaseInitializer
) – Initializer for weight. Initializer for weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for bias. It is initialized with zeros ifwith_bias
isTrue
.base_axis (int) – Dimensions up to
base_axis
are treated as the sample dimensions.fix_parameters (bool) – When set to
True
, the weights and biases will not be updated.rng (numpy.random.RandomState) – Random generator for Initializer.
with_bias (bool) – Specify whether to include the bias term.
max_iter (int) – Max iteration of the ALS.
stopping_criterion (float) – Threshold for stopping the ALS. If the value is negative, the convergence check is ignored; in other words, it may reduce the computation time.
lambda_reg (float) – regularization parameter for the ALS. Larger lambda_reg means larger regularization.
- Returns
\((B + 1)\)-D array. (\(M_0 \times \ldots \times M_{B-1} \times L\))
- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"cpd3_conv"
;I (
need_grad=True
) : Decomposed filter weights \({\mathbf I}\). (shape:(r, inmaps, 1, ...)
)K (
need_grad=True
) : Decomposed filter weights \({\mathbf K}\). (shape:(r, *kernel)
)O (
need_grad=True
) : Decomposed filter weights \({\mathbf O}\). (shape:(outmaps, r, 1, ...)
)b (
need_grad=True
) : Bias vector. (shape:(outmaps,)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = cpd3_convolution(<args>)
- nnabla.parametric_functions.binary_connect_affine(inp, n_outmaps, base_axis=1, quantize_zero_to=1.0, w_init=None, wb_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, name=None)[source]¶
Binary Connect Affine, multiplier-less inner-product.
Binary Connect Affine is an affine function, except the definition of the inner product is modified. The input-output relation of this function is as follows:
\[y_i = \sum_{i} sign(w_i) x_i.\]Therefore \(sign(w_i)\) is either \(1\) or \(-1\) and the inner product simplifies to addition.
This function should be used together with Batch Normalization.
References
M. Courbariaux, Y. Bengio, and J.-P. David. “BinaryConnect: Training Deep Neural Networks with binary weights during propagations.” Advances in Neural Information Processing Systems. 2015.
Note
1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (
weight
) and not the binarized weights (binary_weight
)2) The weights and the binary weights become synced only after
forward()
is called, and not after a call tobackward()
. To access the parameters of the network, remember to callforward()
once before doing so, otherwise the float weights and the binary weights will not be in sync.3) Quantized values are stored as floating point number for
binary_weight
, since this function is only for simulation purposes.- Parameters
inp (Variable) – Input N-D array with shape (\(M_0 \times \ldots \times M_{B-1} \times D_B \times \ldots \times D_N\)). Dimensions before and after base_axis are flattened as if it is a matrix.
n_outmaps (int or
tuple
ofint
) – Number of output neurons per data.base_axis (int) – Dimensions up to
base_axis
are treated as the sample dimensions.quantize_zero_to (float) – Input value at zero is quantized to this value.
w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.wb_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for binary weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for bias. By default, it is initialized with zeros ifwith_bias
isTrue
.fix_parameters (bool) – When set to
True
, the weights and biases will not be updated.rng (numpy.random.RandomState) – Random generator for Initializer.
- Returns
- Parameters to be registered
The following variables are registered in a parameter scope
"bicon_affine"
;W (
need_grad=True
) : Weight matrix in floating type. (shape:(inmaps, outmaps)
)Wb (
need_grad=False
) : Binarized weights. (shape:(inmaps, outmaps)
)b (
need_grad=True
) : Bias vector. (shape:(outmaps,)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = binary_connect_affine(<args>)
- nnabla.parametric_functions.binary_connect_convolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, quantize_zero_to=1.0, w_init=None, wb_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, name=None)[source]¶
Binary Connect Convolution, multiplier-less inner-product.
Binary Connect Convolution is the convolution function, except the definition of the inner product is modified. The input-output relation of this function is as follows:
\[y_{n, a, b} = \sum_{m} \sum_{i} \sum_{j} sign(w_{n, m, i, j}) x_{m, a + i, b + j}.\]Therefore \(sign(w_i)\) is either \(1\) or \(-1\) and the inner product simplifies to addition.
This function should be used together with BatchNormalization.
References
M. Courbariaux, Y. Bengio, and J.-P. David. “BinaryConnect: Training Deep Neural Networks with binary weights during propagations.” Advances in Neural Information Processing Systems. 2015.
Note
1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (
weight
) and not the binarized weights (binary_weight
)2) The weights and the binary weights become synced only after
forward()
is called, and not after a call tobackward()
. To access the parameters of the network, remember to callforward()
once before doing so, otherwise the float weights and the binary weights will not be in sync.3) Quantized values are stored as floating point number for
binary_weight
, since this function is only for simulation purposes.- Parameters
inp (Variable) – N-D array.
outmaps (int) – Number of convolution kernels (which is equal to the number of output channels). For example, to apply convolution on an input with 16 types of filters, specify 16.
kernel (
tuple
ofint
) – Convolution kernel size. For example, to apply convolution on an image with a 3 (height) by 5 (width) two-dimensional kernel, specify (3,5).group (int) – Number of groups of channels. This makes connections across channels sparser by grouping connections along map direction.
quantize_zero_to (float) – Input value at zero is quantized to this value.
w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.wb_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for binary weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for bias. By default, it is initialized with zeros ifwith_bias
isTrue
.base_axis (int) – Dimensions up to
base_axis
are treated as the sample dimensions.fix_parameters (bool) – When set to
True
, the weights and biases will not be updated.rng (numpy.random.RandomState) – Random generator for Initializer.
with_bias (bool) – Specify whether to include the bias term.
- Returns
- Parameters to be registered
The following variables are registered in a parameter scope
"bicon_conv"
;W (
need_grad=True
) : Filter weights in float. (shape:(outmaps, inmaps, *kernel)
)Wb (
need_grad=False
) : Binarized filter weights. (shape:(outmaps, inmaps, *kernel)
)b (
need_grad=True
) : Bias vector. (shape:(outmaps,)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = binary_connect_convolution(<args>)
- nnabla.parametric_functions.binary_weight_affine(inp, n_outmaps, base_axis=1, quantize_zero_to=1.0, w_init=None, wb_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, name=None)[source]¶
Binary Weight Affine, multiplier-less inner-product with a scale factor.
Binary Weight Affine is the affine function, but the inner product in this function is the following,
\[y_j = \frac{1}{\|\mathbf{w}_j\|_{\ell_1}} \sum_{i} sign(w_{ji}) x_i\]Therefore \(sign(w_{ji})\) is either \(1\) or \(-1\) and the inner product simplifies to addition followed by scaling factor \(\alpha = \frac{1}{\|\mathbf{w}_j\|_{\ell_1}}\). The number of :\(\alpha\) is the outmaps of the affine function.
References
Rastegari, Mohammad, et al. “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks.” arXiv preprint arXiv:1603.05279 (2016).
Note
1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (
weight
) and not the binarized weights (binary_weight
)2) The weights and the binary weights become synced only after
forward()
is called, and not after a call tobackward()
. To access the parameters of the network, remember to callforward()
once before doing so, otherwise the float weights and the binary weights will not be in sync.3) Quantized values are stored as floating point number for
binary_weight
, since this function is only for simulation purposes.- Parameters
inp (Variable) – Input N-D array with shape (\(M_0 \times \ldots \times M_{B-1} \times D_B \times \ldots \times D_N\)). Dimensions before and after base_axis are flattened as if it was a matrix.
n_outmaps (int or
tuple
ofint
) – Number of output neurons per data.base_axis (int) – Dimensions up to
base_axis
are treated as the sample dimensions.quantize_zero_to (float) – Input value at zero is quantized to this value.
w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for the weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.wb_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for the binary weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for the bias. By defalut, it is initialized with zeros ifwith_bias
isTrue
.fix_parameters (bool) – When set to
True
, the weight and bias will not be updated.rng (numpy.random.RandomState) – Random generator for Initializer.
with_bias (bool) – Specify whether to include the bias term.
- Returns
- Parameters to be registered
The following variables are registered in a parameter scope
"bwn_affine"
;W (
need_grad=True
) : Weight matrix in floating type. (shape:(inmaps, outmaps)
)Wb (
need_grad=False
) : Binarized weights. (shape:(inmaps, outmaps)
)alpha (
need_grad=False
) : Scaling factor \(\alpha\). (shape:(outmaps,)
)b (
need_grad=True
) : Bias vector. (shape:(outmaps,)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = binary_weight_affine(<args>)
- nnabla.parametric_functions.binary_weight_convolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, quantize_zero_to=1.0, w_init=None, wb_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, name=None)[source]¶
Binary Weight Convolution, multiplier-less inner-product with a scale factor.
Binary Weight Convolution is the convolution function, but the inner product in this function is the following,
\[y_{n, a, b} = \frac{1}{\|\mathbf{w}_n\|_{\ell_1}} \sum_{m} \sum_{i} \sum_{j} sign(w_{n, m, i, j}) x_{m, a + i, b + j}.\]Therefore \(sign(w_{n, m, i, j})\) is either \(1\) or \(-1\) and the inner product simplifies to addition followed by scaling factor \(\alpha = \frac{1}{\|\mathbf{w}_n\|_{\ell_1}}\). The number of \(n\) is the number of outmaps of the convolution function.
References
Rastegari, Mohammad, et al. “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks.” arXiv preprint arXiv:1603.05279 (2016).
Note
1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (
weight
) and not the binarized weights (binary_weight
)2) The weights and the binary weights become synced only after
forward()
is called, and not after a call tobackward()
. To access the parameters of the network, remember to callforward()
once before doing so, otherwise the float weights and the binary weights will not be in sync.3) Quantized values are stored as floating point number for
binary_weight
, since this function is only for simulation purposes.- Parameters
inp (Variable) – N-D array.
outmaps (int) – Number of convolution kernels (which is equal to the number of output channels). For example, to apply convolution on an input with 16 types of filters, specify 16.
kernel (
tuple
ofint
) – Convolution kernel size. For example, to apply convolution on an image with a 3 (height) by 5 (width) two-dimensional kernel, specify (3,5).group (int) – Number of groups of channels. This makes connections across channels sparser by grouping connections along map direction.
quantize_zero_to (float) – Input value at zero is quantized to this value.
w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.wb_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for binary weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for bias. By default, it is initialized with zeros ifwith_bias
isTrue
.base_axis (int) – Dimensions up to
base_axis
are treated as the sample dimensions.fix_parameters (bool) – When set to
True
, the weights and biases will not be updated.rng (numpy.random.RandomState) – Random generator for Initializer.
with_bias (bool) – Specify whether to include the bias term.
- Returns
- Parameters to be registered
The following variables are registered in a parameter scope
"bwn_conv"
;W (
need_grad=True
) : Filter weights in float. (shape:(outmaps, inmaps, *kernel)
)Wb (
need_grad=False
) : Binarized filter weights. (shape:(outmaps, inmaps, *kernel)
)alpha (
need_grad=False
) : Scaling factor \(\alpha\). (shape:(outmaps,)
)b (
need_grad=True
) : Bias vector. (shape:(outmaps,)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = binary_weight_convolution(<args>)
- nnabla.parametric_functions.inq_affine(inp, n_outmaps, base_axis=1, num_bits=4, inq_iterations=(), selection_algorithm='random', seed=- 1, w_init=None, i_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, name=None)[source]¶
Incremental Network Quantization Affine Layer
During training, the weights are sequentially quantized to power-of-two values, which allows the training of a multiplierless network.
Using
inq_iterations
, one can specify after how many forward passes half of the learnable weights are fixed and quantized to powers-of-two. After reaching the last value ininq_iterations
, all weights are fixed.For more details, please refer to the reference.
Reference: Zhou A, Yao A, Guo Y, Xu L, Chen Y. Incremental network quantization: Towards lossless CNNs with low-precision weights. <https://arxiv.org/abs/1702.03044>
- Parameters
inp (Variable) – Input N-D array with shape (\(M_0 \times \ldots \times M_{B-1} \times D_B \times \ldots \times D_N\)). Dimensions before and after base_axis are flattened as if it was a matrix.
n_outmaps (int or
tuple
ofint
) – Number of output neurons per data.base_axis (int) – Dimensions up to
base_axis
are treated as the sample dimensions.quantize_zero_to (float) – Input value at zero is quantized to this value.
num_bits (int) – Number of bits per weight. Value has to be larger than 1 as one bit is already used to code the value “0”
inq_iterations (tuple of int) – Tuple of iteration numbers at which we fix half of the weights.
selection_algorithm (str) – Chooses algorithm that is used to decide which weights are fixed. (“largest_abs” … fix weights with largest absolute value, “random” … fix weights randomly)
seed (int) – Random seed for INQ algorithm
w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.i_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for indicators (0 … learnable, 1 … fixed). By default, it is initialized with zeros.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for bias. By default, it is initialized with zeros ifwith_bias
isTrue
.fix_parameters (bool) – When set to
True
, the weight and bias will not be updated.rng (numpy.random.RandomState) – Random generator for Initializer.
with_bias (bool) – Specify whether to include the bias term.
- Returns
- Parameters to be registered
The following variables are registered in a parameter scope
"inq_affine"
;W (
need_grad=True
) : Weight matrix in floating type. (shape:(inmaps, outmaps)
)I (
need_grad=False
) : Binary indicator matrix of fixed weights. (shape:(inmaps, outmaps)
)b (
need_grad=True
) : Bias vector. (shape:(outmaps,)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = inq_affine(<args>)
- nnabla.parametric_functions.inq_convolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, num_bits=4, inq_iterations=(), selection_algorithm='random', seed=- 1, w_init=None, i_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, name=None)[source]¶
Incremental Network Quantization Convolution Layer
During training, the weights are sequentially quantized to power-of-two values, which allows the training of a multiplierless network.
Using
inq_iterations
, one can specify after how many forward passes half of the learnable weights are fixed and quantized to powers-of-two. After reaching the last value ininq_iterations
, all weights are fixed.For more details, please refer to the reference.
Reference: Zhou A, Yao A, Guo Y, Xu L, Chen Y. Incremental network quantization: Towards lossless CNNs with low-precision weights. <https://arxiv.org/abs/1702.03044>
- Parameters
inp (Variable) – Input N-D array with shape (\(M_0 \times \ldots \times M_{B-1} \times D_B \times \ldots \times D_N\)). Dimensions before and after base_axis are flattened as if it was a matrix.
n_outmaps (int or
tuple
ofint
) – Number of output neurons per data.base_axis (int) – Dimensions up to
base_axis
are treated as the sample dimensions.num_bits (int) – Number of bits per weight. Value has to be larger than 1 as one bit is already used to code the value “0”
inq_iterations (tuple of int) – Tuple of iteration numbers at which we fix half of the weights.
selection_algorithm (str) – Chooses algorithm that is used to decide which weights are fixed. (“largest_abs” … fix weights with largest absolute value, “random” … fix weights randomly)
seed (int) – Random seed for INQ algorithm
w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for the weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.i_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for the indicators (0 … learnable, 1 … fixed). By default, it is initialized with zeros.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for the bias. By default, it is initialized with zeros ifwith_bias
isTrue
.fix_parameters (bool) – When set to
True
, the weight and bias will not be updated.rng (numpy.random.RandomState) – Random generator for Initializer.
with_bias (bool) – Specify whether to include the bias term.
- Returns
- Parameters to be registered
The following variables are registered in a parameter scope
"inq_conv"
;W (
need_grad=True
) : Filter weights in float. (shape:(outmaps, inmaps, *kernel)
)I (
need_grad=False
) : Binary indicator matrix of fixed weights. (shape:(outmaps, inmaps, *kernel)
)b (
need_grad=True
) : Bias vector. (shape:(outmaps,)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = inq_convolution(<args>)
- nnabla.parametric_functions.fixed_point_quantized_affine(inp, n_outmaps, base_axis=1, w_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, quantize_w=True, sign_w=True, n_w=8, delta_w=0.0625, ste_fine_grained_w=True, quantize_b=True, sign_b=True, n_b=8, delta_b=0.0625, ste_fine_grained_b=True, name=None)[source]¶
Fixed-Point Quantized Affine.
Fixed-Point Quantized Affine is the affine function, except the definition of the inner product is modified. The input-output relation of this function is as follows:
\[y_j = \sum_{i} Q(w_{ji}) x_i,\]where \(Q(w_{ji})\) is the fixed-point quantization function.
Note
1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (
weight
) and not the quantized weights (quantized weight
)2) The weights and the quantized weights become synced only after
forward()
is called, and not after a call tobackward()
. To access the parameters of the network, remember to callforward()
once before doing so, otherwise the float weights and the quantized weights will not be in sync.3) CPU and GPU implementations now use float value for
quantized weight
, since this function is only for simulation purposes.- Parameters
inp (Variable) – Input N-D array with shape (\(M_0 \times \ldots \times M_{B-1} \times D_B \times \ldots \times D_N\)). Dimensions before and after base_axis are flattened as if it is a matrix.
n_outmaps (
int
ortuple
ofint
) – Number of output neurons per data.base_axis (int) – Dimensions up to
base_axis
are treated as the sample dimensions.w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for bias. By default, it is initialized with zeros ifwith_bias
isTrue
.fix_parameters (bool) – When set to
True
, the weights and biases will not be updated.rng (numpy.random.RandomState) – Random generator for Initializer.
with_bias (bool) – Specify whether to include the bias term.
n_w (int) – Bit width used for weight.
delta_w (float) – Step size for weight.
n_b (int) – Bit width used for bias.
delta_w – Step size for bias.
- Returns
\((B + 1)\)-D array. (\(M_0 \times \ldots \times M_{B-1} \times L\))
- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"fp_quantized_affine"
;W (
need_grad=True
) : Weight matrix in float. (shape:(inmaps, outmaps)
)b (
need_grad=True
) : Bias vector in float. (shape:(outmaps,)
)W_q (
need_grad=False
) : Quantized weights. (shape:(inmaps, outmaps)
)b_q (
need_grad=False
) : Quantized biases. (shape:(outmaps,)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = fixed_point_quantized_affine(<args>)
- nnabla.parametric_functions.fixed_point_quantized_convolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, channel_last=False, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, quantize_w=True, sign_w=True, n_w=8, delta_w=0.0625, ste_fine_grained_w=True, quantize_b=True, sign_b=True, n_b=8, delta_b=0.0625, ste_fine_grained_b=True, name=None)[source]¶
Fixed-Point Quantized Convolution.
Fixed-Point Quantized Convolution is the convolution function, except the definition of the inner product is modified. The input-output relation of this function is as follows:
\[y_{n, a, b} = \sum_{m} \sum_{i} \sum_{j} Q(w_{n, m, i, j}) x_{m, a + i, b + j},\]where \(Q(w_{n, m, i, j})\) is the fixed-point quantization function.
Note
1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (
weight
) and not the quantized weights (quantized weight
)2) The weights and the quantized weights become synced only after
forward()
is called, and not after a call tobackward()
. To access the parameters of the network, remember to callforward()
once before doing so, otherwise the float weights and the quantized weights will not be in sync.3) CPU and GPU implementations now use float value for
quantized weight
, since this function is only for simulation purposes.- Parameters
inp (Variable) – N-D array.
outmaps (int) – Number of convolution kernels (which is equal to the number of output channels). For example, to apply convolution on an input with 16 types of filters, specify 16.
kernel (
tuple
ofint
) – Convolution kernel size. For example, to apply convolution on an image with a 3 (height) by 5 (width) two-dimensional kernel, specify (3,5).group (int) – Number of groups of channels. This makes connections across channels more sparse by grouping connections along map direction.
w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for bias. By default, it is initialized with zeros ifwith_bias
isTrue
.base_axis (int) – Dimensions up to
base_axis
are treated as the sample dimensions.fix_parameters (bool) – When set to
True
, the weights and biases will not be updated.rng (numpy.random.RandomState) – Random generator for Initializer.
with_bias (bool) – Specify whether to include the bias term.
n_w (int) – Bit width used for weight.
delta_w (float) – Step size for weight.
n_b (int) – Bit width used for bias.
delta_w – Step size for bias.
- Returns
N-D array.
- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"fp_quantized_conv"
;W (
need_grad=True
) : Filter weights in float. (shape:(outmaps, inmaps // group, *kernel)
)b (
need_grad=True
) : Bias vector in float. (shape:(outmaps,)
)W_q (
need_grad=False
) : Quantized weights. (shape:(outmaps, inmaps // group, *kernel)
)b_q (
need_grad=False
) : Quantized biases. (shape:(outmaps,)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = fixed_point_quantized_convolution(<args>)
- nnabla.parametric_functions.min_max_quantized_affine(inp, n_outmaps, base_axis=1, w_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, quantize_w=True, ql_min_w=0, ql_max_w=255, w_min_max=False, qr_min_w_init=None, qr_max_w_init=None, ste_fine_grained_w=True, quantize_b=True, ql_min_b=0, ql_max_b=255, b_min_max=False, qr_min_b_init=None, qr_max_b_init=None, ste_fine_grained_b=True, eps=0.01, name=None)[source]¶
Min-max Quantized Affine.
Min-max Quantized Affine is the affine function, except the definition of the inner product is modified. The input-output relation of this function is as follows:
\[y_j = \sum_{i} Q(w_{ji}) x_i,\]where \(Q(w_{ji})\) is the min-max quantization function.
In the min_max_quantized affine, the exponential moving average is not used. the min and max quantization ranges are either the min-max of weights and bias or trained.
Notice that the min and max values of inputs are always used instead of the exponential moving average.
Note
1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (
weight
) and not the quantized weights (quantized weight
)2) The weights and the quantized weights become synced only after
forward()
is called, and not after a call tobackward()
. To access the parameters of the network, remember to callforward()
once before doing so, otherwise the float weights and the quantized weights will not be in sync.3) CPU and GPU implementations now use float value for
quantized weight
, since this function is only for simulation purposes.- Parameters
inp (Variable) – Input N-D array with shape (\(M_0 \times \ldots \times M_{B-1} \times D_B \times \ldots \times D_N\)). Dimensions before and after base_axis are flattened as if it is a matrix.
n_outmaps (
int
ortuple
ofint
) – Number of output neurons per data.base_axis (int) – Dimensions up to
base_axis
are treated as the sample dimensions.w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for bias. By default, it is initialized with zeros ifwith_bias
isTrue
.fix_parameters (bool) – When set to
True
, the weights and biases will not be updated.rng (numpy.random.RandomState) – Random generator for Initializer.
with_bias (bool) – Specify whether to include the bias term.
ql_min_w (int, float, or Variable) – Minimum quantization level for weights. Default is 0.
ql_max_w (int, float, or Variable) – Maximum quantization level for weights. Default is 255.
w_min_max (bool) – Use the min and max of weights to compute quantization ranges. Default is
False
.qr_min_w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for the minimum quantization range, qr_min. Default isnnabla.initializer.ConstantInitializer
(-2.0).qr_max_w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for the maximum quantization range, qr_max. Default isnnabla.initializer.ConstantInitializer
(2.0).ste_fine_grained_w (bool) – If true, STE is not 1, the {0, 1}-mask computed from the min-max is applied to the gradient in the backward; otherwise, STE is 1.
ql_min_b (int, float, or Variable) – Minimum quantization level for bias. Default is 0.
ql_max_b (int, float, or Variable) – Maximum quantization level for bias. Default is 255.
b_min_max (bool) – Use the min and max of bias to compute quantization ranges. Default is
False
.qr_min_b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for the minimum quantization range, qr_min. Default isnnabla.initializer.ConstantInitializer
(-6.0).qr_max_b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for the maximum quantization range, qr_max. Default isnnabla.initializer.ConstantInitializer
(6.0).ste_fine_grained_b (bool) – If true, STE is not 1, the {0, 1}-mask computed from the min-max is applied to the gradient in the backward; otherwise, STE is 1.
eps (float) – Epsilon, or small value to ensure \(qr_{max} - qr_{min}\) must be greater than the epsilon for both weights and bias.
- Returns
\((B + 1)\)-D array. (\(M_0 \times \ldots \times M_{B-1} \times L\))
- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"min_max_quantized_affine"
;W (
need_grad=True
) : Weight matrix in float. (shape:(inmaps, outmaps)
)b (
need_grad=True
) : Bias vector in float. (shape:(outmaps,)
)W_q (
need_grad=False
) : Quantized weights. (shape:(inmaps, outmaps)
)b_q (
need_grad=False
) : Quantized biases. (shape:(outmaps,)
)qr_min (
need_grad=False
) : Minimum quantization range. Minimum values of inputs or trainable range.. (shape:ql_min.shape
)qr_max (
need_grad=False
) : Maximum quantization range. Maximum values of inputs or trainable range.. (shape:ql_max.shape
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = min_max_quantized_affine(<args>)
- nnabla.parametric_functions.min_max_quantized_convolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, channel_last=False, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, quantize_w=True, ql_min_w=0, ql_max_w=255, w_min_max=False, qr_min_w_init=None, qr_max_w_init=None, ste_fine_grained_w=True, quantize_b=True, ql_min_b=0, ql_max_b=255, b_min_max=False, qr_min_b_init=None, qr_max_b_init=None, ste_fine_grained_b=True, eps=0.01, name=None)[source]¶
Min-max Quantized Convolution.
Min-max Quantized Convolution is the convolution function, except the definition of the inner product is modified. The input-output relation of this function is as follows:
\[y_{n, a, b} = \sum_{m} \sum_{i} \sum_{j} Q(w_{n, m, i, j}) x_{m, a + i, b + j},\]where \(Q(w_{n, m, i, j})\) is the min-max quantization function.
In the min_max_quantized convolution, the exponential moving average is not used. the min and max quantization ranges are either the min-max of weights and bias or trained.
Notice that the min and max values of inputs are always used instead of the exponential moving average.
Note
1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (
weight
) and not the quantized weights (quantized weight
)2) The weights and the quantized weights become synced only after
forward()
is called, and not after a call tobackward()
. To access the parameters of the network, remember to callforward()
once before doing so, otherwise the float weights and the quantized weights will not be in sync.3) CPU and GPU implementations now use float value for
quantized weight
, since this function is only for simulation purposes.- Parameters
inp (Variable) – N-D array.
outmaps (int) – Number of convolution kernels (which is equal to the number of output channels). For example, to apply convolution on an input with 16 types of filters, specify 16.
kernel (
tuple
ofint
) – Convolution kernel size. For example, to apply convolution on an image with a 3 (height) by 5 (width) two-dimensional kernel, specify (3,5).group (int) – Number of groups of channels. This makes connections across channels more sparse by grouping connections along map direction.
channel_last (bool) – If True, the last dimension is considered as channel dimension, a.k.a. NHWC order.
w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for bias. By default, it is initialized with zeros ifwith_bias
isTrue
.base_axis (int) – Dimensions up to
base_axis
are treated as the sample dimensions.fix_parameters (bool) – When set to
True
, the weights and biases will not be updated.rng (numpy.random.RandomState) – Random generator for Initializer.
with_bias (bool) – Specify whether to include the bias term.
ql_min_w (int, float, or Variable) – Minimum quantization level for weights. Default is 0.
ql_max_w (int, float, or Variable) – Maximum quantization level for weights. Default is 255.
w_min_max (bool) – Use the min and max of weights to compute quantization ranges. Default is
False
.qr_min_w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for the minimum quantization range, qr_min. Default isnnabla.initializer.ConstantInitializer
(-2.0).qr_max_w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for the maximum quantization range, qr_max Default isnnabla.initializer.ConstantInitializer
(2.0).ste_fine_grained_w (bool) – If true, STE is not 1, the {0, 1}-mask computed from the min-max is applied to the gradient in the backward; otherwise, STE is 1.
ql_min_b (int, float, or Variable) – Minimum quantization level for bias. Default is 0.
ql_max_b (int, float, or Variable) – Maximum quantization level for bias. Default is 255.
b_min_max (bool) – Use the min and max of bias to compute quantization ranges. Default is
False
.qr_min_b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for the minimum quantization range, qr_min. Default isnnabla.initializer.ConstantInitializer
(-6.0).qr_max_b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for the maximum quantization range, qr_max Default isnnabla.initializer.ConstantInitializer
(6.0).ste_fine_grained_b (bool) – If true, STE is not 1, the {0, 1}-mask computed from the min-max is applied to the gradient in the backward; otherwise, STE is 1.
eps (float) – Epsilon, or small value to ensure \(qr_{max} - qr_{min}\) must be greater than the epsilon for both weights and bias.
- Returns
N-D array.
- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"min_max_quantized_conv"
;W (
need_grad=True
) : Filter weights in float. (shape:(outmaps, inmaps // group, *kernel)
)b (
need_grad=True
) : Bias vector in float. (shape:(outmaps,)
)W_q (
need_grad=False
) : Quantized weights. (shape:(outmaps, inmaps // group, *kernel)
)b_q (
need_grad=False
) : Quantized biases. (shape:(outmaps,)
)qr_min (
need_grad=False
) : Minimum quantization range. Minimum values of inputs or trainable range.. (shape:ql_min.shape
)qr_max (
need_grad=False
) : Maximum quantization range. Maximum values of inputs or trainable range.. (shape:ql_max.shape
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = min_max_quantized_convolution(<args>)
- nnabla.parametric_functions.pow2_quantized_affine(inp, n_outmaps, base_axis=1, w_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, quantize_w=True, sign_w=True, with_zero_w=False, n_w=8, m_w=2, ste_fine_grained_w=True, quantize_b=True, sign_b=True, with_zero_b=False, n_b=8, m_b=2, ste_fine_grained_b=True, name=None)[source]¶
Pow2 Quantized Affine.
Pow2 Quantized Affine is the affine function, except the definition of the inner product is modified. The input-output relation of this function is as follows:
\[y_j = \sum_{i} Q(w_{ji}) x_i,\]where \(Q(w_{ji})\) is the power-of-2 quantization function.
Note
1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (
weight
) and not the quantized weights (quantized weight
)2) The weights and the quantized weights become synced only after
forward()
is called, and not after a call tobackward()
. To access the parameters of the network, remember to callforward()
once before doing so, otherwise the float weights and the quantized weights will not be in sync.3) Quantized values are stored as floating point number for
quantized weight
, since this function is only for simulation purposes.- Parameters
inp (Variable) – Input N-D array with shape (\(M_0 \times \ldots \times M_{B-1} \times D_B \times \ldots \times D_N\)). Dimensions before and after base_axis are flattened as if it is a matrix.
n_outmaps (
int
ortuple
ofint
) – Number of output neurons per data.base_axis (int) – Dimensions up to
base_axis
are treated as the sample dimensions.w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for bias. By default, it is initialized with zeros ifwith_bias
isTrue
.fix_parameters (bool) – When set to
True
, the weights and biases will not be updated.rng (numpy.random.RandomState) – Random generator for Initializer.
with_bias (bool) – Specify whether to include the bias term.
with_zero_w (bool) – Indicate using zero as a quantized value. Default is false.
n_w (int) – Bit width used for weight.
m_w (int) – \(2^m\) is upper bound and \(-2^m\) is lower bound for weights. Default is 2.
with_zero_b (bool) – Indicate using zero as a quantized value. Default is false.
n_b (int) – Bit width used for bias.
m_b (int) – \(2^m\) is upper bound and \(-2^m\) is lower bound for bias. Default is 2.
- Returns
\((B + 1)\)-D array. (\(M_0 \times \ldots \times M_{B-1} \times L\))
- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"pow2_quantized_affine"
;W (
need_grad=True
) : Weight matrix in float. (shape:(inmaps, outmaps)
)b (
need_grad=True
) : Bias vector in float. (shape:(outmaps,)
)W_q (
need_grad=False
) : Quantized weights. (shape:(inmaps, outmaps)
)b_q (
need_grad=False
) : Quantized biases. (shape:(outmaps,)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = pow2_quantized_affine(<args>)
- nnabla.parametric_functions.pow2_quantized_convolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, quantize_w=True, with_zero_w=False, sign_w=True, n_w=8, m_w=2, ste_fine_grained_w=True, quantize_b=True, with_zero_b=False, sign_b=True, n_b=8, m_b=2, ste_fine_grained_b=True, name=None)[source]¶
Pow2 Quantized Convolution.
Pow2 Quantized Convolution is the convolution function, except the definition of the inner product is modified. The input-output relation of this function is as follows:
\[y_{n, a, b} = \sum_{m} \sum_{i} \sum_{j} Q(w_{n, m, i, j}) x_{m, a + i, b + j},\]where \(Q(w_{n, m, i, j})\) is the power-of-2 quantization function.
Note
1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (
weight
) and not the quantized weights (quantized weight
)2) The weights and the quantized weights become synced only after
forward()
is called, and not after a call tobackward()
. To access the parameters of the network, remember to callforward()
once before doing so, otherwise the float weights and the quantized weights will not be in sync.3) Quantized values are stored as floating point number for
quantized weight
, since this function is only for simulation purposes.- Parameters
inp (Variable) – N-D array.
outmaps (int) – Number of convolution kernels (which is equal to the number of output channels). For example, to apply convolution on an input with 16 types of filters, specify 16.
kernel (
tuple
ofint
) – Convolution kernel size. For example, to apply convolution on an image with a 3 (height) by 5 (width) two-dimensional kernel, specify (3,5).group (int) – Number of groups of channels. This makes connections across channels more sparse by grouping connections along map direction.
w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for bias. By default, it is initialized with zeros ifwith_bias
isTrue
.base_axis (int) – Dimensions up to
base_axis
are treated as the sample dimensions.fix_parameters (bool) – When set to
True
, the weights and biases will not be updated.rng (numpy.random.RandomState) – Random generator for Initializer.
with_bias (bool) – Specify whether to include the bias term.
n_w (int) – Bit width used for weight.
m_w (int) – \(2^m\) is upper bound and \(-2^m\) is lower bound for weights. Default is 2.
n_b (int) – Bit width used for bias.
m_b (int) – \(2^m\) is upper bound and \(-2^m\) is lower bound for bias. Default is 2.
- Returns
N-D array.
- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"pow2_quantized_conv"
;W (
need_grad=True
) : Filter weights in float. (shape:(outmaps, inmaps // group, *kernel)
)b (
need_grad=True
) : Bias vector in float. (shape:(outmaps,)
)W_q (
need_grad=False
) : Quantized weights. (shape:(outmaps, inmaps // group, *kernel)
)b_q (
need_grad=False
) : Quantized biases. (shape:(outmaps,)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = pow2_quantized_convolution(<args>)
- nnabla.parametric_functions.pruned_affine(inp, n_outmaps, base_axis=1, w_init=None, b_init=None, fix_parameters=False, rng=None, with_bias=True, prune_w=True, rate_w=0.9, prune_b=True, rate_b=0.9, name=None)[source]¶
Pruned Affine.
Pruned Affine is the affine function, except the definition of the inner product is modified. The input-output relation of this function is as follows:
\[y_j = \sum_{i} Q(w_{ji}) x_i,\]where \(Q(w_{ji})\) is the pruning function, i.e.,
F.prune
.Note
1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (
weight
) and not the quantized weights (quantized weight
)2) The weights and the quantized weights become synced only after
forward()
is called, and not after a call tobackward()
. To access the parameters of the network, remember to callforward()
once before doing so, otherwise the float weights and the quantized weights will not be in sync.3) CPU and GPU implementations now use float value for
quantized weight
, since this function is only for simulation purposes.- Parameters
inp (Variable) – Input N-D array with shape (\(M_0 \times \ldots \times M_{B-1} \times D_B \times \ldots \times D_N\)). Dimensions before and after base_axis are flattened as if it is a matrix.
n_outmaps (
int
ortuple
ofint
) – Number of output neurons per data.base_axis (int) – Dimensions up to
base_axis
are treated as the sample dimensions.w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for weight.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for bias.fix_parameters (bool) – When set to
True
, the weights and biases will not be updated.rng (numpy.random.RandomState) – Random generator for Initializer.
with_bias (bool) – Specify whether to include the bias term.
rate_w (float) – Pruning rate for weights.
rate_b (float) – Pruning rate for bias.
- Returns
\((B + 1)\)-D array. (\(M_0 \times \ldots \times M_{B-1} \times L\))
- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"pruned_affine"
;W (
need_grad=True
) : Weight matrix in float. (shape:(inmaps, outmaps)
)b (
need_grad=True
) : Bias vector in float. (shape:(outmaps,)
)W_q (
need_grad=False
) : Qunatized weights. (shape:(inmaps, outmaps)
)b_q (
need_grad=False
) : Quantized biases. (shape:(outmaps,)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = pruned_affine(<args>)
- nnabla.parametric_functions.pruned_convolution(inp, outmaps, kernel, pad=None, stride=None, dilation=None, group=1, channel_last=False, w_init=None, b_init=None, base_axis=1, fix_parameters=False, rng=None, with_bias=True, prune_w=True, rate_w=0.9, prune_b=True, rate_b=0.9, name=None)[source]¶
Pruned Convolution.
Pruned Convolution is the convolution function, except the definition of the inner product is modified. The input-output relation of this function is as follows:
\[y_{n, a, b} = \sum_{m} \sum_{i} \sum_{j} Q(w_{n, m, i, j}) x_{m, a + i, b + j},\]where \(Q(w_{ji})\) is the pruning function, i.e.,
F.prune
.Note
1) if you would like to share weights between some layers, please make sure to share the standard, floating value weights (
weight
) and not the quantized weights (quantized weight
)2) The weights and the quantized weights become synced only after
forward()
is called, and not after a call tobackward()
. To access the parameters of the network, remember to callforward()
once before doing so, otherwise the float weights and the quantized weights will not be in sync.3) CPU and GPU implementations now use float value for
quantized weight
, since this function is only for simulation purposes.- Parameters
inp (Variable) – N-D array.
outmaps (int) – Number of convolution kernels (which is equal to the number of output channels). For example, to apply convolution on an input with 16 types of filters, specify 16.
kernel (
tuple
ofint
) – Convolution kernel size. For example, to apply convolution on an image with a 3 (height) by 5 (width) two-dimensional kernel, specify (3,5).group (int) – Number of groups of channels. This makes connections across channels more sparse by grouping connections along map direction.
w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for weight.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for bias.base_axis (int) – Dimensions up to
base_axis
are treated as the sample dimensions.fix_parameters (bool) – When set to
True
, the weights and biases will not be updated.rng (numpy.random.RandomState) – Random generator for Initializer.
with_bias (bool) – Specify whether to include the bias term.
rate_w (float) – Pruning rate for weights.
rate_b (float) – Pruning rate for bias.
- Returns
N-D array.
- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"pruned_conv"
;W (
need_grad=True
) : Filter weights in float. (shape:(outmaps, inmaps // group, *kernel)
)b (
need_grad=True
) : Bias vector in float. (shape:(outmaps,)
)W_q (
need_grad=False
) : Qunatized weights. (shape:(outmaps, inmaps // group, *kernel)
)b_q (
need_grad=False
) : Quantized biases. (shape:(outmaps,)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = pruned_convolution(<args>)
- nnabla.parametric_functions.min_max_quantize(x, ql_min=0, ql_max=255, decay=0.999, x_min_max=False, ema=False, ste_fine_grained=True, eps=0.01, qr_min_init=None, qr_max_init=None, fix_parameters=False, outputs=None, name=None)[source]¶
Min-max quantization.
This function uniformly quantizes values in the range of min and max quantization levels.
Min-max quantization is defined as the following equation
\[y = round \left(\frac{\min(\max(x, m), M) - m}{scale} \right) \times scale + m,\]where the \(scale\) is defined as
\[scale = \frac{M - m}{M_q - m_q},\]and
\[\begin{split}m_q = ql_{min}, \\ M_q = ql_{max}, \\ m = qr_{min}, \\ M = qr_{max}.\end{split}\]In the backward pass when using
ste_fine_grained
as false,\[\frac{\partial q_i}{\partial x_i} = 1.\]In the backward pass when using
ste_fine_grained
as true,\[\begin{split} \frac{\partial q_i}{\partial x_i}= \left\{ \begin{array}{ll} 0 & if \ \ \ x_i > M \\ 1 & if \ \ m \le x_i \le M \\ 0 & if \ \ x_i < m \\ \end{array} \right..\end{split}\]\(qr_{min}\) and \(qr_{max}\) are treaded as follows.
x_min_max
isTrue
andema
isTrue
: Exponential moving average are computed for each \(min(x)\) and \(max(x)\) then stored in \(qr_{min}\) and \(qr_{max}\).x_min_max
isTrue
andema
isFalse
: \(min(x)\) and \(max(x)\) are computed then stored in \(qr_{min}\) and \(qr_{max}\).x_min_max
isFalse
andema
isTrue
: Exponential moving average stored in \(qr_{min}\) and \(qr_{max}\) are used.x_min_max
isFalse
andema
isFalse
Gradients of \(qr_{min}\) and \(qr_{max}\) are computed in the backward pass.
More precisely, in inference of the min-max quantization, one has to consider zero-point (zp) which corresponds to the real value 0, and its data type is an integer. zero-point is defined as
\[\begin{split} && zp_f = ql_{min} -\frac{qr_{min}}{scale}, \\ && zp = \left\{ \begin{array}{ll} ql_{max} & if \ \ \ zp_f >= ql_{max} \\ round(zp_f) & if \ \ otherwise \\ ql_{min} & if \ \ zp_f <= ql_{min} \\ \end{array} \right..\end{split}\]Accordingly, in order to simulate quantization effect of zero-point, during both forward and backward pass, \(qr_{min}\) and \(qr_{max}\) are adjusted as follows,
\[\begin{split}qr_{min}^{adj} = ql_{min} - zp * scale, \\ qr_{max}^{adj} = ql_{max} - zp * scale.\end{split}\]These operations are often called nudge.
Finally, in the formulas of the min-max quantization, \(m\) and \(M\) are replaced by \(qr_{min}^{adj}\) and \(qr_{max}^{adj}\) respectively.
- Parameters
x (Variable) – Input N-D array.
ql_min (int, float, or Variable) – Minimum quantization level. Default is 0.
ql_max (int, float, or Variable) – Maximum quantization level. Default is 255.
decay (float) – The decay rate for the exponential moving average.
x_min_max (bool) – Use the min and max of x to compute quantization ranges. Default is
False
.ema (bool) – Use the exponential moving average for the min and max quantization ranges. Default is
False
.ste_fine_grained (bool) – If true, STE is not 1, the {0, 1}-mask computed from the min-max is applied to the gradient in the backward; otherwise, STE is 1.
eps (float) – Epsilon, or small value to ensure \(qr_{max} - qr_{min}\) must be greater than the epsilon for both weights and bias.
qr_min_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for the minimum quantization range, qr_min. Default isnnabla.initializer.ConstantInitializer
(-6.0).qr_max_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for the maximum quantization range, qr_max Default isnnabla.initializer.ConstantInitializer
(6.0).fix_parameters (bool) – When set to
True
, the weights and biases will not be updated.
References
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko, “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference”, https://arxiv.org/abs/1712.05877
- Parameters to be registered
The following variables are registered in a parameter scope
"min_max_quantize"
;qr_min (
need_grad=False
) : Minimum quantization range, the exponential movining average of min values of inputs initialized with -6.0 if ema is True. (shape:ql_min.shape
)qr_max (
need_grad=False
) : Maximum quantization range, the exponential movining average of max values of inputs initialized with 6.0 if ema is True. (shape:ql_max.shape
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = min_max_quantize(<args>)
- nnabla.parametric_functions.lstm_cell(x, h, c, state_size, w_init=None, b_init=None, fix_parameters=False, name=None)[source]¶
Long Short-Term Memory.
Long Short-Term Memory, or LSTM, is a building block for recurrent neural networks (RNN) layers. LSTM unit consists of a cell and input, output, forget gates whose functions are defined as following:
\[\begin{split}f_t&&=\sigma(W_fx_t+U_fh_{t-1}+b_f) \\ i_t&&=\sigma(W_ix_t+U_ih_{t-1}+b_i) \\ o_t&&=\sigma(W_ox_t+U_oh_{t-1}+b_o) \\ c_t&&=f_t\odot c_{t-1}+i_t\odot\tanh(W_cx_t+U_ch_{t-1}+b_c) \\ h_t&&=o_t\odot\tanh(c_t).\end{split}\]References
S. Hochreiter, and J. Schmidhuber. “Long Short-Term Memory.” Neural Computation. 1997.
- Parameters
x (Variable) – Input N-D array with shape (batch_size, input_size).
h (Variable) – Input N-D array with shape (batch_size, state_size).
c (Variable) – Input N-D array with shape (batch_size, state_size).
state_size (int) – Internal state size is set to
state_size
.w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
, optional) – Initializer for weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
, optional) – Initializer for bias. By default, it is initialized with zeros ifwith_bias
isTrue
.fix_parameters (bool) – When set to
True
, the weights and biases will not be updated.
- Returns
- Parameters to be registered
The following variables are registered in a parameter scope
"lstm"
;affine/W (
need_grad=True
) : Stacked weight matrixes of LSTM block. (shape:(inmaps, 4, state_size)
)affine/b (
need_grad=True
) : Stacked bias vectors of LSTM block. (shape:(4, state_size,)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = lstm_cell(<args>)
- class nnabla.parametric_functions.LSTMCell(batch_size, state_size, h=None, c=None, name=None)[source]¶
- __call__(x, w_init, b_init, fix_parameters)[source]¶
Updates h and c by calling lstm function.
- Parameters
x (Variable) – Input N-D array with shape (batch_size, input_size).
w_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
, optional) – Initializer for weight. By default, it is initialized withnnabla.initializer.UniformInitializer
within the range determined bynnabla.initializer.calc_uniform_lim_glorot
.b_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
, optional) – Initializer for bias. By default, it is initialized with zeros ifwith_bias
isTrue
.fix_parameters (bool) – When set to
True
, the weights and biases will not be updated.
- nnabla.parametric_functions.spectral_norm(w, dim=0, itr=1, eps=1e-12, test=False, u_init=None, fix_parameters=True, name=None)[source]¶
Spectral Normalization.
\[W_{sn} = \frac{W}{\sigma(W)}.\]where \(W\) is the input matrix, and the \(\sigma(W)\) is the spectral norm of \(W\). The spectral norm is approximately computed by the power iteration.
References
Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida, “Spectral Normalization for Generative Adversarial Networks”, International Conference on Learning Representations. 2018.
- Parameters
W (Variable) – Input N-D array with shape. This is normally network parameter.
dim (
int
) – Output dimension. Default is 0. If the dimension is not 0, then the specified dimension becomes the most-left dimension by transposing.itr (
int
) – Number of iterations. Default is 1.eps (
float
) – Epsilon for the normalization. Default is 1e-12.test (
bool
) – Use test mode. Default is False.
- Returns
Spectrally normalized \(W_{sn}\) with the same shape as \(W\).
- Return type
Example
import nnabla as nn import nnabla.parametric_functions as PF b, c, h, w = 4, 64, 32, 32 # Spectrally normalized convolution apply_w = lambda w: PF.spectral_norm(w, dim=0) h = nn.Variable.from_numpy_array(np.random.randn(b, c, h, w)) h = PF.convolution(h, with_bias=False, apply_w=apply_w) # Spectrally normalized affine apply_w = lambda w: PF.spectral_norm(w, dim=1) h = nn.Variable.from_numpy_array(np.random.randn(b, c)) h = PF.affine(h, with_bias=False, apply_w=apply_w) # Spectrally normalized embed apply_w = lambda w: PF.spectral_norm(w, dim=1) h = nn.Variable.from_numpy_array(np.random.randn(b, c)) h = PF.embed(h, c, apply_w=apply_w)
- Parameters to be registered
The following variables are registered in a parameter scope
"spectral-norm"
;u (
need_grad=False
) : singular vector. (shape:(w.shape[dim], )
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = spectral_norm(<args>)
- nnabla.parametric_functions.weight_normalization(w, dim=0, eps=1e-12, g_init=None, fix_parameters=False, name=None)[source]¶
Weight Normalization.
\[\mathbf{w}_{WN} = g \dfrac{\mathbf{w}}{\|\mathbf{w}\|}\]where \(\mathbf{w}\) is the input weights to be normalized, and \(g\) is learnable multiplication factors each of which is applied to each input weights at
dim
. This function is in general used as callback passed to apply_w for PF.convolution, PF.affine and so on. According to the author`s original implementation, \(v\) should be initialized by \(N(0, 0.05)\). To meet this condition, initializer should be passed to convolution which Weight Normalization is applied, like an example below.References
- Parameters
W (Variable) – Input N-D array with shape. This is normally network parameter.
dim (
int
) – Output dimension. Default is 0. If the dimension is not 0, then the specified dimension becomes the most-left dimension by transposing.eps (
float
) – Epsilon for the normalization. Default is 1e-12.g_init (
nnabla.initializer.BaseInitializer
ornumpy.ndarray
) – Initializer for the scale. By default, L2-norm of weights corresponding todim
are used.
- Returns
\(W\) with the same shape as \(v\).
- Return type
Example
import nnabla as nn import nnabla.parametric_functions as PF import nnabla.initializer as I # h is nn.Variable. # convolution # according to the original implementation, w should be initialized by N(0, 0.05). h = PF.convolution(h, ..., apply_w=PF.weight_normalization, w_init=I.NormalInitializer(0.05)) # affine h = PF.affine(h, ..., apply_w=lambda w: PF.weight_normalization(w, dim=1), w_init=I.NormalInitializer(0.05))
Warning
Up to the version 1.10.0, this had been implemented as the composite functions.
- Parameters to be registered
The following variables are registered in a parameter scope
"wn"
;g (
need_grad=True
) : Weight Normalization adaptive scale scalar.. (shape:w.shape[dim]
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = weight_normalization(<args>)
- nnabla.parametric_functions.multi_head_attention(query, key, value, num_heads=12, dropout=0.0, k_embed_dim=None, v_embed_dim=None, out_dim=None, rng=None, with_bias=True, add_attn_bias=False, additive_mask=None, key_padding_mask=None, fix_parameters=False, param_init=None, name=None)[source]¶
MultiHeadAttention.
Computes multi-headed attention with query, key, and value. We use the following notations to describe the inputs and outputs below. \(L_T\): target sequence length, \(L_S\): source sequence length, \(B\): batch size, \(D\): input dimension, \(E\): embedding dimension.
References
A. Vaswani et al. “Attention is All You Need.” NIPS. 2017. <https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf>
Example:
q = nn.Variable((tgt_len, batch_size, q_input_dim)) k = nn.Variable((src_len, batch_size, k_input_dim)) v = nn.Variable((src_len, batch_size, v_input_dim)) out, w = PF.multi_head_attention(q, k, v) out.forward()
- Parameters
query (Variable) – Input N-D array with shape \((L_T, B, D_q)\).
key (Variable) – Input N-D array with shape \((L_S, B, D_k)\).
value (Variable) – Input N-D array with shape \((L_S, B, D_v)\).
num_heads (int, optional) – Number of attention heads. Note that embedding dimensoin E must be divisible by the number of heads. Default is 12 which is conventional.
dropout (float, optional) – Dropout ratio applied to parameters. Default is 0.
k_embed_dim (int, optional) – Embedding dimension for key. If specified, embedding dimensions for both query and key are set as that value. Otherwise, k_embed_dim is set as the same alue as embedding dimension for query.
v_embed_dim (int, optional) – Embedding dimension for value. If not specified, it is defaulted as the same value as embedding dimension for query.
out_dim (int, optional) – Embedding dimension for output weight. If not spefied, it is defaulted as the same value as embedding dimension for value.
rng (numpy.random.RandomState, optional) – Random generator for Initializer. Default is None.
with_bias (bool, optional) – Specify whether to include the bias parameters. Default is True.
add_attn_bias (bool, optional) – Specify whether to add attention bias parameters for key and value. Default is False.
additive_mask (Variable, optional) – Input N-D array with shape \((L_T, L_S)\). Values will be added to the attention layer to prevent attention to certain positions.
key_padding_mask (Variable, optional) – Input N-D array with shape \((B, L_S)\). Specified padding elements will be ignored by the attention layer. Values must be either 1 or 0.
fix_parameters (bool, optional) – When set to
True
, the weights and biases will not be updated. Default is False.param_init (dict, optional) – Parameter initializers can be set with a dict. Possible keys of the dict include q_weight, k_weight, v_weight, q_bias, k_bias, v_bias, out_weight, out_bias, attn_bias_k, attn_bias_v. A value of the dict must be an
Initializer
or anumpy.ndarray
. E.g.{'q_bias': ConstantInitializer(0)}
.
- Returns
Output \(y\) with shape \((L_T, B, E)\) ~nnabla.Variable: Output \(h_n\) with shape \((B, L_T, L_S)\)
- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"multi_head_attention"
;q_weight (
need_grad=True
) : weights for query. (shape:(E, E)
)k_weight (
need_grad=True
) : weights for key. (shape:(E_k, E)
)v_weight (
need_grad=True
) : weights for value. (shape:(E_v, E)
)out_weight (
need_grad=True
) : weigths for out projection. (shape:(E, E)
)q_bias (
need_grad=True
) : bias for query. (shape:(E, )
)k_bias (
need_grad=True
) : bias for key. (shape:(E, )
)v_bias (
need_grad=True
) : bais for value. (shape:(E, )
)out_bias (
need_grad=True
) : bias for out projection. (shape:(E, )
)attn_bias_k (
need_grad=True
) : attnetion bias for k. (shape:(E, 1)
)attn_bias_v (
need_grad=True
) : attnetion bias for v. (shape:(E, 1)
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = multi_head_attention(<args>)
- nnabla.parametric_functions.transformer(src, tgt, embed_dim=512, num_heads=8, num_encoder_layers=6, num_decoder_layers=6, dim_feedforward=2048, dropout=0.1, activation=None, src_additive_mask=None, tgt_additive_mask=None, memory_additive_mask=None, src_key_padding_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, rng=None, add_attn_bias=False, fix_parameters=False, name=None)[source]¶
Transformer.
We use the following notations to describe the inputs and outputs below. \(L_T\): target sequence length, \(L_S\): source sequence length, \(B\): batch size, \(E\): embedding dimension.
References
A. Vaswani et al. “Attention is All You Need.” NIPS. 2017. <https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf>
Examples:
src = nn.Variable((src_len, batch_size, embed_dim),need_grad=True) tgt = nn.Variable((tgt_len, batch_size, embed_dim),need_grad=True) out = PF.transformer(src, tgt, num_heads=16, num_encoder_layers=12) out.forward()
- Parameters
src (Variable) – Input source sequence to the encoder with shape:math:
(L_S, B, E)
.tgt (Variable) – Input target sequence to the decoder with shape \((L_T, B, E)\).
embed_dim (int, optional) – Embedding dimension to be used. Default is 512.
num_heads (int, optional) – Number of attention heads. Default is 12.
num_encoder_layers (int, optional) – Number of encoder layers to stack. Default is 6.
num_decoder_layers (int, optional) – Number of decoder layers to stack. Default is 6.
dim_feedforward (int, optional) – Dimension of the feedforward network model. Default is 2048.
dropout (float, optional) – Dropout ratio applied. Default is 0.1.
activation (function, optional) – Non-linear activation function to be used. Default is None, which is set as F.relu in the code.
src_additive_mask (Variable, optional) – Additive mask for the src sequence (optional). \((L_S, L_S)\).
tgt_additive_mask (Variable, optional) – Additive mask for the tgt sequence (optional). \((L_T, L_T)\).
memory_additive_mask (Variable, optional) – Additive mask for the encoder output (optional). \((L_T, L_S)\).
src_key_padding_mask (Variable, optional) – Key padding mask for src keys per batch (optional). \((B, L_S)\). Specified padding elements will be ignored by the attention layer. Values must be either 1 or 0.
tgt_key_padding_mask (Variable, optional) – Key padding mask for tgt keys per batch (optional). \((B, L_T)\). Specified padding elements will be ignored by the attention layer. Values must be either 1 or 0.
memory_key_padding_mask (Variable, optional) – Key padding mask for memory keys per batch (optional). \((B, L_S)\). Specified padding elements will be ignored by the attention layer. Values must be either 1 or 0.
rng (numpy.random.RandomState, optional) – Random generator for Initializer. Default is None.
add_attn_bias (bool, optional) – Specify whether to add attention bias parameters for key and value. Default is False.
fix_parameters (bool, optional) – When set to
True
, the weights and biases will not be updated. Default is False.
- Returns
Output \(y\) with shape \((L_T, B, E)\)
- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"transformer"
;encoder{layer#} (
need_grad=True
) : parameters for the n’th encoder layer. (shape:Refer to transformer_encode for details
)decoder{layer#} (
need_grad=True
) : parameters for the n’th decoder layer. (shape:Refer to transformer_decode for details
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = transformer(<args>)
- nnabla.parametric_functions.transformer_encode(src, embed_dim, num_heads, dim_feedforward=2048, dropout=0.1, activation=None, src_additive_mask=None, src_key_padding_mask=None, rng=None, add_attn_bias=False, fix_parameters=False, name=None)[source]¶
Transformer Encoder.
- Parameters
src (Variable) – Input sequnce to the encoder layer with shape \((L_S, B, E)\).
embed_dim (int) – Number of embedding dimension.
num_heads (int) – Number of attention heads.
dim_feedforward (int, optional) – Dimension of the feedforward network model. Default is 2048.
dropout (float, optional) – Dropout ratio. Default is 0.1.
activation (function, optional) – Non-linear activation function to be used. Default is None, which is set as F.relu in the code.
src_additive_mask (Variable, optional) – Additive mask for the source sequence with shape \((L_S, L_S)\)
src_key_padding_mask (Variable, optional) – Padding mask for the source sequence with shape \((B, L_S)\). Specified padding elements will be ignored by the attention layer. Values must be either 1 or 0.
rng (numpy.random.RandomState, optional) – Random generator for Initializer. Defalut is None.
add_attn_bias (bool, optional) – Specify whether to add attention bias parameters for key and value. Default is False.
fix_parameters (bool, optional) – When set to
True
, the weights and biases will not be updated. Default is False.
- Returns
Output \(y\) with shape \((L_S, B, E)\)
- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"transformer_encode"
;src_self_attn (
need_grad=True
) : self-attention parameters for source sequence. (shape:Refer to multi_head_attention for details
)enc_affine1 (
need_grad=True
) : first affine used in encoder. (shape:Refer to affine for details
)enc_affine2 (
need_grad=True
) : second affine used in encoder. (shape:Refer to affine for details
)enc_layer_norm1 (
need_grad=True
) : fist layer normalization used in encoder. (shape:Refer to layer_normalization for details
)enc_layer_norm2 (
need_grad=True
) : second layer normalization used in encoder. (shape:Refer to layer_normalization for details
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = transformer_encode(<args>)
- nnabla.parametric_functions.transformer_decode(tgt, memory, embed_dim, num_heads, dim_feedforward=2048, dropout=0.1, activation=None, tgt_additive_mask=None, memory_additive_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, rng=None, add_attn_bias=False, fix_parameters=False, name=None)[source]¶
Transformer Decoder.
- Parameters
tgt (Variable) – Input sequnce to the decoder layer with shape \((L_T, B, E)\).
memory (Variable) – Output sequnce from the last layer of the encoder with shape \((L_T, B, E)\).
embed_dim (int) – Number of embedding dimension.
num_heads (int) – Number of attention heads.
dim_feedforward (int, optional) – Dimension of the feedforward network model. Default is 2048.
dropout (float, optional) – Dropout ratio. Default is 0.1.
activation (function, optional) – Non-linear activation function to be used. Default is None, which is set as F.relu in the code.
tgt_additive_mask (Variable, optional) – Additive mask for the target sequence with shape \((L_T, L_T)\).
memory_additive_mask (Variable, optional) – Additive mask for the memory sequcne with shape \((L_T, L_S)\).
tgt_key_padding_mask (Variable, optional) – Padding mask for the target sequence with shape \((B, L_T)\). Specified padding elements will be ignored by the attention layer. Values must be either 1 or 0.
memory_key_padding_mask (Variable, optional) – Padding mask for the mask sequence with shape \((B, L_S)\). Specified padding elements will be ignored by the attention layer. Values must be either 1 or 0.
rng (numpy.random.RandomState) – Random generator for Initializer. Default is None.
add_attn_bias (bool, optional) – Specify whether to add attention bias parameters for key and value. Default is False.
fix_parameters (bool) – When set to
True
, the weights and biases will not be updated. Default is False.
- Returns
Output \(y\) with shape \((L_T, B, E)\)
- Return type
- Parameters to be registered
The following variables are registered in a parameter scope
"transformer_decode"
;tgt_self_attn (
need_grad=True
) : self-attention parameters for target sequence. (shape:Refer to multi_head_attention for details
)tgt_memory_attn (
need_grad=True
) : attention parameters for target sequence with output from encoder as key. (shape:Refer to multi_head_attention for details
)dec_affine1 (
need_grad=True
) : first affine used in decoder. (shape:Refer to affine for details
)dec_affine2 (
need_grad=True
) : second affine used in decoder. (shape:Refer to affine for details
)dec_layer_norm1 (
need_grad=True
) : fist layer normalization used in decoder. (shape:Refer to layer_normalization for details
)dec_layer_norm2 (
need_grad=True
) : second layer normalization used in decoder. (shape:Refer to layer_normalization for details
)dec_layer_norm3 (
need_grad=True
) : third layer normalization used in decoder. (shape:Refer to layer_normalization for details
)
Note
If the
name
option is passed, the parameters become wrapped inside the parameter scope with the specified name, yielding the same results as the following code. This can be used to simplify the code.with parametric_scope(name): output = transformer_decode(<args>)
Parameter Initializer¶
Some of the parametric functions optionally takes parameter initializer listed below.
- class nnabla.initializer.BaseInitializer[source]¶
Base class of the parameter initializer.
- __call__(shape)[source]¶
Generates an array with an initializer.
- Parameters
shape (
tuple
ofint
) –numpy.ndarray
with the shape created.- Returns
Array.
- Return type
Note
Subclasses of
BaseInitializer
must override this method.
- class nnabla.initializer.ConstantInitializer(value=0)[source]¶
Bases:
nnabla.initializer.BaseInitializer
Generates a constant valued array.
- Parameters
value (float) – A constant value.
Example:
import nnabla as nn import nnabla.parametric_functions as PF import nnabla.initializer as I x = nn.Variable([60,1,28,28]) w = I.ConstantInitializer(0.1) b = I.ConstantInitializer() # this generates constant valued array of default value 0 h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv'
- class nnabla.initializer.NormalInitializer(sigma=1.0, rng=None)[source]¶
Bases:
nnabla.initializer.BaseInitializer
Generates a random array from a specified normal distribution.
\[\mathbf x \sim {\cal N} (\mathbf 0 | \sigma^2 \mathbf I)\]- Parameters
sigma (float) – \(\sigma\).
rng (numpy.random.RandomState) – Random number generator.
Example:
import nnabla as nn import nnabla.parametric_functions as PF import nnabla.initializer as I x = nn.Variable([60,1,28,28]) w = I.NormalInitializer(5e-5) b = I.NormalInitializer(0.0) h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv')
- class nnabla.initializer.UniformInitializer(lim=(- 1, 1), rng=None)[source]¶
Bases:
nnabla.initializer.BaseInitializer
Generates a random array from a specified uniform distribution.
\[\mathbf x \sim {\cal U} (a, b)\]- Parameters
rng (numpy.random.RandomState) – Random number generator.
Example:
import nnabla as nn import nnabla.parametric_functions as PF import nnabla.initializer as I x = nn.Variable([60,1,28,28]) w = I.UniformInitializer() # this generates uniform distribution within the default range of (-1,1) b = I.UniformInitializer((-0.5,0.5)) h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv')
- class nnabla.initializer.UniformIntInitializer(lim=(0, 10), rng=None)[source]¶
Bases:
nnabla.initializer.BaseInitializer
Generates a random array from a specified integer uniform distribution.
\[\mathbf x \sim {\cal U} ([a, b))\]- Parameters
rng (numpy.random.RandomState) – Random number generator.
Example:
import nnabla as nn import nnabla.parametric_functions as PF import nnabla.initializer as I x = nn.Variable([60,1,28,28]) w = I.UniformIntInitializer() # this generates uniform integer distribution within the default range of (0,10) b = I.UniformIntInitializer((-1,1)) h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv')
- class nnabla.initializer.RangeInitializer(start=0, step=1)[source]¶
Bases:
nnabla.initializer.BaseInitializer
Generates an array with sequence of numbers.
\[\mathbf x[i] = start + step * i\]Example:
import nnabla as nn import nnabla.initializer as I x = nn.Variable([100]) x.d = I.RangeInitializer(0, 1)(x.shape)
- class nnabla.initializer.OrthogonalInitializer(gain=1.0, rng=None)[source]¶
Bases:
nnabla.initializer.BaseInitializer
Generates an orthogonal matrix weights proposed by Saxe et al.
- Parameters
gain (float) – scaling factor which should be decided depending on a type of units.
rng (numpy.random.RandomState) – Random number generator.
Example:
import numpy as np import nnabla as nn import nnabla.parametric_functions as PF import nnabla.initializer as I x = nn.Variable([60,1,28,28]) w = I.OrthogonalInitializer(np.sqrt(2.0)) b = I.ConstantInitializer(0.0) h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv')
References
- class nnabla.initializer.WeightNormalizationScaleInitializer(w, dim=0, eps=1e-12)[source]¶
Bases:
nnabla.initializer.BaseInitializer
Compute the L2-norm for each weight kernel.
This initializer is specific to the weight normalization scale to keep the same magnitude of the originally initialized weights even after the applicaiton of the weight normalization at only initialization.
- nnabla.initializer.calc_normal_std_he_forward(inmaps, outmaps, kernel=(1, 1))[source]¶
Calculates the standard deviation proposed by He et al.
\[\sigma = \sqrt{\frac{2}{NK}}\]- Parameters
Example:
import nnabla as nn import nnabla.parametric_functions as PF import nnabla.initializer as I x = nn.Variable([60,1,28,28]) s = I.calc_normal_std_he_forward(x.shape[1],64) w = I.NormalInitializer(s) b = I.ConstantInitializer(0) h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv')
References
- nnabla.initializer.calc_normal_std_he_backward(inmaps, outmaps, kernel=(1, 1))[source]¶
Calculates the standard deviation of He et al. (backward case).
\[\sigma = \sqrt{\frac{2}{MK}}\]- Parameters
Example:
import nnabla as nn import nnabla.parametric_functions as PF import nnabla.initializer as I x = nn.Variable([60,1,28,28]) s = I.calc_normal_std_he_backward(x.shape[1],64) w = I.NormalInitializer(s) b = I.ConstantInitializer(0) h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv')
References
- nnabla.initializer.calc_normal_std_glorot(inmaps, outmaps, kernel=(1, 1))[source]¶
Calculates the standard deviation proposed by Glorot et al.
Note
We have updated the definition as following from v.1.2. It may affect the behavior of existing scripts that rely on the default initialization.
\[\sigma = \sqrt{\frac{2}{K(N + M)}}\]- Parameters
Example:
import nnabla as nn import nnabla.parametric_functions as PF import nnabla.initializer as I x = nn.Variable([60,1,28,28]) s = I.calc_normal_std_glorot(x.shape[1],64) w = I.NormalInitializer(s) b = I.ConstantInitializer(0) h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv')
References
- nnabla.initializer.calc_uniform_lim_glorot(inmaps, outmaps, kernel=(1, 1))[source]¶
Calculates the lower bound and the upper bound of the uniform distribution proposed by Glorot et al.
Note
We have updated the definition as following from v.1.3. It may affect the behavior of existing scripts that rely on the default initialization.
\[\begin{split}b &= \sqrt{\frac{6}{K(N + M)}}\\ a &= -b\end{split}\]- Parameters
Example:
import nnabla as nn import nnabla.parametric_functions as PF import nnabla.initializer as I x = nn.Variable([60,1,28,28]) lb,ub= I.calc_uniform_lim_glorot(x.shape[1],64) w = I.UniformInitializer((lb,ub)) b = I.ConstantInitializer(0) h = PF.convolution(x, 64, [3, 3], w_init=w, b_init=b, pad=[1, 1], name='conv')
References