Neural Network Libraries¶
Neural Network Libraries is deep learning framework that is intended to be used for research, development, and production. We aim it running everywhere like desktop PCs, HPC clusters, embedded devices and production servers.
This document describes how to use the Python API and C++ API, the contribution guide for developers, and the license term of this software. The Python API is more suitable for fast prototyping and experimentation of deep learning systems, while the C++ API is for deploying inference or training algorithms into embedded systems and servers (The documentation is not available so far. We will make it available soon). The framework is designed modularity and extensibility in mind. Community contributors can add a new operator or optimizer module of neural networks, and a specialized implementation of neural network modules for a specific target device as an extension.
Python Package¶
The Python API built on top of our C++11 core maximizes the flexibility of the design of neural networks , and encourages fast prototyping and experimentation. NNabla works on Python>=3.5 (>=3.6 is recommended).
Python Package Installation¶
There are three ways to install NNabla Python package.
Install with pip command¶
The NNabla python packages are hosted on PYPI for many platforms. For people who are familiar with Python and its package management system pip
(and optionally CUDA, but recommended), the following pip installation guide will be satisfactory when you install NNabla Python. To see the a bit more detailed OS specific setup guide, go to the next section.
NNabla package installation using PIP¶
Note: please refer to the OS specific workflows for the OS specific dependencies setup.
Install NNabla package via pip:
pip install nnabla
Note: If you want to make sure the latest version will be installed, try to uninstall previously installed one with pip uninstall y nnabla
beforehand.
Then, check if it works by running:
python c "import nnabla"
20180626 15:20:16,759 [nnabla][INFO]: Initializing CPU extension...
NNabla CUDA extension package installation¶
Run an Example¶
Get the examples (, and unzip) or clone NNabla Examples repository, and go to the MNIST folder.
cd nnablaexamples/mnistcollection/
Run MNIST classification.
python classification.py
Run MNIST classification with CUDA/cuDNN.
python classification.py c cudnn
OS specific workflows¶
Installation on Linux¶
This installation instruction describes how to install NNabla using pip on almost any Linux 64bit systems.
The supported Python versions for provided binary packages are 3.5(not recommended), 3.6 and 3.7. It is recommended to use Miniconda as a Python distribution. The following is a simple procedure to install Miniconda Python.
wget https://repo.continuum.io/miniconda/Miniconda3latestLinuxx86_64.sh
bash Miniconda3latestLinuxx86_64.sh b p {installation path e.g. ~/miniconda}
# You have to set an environment variable PATH accordingly
# to enable the installed ``Python`` and the ``conda`` system.
echo 'export PATH=<installation path>/bin:$PATH' >> ~/.bashrc
# Restart your bash or source ~/.bashrc
# Switch the default Python version
conda install y python={version number e.g. 3.6}
We actually tested other linux distributions and versions; Ubuntu 14.04, CentOS 6.9, 7.3, Fedora 23, 25, 26, and RHEL 7.3 on various environments; Baremetal server, AWS instance, and/or Docker machine. Thus, you can install in almost the same way described here. The details of howtoinstall for each are coming soon.
Installation on Windows¶
We tested on Windows8.1 64bit and Windows10 64bit.
The following software are required for installation:
Required software.
Python>=3.6: PIP
Microsoft Visual C++ 2015 Redistributable
Recommended.
CUDA Toolkit and cuDNN (if you have CUDA GPUs).
In this instruction, we use Miniconda.
Get and install the windows binary from here
And then install required packages from command prompt.
> conda install scipy scikitimage ipython
If your network is using proxy and setup fails, configure proxy server with environment variable and try install again.
> SET HTTP_PROXY=http://(enter the address of the http proxy server here)
> SET HTTPS_PROXY=https://(enter the address of the https proxy server here)
Get and install from here
If you are using a NVIDIA GPU, execution speed will be drastically improved by installing the following software.
To install cuDNN, copy bin, include and lib to C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v{CUDA_VERSION}
See a list of compatible cuDNN versions of CUDA extension packages.
Depending on the environment, it will take a long time. Please wait.
Please install scipy using “conda install” before “pip install nnabla”.
Installation on macOS¶
NOTE: Our testing coverage in terms of environments and machines on macOS is very limited. Please submit an issue if you face any issue.
We test the installation on macOS Sierra.
The following software are required for installation:
Python>=3.6 (We’d recommend you to setup Python using Anaconda or Miniconda).
pip (bundled in Conda Python)
wheel (bundled in Conda Python)
setuptools (bundled in Conda Python. You may need to upgrade the version of setuptools with
pip install U nodeps setuptools
.)
See NNabla package installation using PIP (note that the binary packages for the CUDA extension are not available for macOS. Please build it from source).
Install NNabla package compatible with MultiGPU execution¶
To enable multiGPU execution such as distributed training on NNabla, you have to install a special edition of NNabla package. See Installation with MultiGPU supported for installation.
Install from source¶
Documentation of build from source has been moved to Github repository (build or build_distributed).
Running on Docker¶
Docker images¶
Python API Tutorial¶
The following tutorial documents are automatically generated from Jupyter notebook files listed in NNabla Tutorial. If you want to run these stepbystep, follow the link and see the instruction found there.
NNabla by Examples¶
This tutorial demonstrates how you can write a script to train a neural network by using a simple hand digits classification task.
Note: This tutorial notebook requires scikitlearn and matplotlib installed in your Python environment.
First let us prepare some dependencies.
import nnabla as nn
import nnabla.functions as F
import nnabla.parametric_functions as PF
import nnabla.solvers as S
from nnabla.monitor import tile_images
import numpy as np
import matplotlib.pyplot as plt
import tiny_digits
%matplotlib inline
np.random.seed(0)
imshow_opt = dict(cmap='gray', interpolation='nearest')
20170626 23:09:49,971 [nnabla][INFO]: Initializing CPU extension...
The tiny_digits
module is located under this folder. It provides
some utilities for loading a handwrittendigit classification dataset
(MNIST) available in scikitlearn.
Logistic Regression¶
We will first start by defining a computation graph for logistic regression. (For details on logistic regression, see Appendix A.)
The training will be done by gradient descent, where gradients are calculated using the error backpropagation algorithm (backprop).
Preparing a Toy Dataset¶
This section just prepares a dataset to be used for demonstration of NNabla usage.
digits = tiny_digits.load_digits(n_class=10)
tiny_digits.plot_stats(digits)
Num images: 1797
Image shape: (8, 8)
Labels: [0 1 2 3 4 5 6 7 8 9]
The next block creates a dataset loader which is a generator providing images and labels as minibatches. Note that this dataset is just an example purpose and not a part of NNabla.
data = tiny_digits.data_iterator_tiny_digits(digits, batch_size=64, shuffle=True)
20170626 23:09:50,545 [nnabla][INFO]: DataSource with shuffle(True)
20170626 23:09:50,546 [nnabla][INFO]: Using DataSourceWithMemoryCache
20170626 23:09:50,546 [nnabla][INFO]: DataSource with shuffle(True)
20170626 23:09:50,547 [nnabla][INFO]: Onmemory
20170626 23:09:50,547 [nnabla][INFO]: Using DataIterator
A minibatch is as follows. img
and label
are in
numpy.ndarray
.
img, label = data.next()
plt.imshow(tile_images(img), **imshow_opt)
print("labels: {}".format(label.reshape(8, 8)))
print("Label shape: {}".format(label.shape))
labels: [[ 2. 8. 2. 6. 6. 7. 1. 9.]
[ 8. 5. 2. 8. 6. 6. 6. 6.]
[ 1. 0. 5. 8. 8. 7. 8. 4.]
[ 7. 5. 4. 9. 2. 9. 4. 7.]
[ 6. 8. 9. 4. 3. 1. 0. 1.]
[ 8. 6. 7. 7. 1. 0. 7. 6.]
[ 2. 1. 9. 6. 7. 9. 0. 0.]
[ 5. 1. 6. 3. 0. 2. 3. 4.]]
Label shape: (64, 1)
Preparing the Computation Graph¶
NNabla provides two different ways for backpropbased gradient descent optimization. One is with a static graph, and another is with a dynamic graph. We are going to show a static version first.
# Forward pass
x = nn.Variable(img.shape) # Define an image variable
with nn.parameter_scope("affine1"):
y = PF.affine(x, 10) # Output is 10 class
This code block shows one of the most important features in graph
building in NNabla, the parameter scope. The first line defines an
input variable x
. The second line creates a parameter scope. The
third line then applies PF.affine
 an affine transform  to x
,
and creates a variable y
holding that result. Here, the PF
(parametric_function) module provides functions that contain learnable
parameters, such as affine transforms (which contains weights),
convolution (which contains kernels) and batch normalization (which
contains transformation factors and coefficients). We will call these
functions as parametric functions. The parameters are created and
initialized randomly at function call, and registered by a name
“affine1” using parameter_scope
context.
# Building a loss graph
t = nn.Variable(label.shape) # Define an target variable
loss = F.mean(F.softmax_cross_entropy(y, t)) # Softmax Xentropy fits multiclass classification problems
The remaining lines shown above define a target variable and attach functions for loss at the end of the graph. Note that the static graph build doesn’t execute any computation, but the shapes of output variables are inferred. Therefore, we can inspect the shapes of each variable at this time:
print("Printing shapes of variables")
print(x.shape)
print(y.shape)
print(t.shape)
print(loss.shape) # empty tuple means scalar
Printing shapes of variables
(64, 1, 8, 8)
(64, 10)
(64, 1)
()
Executing a static graph¶
You can execute the computation of the graph by calling the
forward()
method in a sink variable. Inputs can be set via .d
accessor. It will borrow CPU array references as numpy.ndarray
.
# Set data
x.d = img
t.d = label
# Execute a forward pass
loss.forward()
# Showing results
print("Prediction score of 0th image: {}".format(y.d[0]))
print("Loss: {}".format(loss.d))
Prediction score of 0th image: [ 9.75851917 6.49118519 16.47323608 1.36296904 0.78583491
4.08872032 7.84134388 2.42956853 3.31485462 3.61868763]
Loss: 10.6016616821
The output doesn’t make sense since the network is just randomly initialized.
Backward propagation through the graph¶
The parameters registered by parameter_scope
management function can
be queried by get_parameters()
as a dict format.
print(nn.get_parameters())
OrderedDict([('affine1/affine/W', <Variable((64, 10), need_grad=True) at 0x7fa0ba361d50>), ('affine1/affine/b', <Variable((10,), need_grad=True) at 0x7fa0ba361ce8>)])
Before executing backpropagation, we should initialize gradient buffers of all parameter to zeros.
for param in nn.get_parameters().values():
param.grad.zero()
Then, you can execute backprop by calling backward()
method at the
sink variable.
# Compute backward
loss.backward()
# Showing gradients.
for name, param in nn.get_parameters().items():
print(name, param.shape, param.g.flat[:20]) # Showing first 20.
affine1/affine/W (64, 10) [ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 4.98418584e02 8.72317329e03
4.06671129e02 4.68742661e02 2.52632981e09 7.86017510e04
9.06870365e02 1.56249944e02 1.56217301e02 3.12499963e02]
affine1/affine/b (10,) [ 0.42710391 0.01852455 0.07369987 0.04687012 0.07798236 0.03664626
0.01651323 0.1249291 0.11862005 0.09374455]
Gradient is stored in grad field of Variable
. .g
accessor can be
used to access grad data in numpy.ndarray
format.
Optimizing parameters (=Training)¶
To optimize parameters, we provide solver module (aliased as S here). The solver module contains a bunch of optimizer implementations such as SGD, SGD with momentum, Adam etc. The below block creates SGD solver and sets parameters of logistic regression to it.
# Create a solver (gradientbased optimizer)
learning_rate = 1e3
solver = S.Sgd(learning_rate)
solver.set_parameters(nn.get_parameters()) # Set parameter variables to be updated.
In the next block, we demonstrate a single step of optimization loop.
solver.zero_grad()
line does equivalent to calling .grad.zero()
for all parameters as we shown above. After backward computation, we
apply weight decay, then applying gradient descent implemented in Sgd
solver class as follows
where \(\eta\) denotes learning rate.
# One step of training
x.d, t.d = data.next()
loss.forward()
solver.zero_grad() # Initialize gradients of all parameters to zero.
loss.backward()
solver.weight_decay(1e5) # Applying weight decay as an regularization
solver.update()
print(loss.d)
12.9438686371
Next block iterates optimization steps, and shows the loss decreases.
for i in range(1000):
x.d, t.d = data.next()
loss.forward()
solver.zero_grad() # Initialize gradients of all parameters to zero.
loss.backward()
solver.weight_decay(1e5) # Applying weight decay as an regularization
solver.update()
if i % 100 == 0: # Print for each 10 iterations
print(i, loss.d)
0 12.6905069351
100 3.17041015625
200 1.60036706924
300 0.673069953918
400 0.951370298862
500 0.724424362183
600 0.361597299576
700 0.588107347488
800 0.28792989254
900 0.415006935596
Show prediction¶
The following code displays training results.
x.d, t.d = data.next() # Here we predict images from training set although it's useless.
y.forward() # You can execute a sub graph.
plt.imshow(tile_images(x.d), **imshow_opt)
print("prediction:")
print(y.d.argmax(axis=1).reshape(8, 8)) # Taking a class index based on prediction score.
prediction:
[[5 0 1 9 0 1 3 3]
[2 4 1 7 4 5 6 5]
[7 7 9 7 9 0 7 3]
[5 3 7 6 6 8 0 9]
[0 1 3 5 5 5 4 9]
[1 0 0 8 5 1 8 8]
[7 5 0 7 6 9 0 0]
[0 6 2 6 4 4 2 6]]
Dynamic graph construction support¶
This is another way of running computation graph in NNabla. This example doesn’t show how useful dynamic graph is, but shows a bit of flavor.
The next block just define computation graph building as functions for later use.
def logreg_forward(x):
with nn.parameter_scope("affine1"):
y = PF.affine(x, 10)
return y
def logreg_loss(y, t):
loss = F.mean(F.softmax_cross_entropy(y, t)) # Softmax Xentropy fits multiclass classification problems
return loss
To run a computation graph dynamically during creation, you use
nnabla.auto_forward()
context as you see in the below block. By
this, computation is fired immediately at functions are called. (You can
also use nnabla.set_auto_forward(auto)
to set the autoforward state
globally.)
x = nn.Variable(img.shape)
t = nn.Variable(label.shape)
x.d, t.d = data.next()
with nn.auto_forward(): # Graph are executed
y = logreg_forward(x)
loss = logreg_loss(y, t)
print("Loss: {}".format(loss.d))
plt.imshow(tile_images(x.d), **imshow_opt)
print("prediction:")
print(y.d.argmax(axis=1).reshape(8, 8))
Loss: 0.43071603775
prediction:
[[9 3 5 0 1 9 9 2]
[5 6 6 2 7 5 1 1]
[3 7 7 6 0 8 3 8]
[0 6 4 6 0 6 9 9]
[6 1 2 5 8 3 2 4]
[1 4 4 0 5 7 1 7]
[7 8 9 5 8 3 7 8]
[5 7 5 3 3 0 0 7]]
Backward computation can be done on a dynamically constructed graph.
solver.zero_grad()
loss.backward()
MultiLayer Perceptron (MLP)¶
In this section, you see an example of MLP graph building and training.
Before starting, we clear all parameters registered in the logistic regression example.
nn.clear_parameters() # Clear all parameters
Here is the function that builds a MLP with an arbitrary depth and width for 10 class classification.
def mlp(x, hidden=[16, 32, 16]):
hs = []
with nn.parameter_scope("mlp"): # Parameter scope can be nested
h = x
for hid, hsize in enumerate(hidden):
with nn.parameter_scope("affine{}".format(hid + 1)):
h = F.tanh(PF.affine(h, hsize))
hs.append(h)
with nn.parameter_scope("classifier"):
y = PF.affine(h, 10)
return y, hs
# Construct a MLP graph
y, hs = mlp(x)
print("Printing shapes")
print("x: {}".format(x.shape))
for i, h in enumerate(hs):
print("h{}:".format(i + 1), h.shape)
print("y: {}".format(y.shape))
Printing shapes
x: (64, 1, 8, 8)
h1: (64, 16)
h2: (64, 32)
h3: (64, 16)
y: (64, 10)
# Training
loss = logreg_loss(y, t) # Reuse logreg loss function.
# Copied from the above logreg example.
def training(steps, learning_rate):
solver = S.Sgd(learning_rate)
solver.set_parameters(nn.get_parameters()) # Set parameter variables to be updated.
for i in range(steps):
x.d, t.d = data.next()
loss.forward()
solver.zero_grad() # Initialize gradients of all parameters to zero.
loss.backward()
solver.weight_decay(1e5) # Applying weight decay as an regularization
solver.update()
if i % 100 == 0: # Print for each 10 iterations
print(i, loss.d)
# Training
training(1000, 1e2)
0 2.42193937302
100 1.83251476288
200 1.49943637848
300 1.30751883984
400 1.00974023342
500 0.904026031494
600 0.873289525509
700 0.725554704666
800 0.614291608334
900 0.555113613605
# Showing responses for each layer
num_plot = len(hs) + 2
gid = 1
def scale01(h):
return (h  h.min()) / (h.max()  h.min())
def imshow(img, title):
global gid
plt.subplot(num_plot, 1, gid)
gid += 1
plt.title(title)
plt.imshow(img, **imshow_opt)
plt.axis('off')
plt.figure(figsize=(2, 5))
imshow(x.d[0, 0], 'x')
for hid, h in enumerate(hs):
imshow(scale01(h.d[0]).reshape(1, 8), 'h{}'.format(hid + 1))
imshow(scale01(y.d[0]).reshape(2, 5), 'y')
Convolutional Neural Network with CUDA acceleration¶
Here we demonstrates a CNN with CUDA GPU acceleration.
nn.clear_parameters()
def cnn(x):
with nn.parameter_scope("cnn"): # Parameter scope can be nested
with nn.parameter_scope("conv1"):
c1 = F.tanh(PF.batch_normalization(
PF.convolution(x, 4, (3, 3), pad=(1, 1), stride=(2, 2))))
with nn.parameter_scope("conv2"):
c2 = F.tanh(PF.batch_normalization(
PF.convolution(c1, 8, (3, 3), pad=(1, 1))))
c2 = F.average_pooling(c2, (2, 2))
with nn.parameter_scope("fc3"):
fc3 = F.tanh(PF.affine(c2, 32))
with nn.parameter_scope("classifier"):
y = PF.affine(fc3, 10)
return y, [c1, c2, fc3]
To enable CUDA extension in NNabla, you have to install nnablaextcuda
package first. See the install
guide.
After installing the CUDA extension, you can easily switch to run on
CUDA by specifying a context before building a graph. We strongly
recommend using a cuDNN context that is fast. Although the context class
can be instantiated by nn.Context()
, specifying a context descriptor
might be a bit complicated for users. There for we recommend create a
context by using a helper function get_extension_context()
found in the
nnabla.ext_utils
module. NNabla officially supports cpu
and cudnn
as a context specifier passed to the first argument
(extension name). NOTE: By setting the cudnn context as a global default
context, Functions and solves created are instantiated with cuDNN
(preferred) mode. You can also specify a context using
with nn.context_scope()
. See API
reference
for details.
# Run on CUDA
from nnabla.ext_utils import get_extension_context
cuda_device_id = 0
ctx = get_extension_context('cudnn', device_id=cuda_device_id)
print("Context: {}".format(ctx))
nn.set_default_context(ctx) # Set CUDA as a default context.
y, hs = cnn(x)
loss = logreg_loss(y, t)
20170626 23:09:54,555 [nnabla][INFO]: Initializing CUDA extension...
20170626 23:09:54,731 [nnabla][INFO]: Initializing cuDNN extension...
Context: Context(backend='cpucuda', array_class='CudaCachedArray', device_id='0', compute_backend='defaultcudnn')
training(1000, 1e1)
0 2.34862923622
100 1.00527024269
200 0.416576713324
300 0.240603536367
400 0.254562884569
500 0.206138283014
600 0.220851421356
700 0.161689639091
800 0.230873346329
900 0.121101222932
# Showing responses for each layer
num_plot = len(hs) + 2
gid = 1
plt.figure(figsize=(2, 8))
imshow(x.d[0, 0], 'x')
imshow(tile_images(hs[0].d[0][:, None]), 'conv1')
imshow(tile_images(hs[1].d[0][:, None]), 'conv2')
imshow(hs[2].d[0].reshape(1, 8), 'fc3')
imshow(scale01(y.d[0]).reshape(2, 5), 'y')
nn.save_parameters
writes parameters registered in
parameter_scope
system in HDF5 format. We use it a later example.
path_cnn_params = "tmp.params.cnn.h5"
nn.save_parameters(path_cnn_params)
20170626 23:09:56,132 [nnabla][INFO]: Parameter save (hdf5): tmp.params.cnn.h5
Recurrent Neural Network (Elman RNN)¶
This is an example of recurrent neural network training.
nn.clear_parameters()
def rnn(xs, h0, hidden=32):
hs = []
with nn.parameter_scope("rnn"):
h = h0
# Time step loop
for x in xs:
# Note: Parameter scopes are reused over time
# which means parameters are shared over time.
with nn.parameter_scope("x2h"):
x2h = PF.affine(x, hidden, with_bias=False)
with nn.parameter_scope("h2h"):
h2h = PF.affine(h, hidden)
h = F.tanh(x2h + h2h)
hs.append(h)
with nn.parameter_scope("classifier"):
y = PF.affine(h, 10)
return y, hs
It is not meaningful, but just a demonstration purpose. We split an image into 2 by 2 grids, and feed them sequentially into RNN.
def split_grid4(x):
x0 = x[..., :4, :4]
x1 = x[..., :4, 4:]
x2 = x[..., 4:, :4]
x3 = x[..., 4:, 4:]
return x0, x1, x2, x3
hidden = 32
seq_img = split_grid4(img)
seq_x = [nn.Variable(subimg.shape) for subimg in seq_img]
h0 = nn.Variable((img.shape[0], hidden)) # Initial hidden state.
y, hs = rnn(seq_x, h0, hidden)
loss = logreg_loss(y, t)
# Copied from the above logreg example.
def training_rnn(steps, learning_rate):
solver = S.Sgd(learning_rate)
solver.set_parameters(nn.get_parameters()) # Set parameter variables to be updated.
for i in range(steps):
minibatch = data.next()
img, t.d = minibatch
seq_img = split_grid4(img)
h0.d = 0 # Initialize as 0
for x, subimg in zip(seq_x, seq_img):
x.d = subimg
loss.forward()
solver.zero_grad() # Initialize gradients of all parameters to zero.
loss.backward()
solver.weight_decay(1e5) # Applying weight decay as an regularization
solver.update()
if i % 100 == 0: # Print for each 10 iterations
print(i, loss.d)
training_rnn(1000, 1e1)
0 2.62527275085
100 0.780260562897
200 0.486522495747
300 0.289345681667
400 0.249717146158
500 0.538961410522
600 0.276877015829
700 0.159639537334
800 0.249660402536
900 0.0925596579909
# Showing responses for each layer
num_plot = len(hs) + 2
gid = 1
plt.figure(figsize=(2, 8))
imshow(x.d[0, 0], 'x')
for hid, h in enumerate(hs):
imshow(scale01(h.d[0]).reshape(1, 8), 'h{}'.format(hid + 1))
imshow(scale01(y.d[0]).reshape(2, 5), 'y')
Siamese Network¶
This example show how to embed an image in a categorical dataset into 2D space using deep learning. This also demonstrates how to reuse a pretrained network.
First, we load parameters learned in the CNN example.
nn.clear_parameters()
# Loading CNN pretrained parameters.
_ = nn.load_parameters(path_cnn_params)
20170626 23:09:57,838 [nnabla][INFO]: Parameter load (<builtin function format>): tmp.params.cnn.h5
We define embedding function. Note that the network structure and parameter hierarchy is identical to the previous CNN example. That enables you to reuse the saved parameters and finetune from it.
def cnn_embed(x, test=False):
# Note: Identical configuration with the CNN example above.
# Parameters pretrained in the above CNN example are used.
with nn.parameter_scope("cnn"):
with nn.parameter_scope("conv1"):
c1 = F.tanh(PF.batch_normalization(PF.convolution(x, 4, (3, 3), pad=(1, 1), stride=(2, 2)), batch_stat=not test))
with nn.parameter_scope("conv2"):
c2 = F.tanh(PF.batch_normalization(PF.convolution(c1, 8, (3, 3), pad=(1, 1)), batch_stat=not test))
c2 = F.average_pooling(c2, (2, 2))
with nn.parameter_scope("fc3"):
fc3 = PF.affine(c2, 32)
# Additional affine for map into 2D.
with nn.parameter_scope("embed2d"):
embed = PF.affine(c2, 2)
return embed, [c1, c2, fc3]
def siamese_loss(e0, e1, t, margin=1.0, eps=1e4):
dist = F.sum(F.squared_error(e0, e1), axis=1) # Squared distance
# Contrastive loss
sim_cost = t * dist
dissim_cost = (1  t) * \
(F.maximum_scalar(margin  (dist + eps) ** (0.5), 0) ** 2)
return F.mean(sim_cost + dissim_cost)
We build two stream CNNs and compare them with the contrastive loss function defined above. Note that both CNNs have the same parameter hierarchy, which means both parameters are shared.
x0 = nn.Variable(img.shape)
x1 = nn.Variable(img.shape)
t = nn.Variable((img.shape[0],)) # Same class or not
e0, hs0 = cnn_embed(x0)
e1, hs1 = cnn_embed(x1) # NOTE: parameters are shared
loss = siamese_loss(e0, e1, t)
def training_siamese(steps):
for i in range(steps):
minibatchs = []
for _ in range(2):
minibatch = data.next()
minibatchs.append((minibatch[0].copy(), minibatch[1].copy()))
x0.d, label0 = minibatchs[0]
x1.d, label1 = minibatchs[1]
t.d = (label0 == label1).astype(np.int).flat
loss.forward()
solver.zero_grad() # Initialize gradients of all parameters to zero.
loss.backward()
solver.weight_decay(1e5) # Applying weight decay as an regularization
solver.update()
if i % 100 == 0: # Print for each 10 iterations
print(i, loss.d)
learning_rate = 1e2
solver = S.Sgd(learning_rate)
with nn.parameter_scope("embed2d"):
# Only 2d embedding affine will be updated.
solver.set_parameters(nn.get_parameters())
training_siamese(2000)
# Decay learning rate
solver.set_learning_rate(solver.learning_rate() * 0.1)
training_siamese(2000)
0 0.150528043509
100 0.186870157719
200 0.149316266179
300 0.207163512707
400 0.171384960413
500 0.190256178379
600 0.138507723808
700 0.0918073058128
800 0.159692272544
900 0.0833697617054
1000 0.0839115008712
1100 0.104669973254
1200 0.0776312947273
1300 0.114788673818
1400 0.120309025049
1500 0.107732802629
1600 0.070114441216
1700 0.101728007197
1800 0.114350572228
1900 0.118794307113
0 0.0669310241938
100 0.0553173273802
200 0.0829797014594
300 0.0951051414013
400 0.128303915262
500 0.102963000536
600 0.0910559669137
700 0.0898950695992
800 0.119949311018
900 0.0603067912161
1000 0.105748720467
1100 0.108760476112
1200 0.0820947736502
1300 0.0971114039421
1400 0.0836166366935
1500 0.0899554267526
1600 0.109069615602
1700 0.0921652168036
1800 0.0759357959032
1900 0.100669950247
We visualize embedded training images as following. You see the images from the same class embedded near each other.
all_image = digits.images[:512, None]
all_label = digits.target[:512]
x_all = nn.Variable(all_image.shape)
x_all.d = all_image
with nn.auto_forward():
embed, _ = cnn_embed(x_all, test=True)
plt.figure(figsize=(16, 9))
for i in range(10):
c = plt.cm.Set1(i / 10.) # Maybe it doesn't work in an older version of Matplotlib where color map lies in [0, 256)
plt.plot(embed.d[all_label == i, 0].flatten(), embed.d[
all_label == i, 1].flatten(), '.', c=c)
plt.legend(map(str, range(10)))
plt.grid()
Appendix¶
A. Logistic Regression¶
Here we demonstrate how to train the simplest neural network, logistic regression (single layer perceptron). Logistic regression is a linear classifier \(f : {\cal R}^{D\times 1} \rightarrow {\cal R}^{K\times 1}\)
where \(\mathbf x \in {\cal R}^{D \times 1}\) is an input image flattened to a vector, \(t \in \{0, 1, \cdots, K\}\) is a target label, \(\mathbf W \in {\cal R}^{K \times D}\) is a weight matrix, \(\mathbf b \in {\cal R}^{K \times 1}\) is a bias vector and \(\mathbf \Theta \equiv \left\{\mathbf W, \mathbf b\right\}\). Loss function is defined as
where \(\mathbf X \equiv \left\{\mathbf x_1, t_1, \cdots, \mathbf x_N, t_N\right\}\) denotes a dataset the network trained on, \(\sigma(\mathbf z)\) is softmax operation defined as \(\frac{\exp(\mathbf z)}{\sum_{z \subset \mathbf z} \exp(z)}\), and \(\left[\mathbf z\right]_i\) denotes ith element of \(\mathbf z\).
NNabla Python API Demonstration Tutorial¶
Let us import nnabla first, and some additional useful tools.
# python2/3 compatibility
from __future__ import print_function
from __future__ import absolute_import
from __future__ import division
import nnabla as nn # Abbreviate as nn for convenience.
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
20170927 14:00:30,785 [nnabla][INFO]: Initializing CPU extension...
NdArray¶
NdArray is a data container of a multidimensional array. NdArray is
device (e.g. CPU, CUDA) and type (e.g. uint8, float32) agnostic, in
which both type and device are implicitly casted or transferred when it
is used. Below, you create a NdArray with a shape of (2, 3, 4)
.
a = nn.NdArray((2, 3, 4))
You can see the values held inside a
by the following. The values
are not initialized, and are created as float32 by default.
print(a.data)
[[[ 9.42546995e+24 4.56809286e41 8.47690058e38 0.00000000e+00]
[ 7.38056336e+34 7.50334969e+28 1.17078231e32 7.58387310e+31]
[ 7.87001454e12 9.84394250e12 6.85712044e+22 1.81785692e+31]]
[[ 1.84681296e+25 1.84933247e+20 4.85656319e+33 2.06176836e19]
[ 6.80020530e+22 1.69307638e+22 2.11235872e19 1.94316151e19]
[ 1.81805047e+31 3.01289097e+29 2.07004908e19 1.84648795e+25]]]
The accessor .data
returns a reference to the values of NdArray as
numpy.ndarray
. You can modify these by using the NumPy API as
follows.
print('[Substituting random values]')
a.data = np.random.randn(*a.shape)
print(a.data)
print('[Slicing]')
a.data[0, :, ::2] = 0
print(a.data)
[Substituting random values]
[[[ 0.36133638 0.22121875 1.5912329 0.33490974]
[ 1.35962474 0.2165522 0.54483992 0.61813235]
[0.13718799 0.44104072 0.51307833 0.73900551]]
[[0.59464753 2.17738533 0.28626776 0.45654735]
[ 0.73566747 0.87292582 0.41605178 0.04792296]
[0.63856047 0.31966645 0.63974309 0.61385244]]]
[Slicing]
[[[ 0. 0.22121875 0. 0.33490974]
[ 0. 0.2165522 0. 0.61813235]
[ 0. 0.44104072 0. 0.73900551]]
[[0.59464753 2.17738533 0.28626776 0.45654735]
[ 0.73566747 0.87292582 0.41605178 0.04792296]
[0.63856047 0.31966645 0.63974309 0.61385244]]]
Note that the above operation is all done in the host device (CPU).
NdArray provides more efficient functions in case you want to fill all
values with a constant, .zero
and .fill
. They are lazily
evaluated when the data is requested (when neural network computation
requests the data, or when NumPy array is requested by Python) The
filling operation is executed within a specific device (e.g. CUDA GPU),
and more efficient if you specify the device setting, which we explain
later.
a.fill(1) # Filling all values with one.
print(a.data)
[[[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]]
[[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]]]
You can create an NdArray instance directly from a NumPy array object.
b = nn.NdArray.from_numpy_array(np.ones(a.shape))
print(b.data)
[[[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]]
[[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]]]
NdArray is used in Variable class, as well as NNabla’s imperative computation of neural networks. We describe them in the later sections.
Variable¶
Variable class is used when you construct a neural network. The neural network can be described as a graph in which an edge represents a function (a.k.a operator and layer) which defines operation of a minimum unit of computation, and a node represents a variable which holds input/output values of a function (Function class is explained later). The graph is called “Computation Graph”.
In NNabla, a Variable, a node of a computation graph, holds two
NdArray
s, one for storing the input or output values of a function
during forward propagation (executing computation graph in the forward
order), while another for storing the backward error signal (gradient)
during backward propagation (executing computation graph in backward
order to propagate error signals down to parameters (weights) of neural
networks). The first one is called data
, the second is grad
in
NNabla.
The following line creates a Variable instance with a shape of (2, 3,
4). It has data
and grad
as NdArray
. The flag need_grad
is used to omit unnecessary gradient computation during backprop if set
to False.
x = nn.Variable([2, 3, 4], need_grad=True)
print('x.data:', x.data)
print('x.grad:', x.grad)
x.data: <NdArray((2, 3, 4)) at 0x7f575caf4ea0>
x.grad: <NdArray((2, 3, 4)) at 0x7f575caf4ea0>
You can get the shape by:
x.shape
(2, 3, 4)
Since both data
and grad
are NdArray
, you can get a
reference to its values as NdArray with the .data
accessor, but also
it can be referred by .d
or .g
property for data
and grad
respectively.
print('x.data')
print(x.d)
x.d = 1.2345 # To avoid NaN
assert np.all(x.d == x.data.data), 'd: {} != {}'.format(x.d, x.data.data)
print('x.grad')
print(x.g)
x.g = 1.2345 # To avoid NaN
assert np.all(x.g == x.grad.data), 'g: {} != {}'.format(x.g, x.grad.data)
# Zeroing grad values
x.grad.zero()
print('x.grad (after `.zero()`)')
print(x.g)
x.data
[[[ 9.42553452e+24 4.56809286e41 8.32543479e38 0.00000000e+00]
[ nan nan 0.00000000e+00 0.00000000e+00]
[ 3.70977305e+25 4.56809286e41 3.78350585e44 0.00000000e+00]]
[[ 5.68736600e38 0.00000000e+00 1.86176378e13 4.56809286e41]
[ 4.74367616e+25 4.56809286e41 5.43829710e+19 4.56809286e41]
[ 0.00000000e+00 0.00000000e+00 2.93623372e38 0.00000000e+00]]]
x.grad
[[[ 9.42576510e+24 4.56809286e41 9.42576510e+24 4.56809286e41]
[ 9.27127763e38 0.00000000e+00 9.27127763e38 0.00000000e+00]
[ 1.69275966e+22 4.80112800e+30 1.21230330e+25 7.22962302e+31]]
[[ 1.10471027e32 4.63080422e+27 2.44632805e+20 2.87606258e+20]
[ 4.46263300e+30 4.62311881e+30 7.65000750e+28 3.01339003e+29]
[ 2.08627352e10 1.03961868e+21 7.99576678e+20 1.74441223e+22]]]
x.grad (after .zero()
)
[[[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]
[[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]]
Like NdArray
, a Variable
can also be created from NumPy
array(s).
x2 = nn.Variable.from_numpy_array(np.ones((3,)), need_grad=True)
print(x2)
print(x2.d)
x3 = nn.Variable.from_numpy_array(np.ones((3,)), np.zeros((3,)), need_grad=True)
print(x3)
print(x3.d)
print(x3.g)
<Variable((3,), need_grad=True) at 0x7f572a5242c8>
[ 1. 1. 1.]
<Variable((3,), need_grad=True) at 0x7f572a5244a8>
[ 1. 1. 1.]
[ 0. 0. 0.]
Besides storing values of a computation graph, pointing a parent edge
(function) to trace the computation graph is an important role. Here
x
doesn’t have any connection. Therefore, the .parent
property
returns None.
print(x.parent)
None
Function¶
A function defines an operation block of a computation graph as we
described above. The module nnabla.functions
offers various
functions (e.g. Convolution, Affine and ReLU). You can see the list of
functions available in the API reference
guide.
import nnabla.functions as F
As an example, here you will defines a computation graph that computes the elementwise Sigmoid function outputs for the input variable and sums up all values into a scalar. (This is simple enough to explain how it behaves but a meaningless example in the context of neural network training. We will show you a neural network example later.)
sigmoid_output = F.sigmoid(x)
sum_output = F.reduce_sum(sigmoid_output)
The function API in nnabla.functions
takes one (or several)
Variable(s) and arguments (if any), and returns one (or several) output
Variable(s). The .parent
points to the function instance which
created it. Note that no computation occurs at this time since we just
define the graph. (This is the default behavior of NNabla computation
graph API. You can also fire actual computation during graph definition
which we call “Dynamic mode” (explained later)).
print("sigmoid_output.parent.name:", sigmoid_output.parent.name)
print("x:", x)
print("sigmoid_output.parent.inputs refers to x:", sigmoid_output.parent.inputs)
sigmoid_output.parent.name: Sigmoid
x: <Variable((2, 3, 4), need_grad=True) at 0x7f572a51a778>
sigmoid_output.parent.inputs refers to x: [<Variable((2, 3, 4), need_grad=True) at 0x7f572a273a48>]
print("sum_output.parent.name:", sum_output.parent.name)
print("sigmoid_output:", sigmoid_output)
print("sum_output.parent.inputs refers to sigmoid_output:", sum_output.parent.inputs)
sum_output.parent.name: ReduceSum
sigmoid_output: <Variable((2, 3, 4), need_grad=True) at 0x7f572a524638>
sum_output.parent.inputs refers to sigmoid_output: [<Variable((2, 3, 4), need_grad=True) at 0x7f572a273a48>]
The .forward()
at a leaf Variable executes the forward pass
computation in the computation graph.
sum_output.forward()
print("CG output:", sum_output.d)
print("Reference:", np.sum(1.0 / (1.0 + np.exp(x.d))))
CG output: 18.59052085876465
Reference: 18.5905
The .backward()
does the backward propagation through the graph.
Here we initialize the grad
values as zero before backprop since the
NNabla backprop algorithm always accumulates the gradient in the root
variables.
x.grad.zero()
sum_output.backward()
print("d sum_o / d sigmoid_o:")
print(sigmoid_output.g)
print("d sum_o / d x:")
print(x.g)
d sum_o / d sigmoid_o:
[[[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]]
[[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]]]
d sum_o / d x:
[[[ 0.17459197 0.17459197 0.17459197 0.17459197]
[ 0.17459197 0.17459197 0.17459197 0.17459197]
[ 0.17459197 0.17459197 0.17459197 0.17459197]]
[[ 0.17459197 0.17459197 0.17459197 0.17459197]
[ 0.17459197 0.17459197 0.17459197 0.17459197]
[ 0.17459197 0.17459197 0.17459197 0.17459197]]]
NNabla is developed by mainly focused on neural network training and
inference. Neural networks have parameters to be learned associated with
computation blocks such as Convolution, Affine (a.k.a. fully connected,
dense etc.). In NNabla, the learnable parameters are also represented as
Variable
objects. Just like input variables, those parameter
variables are also used by passing into Function
s. For example,
Affine function takes input, weights and biases as inputs.
x = nn.Variable([5, 2]) # Input
w = nn.Variable([2, 3], need_grad=True) # Weights
b = nn.Variable([3], need_grad=True) # Biases
affine_out = F.affine(x, w, b) # Create a graph including only affine
The above example takes an input with B=5 (batchsize) and D=2 (dimensions) and maps it to D’=3 outputs, i.e. (B, D’) output.
You may also notice that here you set need_grad=True
only for
parameter variables (w and b). The x is a nonparameter variable and the
root of computation graph. Therefore, it doesn’t require gradient
computation. In this configuration, the gradient computation for x is
not executed in the first affine, which will omit the computation of
unnecessary backpropagation.
The next block sets data and initializes grad, then applies forward and backward computation.
# Set random input and parameters
x.d = np.random.randn(*x.shape)
w.d = np.random.randn(*w.shape)
b.d = np.random.randn(*b.shape)
# Initialize grad
x.grad.zero() # Just for showing gradients are not computed when need_grad=False (default).
w.grad.zero()
b.grad.zero()
# Forward and backward
affine_out.forward()
affine_out.backward()
# Note: Calling backward at nonscalar Variable propagates 1 as error message from all element of outputs. .
You can see that affine_out holds an output of Affine.
print('F.affine')
print(affine_out.d)
print('Reference')
print(np.dot(x.d, w.d) + b.d)
F.affine
[[0.17701732 2.86095762 0.82298267]
[0.75544345 1.16702223 2.44841242]
[0.36278027 3.4771595 0.75681627]
[ 0.32743117 0.24258983 1.30944324]
[0.87201929 1.94556415 3.23357344]]
Reference
[[0.1770173 2.86095762 0.82298267]
[0.75544345 1.16702223 2.44841242]
[0.3627803 3.4771595 0.75681627]
[ 0.32743117 0.24258983 1.309443 ]
[0.87201929 1.94556415 3.23357344]]
The resulting gradients of weights and biases are as follows.
print("dw")
print(w.g)
print("db")
print(b.g)
dw
[[ 3.10820675 3.10820675 3.10820675]
[ 0.37446201 0.37446201 0.37446201]]
db
[ 5. 5. 5.]
The gradient of x
is not changed because need_grad
is set as
False.
print(x.g)
[[ 0. 0.]
[ 0. 0.]
[ 0. 0.]
[ 0. 0.]
[ 0. 0.]]
Parametric Function¶
Considering parameters as inputs of Function
enhances expressiveness
and flexibility of computation graphs. However, to define all parameters
for each learnable function is annoying for users to define a neural
network. In NNabla, trainable models are usually created by composing
functions that have optimizable parameters. These functions are called
“Parametric Functions”. The Parametric Function API provides various
parametric functions and an interface for composing trainable models.
To use parametric functions, import:
import nnabla.parametric_functions as PF
The function with optimizable parameter can be created as below.
with nn.parameter_scope("affine1"):
c1 = PF.affine(x, 3)
The first line creates a parameter scope. The second line then
applies PF.affine
 an affine transform  to x
, and creates a
variable c1
holding that result. The parameters are created and
initialized randomly at function call, and registered by a name
“affine1” using parameter_scope
context. The function
nnabla.get_parameters()
allows to get the registered parameters.
nn.get_parameters()
OrderedDict([('affine1/affine/W',
<Variable((2, 3), need_grad=True) at 0x7f572822f0e8>),
('affine1/affine/b',
<Variable((3,), need_grad=True) at 0x7f572822f138>)])
The name=
argument of any PF function creates the equivalent
parameter space to the above definition of PF.affine
transformation
as below. It could save the space of your Python code. The
nnabla.parametric_scope
is more useful when you group multiple
parametric functions such as ConvolutionBatchNormalization found in a
typical unit of CNNs.
c1 = PF.affine(x, 3, name='affine1')
nn.get_parameters()
OrderedDict([('affine1/affine/W',
<Variable((2, 3), need_grad=True) at 0x7f572822f0e8>),
('affine1/affine/b',
<Variable((3,), need_grad=True) at 0x7f572822f138>)])
It is worth noting that the shapes of both outputs and parameter variables (as you can see above) are automatically determined by only providing the output size of affine transformation(in the example above the output size is 3). This helps to create a graph in an easy way.
c1.shape
(5, 3)
Parameter scope can be nested as follows (although a meaningless example).
with nn.parameter_scope('foo'):
h = PF.affine(x, 3)
with nn.parameter_scope('bar'):
h = PF.affine(h, 4)
This creates the following.
nn.get_parameters()
OrderedDict([('affine1/affine/W',
<Variable((2, 3), need_grad=True) at 0x7f572822f0e8>),
('affine1/affine/b',
<Variable((3,), need_grad=True) at 0x7f572822f138>),
('foo/affine/W',
<Variable((2, 3), need_grad=True) at 0x7f572822fa98>),
('foo/affine/b',
<Variable((3,), need_grad=True) at 0x7f572822fae8>),
('foo/bar/affine/W',
<Variable((3, 4), need_grad=True) at 0x7f572822f728>),
('foo/bar/affine/b',
<Variable((4,), need_grad=True) at 0x7f572822fdb8>)])
Also, get_parameters()
can be used in parameter_scope
. For
example:
with nn.parameter_scope("foo"):
print(nn.get_parameters())
OrderedDict([('affine/W', <Variable((2, 3), need_grad=True) at 0x7f572822fa98>), ('affine/b', <Variable((3,), need_grad=True) at 0x7f572822fae8>), ('bar/affine/W', <Variable((3, 4), need_grad=True) at 0x7f572822f728>), ('bar/affine/b', <Variable((4,), need_grad=True) at 0x7f572822fdb8>)])
nnabla.clear_parameters()
can be used to delete registered
parameters under the scope.
with nn.parameter_scope("foo"):
nn.clear_parameters()
print(nn.get_parameters())
OrderedDict([('affine1/affine/W', <Variable((2, 3), need_grad=True) at 0x7f572822f0e8>), ('affine1/affine/b', <Variable((3,), need_grad=True) at 0x7f572822f138>)])
MLP Example For Explanation¶
The following block creates a computation graph to predict one dimensional output from two dimensional inputs by a 2 layer fully connected neural network (multilayer perceptron).
nn.clear_parameters()
batchsize = 16
x = nn.Variable([batchsize, 2])
with nn.parameter_scope("fc1"):
h = F.tanh(PF.affine(x, 512))
with nn.parameter_scope("fc2"):
y = PF.affine(h, 1)
print("Shapes:", h.shape, y.shape)
Shapes: (16, 512) (16, 1)
This will create the following parameter variables.
nn.get_parameters()
OrderedDict([('fc1/affine/W',
<Variable((2, 512), need_grad=True) at 0x7f572822fef8>),
('fc1/affine/b',
<Variable((512,), need_grad=True) at 0x7f572822f9a8>),
('fc2/affine/W',
<Variable((512, 1), need_grad=True) at 0x7f572822f778>),
('fc2/affine/b',
<Variable((1,), need_grad=True) at 0x7f572822ff98>)])
As described above, you can execute the forward pass by calling forward method at the terminal variable.
x.d = np.random.randn(*x.shape) # Set random input
y.forward()
print(y.d)
[[0.05708594]
[ 0.01661986]
[0.34168088]
[ 0.05822293]
[0.16566885]
[0.04867431]
[ 0.2633169 ]
[ 0.10496549]
[0.01291842]
[0.09726256]
[0.05720493]
[0.09691752]
[0.07822668]
[0.17180404]
[ 0.11970415]
[0.08222144]]
Training a neural networks needs a loss value to be minimized by gradient descent with backprop. In NNabla, loss function is also a just function, and packaged in the functions module.
# Variable for label
label = nn.Variable([batchsize, 1])
# Set loss
loss = F.reduce_mean(F.squared_error(y, label))
# Execute forward pass.
label.d = np.random.randn(*label.shape) # Randomly generate labels
loss.forward()
print(loss.d)
1.9382084608078003
As you’ve seen above, NNabla backward
accumulates the gradients at
the root variables. You have to initialize the grad of the parameter
variables before backprop (We will show you the easiest way with
Solver
API).
# Collect all parameter variables and init grad.
for name, param in nn.get_parameters().items():
param.grad.zero()
# Gradients are accumulated to grad of params.
loss.backward()
Imperative Mode¶
After performing backprop, gradients are held in parameter variable grads. The next block will update the parameters with vanilla gradient descent.
for name, param in nn.get_parameters().items():
param.data = param.grad * 0.001 # 0.001 as learning rate
The above computation is an example of NNabla’s “Imperative Mode” for
executing neural networks. Normally, NNabla functions (instances of
nnabla.functions)
take Variable
s as their input. When at least one NdArray
is
provided as an input for NNabla functions (instead of Variable
s),
the function computation will be fired immediately, and returns an
NdArray
as the output, instead of returning a Variable
. In the
above example, the NNabla functions F.mul_scalar
and F.sub2
are
called by the overridden operators *
and =
, respectively.
In other words, NNabla’s “Imperative mode” doesn’t create a computation graph, and can be used like NumPy. If device acceleration such as CUDA is enabled, it can be used like NumPy empowered with device acceleration. Parametric functions can also be used with NdArray input(s). The following block demonstrates a simple imperative execution example.
# A simple example of imperative mode.
xi = nn.NdArray.from_numpy_array(np.arange(4).reshape(2, 2))
yi = F.relu(xi  1)
print(xi.data)
print(yi.data)
[[0 1]
[2 3]]
[[ 0. 0.]
[ 1. 2.]]
Note that inplace substitution from the rhs to the lhs cannot be done
by the =
operator. For example, when x
is an NdArray
,
writing x = x + 1
will not increment all values of x

instead, the expression on the rhs will create a new NdArray
object that is different from the one originally bound by x
, and binds
the new NdArray
object to the Python variable x
on the lhs.
For inplace editing of NdArrays
, the inplace assignment operators
+=
, =
, *=
, and /=
can be used. The copy_from
method
can also be used to copy values of an existing NdArray
to another.
For example, incrementing 1 to x
, an NdArray
, can be done by
x.copy_from(x+1)
. The copy is performed with device acceleration if
a device context is specified by using nnabla.set_default_context
or
nnabla.context_scope
.
# The following doesn't perform substitution but assigns a new NdArray object to `xi`.
# xi = xi + 1
# The following copies the result of `xi + 1` to `xi`.
xi.copy_from(xi + 1)
assert np.all(xi.data == (np.arange(4).reshape(2, 2) + 1))
# Inplace operations like `+=`, `*=` can also be used (more efficient).
xi += 1
assert np.all(xi.data == (np.arange(4).reshape(2, 2) + 2))
Solver¶
NNabla provides stochastic gradient descent algorithms to optimize
parameters listed in the nnabla.solvers
module. The parameter
updates demonstrated above can be replaced with this Solver API, which
is easier and usually faster.
from nnabla import solvers as S
solver = S.Sgd(lr=0.00001)
solver.set_parameters(nn.get_parameters())
# Set random data
x.d = np.random.randn(*x.shape)
label.d = np.random.randn(*label.shape)
# Forward
loss.forward()
Just call the the following solver method to fill zero grad region, then backprop
solver.zero_grad()
loss.backward()
The following block updates parameters with the Vanilla Sgd rule (equivalent to the imperative example above).
solver.update()
Toy Problem To Demonstrate Training¶
The following function defines a regression problem which computes the norm of a vector.
def vector2length(x):
# x : [B, 2] where B is number of samples.
return np.sqrt(np.sum(x ** 2, axis=1, keepdims=True))
We visualize this mapping with the contour plot by matplotlib as follows.
# Data for plotting contour on a grid data.
xs = np.linspace(1, 1, 100)
ys = np.linspace(1, 1, 100)
grid = np.meshgrid(xs, ys)
X = grid[0].flatten()
Y = grid[1].flatten()
def plot_true():
"""Plotting contour of true mapping from a grid data created above."""
plt.contourf(xs, ys, vector2length(np.hstack([X[:, None], Y[:, None]])).reshape(100, 100))
plt.axis('equal')
plt.colorbar()
plot_true()
We define a deep prediction neural network.
def length_mlp(x):
h = x
for i, hnum in enumerate([4, 8, 4, 2]):
h = F.tanh(PF.affine(h, hnum, name="fc{}".format(i)))
y = PF.affine(h, 1, name='fc')
return y
nn.clear_parameters()
batchsize = 100
x = nn.Variable([batchsize, 2])
y = length_mlp(x)
label = nn.Variable([batchsize, 1])
loss = F.reduce_mean(F.squared_error(y, label))
We created a 5 layers deep MLP using forloop. Note that only 3 lines of the code potentially create infinitely deep neural networks. The next block adds helper functions to visualize the learned function.
def predict(inp):
ret = []
for i in range(0, inp.shape[0], x.shape[0]):
xx = inp[i:i + x.shape[0]]
# Imperative execution
xi = nn.NdArray.from_numpy_array(xx)
yi = length_mlp(xi)
ret.append(yi.data.copy())
return np.vstack(ret)
def plot_prediction():
plt.contourf(xs, ys, predict(np.hstack([X[:, None], Y[:, None]])).reshape(100, 100))
plt.colorbar()
plt.axis('equal')
Next we instantiate a solver object as follows. We use Adam optimizer which is one of the most popular SGD algorithm used in the literature.
from nnabla import solvers as S
solver = S.Adam(alpha=0.01)
solver.set_parameters(nn.get_parameters())
The following function generates data from the true system infinitely.
def random_data_provider(n):
x = np.random.uniform(1, 1, size=(n, 2))
y = vector2length(x)
return x, y
In the next block, we run 2000 training steps (SGD updates).
num_iter = 2000
for i in range(num_iter):
# Sample data and set them to input variables of training.
xx, ll = random_data_provider(batchsize)
x.d = xx
label.d = ll
# Forward propagation given inputs.
loss.forward(clear_no_need_grad=True)
# Parameter gradients initialization and gradients computation by backprop.
solver.zero_grad()
loss.backward(clear_buffer=True)
# Apply weight decay and update by Adam rule.
solver.weight_decay(1e6)
solver.update()
# Just print progress.
if i % 100 == 0 or i == num_iter  1:
print("Loss@{:4d}: {}".format(i, loss.d))
Loss@ 0: 0.6976373195648193
Loss@ 100: 0.08075223118066788
Loss@ 200: 0.005213144235312939
Loss@ 300: 0.001955194864422083
Loss@ 400: 0.0011660841992124915
Loss@ 500: 0.0006421314901672304
Loss@ 600: 0.0009330055327154696
Loss@ 700: 0.0008817618945613503
Loss@ 800: 0.0006205961108207703
Loss@ 900: 0.0009072928223758936
Loss@1000: 0.0008160348515957594
Loss@1100: 0.0011569359339773655
Loss@1200: 0.000837412488181144
Loss@1300: 0.0011542742140591145
Loss@1400: 0.0005833200993947685
Loss@1500: 0.0009848927147686481
Loss@1600: 0.0005141657311469316
Loss@1700: 0.0009339841199107468
Loss@1800: 0.000950580753851682
Loss@1900: 0.0005430278833955526
Loss@1999: 0.0007046313839964569
Memory usage optimization: You may notice that, in the above
updates, .forward()
is called with the clear_no_need_grad=
option, and .backward()
is called with the clear_buffer=
option.
Training of neural network in more realistic scenarios usually consumes
huge memory due to the nature of backpropagation algorithm, in which all
of the forward variable buffer data
should be kept in order to
compute the gradient of a function. In a naive implementation, we keep
all the variable data
and grad
living until the NdArray
objects are not referenced (i.e. the graph is deleted). The clear_*
options in .forward()
and .backward()
enables to save memory
consumption due to that by clearing (erasing) memory of data
and
grad
when it is not referenced by any subsequent computation. (More
precisely speaking, it doesn’t free memory actually. We use our memory
pool engine by default to avoid memory alloc/free overhead). The
unreferenced buffers can be reused in subsequent computation. See the
document of Variable
for more details. Note that the following
loss.forward(clear_buffer=True)
clears data
of any intermediate
variables. If you are interested in intermediate variables for some
purposes (e.g. debug, log), you can use the .persistent
flag to
prevent clearing buffer of a specific Variable
like below.
loss.forward(clear_buffer=True)
print("The prediction `y` is cleared because it's an intermediate variable.")
print(y.d.flatten()[:4]) # to save space show only 4 values
y.persistent = True
loss.forward(clear_buffer=True)
print("The prediction `y` is kept by the persistent flag.")
print(y.d.flatten()[:4]) # to save space show only 4 value
The predictiony
is cleared because it's an intermediate variable. [ 2.27279830e04 6.02164946e05 5.33679675e04 2.35557582e05] The predictiony
is kept by the persistent flag. [ 1.0851264 0.87657517 0.79603785 0.40098712]
We can confirm the prediction performs fairly well by looking at the following visualization of the ground truth and prediction function.
plt.subplot(121)
plt.title("Ground truth")
plot_true()
plt.subplot(122)
plt.title("Prediction")
plot_prediction()
You can save learned parameters by nnabla.save_parameters
and load
by nnabla.load_parameters
.
path_param = "paramvector2length.h5"
nn.save_parameters(path_param)
# Remove all once
nn.clear_parameters()
nn.get_parameters()
20170927 14:00:40,544 [nnabla][INFO]: Parameter save (.h5): paramvector2length.h5
OrderedDict()
# Load again
nn.load_parameters(path_param)
print('\n'.join(map(str, nn.get_parameters().items())))
20170927 14:00:40,564 [nnabla][INFO]: Parameter load (<builtin function format>): paramvector2length.h5
('fc0/affine/W', <Variable((2, 4), need_grad=True) at 0x7f576328df48>)
('fc0/affine/b', <Variable((4,), need_grad=True) at 0x7f57245f2868>)
('fc1/affine/W', <Variable((4, 8), need_grad=True) at 0x7f576328def8>)
('fc1/affine/b', <Variable((8,), need_grad=True) at 0x7f5727ee5c78>)
('fc2/affine/W', <Variable((8, 4), need_grad=True) at 0x7f5763297318>)
('fc2/affine/b', <Variable((4,), need_grad=True) at 0x7f5727d29908>)
('fc3/affine/W', <Variable((4, 2), need_grad=True) at 0x7f57632973b8>)
('fc3/affine/b', <Variable((2,), need_grad=True) at 0x7f57632974a8>)
('fc/affine/W', <Variable((2, 1), need_grad=True) at 0x7f57632974f8>)
('fc/affine/b', <Variable((1,), need_grad=True) at 0x7f5763297598>)
Both save and load functions can also be used in a parameter scope.
with nn.parameter_scope('foo'):
nn.load_parameters(path_param)
print('\n'.join(map(str, nn.get_parameters().items())))
20170927 14:00:40,714 [nnabla][INFO]: Parameter load (<builtin function format>): paramvector2length.h5
('fc0/affine/W', <Variable((2, 4), need_grad=True) at 0x7f576328df48>)
('fc0/affine/b', <Variable((4,), need_grad=True) at 0x7f57245f2868>)
('fc1/affine/W', <Variable((4, 8), need_grad=True) at 0x7f576328def8>)
('fc1/affine/b', <Variable((8,), need_grad=True) at 0x7f5727ee5c78>)
('fc2/affine/W', <Variable((8, 4), need_grad=True) at 0x7f5763297318>)
('fc2/affine/b', <Variable((4,), need_grad=True) at 0x7f5727d29908>)
('fc3/affine/W', <Variable((4, 2), need_grad=True) at 0x7f57632973b8>)
('fc3/affine/b', <Variable((2,), need_grad=True) at 0x7f57632974a8>)
('fc/affine/W', <Variable((2, 1), need_grad=True) at 0x7f57632974f8>)
('fc/affine/b', <Variable((1,), need_grad=True) at 0x7f5763297598>)
('foo/fc0/affine/W', <Variable((2, 4), need_grad=True) at 0x7f5763297958>)
('foo/fc0/affine/b', <Variable((4,), need_grad=True) at 0x7f57632978b8>)
('foo/fc1/affine/W', <Variable((4, 8), need_grad=True) at 0x7f572a51ac78>)
('foo/fc1/affine/b', <Variable((8,), need_grad=True) at 0x7f5763297c78>)
('foo/fc2/affine/W', <Variable((8, 4), need_grad=True) at 0x7f5763297a98>)
('foo/fc2/affine/b', <Variable((4,), need_grad=True) at 0x7f5763297d68>)
('foo/fc3/affine/W', <Variable((4, 2), need_grad=True) at 0x7f5763297e08>)
('foo/fc3/affine/b', <Variable((2,), need_grad=True) at 0x7f5763297ea8>)
('foo/fc/affine/W', <Variable((2, 1), need_grad=True) at 0x7f5763297f48>)
('foo/fc/affine/b', <Variable((1,), need_grad=True) at 0x7f5763297cc8>)
!rm {path_param} # Clean ups
NNabla Models Finetuning Tutorial¶
Here we demonstrate how to perform finetuning using nnabla’s pretrained models.
Load the model¶
Loading the model is very simple. All you need is just 2 lines.
from nnabla.models.imagenet import ResNet18
model = ResNet18()
You can choose other ResNet models such as ResNet34
, ResNet50
,
by specifying the model’s name as an argument. Of course, you can choose
other pretrained models as well. See the
Docs.
NOTE: If you use the ResNet18
for the first time, nnabla will
automatically download the weights from https://nnabla.org
and it
may take up to a few minutes.
Dataset¶
In this tutorial, we use Caltech101 as the dataset for finetuning. Caltech101 consists of more than 9,000 object images in total and each image belongs to one of 101 distinct categories or “clutter” category. We use images from 101 categories for simple classification.
We have a script named caltech101_data.py
which can automatically
download the dataset and store it in nnabla_data
.
If you have your own dataset and DataIterator
which can load your
data, you can use it instead.
run caltech101_data.py
batch_size = 32 # we set batch_size = 32
all_data = data_iterator_caltech101(batch_size)
Since there is no separate data for training and validation in caltech101, we need to manually split it up. Here, we will split the dataset as the following way; 80% for training, and 20% for validation.
num_samples = all_data.size
num_train_samples = int(0.8 * num_samples) # Take 80% for training, and the rest for validation.
num_class = 101
data_iterator_train = all_data.slice(
rng=None, slice_start=0, slice_end=num_train_samples)
data_iterator_valid = all_data.slice(
rng=None, slice_start=num_train_samples, slice_end=num_samples)
Now we have model and data!
Optional: Check the image in the dataset¶
Let’s take a look at what kind of images are included in the dataset.
You can get images by DataIterator
’s method, next
import matplotlib.pyplot as plt
%matplotlib inline
images, labels = data_iterator_train.next()
sample_image, sample_label = images[0], labels[0]
plt.imshow(sample_image.transpose(1,2,0))
plt.show()
print("image_shape: {}".format(sample_image.shape))
print("label_id: {}".format(sample_label))
image_shape: (3, 128, 128)
label_id: [94]
Preparing Graph Construction¶
Let’s start with importing basic modules.
import nnabla as nn
# Optional: If you want to use GPU
from nnabla.ext_utils import get_extension_context
ctx = get_extension_context("cudnn")
nn.set_default_context(ctx)
ext = nn.ext_utils.import_extension_module("cudnn")
Create input Variables for the Network¶
Now we are going to create the input variables.
channels, image_height, image_width = sample_image.shape # use info from the image we got
# input variables for the validation network
image_valid = nn.Variable((batch_size, channels, image_height, image_width))
label_valid = nn.Variable((batch_size, 1))
input_image_valid = {"image": image_valid, "label": label_valid}
# input variables for the training network
image_train = nn.Variable((batch_size, channels, image_height, image_width))
label_train = nn.Variable((batch_size, 1))
input_image_train = {"image": image_train, "label": label_train}
Create the training graph using the pretrained model¶
If you take a look at the Model’s API
Reference,
you can find use_up_to
option. Specifying one of the predefined
strings when calling the model, the computation graph will be
constructed up to the layer you specify. For example, in case of
ResNet18
, you can choose one of the following as the last layer of
the graph.
‘classifier’ (default): The output of the final affine layer for classification.
‘pool’: The output of the final global average pooling.
‘lastconv’: The input of the final global average pooling without ReLU activation..
‘lastconv+relu’: Network up to ‘lastconv’ followed by ReLU activation.
For finetuning, it is common to replace only the upper layers with the
new (not trained) ones and reuse the lower layers with their pretrained
weights. Also, pretrained models have been trained on a classification
task on ImageNet, which has 1000 categories, so the output of the
classifier
layer has the output shape (batch_size, 1000)
that
wouldn’t fit our current dataset. For this reason, here we construct the
graph up to the pool
layer, which corresponds to the
global average pooling
layer in the original graph, and connect it
to the additional affine (fullyconnected) layer for 101way
classification. For finetuning, it is common to train only the weights
for the newly added layers (in this case, the last affine layer), but in
this tutorial, we will update the weights for all layers in the graph.
Also, when creating a training graph, you need to set training=True
.
import nnabla.parametric_functions as PF
y_train = model(image_train, force_global_pooling=True, use_up_to="pool", training=True)
with nn.parameter_scope("finetuning_fc"):
pred_train = PF.affine(y_train, 101) # adding the affine layer to the graph.
NOTE: You need to specify force_global_pooling=True
when the
input shape is different from what the model expects. You can check the
model’s default input shape by typing model.input_shape
.
Create the validation graph using the model¶
Creating the validation graph is almost the same. You simply need to
change training
flag to False
.
y_valid = model(image_valid,
force_global_pooling=True, use_up_to="pool", training=False)
with nn.parameter_scope("finetuning_fc"):
pred_valid = PF.affine(y_valid, 101)
pred_valid.persistent = True # to keep the value when get `forward(clear_buffer=True)`ed.
Define the functions for computing Loss and Categorical Error¶
import nnabla.functions as F
def loss_function(pred, label):
"""
Compute loss.
"""
loss = F.mean(F.softmax_cross_entropy(pred, label))
return loss
loss_valid = loss_function(pred_valid, label_valid)
top_1_error_valid = F.mean(F.top_n_error(pred_valid, label_valid))
loss_train = loss_function(pred_train, label_train)
top_1_error_train = F.mean(F.top_n_error(pred_train, label_train))
Prepare the solver¶
import nnabla.solvers as S
solver = S.Momentum(0.01) # you can choose others as well
solver.set_parameters(nn.get_parameters())
Some setting for iteration¶
num_epoch = 10 # arbitrary
one_epoch = data_iterator_train.size // batch_size
max_iter = num_epoch * one_epoch
val_iter = data_iterator_valid.size // batch_size
Performance before finetuning¶
Let’s see how well the model works. Note that all the weights are pretrained on ImageNet except for the last affine layer. First, prepare a function to show us the model’s performance,
def run_validation(pred_valid, loss_valid, top_1_error_valid,
input_image_valid, data_iterator_valid,
with_visualized=False, num_visualized=3):
assert num_visualized < pred_valid.shape[0], "too many images to plot."
val_iter = data_iterator_valid.size // pred_valid.shape[0]
ve = 0.
vloss = 0.
for j in range(val_iter):
v_image, v_label = data_iterator_valid.next()
input_image_valid["image"].d = v_image
input_image_valid["label"].d = v_label
nn.forward_all([loss_valid, top_1_error_valid], clear_no_need_grad=True)
vloss += loss_valid.d.copy()
ve += top_1_error_valid.d.copy()
vloss /= val_iter
ve /= val_iter
if with_visualized:
ind = 1
random_start = np.random.randint(pred_valid.shape[0]  num_visualized)
fig = plt.figure(figsize=(12., 12.))
for n in range(random_start, random_start + num_visualized):
sample_image, sample_label = v_image[n], v_label[n]
ax = fig.add_subplot(1, num_visualized, ind)
ax.imshow(sample_image.transpose(1,2,0))
with nn.auto_forward():
predicted_id = np.argmax(F.softmax(pred_valid)[n].d)
result = "true label_id: {}  predicted as {}".format(str(sample_label[0]), str(predicted_id))
ax.set_title(result)
ind += 1
fig.show()
return ve, vloss
_, _ = run_validation(pred_valid, loss_valid, top_1_error_valid, input_image_valid, data_iterator_valid, with_visualized=True)
As you can see, the model fails to classify images properly. Now, let’s begin the finetuning and see how performance improves.
Start Finetuning¶
Let’s prepare the monitor for training.
from nnabla.monitor import Monitor, MonitorSeries, MonitorTimeElapsed
monitor = Monitor("tmp.monitor")
monitor_loss = MonitorSeries("Training loss", monitor, interval=200)
monitor_err = MonitorSeries("Training error", monitor, interval=200)
monitor_vloss = MonitorSeries("Test loss", monitor, interval=200)
monitor_verr = MonitorSeries("Test error", monitor, interval=200)
# Trainingloop
for i in range(max_iter):
image, label = data_iterator_train.next()
input_image_train["image"].d = image
input_image_train["label"].d = label
nn.forward_all([loss_train, top_1_error_train], clear_no_need_grad=True)
monitor_loss.add(i, loss_train.d.copy())
monitor_err.add(i, top_1_error_train.d.copy())
solver.zero_grad()
loss_train.backward(clear_buffer=True)
# update parameters
solver.weight_decay(3e4)
solver.update()
if i % 200 == 0:
ve, vloss = run_validation(pred_valid, loss_valid, top_1_error_valid,
input_image_valid, data_iterator_valid,
with_visualized=False, num_visualized=3)
monitor_vloss.add(i, vloss)
monitor_verr.add(i, ve)
20190705 14:26:26,885 [nnabla][INFO]: iter=199 {Training loss}=1.5021580457687378
20190705 14:26:26,887 [nnabla][INFO]: iter=199 {Training error}=0.3345312476158142
20190705 14:26:28,756 [nnabla][INFO]: iter=200 {Test loss}=2.975713219355654
20190705 14:26:28,756 [nnabla][INFO]: iter=200 {Test error}=0.5384837962962963
20190705 14:26:50,249 [nnabla][INFO]: iter=399 {Training loss}=0.22022955119609833
20190705 14:26:50,250 [nnabla][INFO]: iter=399 {Training error}=0.053437501192092896
20190705 14:26:52,256 [nnabla][INFO]: iter=400 {Test loss}=0.12045302835327608
20190705 14:26:52,257 [nnabla][INFO]: iter=400 {Test error}=0.029513888888888888
20190705 14:27:14,151 [nnabla][INFO]: iter=599 {Training loss}=0.0659928247332573
20190705 14:27:14,152 [nnabla][INFO]: iter=599 {Training error}=0.012500000186264515
20190705 14:27:16,175 [nnabla][INFO]: iter=600 {Test loss}=0.08744175952893717
20190705 14:27:16,175 [nnabla][INFO]: iter=600 {Test error}=0.02199074074074074
20190705 14:27:38,097 [nnabla][INFO]: iter=799 {Training loss}=0.03324155509471893
20190705 14:27:38,098 [nnabla][INFO]: iter=799 {Training error}=0.0054687499068677425
20190705 14:27:40,120 [nnabla][INFO]: iter=800 {Test loss}=0.07678695395588875
20190705 14:27:40,121 [nnabla][INFO]: iter=800 {Test error}=0.02025462962962963
20190705 14:28:02,041 [nnabla][INFO]: iter=999 {Training loss}=0.019672293215990067
20190705 14:28:02,042 [nnabla][INFO]: iter=999 {Training error}=0.0017187499906867743
20190705 14:28:04,064 [nnabla][INFO]: iter=1000 {Test loss}=0.06333287184437116
20190705 14:28:04,065 [nnabla][INFO]: iter=1000 {Test error}=0.017361111111111112
20190705 14:28:25,984 [nnabla][INFO]: iter=1199 {Training loss}=0.009992362931370735
20190705 14:28:25,985 [nnabla][INFO]: iter=1199 {Training error}=0.0003124999930150807
20190705 14:28:28,008 [nnabla][INFO]: iter=1200 {Test loss}=0.06950318495984431
20190705 14:28:28,008 [nnabla][INFO]: iter=1200 {Test error}=0.015625
20190705 14:28:49,954 [nnabla][INFO]: iter=1399 {Training loss}=0.007941835559904575
20190705 14:28:49,955 [nnabla][INFO]: iter=1399 {Training error}=0.0003124999930150807
20190705 14:28:51,978 [nnabla][INFO]: iter=1400 {Test loss}=0.06711215277512868
20190705 14:28:51,979 [nnabla][INFO]: iter=1400 {Test error}=0.016203703703703703
20190705 14:29:13,898 [nnabla][INFO]: iter=1599 {Training loss}=0.008225565776228905
20190705 14:29:13,899 [nnabla][INFO]: iter=1599 {Training error}=0.0007812500116415322
20190705 14:29:15,923 [nnabla][INFO]: iter=1600 {Test loss}=0.06447940292181792
20190705 14:29:15,923 [nnabla][INFO]: iter=1600 {Test error}=0.016203703703703703
20190705 14:29:37,850 [nnabla][INFO]: iter=1799 {Training loss}=0.005678100511431694
20190705 14:29:37,850 [nnabla][INFO]: iter=1799 {Training error}=0.0
20190705 14:29:39,873 [nnabla][INFO]: iter=1800 {Test loss}=0.06282947226255028
20190705 14:29:39,873 [nnabla][INFO]: iter=1800 {Test error}=0.01678240740740741
20190705 14:30:01,795 [nnabla][INFO]: iter=1999 {Training loss}=0.006834140978753567
20190705 14:30:01,796 [nnabla][INFO]: iter=1999 {Training error}=0.00046874998952262104
20190705 14:30:03,818 [nnabla][INFO]: iter=2000 {Test loss}=0.05948294078310331
20190705 14:30:03,818 [nnabla][INFO]: iter=2000 {Test error}=0.014467592592592593
As you see, the loss and error rate is decreasing as the finetuning progresses. Let’s see the classification result after finetuning.
_, _ = run_validation(pred_valid, loss_valid, top_1_error_valid, input_image_valid, data_iterator_valid, with_visualized=True)
You can see now the model is able to classify the image properly.
Finetuning more¶
we have a convenient script named finetuning.py
. By using this, you
can try finetuning with different models even on your original
dataset.
To do this, you need to prepare your own dataset and do some preprocessing. We will explain how to do this in the following.
Prepare your dataset¶
Suppose you have a lot of images which can be used for image
classification. You need to organize your data in a certain manner.
Here, we will explain that with another dataset, Stanford Dogs
Dataset. First,
visit the official page and download images.tar
(here is the direct
link).
Next, untar the archive and then you will see a directory named
Images
. Inside that directory, there are many subdirectories and
each subdirectory stores images which belong to 1 category. For example,
a directory n02099712Labrador_retriever
contains labrador
retriever’s images only. So if you want to use your own dataset, you
need to organize your images and directiories in the same way like the
following;
parent_directory
├── subdirectory_for_category_A
│ ├── image_0.jpg
│ ├── image_1.jpg
│ ├── image_2.jpg
│ ├── ...
│
├── subdirectory_for_category_B
│ ├── image_0.jpg
│ ├── ...
│
├── subdirectory_for_category_C
│ ├── image_0.jpg
│ ├── ...
│
├── subdirectory_for_category_D
│ ├── image_0.jpg
│ ├── ...
│
...
The numbers of images in each category can vary, do not have to be exactly the same. Once you arrange your dataset, now you’re good to go!
Create image classification dataset using NNabla CLI¶
Now that you prepare and organize your dataset, the only thing you have
to do is to create a .csv
file which will be used in
finetuning.py
. To do so, you can use NNabla’s Python Command Line
Interface.
Just type like the following.
nnabla_cli create_image_classification_dataset i <path to parent directory> o <output directory which contains "preprocessed" images> c <number of channels> w <width> g <height> m <padding or trimming> s <whether apply shuffle or not> f1 <name of the output csv file for training data> f2 <name of the output csv file for test data> r2 <ratio(%) of test data to training data>
If you do that on Stanford Dogs Dataset,
nnabla_cli create_image_classification_dataset i Images o arranged_images c 3 w 128 g 128 m padding s true f1 stanford_dog_train.csv f2 stanford_dog_test.csv r2 20
Note that output .csv
file will be stored in the same directory you
specified with o option. For more information, please check the
docs.
After executing the command above, you can start finetuning on your dataset.
Run finetuning¶
All you need is just to type one line.
python finetuning.py model <model name> traincsv <.csv file containing training data> testcsv <.csv file containing test data>
It will execute finetuning on your dataset!
run finetuning.py model ResNet34 epoch 10 traincsv ~/nnabla_data/stanford_dog_arranged/stanford_dog_train.csv testcsv ~/nnabla_data/stanford_dog_arranged/stanford_dog_test.csv shuffle True
An example of how to use finetuning’s result for inference¶
Once the finetuning finished, let’s use it for inference! The script above has saved the parameters at every certain iteration you specified. So now call the same model you trained and this time let’s use the finetuned parameters in the following way.
from nnabla.models.imagenet import ResNet34
import nnabla as nn
param_path = "params_XXX.h5" # specify the path to the saved parameter (.h5)
model = ResNet34()
batch_size = 1 # just for inference
input_shape = (batch_size, ) + model.input_shape
Then define an input Variable and a network for inference. Note that you need to construct the network exactly the same way as done in finetuning script (layer configuration, parameters names, and so on…).
x = nn.Variable(input_shape) # input Variable
pooled = model(x, use_up_to="pool", training=False)
with nn.parameter_scope("finetuning"):
with nn.parameter_scope("last_fc"):
pred = PF.affine(pooled, 120)
Load the parameters which you finetuned above. You can use
nn.load_parameters()
to load the parameters. Once you call this, the
parameters stored in the params.h5
will be stored in global scope.
You can check the parameters are different before and after
nn.load_parameters()
by using nn.get_parameters()
.
nn.load_parameters(param_path) # load the finetuned parameters.
pred.forward()
Debugging¶
Deep neural networks are going deeper and deeper every year, requiring more components in the networks. Such complexity often misleads us to malconfigure the networks that can turn out be critical. Even if we correctly configure a neural network as desired, we may still want to find out its performance bottleneck, e.g., from which layer(s) the computational bottleneck comes.
In this debugging tutorial, we introduce the following ways to deal with such cases:
visit
method of a variableprettyprint
simple graph viewer
profiling utils
value tracer
We will go over each technique, but first prepare the following reference model.
# If you run this notebook on Google Colab, uncomment and run the following to set up dependencies.
# !pip install nnablaextcuda100
# !git clone https://github.com/sony/nnabla.git
# %cd nnabla/tutorial
# Python2/3 compatibility
from __future__ import print_function
from __future__ import absolute_import
from __future__ import division
import numpy as np
import nnabla as nn
import nnabla.logger as logger
import nnabla.functions as F
import nnabla.parametric_functions as PF
import nnabla.solvers as S
def block(x, maps, test=False, name="block"):
h = x
with nn.parameter_scope(name):
with nn.parameter_scope("inblock1"):
h = PF.convolution(h, maps, kernel=(3, 3), pad=(1, 1), with_bias=False)
h = PF.batch_normalization(h, batch_stat=not test)
h = F.relu(h)
with nn.parameter_scope("inblock2"):
h = PF.convolution(h, maps // 2, kernel=(3, 3), pad=(1, 1), with_bias=False)
h = PF.batch_normalization(h, batch_stat=not test)
h = F.relu(h)
with nn.parameter_scope("inblock3"):
h = PF.convolution(h, maps, kernel=(3, 3), pad=(1, 1), with_bias=False)
h = PF.batch_normalization(h, batch_stat=not test)
if h.shape[1] != x.shape[1]:
with nn.parameter_scope("skip"):
s = PF.convolution(x, maps, kernel=(3, 3), pad=(1, 1), with_bias=False)
s = PF.batch_normalization(s, batch_stat=not test)
return F.relu(h + s)
def network(x, maps=16, test=False):
h = x
h = PF.convolution(h, maps, kernel=(3, 3), pad=(1, 1), name="firstconv", with_bias=False)
h = PF.batch_normalization(h, batch_stat=not test, name="firstbn")
h = F.relu(h)
for l in range(4):
h = block(h, maps * 2 ** (l + 1), name="block{}".format(l))
h = F.max_pooling(h, (2, 2))
h = F.average_pooling(h, h.shape[2:])
pred = PF.affine(h, 100, name="pred")
return pred
Visit Method¶
Visit method of a variable takes either lambda, function, callable object as an argument and calls it over all NNabla functions where the variable can traverse in the forward order. It is easier to see the usage than expalined.
First of all, define the callable class.
class PrintFunc(object):
def __call__(self, nnabla_func):
print("==========")
print(nnabla_func.info.type_name)
print(nnabla_func.inputs)
print(nnabla_func.outputs)
print(nnabla_func.info.args)
This callable object takes a NNabla function, e.g., convolution, relu, etc., so a user can get information of that function.
nn.clear_parameters() # this call is just in case to do the following code again
x = nn.Variable.from_numpy_array(np.random.randn(*[4, 3, 128, 128]))
pred = network(x)
pred.visit(PrintFunc())
This is the lowlevel API to see the graph information as you want by hand.
PPrint¶
PPrint method is one of the instantiation of the visit method. We can see the graph structure in the topological (forward) order in details. Here is a usage to see detailed information of a graph.
nn.clear_parameters() # call this in case you want to run the following code agian
x = nn.Variable.from_numpy_array(np.random.randn(*[4, 3, 128, 128]))
pred = network(x)
# pprint
from nnabla.utils.inspection import pprint
pprint(pred, summary=True, forward=True, backward=True)
Simple Graph Viewer¶
Visit method is very useful for getting information about each function used in a graph, but it is hard to see the details of the whole network structure, e.g., which variable is connected to which variable. So we have a graph viewer that visually shows the whole structure of network, enabling us to debug more efficiently. Using this graph viewer is straightforward, as shown in the following code:
nn.clear_parameters() # call this in case you want to run the following code agian
x = nn.Variable([4, 3, 128, 128])
pred = network(x)
import nnabla.experimental.viewers as V
graph = V.SimpleGraph(verbose=False)
graph.view(pred)
If one would like to see more detailed information as in visit
method case, change verbose option to True
.
graph = V.SimpleGraph(verbose=True)
graph.view(pred)
Now one can see detailed information!
Note that this viewer is mainly for NNabla users who want to write codes in python, so for those who like to see more beautiful network and play with that, please use Neural Network Console and visit https://dl.sony.com/.
Profiling Utils¶
Basically, this feature is for developers who want to know the whole stats in speed and which functions could be bottlenecks. NNabla provides a simple profiling tool. Once a network is prepared, one better to have other components to train the network like a loss function and solvers.
To create the profiler and see the results, run the following codes.
nn.clear_parameters() # call this in case you want to run the following code agian
# Context
from nnabla.ext_utils import get_extension_context
device = "cudnn"
ctx = get_extension_context(device)
nn.set_default_context(ctx)
# Network
x = nn.Variable.from_numpy_array(np.random.randn(*[4, 3, 128, 128]))
t = nn.Variable([4, 1])
pred = network(x)
loss = F.mean(F.softmax_cross_entropy(pred, t))
# Solver
solver = S.Momentum()
solver.set_parameters(nn.get_parameters())
# Profiler
from nnabla.utils.profiler import GraphProfiler
B = GraphProfiler(loss, solver=solver, device_id=0, ext_name=device, n_run=100)
B.run()
print("Profile finished.")
# Report
from nnabla.utils.profiler import GraphProfilerCsvWriter
with open("./profile.csv", "w") as f:
writer = GraphProfilerCsvWriter(B, file=f)
writer.write()
print("Report is prepared.")
You can also find TimeProfiler to profile, but it is more finegrained in measureing execution time.
With TimeProfiler, you can put a callback function to the forward and/or backward method in the training loop.
Value Tracer¶
We sometimes want to check if there exsits NaN/Inf. NanInfTracer is a convenient way to check if one of all layers in a graph has NaN/Inf value.
# Create graph again just in case
nn.clear_parameters() # call this in case you want to run the following code agian
# Try to switch these two
x = nn.Variable.from_numpy_array(np.random.randn(*[4, 3, 64, 64]))
#x = nn.Variable([4, 3, 64, 64])
pred = network(x)
# NanInfTracer
from nnabla.utils.inspection import NanInfTracer
nit = NanInfTracer(trace_inf=True, trace_nan=True, need_details=True)
with nit.trace():
# Try to comment either of these two or both
pred.forward(function_post_hook=nit.forward_post_hook)
pred.backward(function_post_hook=nit.backward_post_hook)
print(nit.check())
Static vs Dynamic Neural Networks in NNabla¶
NNabla allows you to define static and dynamic neural networks. Static neural networks have a fixed layer architecture, i.e., a static computation graph. In contrast, dynamic neural networks use a dynamic computation graph, e.g., randomly dropping layers for each minibatch.
This tutorial compares both computation graphs.
%matplotlib inline
import nnabla as nn
import nnabla.functions as F
import nnabla.parametric_functions as PF
import nnabla.solvers as S
import numpy as np
np.random.seed(0)
GPU = 0 # ID of GPU that we will use
20170626 23:10:05,832 [nnabla][INFO]: Initializing CPU extension...
Dataset loading¶
We will first setup the digits dataset from scikitlearn:
from tiny_digits import *
digits = load_digits()
data = data_iterator_tiny_digits(digits, batch_size=16, shuffle=True)
20170626 23:10:06,042 [nnabla][INFO]: DataSource with shuffle(True)
20170626 23:10:06,043 [nnabla][INFO]: Using DataSourceWithMemoryCache
20170626 23:10:06,044 [nnabla][INFO]: DataSource with shuffle(True)
20170626 23:10:06,044 [nnabla][INFO]: Onmemory
20170626 23:10:06,045 [nnabla][INFO]: Using DataIterator
Each sample in this dataset is a grayscale image of size 8x8 and belongs
to one of the ten classes 0
, 1
, …, 9
.
img, label = data.next()
print(img.shape, label.shape)
(16, 1, 8, 8) (16, 1)
Network definition¶
As an example, we define a (unnecessarily) deep CNN:
def cnn(x):
"""Unnecessarily Deep CNN.
Args:
x : Variable, shape (B, 1, 8, 8)
Returns:
y : Variable, shape (B, 10)
"""
with nn.parameter_scope("cnn"): # Parameter scope can be nested
with nn.parameter_scope("conv1"):
h = F.tanh(PF.batch_normalization(
PF.convolution(x, 64, (3, 3), pad=(1, 1))))
for i in range(10): # unnecessarily deep
with nn.parameter_scope("conv{}".format(i + 2)):
h = F.tanh(PF.batch_normalization(
PF.convolution(h, 128, (3, 3), pad=(1, 1))))
with nn.parameter_scope("conv_last"):
h = F.tanh(PF.batch_normalization(
PF.convolution(h, 512, (3, 3), pad=(1, 1))))
h = F.average_pooling(h, (2, 2))
with nn.parameter_scope("fc"):
h = F.tanh(PF.affine(h, 1024))
with nn.parameter_scope("classifier"):
y = PF.affine(h, 10)
return y
Static computation graph¶
First, we will look at the case of a static computation graph where the neural network does not change during training.
from nnabla.ext_utils import get_extension_context
# setup cuda extension
ctx_cuda = get_extension_context('cudnn', device_id=GPU) # replace 'cudnn' by 'cpu' if you want to run the example on the CPU
nn.set_default_context(ctx_cuda)
# create variables for network input and label
x = nn.Variable(img.shape)
t = nn.Variable(label.shape)
# create network
static_y = cnn(x)
static_y.persistent = True
# define loss function for training
static_l = F.mean(F.softmax_cross_entropy(static_y, t))
20170626 23:10:06,350 [nnabla][INFO]: Initializing CUDA extension...
20170626 23:10:06,571 [nnabla][INFO]: Initializing cuDNN extension...
Setup solver for training
solver = S.Adam(alpha=1e3)
solver.set_parameters(nn.get_parameters())
Create data iterator
loss = []
def epoch_end_callback(epoch):
global loss
print("[{} {} {}]".format(epoch, np.mean(loss), itr))
loss = []
data = data_iterator_tiny_digits(digits, batch_size=16, shuffle=True)
data.register_epoch_end_callback(epoch_end_callback)
20170626 23:10:07,221 [nnabla][INFO]: DataSource with shuffle(True)
20170626 23:10:07,224 [nnabla][INFO]: Using DataSourceWithMemoryCache
20170626 23:10:07,226 [nnabla][INFO]: DataSource with shuffle(True)
20170626 23:10:07,228 [nnabla][INFO]: Onmemory
20170626 23:10:07,230 [nnabla][INFO]: Using DataIterator
Perform training iterations and output training loss:
%%time
for epoch in range(30):
itr = 0
while data.epoch == epoch:
x.d, t.d = data.next()
static_l.forward(clear_no_need_grad=True)
solver.zero_grad()
static_l.backward(clear_buffer=True)
solver.update()
loss.append(static_l.d.copy())
itr += 1
print()
[ 0 0.909297 112 ] [ 1 0.183863 111 ] [ 2 0.0723054 111 ] [ 3 0.0653021 112 ] [ 4 0.0628503 111 ] [ 5 0.0731626 111 ] [ 6 0.0319093 112 ] [ 7 0.0610926 111 ] [ 8 0.0817437 111 ] [ 9 0.0717577 112 ] [ 10 0.0241882 111 ] [ 11 0.0119452 111 ] [ 12 0.00664761 112 ] [ 13 0.00377711 111 ] [ 14 0.000605656 111 ] [ 15 0.000236613 111 ] [ 16 0.000174549 112 ] [ 17 0.000142428 111 ] [ 18 0.000126015 111 ] [ 19 0.000111144 112 ] [ 20 0.000100751 111 ] [ 21 9.03808e05 111 ] [ 22 8.35904e05 112 ] [ 23 7.73492e05 111 ] [ 24 6.91389e05 111 ] [ 25 6.74929e05 112 ] [ 26 6.08386e05 111 ] [ 27 5.62182e05 111 ] [ 28 5.33428e05 112 ] [ 29 4.94594e05 111 ]
CPU times: user 14.3 s, sys: 6.78 s, total: 21.1 s
Wall time: 21.1 s
Dynamic computation graph¶
Now, we will use a dynamic computation graph, where the neural network
is setup each time we want to do a forward/backward pass through it.
This allows us to, e.g., randomly dropout layers or to have network
architectures that depend on input data. In this example, we will use
for simplicity the same neural network structure and only dynamically
create it. For example, adding a
if np.random.rand() > dropout_probability:
into cnn()
allows to
dropout layers.
First, we setup the solver and the data iterator for the training:
nn.clear_parameters()
solver = S.Adam(alpha=1e3)
solver.set_parameters(nn.get_parameters())
loss = []
def epoch_end_callback(epoch):
global loss
print("[{} {} {}]".format(epoch, np.mean(loss), itr))
loss = []
data = data_iterator_tiny_digits(digits, batch_size=16, shuffle=True)
data.register_epoch_end_callback(epoch_end_callback)
20170626 23:10:28,449 [nnabla][INFO]: DataSource with shuffle(True)
20170626 23:10:28,450 [nnabla][INFO]: Using DataSourceWithMemoryCache
20170626 23:10:28,450 [nnabla][INFO]: DataSource with shuffle(True)
20170626 23:10:28,451 [nnabla][INFO]: Onmemory
20170626 23:10:28,451 [nnabla][INFO]: Using DataIterator
%%time
for epoch in range(30):
itr = 0
while data.epoch == epoch:
x.d, t.d = data.next()
with nn.auto_forward():
dynamic_y = cnn(x)
dynamic_l = F.mean(F.softmax_cross_entropy(dynamic_y, t))
solver.set_parameters(nn.get_parameters(), reset=False, retain_state=True) # this can be done dynamically
solver.zero_grad()
dynamic_l.backward(clear_buffer=True)
solver.update()
loss.append(dynamic_l.d.copy())
itr += 1
print()
[ 0 1.04669 112 ] [ 1 0.151949 111 ] [ 2 0.093581 111 ] [ 3 0.129242 112 ] [ 4 0.0452591 111 ] [ 5 0.0343987 111 ] [ 6 0.0315372 112 ] [ 7 0.0336886 111 ] [ 8 0.0194571 111 ] [ 9 0.00923094 112 ] [ 10 0.00536065 111 ] [ 11 0.000669383 111 ] [ 12 0.000294232 112 ] [ 13 0.000245866 111 ] [ 14 0.000201116 111 ] [ 15 0.000164177 111 ] [ 16 0.00014832 112 ] [ 17 0.000131479 111 ] [ 18 0.000115171 111 ] [ 19 0.000101432 112 ] [ 20 9.06228e05 111 ] [ 21 8.7103e05 111 ] [ 22 7.79601e05 112 ] [ 23 7.59678e05 111 ] [ 24 6.64341e05 111 ] [ 25 6.22717e05 112 ] [ 26 5.8643e05 111 ] [ 27 5.35373e05 111 ] [ 28 4.96717e05 112 ] [ 29 4.65124e05 111 ]
CPU times: user 23.4 s, sys: 5.35 s, total: 28.7 s
Wall time: 28.7 s
Comparing the two processing times, we can observe that both schemes (“static” and “dynamic”) takes the same execution time, i.e., although we created the computation graph dynamically, we did not lose performance.
Graph Converters¶
As neural networks becomes complex and one of components in a system, we sometimes want to convert a network as we want. Typical usecase is for inference. We want to merge or change some layers in a network as a highlevel optimization for the inference speed. Also, there are other usecases: adding new layers to keep track some stats, adding quantize/dequantize layers for a quantized inference, decomposing a layer as combination of a lowrank ones, changing a network architecture for the neural architecture search based on an original network architecture, changing the tensor format from the channel first to channel last and opposite, and so on.
Let’s look at the simple cases 1. batch normalization folding 2. channel last conversion
As a reference network, use the follows.
# ResNet50 for inference
import nnabla as nn
import nnabla.functions as F
import nnabla.parametric_functions as PF
import numpy as np
from nnabla.utils.inspection import pprint
from nnabla.models.imagenet import ResNet50
model = ResNet50()
batch_size = 1
x = nn.Variable((batch_size,) + model.input_shape)
y = model(x, training=False)
Batch Normalization Folding¶
See the resnet architecture.
pprint(y)
Now, we can see the batch normalization. For the inference, we do not need to compute the batch normalization explicitly by folding the batch normalization parameters if there is e.g., a convolution before the batch normalization.
To fold the batch normalization, use BatchNormalizationFoldingModifier as the following.
import nnabla.experimental.graph_converters as GC
modifiers = [GC.BatchNormalizationFoldingModifier()]
gc = GC.GraphConverter(modifiers)
yy = gc.convert(y)
Again, see the resnet architecture converted.
pprint(yy)
You can see that the converterd network does not contain the batch normalization any more!
In some cases, we can not fold the batch normalization, but the batch normalization can also be selffolded, i.e., the four parameters: scale, bias, running mean, running variance can be two other scale and bias. For doing this, use BatchNormalizationSelfFoldingModifier.
Channel Last Conversion¶
In NVIDIA latest GPU architectures since Volta, it supports TensorCore to accelerate the computatoinal performance. To boost the performance as maximum as possible, we need the channellast tensor format aka NHWC. In NNabla, the default tensor format is the channel first aka NCHW, so as to utilize TensorCore, we need to change the tensor format to NHWC format.
ChannelLastModifier convert a network with NCHW tesnor format to another network with NHWC tensor format.
import nnabla.experimental.graph_converters as GC
modifiers = [GC.ChannelLastModifier([x])]
gc = GC.GraphConverter(modifiers)
yy = gc.convert(y)
Let’s see the resnet architecture converted.
pprint(yy)
We can find the channel dimension changed at the last!
If we want to access to the inputs of which tensor format converted,
x_cl = modifiers[0].inputs_cl[0]
print(x_cl)
Note that ChannelLastModifier supports a set of layers: Convolution, Deconvolution, BatchNormalization, MaxPooling, AveragePooling, SumPooling, Unpooling, Concatenate and also supposes NCHW format.
There also exists ChannelFirstModifier in the opposite change.
Mixed Precision Training¶
Introduction¶
Traditionally, for training a neural network, we used to use FP32
for weights and activations; however computation costs for training a
neural network rapidly increase over years as the success of deep
learning and the growing size of a neural network. It indicates that we
need to spend much more time for training a huge size of a neural
network while we would like to do lots of trials before a product
launch. To address this problem, companies (e.g., NVIDIA) introduced an
accelerator for speeding up computation. For example, NVIDIA Volta has
Tensor
Cores
to speed up computation.
However, it uses FP16
weights, activations, gradients, and the range
of FP16
is very limited when compared to that of FP32
, meaning
that sometimes (or often) values of gradients overflow and/or underflow,
which affects the performance of a neural network or makes it collapse
during training.
Mixed precision training is one of the algorithms to circumvent that
problem while maintaining the same results that we could obtain with
FP32
networks. It is welldescribed in The Training with Mixed
Precision User
Guide
and Mixed Precision Training.
This tutorial explains how to do the mixed precision training in NNabla stepbystep.
StepbyStep Instruction¶
Basically, the mixed precision training are composed of three parts.
Use the accelerator for computation (here we assume Tensor Cores)
Use loss scaling to prevent underflow
Use dynamic loss scaling to prevent overflow/underflow
In NNabla, we can do the correspondences as follows.
1. Use Tensor Cores¶
ctx = get_extension_context("cudnn", type_config="half")
2. Use loss scaling to prevent underflow¶
loss_scale = 8
loss.backward(loss_scale)
solver.scale_grad(1. / loss_scale) # do some gradient clipping, etc. after this
solver.update()
3. Use dynamic loss scaling to prevent overflow/underflow¶
loss_scale = 8
scaling_factor = 2
counter = 0
interval = 2000
...
loss.backward(loss_scale, ...)
...
if solver.check_inf_or_nan_grad():
loss_scale /= scaling_factor
counter = 0
else:
solver.scale_grad(1. / loss_scale) # do some gradient clipping, etc. after this
solver.update()
if counter > interval:
loss_scale *= scaling_factor
counter = 0
counter += 1
Note that currently the procedures of 2nd (Use loss scaling to prevent underflow) and 3rd (Use loss scaling to prevent overflow) are experimental, and we are now trying to speed up the mixed precision training, so API might change for future use, especially 3rd.
Allinone Instruction¶
In the previous stepbystep example, the 3rd step is lengthy in a training loop, thus we can write a wrapper class like the following.
class DynamicLossScalingUpdater(object):
'''Dynamic Loss Scaling Updater for the mixed precision training.
Args:
solver (:obj:`nnabla.solvers.Solver`): Solver object. E.g., Momentum or Adam.
loss (:obj:`nnabla.Variable`): Loss variable from which the forward and the backward is called.
data_feeder (callable :obj:`object`, function, or lambda): Data feeder
scale (:obj:`float`): Loss scale constant. This is dynamically changing during training.
scaling_factor (:obj:`float`): Scaling factor for the dynamic loss scaling.
N (:obj:`int`): Interval, the number of iterations in training for increasing `loss scale` by `scaling_factor`.
clear_buffer (:obj:`bool`): Clears the no longer referenced variables during backpropagation to save memory.
accum_grad (:obj:`int`): Number of accumulation of gradients. Update method of the `solver` is called after the `accum_grad` number of the forward and backward is called.
weight_decay (:obj:`float`): Decay constant. Default is `None`, not applying the weight decay.
comm (:obj:`nnabla.communicators.Communicator`): Communicator when to do distributed training. Default is :obj:`None`.
grads (:obj:`list` of :obj:`nnabla._nd_array.NdArray`): The list of gradients to be exchanged when to do distributed training. Default is the empty :obj:`list`.
Attributes:
solver (:obj:`nnabla.solvers.Solver`): Solver object. E.g., Momentum or Adam.
loss (:obj:`nnabla.Variable`): Loss variable from which the forward and the backward is called.
data_feeder (callable :obj:`object`, function, lambda): Data feeder
scale (:obj:`float`): Loss scale constant. This is dynamically changing during training.
scaling_factor (:obj:`float`): Scaling factor for the dynamic loss scaling.
N (:obj:`int`): Interval, the number of iterations in training for increasing `loss scale` by `scaling_factor`.
clear_buffer (:obj:`bool`): Clears the no longer referenced variables during backpropagation to save memory.
accum_grad (:obj:`int`): Number of accumulation of gradients. Update method of the `solver` is called after the `accum_grad` number of the forward and backward is called.
weight_decay (:obj:`float`): Decay constant. Default is `None`, not applying the weight decay.
comm (:obj:`nnabla.communicators.Communicator`): Communicator when to do distributed training.
grads (:obj:`list` of :obj:`nnabla._nd_array.NdArray`): The list of gradients to be exchanged when to do distributed training.
Example:
.. codeblock:: python
solver = <Solver>
loss = <Loss Variable of Network>
data_feeder = <DataFeeder>
updater = DynamicLossScalingUpdater(solver, loss, data_feeder)
# Training iteration
for itr in range(max_iter):
# Call solver.zero_grad, data_feeder, loss.forward, loss.backward
# and solver.update with the dynamic loss scaling.
updater.update()
Reference:
https://docs.nvidia.com/deeplearning/sdk/mixedprecisiontraining/index.html#scalefactor
'''
def __init__(self, solver, loss, data_feeder=lambda x: x,
scale=8.0, scaling_factor=2.0, N=2000, clear_buffer=True,
accum_grad=1, weight_decay=None,
comm=None,
grads=[]):
self.solver = solver
self.loss = loss
self.data_feeder = data_feeder
self.scale = scale
self.scaling_factor = scaling_factor
self.N = N
self.clear_buffer = clear_buffer
self.accum_grad = accum_grad
self.weight_decay = weight_decay
self.comm = comm
self.grads = grads
self._counter = 0
self._recursive_count = 0
self._max_recursive_count = 100
def update(self):
"""Monolithic update method.
This method calls the following methods with the dynamic loss scaling.
1. solver.zerograd
2. feed data
3. loss.forward
4. loss.backward
5. comm.all_reduce (if it is specified)
6. solver.update
"""
# Initialize gradients.
self.solver.zero_grad()
# Forward and backward
for _ in range(self.accum_grad):
# feed data
self.data_feeder()
# forward
self.loss.forward(clear_no_need_grad=self.clear_buffer)
# backward with scale
self.loss.backward(self.scale, clear_buffer=self.clear_buffer)
# AllReduce
if self.comm and len(self.grads) != 0:
self.comm.all_reduce(self.grads, division=False, inplace=False)
# Check Inf/NaN in grads
if self.solver.check_inf_or_nan_grad():
self.scale /= self.scaling_factor
self._counter = 0
# Recursively call udpate function until no inf nor nan.
self._recursive_count += 1
if self._recursive_count > self._max_recursive_count:
self._recursive_count = 0
return # skip
return self.update()
self._recursive_count = 0
# Rescale grads
self.solver.scale_grad(1. / self.scale)
# Do some gradient clipping, etc.
if self.weight_decay is not None:
self.solver.weight_decay(self.weight_decay)
# Update
self.solver.update()
if self._counter > self.N:
self.scale *= self.scaling_factor
self._counter = 0
self._counter += 1
Then, call the update method in a training loop:
from nnabla.experimental.mixed_precision_training import DynamicLossScalingUpdater
solver = <Solver>
loss = <Loss Variable of Network>
data_feeder = <DataFeeder>
updater = DynamicLossScalingUpdater(solver, loss, data_feeder)
# Training iteration
for itr in range(max_iter):
# Call solver.zero_grad, data_feeder, loss.forward, loss.backward
# and solver.update with the dynamic loss scaling.
updater.update()
Notice¶
In the mixedprecision training, the followings are premise:
Solver contains
FP16
weights and theFP32
copy of weights. Solvers in NNabla holdFP32
weights and weight gradients and cast it toFP16
weights in forward pass and toFP16
weight gradients in backward pass if one setstype_config="half"
.Reductions should be left in
FP32
, for examples, the statistics (mean and variance) computed by the batchnormalization, Mean, Sum, SoftMax, SoftMaxCrossEntropy, etc. (see The Training with Mixed Precision User Guide). In NNabla, these functions are automatically fallbacked to useFP32
.
Data Parallel Distributed Training¶
DataParallelCommunicator enables to train your neural network using multiple devices. It is normally used for gradients exchange in data parallel distributed training. Basically, there are two types of distributed trainings in Neural Network literature: Data Parallel and Model Parallel. Here we only focus on the former, Data Parallel Training. Data Parallel Distributed Training is based on the very simple equation used for the optimization of a neural network called (MiniBatch) Stochastic Gradient Descent.
In the optimization process, the objective one tries to minimize is
where \(f\) is a neural network, \(B \times N\) is the batch size, \(\ell\) is a loss function for each data point \(\mathbf{x} \in X\), and \(\mathbf{w}\) is the trainable parameter of the neural network.
When taking the derivative of this objective, one gets,
Since the derivative has linearity, one can change the objective to the sum of summations each of which is the sum of derivatives over \(B\) data points.
In data parallel distributed training, the following steps are performed according to the above equation,
each term, summation of derivatives (gradients) divided by batch size \(B\), is computed on a separated device (typically GPU),
take the sum over devices,
divide the result by the number of devices, \(N\).
That is the underlying foundation of Data Parallel Distributed Training.
This tutorial shows the usage of Multi Process Data Parallel Communicator for data parallel distributed training with a very simple example.
NOTE¶
This tutorial depends on IPython Cluster, thus when you want to run the following excerpts of the scripts on Jupyter Notebook, follow this to enable mpiexec/mpirun mode, then launch a corresponding Ipython Cluster on Ipython Clusters tab.
Launch client¶
This code is only needed for this tutorial via Jupyter Notebook.
import ipyparallel as ipp
rc = ipp.Client(profile='mpi')
Prepare the dependencies¶
%%px
import os
import time
import nnabla as nn
import nnabla.communicators as C
from nnabla.ext_utils import get_extension_context
import nnabla.functions as F
from nnabla.initializer import (
calc_uniform_lim_glorot,
UniformInitializer)
import nnabla.parametric_functions as PF
import nnabla.solvers as S
import numpy as np
Define the communicator for gradients exchange.¶
%%px
extension_module = "cudnn"
ctx = get_extension_context(extension_module)
comm = C.MultiProcessCommunicator(ctx)
comm.init()
n_devices = comm.size
mpi_rank = comm.rank
device_id = mpi_rank
ctx = get_extension_context(extension_module, device_id=device_id)
Check different ranks are assigned to different devices
%%px
print("n_devices={}".format(n_devices))
print("mpi_rank={}".format(mpi_rank))
[stdout:0]
n_devices=2
mpi_rank=1
[stdout:1]
n_devices=2
mpi_rank=0
Create data points and a very simple neural network¶
%%px
# Data points setting
n_class = 2
b, c, h, w = 4, 1, 32, 32
# Data points
x_data = np.random.rand(b, c, h, w)
y_data = np.random.choice(n_class, b).reshape((b, 1))
x = nn.Variable(x_data.shape)
y = nn.Variable(y_data.shape)
x.d = x_data
y.d = y_data
# Network setting
C = 1
kernel = (3, 3)
pad = (1, 1)
stride = (1, 1)
%%px
rng = np.random.RandomState(0)
w_init = UniformInitializer(
calc_uniform_lim_glorot(C, C/2, kernel=(1, 1)),
rng=rng)
%%px
# Network
with nn.context_scope(ctx):
h = PF.convolution(x, C, kernel, pad, stride, w_init=w_init)
pred = PF.affine(h, n_class, w_init=w_init)
loss = F.mean(F.softmax_cross_entropy(pred, y))
Important notice here is that w_init
is passed to parametric
functions to let the network on each GPU start from the same values of
trainable parameters in the optimization process.
Create a solver.¶
%%px
# Solver and add parameters
solver = S.Adam()
solver.set_parameters(nn.get_parameters())
Training¶
Recall the basic usage of nnabla
API for training a neural network,
it is
loss.forward()
solver.zero_grad()
loss.backward()
solver.update()
In use of C.MultiProcessCommunicator
, these steps are
performed in different GPUs, and the only difference from these
steps is comm.all_reduce()
. Thus, in case of
C.MultiProcessCommunicator
training steps are as
follows,
loss.forward()
solver.zero_grad()
loss.backward()
comm.all_reduce([x.grad for x in nn.get_parameters().values()])
solver.update()
First, forward, zero_grad, and backward,
%%px
# Training steps
loss.forward()
solver.zero_grad()
loss.backward()
Check gradients of weights once,
%%px
for n, v in nn.get_parameters().items():
print(n, v.g)
[stdout:0]
('conv/W', array([[[[ 5.0180483, 0.457942 , 2.8701296],
[ 2.0715926, 3.0698593, 1.6650047],
[2.5591214, 6.4248834, 9.881935 ]]]], dtype=float32))
('conv/b', array([8.658947], dtype=float32))
('affine/W', array([[0.93160367, 0.9316036 ],
[1.376812 , 1.376812 ],
[1.8957546 , 1.8957543 ],
...,
[0.33000934, 0.33000934],
[0.7211893 , 0.72118926],
[0.25237036, 0.25237036]], dtype=float32))
('affine/b', array([0.48865744, 0.48865741], dtype=float32))
[stdout:1]
('conv/W', array([[[[ 1.2505884 , 0.87151337, 8.685524 ],
[ 10.738419 , 14.676786 , 7.483423 ],
[ 5.612471 , 12.880402 , 19.141157 ]]]], dtype=float32))
('conv/b', array([13.196114], dtype=float32))
('affine/W', array([[1.6865108 , 1.6865108 ],
[0.938529 , 0.938529 ],
[1.028422 , 1.028422 ],
...,
[0.98217344, 0.98217344],
[0.97528917, 0.97528917],
[0.413546 , 0.413546 ]], dtype=float32))
('affine/b', array([0.7447065, 0.7447065], dtype=float32))
You can see the different values on each device, then call
all_reduce
,
%%px
comm.all_reduce([x.grad for x in nn.get_parameters().values()], division=True)
Commonly, all_reduce
only means the sum; however,
comm.all_reduce
addresses both cases: summation and summation
division.
Again, check gradients of weights,
%%px
for n, v in nn.get_parameters().items():
print(n, v.g)
[stdout:0]
('conv/W', array([[[[ 1.8837299 , 0.20678568, 5.777827 ],
[ 6.4050055 , 8.8733225 , 2.9092093 ],
[ 1.5266749 , 3.2277591 , 14.511546 ]]]], dtype=float32))
('conv/b', array([21.85506], dtype=float32))
('affine/W', array([[2.6181145, 2.6181145],
[2.315341 , 2.315341 ],
[2.9241767, 2.9241762],
...,
[1.3121828, 1.3121828],
[1.6964785, 1.6964784],
[0.6659163, 0.6659163]], dtype=float32))
('affine/b', array([1.233364 , 1.2333639], dtype=float32))
[stdout:1]
('conv/W', array([[[[ 1.8837299 , 0.20678568, 5.777827 ],
[ 6.4050055 , 8.8733225 , 2.9092093 ],
[ 1.5266749 , 3.2277591 , 14.511546 ]]]], dtype=float32))
('conv/b', array([21.85506], dtype=float32))
('affine/W', array([[2.6181145, 2.6181145],
[2.315341 , 2.315341 ],
[2.9241767, 2.9241762],
...,
[1.3121828, 1.3121828],
[1.6964785, 1.6964784],
[0.6659163, 0.6659163]], dtype=float32))
('affine/b', array([1.233364 , 1.2333639], dtype=float32))
You can see the same values over the devices because of all_reduce
.
Update weights,
%%px
solver.update()
This concludes the usage of C.MultiProcessDataCommunicator
for Data Parallel Distributed Training.
Now you should have an understanding of how to use
C.MultiProcessCommunicator
, go to the cifar10 example,
multi_device_multi_process_classification.sh
multi_device_multi_process_classification.py
for more details.
Function list and converter¶
nnabla_cli
is the command line interface of nnabla. With this command line interface, user may know current NNabla support status, and know whether or how to convert a nnabla model(e.g. *.nnp)
to other format of model(e.g. *.onnx).
The subcommand function_info
provides a set of functions to output implemented function information.
With this information, you may build tailored nnablacruntime library for your model, or skip some unsupported
functions for the target model.
Some simple use cases¶
Please let us introduce some simple use cases:
At first, you want to know how many functions (which functions) nnabla currently supports:
$ nnabla_cli function_info
You get the following list:
20190614 16:16:13,106 [nnabla][INFO]: Initializing CPU extension...
NNabla command line interface (Version:1.0.18.dev1, Build:190531084842)
LSTM
Sub2
Mul2
GreaterEqual
Sigmoid
NotEqual
Unpooling
Log
CategoricalCrossEntropy
...
That is the list of current nnabla all supported functions. Only function names are shown, no more detail, only for seeking certain function by name. For the detail of each function, you have to check with online document.
As you known, nnabla’s model *.nnp can be converted to a compact version, it has the postfix .nnb
, can be inferred by nnablacruntime library. We simply named this format as NNB
. To know how many functions are supported in this format, you may use this command:
$ nnabla_cli function_info f NNB
Similar as above, a function list is shown.
Do we simple list the functions used in a .nnp model? Yes, of course.
$ nnabla_cli function_info my_model.nnp
Similar as above, a function list used in this model is listed.
Then, we may know whether our model can be converted to nnablacruntime model format, or formally speaking, we can know the intersection of 2 function sets, one is the function set in .nnp and the other is nnablacruntime has supported.
$ nnabla_cli function_info my_model.nnp f NNB
The output looks like:
20190614 17:01:29,393 [nnabla][INFO]: Initializing CPU extension...
NNabla command line interface (Version:1.0.18.dev1, Build:190531084842)
Importing mnist_nnp/lenet_010000.nnp
Expanding runtime.
nnablacruntime currently support the following functions in model:
Convolution
MulScalar
Affine
MaxPooling
ReLU
...
Unsupported functions are also listed up if there are any in this model.
Tailored nnablacruntime library¶
When implementing nnablacruntime library, we hope to implement all functions we can. But from customer’s aspect, that is sometimes no need. If user only wants to use nnablacruntime for enumerable models, the nnablacruntime should be tailed exactly as what these models required. How to do then?
It can be implemented with the following steps:
generate function list
config your nnablacruntime library
build nnablacruntime library
1. Generate function list¶
$ nnabla_cli function_info my_model.nnp f NNB o functions.txt
This is similar as above, except that with o
parameter, which pointed out which file should be written to. (of course, the format is different from the version output to stdout, it is more compact)
2. config your nnablacruntime library¶
User may manually modify functions.txt
. Then, this file is used as input, used to generate nnablacruntime library’s config file:
$ nnabla_cli function_info c functions.txt o nnablacruntime/buildtools/codegenerator/functions.yaml
As we inferred, if there is no c
parameter, a full function set will be used to generate this config file, of course, the library will finally contain all implemented functions. This is the default behavior.
3. build nnablacruntime library¶
The build process is relatively directly, as the following:
#> nnablacruntime>mkdir build
#> nnablacruntime>cd build
#> nnablacruntime>cmake ..
#> nnablacruntime>make
The nnablacruntime library libnnablart_functions.a
will contain the functions what you want.
Skip functions unsupported¶
When you want to convert *.nnp
to *.onnx
or *.nnb
, there are some functions are not supported in target function list. For example, you want to convert a network to nnablacruntime. The network looks like:
Affine
Softmax
Tanh
Convolution
MaxPooling
ReLU
You do not want to use nnablacruntime library’s Convolution
, you want to split the network in 2 pieces at the point of Convolution
. 2 Steps are needed to do so:
comment out the function in functions.txt
convert the network with
c
parameter
2. convert the network with c
parameter¶
$ nnabla_cli convert c functions.txt a.nnp b.nnb
Thus, the network is splitted into pieces, the output shows as the following:
...
LeNet_036_0_5.nnb:
input:
 name: Input
shape: (1, 1, 28, 28)
output:
 name: Tanh_2
shape: (1, 30, 4, 4)
LeNet_036_7_7.nnb:
input:
 name: Affine
shape: (1, 150)
output:
 name: ReLU_2
shape: (1, 150)
LeNet_036_9_9.nnb:
input:
 name: Affine_2
shape: (1, 10)
output:
 name: Softmax
shape: (1, 10)
The network is split at the Affine
function. Since there are 2 Affine
in network, 3 subnetworks is generated.
Converting to ONNX¶
The following commands just do similar as above, exactly to *.onnx.
List all functions supported:
$ nnabla_cli function_info f ONNX
List the intersection of function sets, in a model and supported by ONNX:
$ nnabla_cli function_info LeNet_036.nnp f ONNX
Split network to skip some function:
$ nnabla_cli convert c functions.txt a.nnp a.onnx
Python Command Line Interface¶
Nnabla has command line interface utility which can do train, forward(inference), convert param and dataset, measure performance, file format converter and so on.
usage: nnabla_cli [h] [m]
{train,infer,forward,encode_param,decode_param,profile,conv_dataset,compare_with_cpu,create_image_classification_dataset,upload,create_tar,function_info,optimize,dump,nnb_template,convert,plot_series,plot_timer,draw_graph,version}
...
Command line interface for NNabla(Version 1.0.11.dev1, Build 181226024531)
positional arguments:
{train,infer,forward,encode_param,decode_param,profile,conv_dataset,compare_with_cpu,create_image_classification_dataset,upload,create_tar,function_info,optimize,dump,nnb_template,convert,plot_series,plot_timer,draw_graph,version}
train Training with NNP.
infer Do inference with NNP and binary data file input.
forward Do evaluation with NNP and test dataset.
encode_param Encode plain text to parameter format.
decode_param Decode parameter to plain text.
profile Profiling performance with NNP.
conv_dataset Convert CSV dataset to cache.
compare_with_cpu Compare performance between two nntxt.
create_image_classification_dataset
Create dataset from image files.
upload Upload dataset to Neural Network Console.
create_tar Create tar file for Neural Network Console.
function_info Output function info.
optimize Optimize pb model.
dump Dump network with supported format.
nnb_template Generate NNB config file template.
convert File format converter.
plot_series Plot *.series.txt files.
plot_timer Plot *.timer.txt files.
draw_graph Draw a graph in a NNP or nntxt file with graphviz.
version Print version and build number.
optional arguments:
h, help show this help message and exit
m, mpi exec with mpi.
Work with NNP¶
Training¶
usage: nnabla_cli train [h] c CONFIG [p PARAM] o OUTDIR
optional arguments:
h, help show this help message and exit
c CONFIG, config CONFIG
path to nntxt
p PARAM, param PARAM
path to parameter file
o OUTDIR, outdir OUTDIR
output directory
Profile¶
usage: nnabla_cli profile [h] c CONFIG o OUTDIR
optional arguments:
h, help show this help message and exit
c CONFIG, config CONFIG
path to nntxt
o OUTDIR, outdir OUTDIR
output directory
Forward¶
usage: nnabla_cli forward [h] c CONFIG [p PARAM] [d DATASET] o OUTDIR [b BATCH_SIZE]
optional arguments:
h, help show this help message and exit
c CONFIG, config CONFIG
path to nntxt
p PARAM, param PARAM
path to parameter file
d DATASET, dataset DATASET
path to CSV dataset
o OUTDIR, outdir OUTDIR
output directory
b BATCH_SIZE, batch_size BATCH_SIZE
Batch size to use batch size in nnp file set 1.
Inference¶
usage: nnabla_cli infer [h] c CONFIG [o OUTPUT] [p PARAM] [b BATCH_SIZE] inputs [inputs ...]
positional arguments:
inputs
optional arguments:
h, help show this help message and exit
c CONFIG, config CONFIG
path to nntxt
o OUTPUT, output OUTPUT
output file prefix
p PARAM, param PARAM
path to parameter file
b BATCH_SIZE, batch_size BATCH_SIZE
Batch size to use batch size in nnp file set 1.
Compare with CPU¶
usage: nnabla_cli compare_with_cpu [h] c CONFIG c2 CONFIG2 o OUTDIR
optional arguments:
h, help show this help message and exit
c CONFIG, config CONFIG
path to nntxt
c2 CONFIG2, config2 CONFIG2
path to cpu nntxt
o OUTDIR, outdir OUTDIR
output directory
Dataset manipulation¶
Encode parameter¶
usage: nnabla_cli encode_param [h] i INDIR [p PARAM]
optional arguments:
h, help show this help message and exit
i INDIR, indir INDIR
input directory
p PARAM, param PARAM
path to parameter file
Decode parameter¶
usage: nnabla_cli decode_param [h] [p PARAM] o OUTDIR
optional arguments:
h, help show this help message and exit
p PARAM, param PARAM
path to parameter file
o OUTDIR, outdir OUTDIR
output directory
Convert dataset¶
usage: nnabla_cli conv_dataset [h] [F] [S] [N] source destination
positional arguments:
source
destination
optional arguments:
h, help show this help message and exit
F, force force overwrite destination
S, shuffle shuffle data
N, normalize normalize data range
Create image classification dataset¶
usage: nnabla_cli create_image_classification_dataset [h] i SOURCEDIR o OUTDIR c CHANNEL w WIDTH g HEIGHT m MODE s SHUFFLE f1 FILE1 [r1 RATIO1] [f2 FILE2]
[r2 RATIO2]
optional arguments:
h, help show this help message and exit
i SOURCEDIR, sourcedir SOURCEDIR
source directory with directories for each class
o OUTDIR, outdir OUTDIR
output directory
c CHANNEL, channel CHANNEL
number of output color channels
w WIDTH, width WIDTH
width of output image
g HEIGHT, height HEIGHT
height of output image
m MODE, mode MODE shaping mode (trimming or padding)
s SHUFFLE, shuffle SHUFFLE
shuffle mode (true or false)
f1 FILE1, file1 FILE1
output file name 1
r1 RATIO1, ratio1 RATIO1
output file ratio(%) 1
f2 FILE2, file2 FILE2
output file name 2
r2 RATIO2, ratio2 RATIO2
output file ratio(%) 2
Upload dataset to Neural Network Console¶
usage: nnabla_cli upload [h] [e ENDPOINT] token filename
positional arguments:
token token for upload
filename filename to upload
optional arguments:
h, help show this help message and exit
e ENDPOINT, endpoint ENDPOINT
set endpoint uri
Create dataset archive for Neural Network Console¶
usage: nnabla_cli create_tar [h] source destination
positional arguments:
source CSV dataset
destination TAR filename
optional arguments:
h, help show this help message and exit
File format converter¶
For detailed information please see File format converter.
Dump content of supported format¶
usage: nnabla_cli dump [h] [v] [F] [V] [dumplimit DUMP_LIMIT]
[n DUMP_VARIABLE_NAME] [I IMPORT_FORMAT]
[E NNP_IMPORT_EXECUTOR_INDEX]
[nnpexcludepreprocess] [nnpnoexpandnetwork]
FILE [FILE ...]
positional arguments:
FILE File or directory name(s) to convert.
optional arguments:
h, help show this help message and exit
v, dumpverbose [dump] verbose output.
F, dumpfunctions [dump] dump function list.
V, dumpvariables [dump] dump variable list.
dumplimit DUMP_LIMIT
[dump] limit num of items.
n DUMP_VARIABLE_NAME, dumpvariablename DUMP_VARIABLE_NAME
[dump] Specific variable name to display.
I IMPORT_FORMAT, importformat IMPORT_FORMAT
[import] import format. (one of [NNP,ONNX])
E NNP_IMPORT_EXECUTOR_INDEX, nnpimportexecutorindex NNP_IMPORT_EXECUTOR_INDEX
[import][NNP] import only specified executor.
nnpexcludepreprocess
[import][NNP] EXPERIMENTAL exclude preprocess
functions when import.
nnpnoexpandnetwork
[import][NNP] expand network with repeat or recurrent.
Generate NNB config file template¶
usage: nnabla_cli nnb_template [h] [I IMPORT_FORMAT]
[nnpnoexpandnetwork] [b BATCH_SIZE]
[T DEFAULT_VARIABLE_TYPE]
FILE [FILE ...]
positional arguments:
FILE File or directory name(s) to convert.
optional arguments:
h, help show this help message and exit
I IMPORT_FORMAT, importformat IMPORT_FORMAT
[import] import format. (one of [NNP,ONNX])
nnpnoexpandnetwork
[import][NNP] expand network with repeat or recurrent.
b BATCH_SIZE, batchsize BATCH_SIZE
[export] overwrite batch size.
T DEFAULT_VARIABLE_TYPE, defaultvariabletype DEFAULT_VARIABLE_TYPE
Default type of variable
File format converter¶
usage: nnabla_cli convert [h] [I IMPORT_FORMAT] [nnpnoexpandnetwork]
[O EXPORT_FORMAT] [f] [b BATCH_SIZE]
[nnpparameterh5] [nnpparameternntxt]
[nnpexcludeparameter] [T DEFAULT_VARIABLE_TYPE]
[s SETTINGS] [c CONFIG] [d DEFINE_VERSION] [api API]
[enableoptimizepb] [outputs OUTPUTS]
[inputs INPUTS] FILE [FILE ...]
positional arguments:
FILE File or directory name(s) to convert.
(When convert ckpt format of the tensorflow model,
If the version of the checkpoint is V1, need to enter the `.ckpt` file,
otherwise need to enter the `.meta` file.)
optional arguments:
h, help show this help message and exit
I IMPORT_FORMAT, importformat IMPORT_FORMAT
[import] import format. (one of [NNP,ONNX,TF_CKPT_V1,TF_CKPT_V2,TF_PB,SAVED_MODEL,TFLITE])
nnpnoexpandnetwork
[import][NNP] expand network with repeat or recurrent.
outputs OUTPUTS
[import][tensorflow] The name(s) of the output nodes, comma separated.
Only needed when convert CKPT format.
inputs INPUTS
[import][tensorflow] The name(s) of the input nodes, comma separated.
Only needed when convert CKPT format.
O EXPORT_FORMAT, exportformat EXPORT_FORMAT
[export] export format. (one of [NNP,NNB,CSRC,ONNX,SAVED_MODEL,TFLITE,TF_PB],
the export file format is 'CSRC' or 'SAVED_MODEL' that
argument 'exportformat' will have to be set!!!)
f, force [export] overwrite output file.
b BATCH_SIZE, batchsize BATCH_SIZE
[export] overwrite batch size.
nnpparameterh5 [export][NNP] store parameter with h5 format
nnpparameternntxt
[export][NNP] store parameter into nntxt
nnpexcludeparameter
[export][NNP] output without parameter
T DEFAULT_VARIABLE_TYPE, defaultvariabletype DEFAULT_VARIABLE_TYPE
Default type of variable
s SETTINGS, settings SETTINGS
Settings in YAML format file.
c CONFIG, config CONFIG
[export] config target function list.
d DEFINE_VERSION, define_version
[export][ONNX] define onnx opset version. e.g. opset_6
[export][ONNX] define convert to onnx for SNPE. e.g. opset_snpe
[export][ONNX] define convert to onnx for TensorRT. e.g. opset_tensorrt
[export][NNB] define binary format version. e.g. nnb_3
api API [export][NNB] Set API Level to convert to, default is highest API Level.
enableoptimizepb [export][tensorflow] enable optimization when export to pb.
channel_last [export][TFLite] Specify the data_format of the NNP network,
data_format default is channel_first.
Optimize pb model¶
usage: nnabla_cli optimize [h] input_pb_file output_pb_file
positional arguments:
input_pb_file Input preoptimized pb model.
output_pb_file Output optimized pb model.
Plot Monitor class output files¶
Note:
Plotting subcommands require matplotlib package.
By default, the following commands show a plot on your display using a backend rendering engine of matplotlib depending on your environment. If you want to save a plot as an image or a vector data, use
o
option to specifiy a file name where a plot is saved.
MonitorSeries¶
usage: nnabla_cli plot_series [h] [l LABEL] [o OUTFILE] [x XLABEL]
[y YLABEL] [t TITLE] [T YLIM_MAX]
[B YLIM_MIN] [R XLIM_MAX] [L XLIM_MIN]
infile [infile ...]
Plot *.series.txt files produced by nnabla.monitor.MonitorSeries class.
Example:
nnabla_cli plot_series x "Epochs" y "Squared error loss" T 10 l "config A" l "config B" result_a/Trainingloss.series.txt result_b/Trainingloss.series.txt
positional arguments:
infile Path to input file.
optional arguments:
h, help show this help message and exit
l LABEL, label LABEL
Label of each plot.
o OUTFILE, outfile OUTFILE
Path to output file.
x XLABEL, xlabel XLABEL
Xaxis label of plot.
y YLABEL, ylabel YLABEL
Yaxis label of plot.
t TITLE, title TITLE
Title of plot.
T YLIM_MAX, ylimmax YLIM_MAX
Yaxis plot range max.
B YLIM_MIN, ylimmin YLIM_MIN
Yaxis plot range min.
R XLIM_MAX, xlimmax XLIM_MAX
Xaxis plot range max.
L XLIM_MIN, xlimmin XLIM_MIN
Xaxis plot range min.
MonitorTimeElapsed¶
usage: nnabla_cli plot_timer [h] [l LABEL] [o OUTFILE] [x XLABEL]
[y YLABEL] [t TITLE] [T YLIM_MAX]
[B YLIM_MIN] [R XLIM_MAX] [L XLIM_MIN] [e]
[u TIME_UNIT]
infile [infile ...]
Plot *.timer.txt files produced by nnabla.MonitorTimeElapsed class.
Example:
nnabla_cli plot_timer x "Epochs" l "config A" l "config B" result_a/Epochtime.timer.txt result_b/Epochtime.timer.txt
positional arguments:
infile Path to input file.
optional arguments:
h, help show this help message and exit
l LABEL, label LABEL
Label of each plot.
o OUTFILE, outfile OUTFILE
Path to output file.
x XLABEL, xlabel XLABEL
Xaxis label of plot.
y YLABEL, ylabel YLABEL
Yaxis label of plot.
t TITLE, title TITLE
Title of plot.
T YLIM_MAX, ylimmax YLIM_MAX
Yaxis plot range max.
B YLIM_MIN, ylimmin YLIM_MIN
Yaxis plot range min.
R XLIM_MAX, xlimmax XLIM_MAX
Xaxis plot range max.
L XLIM_MIN, xlimmin XLIM_MIN
Xaxis plot range min.
e, elapsed Plot total elapsed time. By default, it plots elapsed time per iteration.
u TIME_UNIT, timeunit TIME_UNIT
Time unit chosen from {smhd}.
Draw a graph from NNP or .nntxt files¶
Note:
This feature requires
graphviz
installed as a Python package. Thegraphviz
Python is a interface to graphviz library which is not installed bypip
command. You have to install it usingapt
on Ubuntu for example.
usage: nnabla_cli draw_graph [h] [o OUTPUT_DIR] [n NETWORK] [f FORMAT]
input
Draw a graph in a NNP or nntxt file with graphviz.
Example:
nnabla_cli draw_graph o outputfolder pathtonnp.nnp
positional arguments:
input Path to input nnp or nntxt.
optional arguments:
h, help show this help message and exit
o OUTPUT_DIR, outputdir OUTPUT_DIR
Output directory.
n NETWORK, network NETWORK
Network names to be drawn.
f FORMAT, format FORMAT
Graph saving format compatible with graphviz (`pdf`, `png`, ...).
Development¶
Generate function information¶
usage: nnabla_cli function_info [h] [o OUTFILE] [f FUNC_SET] [c CONFIG]
[t TARGET] [q query] [nnpnoexpandnetwork]
[api API] [FILE] [FILE ...]
positional arguments:
FILE Path to nnp file.
optional arguments:
h, help show this help message and exit
o OUTFILE, output OUTFILE
output filename, *.txt or *.yaml, the default is stdout.
f FUNC_SET, all_support FUNC_SET
select function set: NNB, ONNX, the default is nnabla.
c CONFIG, config CONFIG
user config file for target constraint, *.txt file of the
function list or the "opset_" args.
t, target
output target function list.
q, query
query the detail of a function.
nnpnoexpandnetwork
[import][NNP] expand network with repeat or recurrent.
api API List up api levels
Display version¶
usage: nnabla_cli version [h]
optional arguments:
h, help show this help message and exit
Python API Examples¶
There are a bunch of examples provided in NNabla repository. Please follow [this link](https://github.com/sony/nnablaexamples) to see examples.
Python API Reference¶
Common¶
Config¶
Search config file and get config information from config file.
Config file search order is described in following table. Each config value is overwritten by the following configs.
Type 
Posix 
Windows 

System wide 
/etc/nnabla.conf 
c:\ProgramData\NNabla\nnabla.ini 
User 
~/.nnabla 
c:\Users\[USERNAME]\AppData\Roaming\NNabla\nnabla.ini 
Default 
(Same directory with ‘config.py’)/nnabla.conf 

Local 
[CURRENT DIRECTORY]/nnabla.conf 
You can get config value as followings.
from utils.config import nnabla_config
value = nnabla_config.get(CATEGORY, VALUE_NAME)
CATEGORY and VALUE_NAME does not defined in config.py. You can add CATEGORY and VALUE as you like. See Official document for more information.
[CATEGORY]
VALUE_NAME = value
Default values defined in ‘nnabla.conf’ placed same directory with config.py is here.
Logger¶
Wrapper module for logging.
You can use the logger as follows:
from utils.logger import logger
logger.debug('Log message(DEBUG)')
logger.info('Log message(INFO)')
logger.warn('Log message(WARNING)')
logger.error('Log message(ERROR)')
logger.critical('Log message(CRITICAL)')
With the default settings, it should yield the following output:
$ python scripts/logger_test.py
[nnabla][ERROR]: logger_test.py : <module> : 5 : Log message(ERROR)
[nnabla][CRITICAL]: logger_test.py : <module> : 6 : Log message(CRITICAL)
If you want to output log to file.
You must create nnabla.conf
file and put following entry.
See nnabla.config
for more information about config file.
[LOG]
log_file_name = /tmp/nbla.log
After this you can get following output.
$ python scripts/logger_test.py
[nnabla][ERROR]: logger_test.py : <module> : 5 : Log message(ERROR)
[nnabla][CRITICAL]: logger_test.py : <module> : 6 : Log message(CRITICAL)
$ cat /tmp/nbla.log
20170119 14:41:35,132 [nnabla][DEBUG]: scripts/logger_test.py : <module> : 3 : Log message(DEBUG)
20170119 14:41:35,132 [nnabla][INFO]: scripts/logger_test.py : <module> : 4 : Log message(INFO)
20170119 14:41:35,132 [nnabla][ERROR]: scripts/logger_test.py : <module> : 5 : Log message(ERROR)
20170119 14:41:35,132 [nnabla][CRITICAL]: scripts/logger_test.py : <module> : 6 : Log message(CRITICAL)
 nnabla.logger.logger¶
alias of <Logger nnabla (INFO)>
Autoforward mode¶
NNabla provides the dynamic computation graph feature, which enables automatic forward propagation during graph construction. This can be enabled using the set_auto_forward()
function. Backpropagation shall be manually executed on the dynamically constructed graph.
 nnabla.auto_forward(auto=True)[source]¶
Context for dynamic graph execution mode.
 Parameters
auto (bool) – Whether forward computation is executed during a computation graph construction.
Returns: bool
 nnabla.set_auto_forward(auto)[source]¶
Set the default mode for automatic forward propagation.
When it is set to
True
, forward propagation is invoked immediately when the computation graph is updated. Parameters
auto (bool) – Whether forward computation is executed when the computation graph is updated.
Returns: bool
Context¶
 class nnabla.Context(backend=None, array_class='', device_id='0')¶
Context is used to specify the computation engine (cpu, cuda, cudnn etc.) which the function operator modules and optimizer modules shall be ran on. The context can be set for each function, as well as set globally with functions listed in the
contextspecifier()
.
Context Specifier API¶
 nnabla.context_scope(ctx)[source]¶
Context as Python context.
import nnabla as nn import nnabla.functions as F x = nn.Variable([2, 3 ,4]) ctx = nnabla_ext.cuda.context('0') with context_scope(ctx): # Inside with scope, the specified context is used. with parameter_scope('w1'): l1 = F.relu(F.affine(x, 64)) with parameter_scope('w2'): l2 = F.relu(F.affine(x, 64))
 nnabla.set_default_context(ctx)[source]¶
Set the default context.
Note
It cannot be called inside any
context_scope
. Parameters
ctx (Context) – A Context.
 nnabla.get_current_context()[source]¶
Get the current context.
It can be set using
nnabla.context_scope()
ornnabla.set_default_context()
. Returns
a current context.
 Return type
NdArray¶
 class nnabla.NdArray(*args, **kwargs)¶
nnabla.NdArray
is a deviceagnostic data container for multidimensional arrays (tensors).nnabla.NdArray
can also implicitly handle data transfers across different devices (e.g. CPU to CUDA GPU, CUDA GPU to CPU). See Python API Tutorial for more details.NdArray
overrides some arithmetic operators (+
,
,*
,/
,**
). Operands can be either a scalar number,NdArray
orVariable
. An arithmetic operation containingNdArray
returnsNdArray
which stores the output of the computation immediately invoked. Also, inplace arithmetic operations (+=
,=
,*=
,/=
,**=
) are implemented. Note that=
doesn’t perform inplace substitution but just replaces the object reference. Instead, you can usecopy_from()
for inplace substitution. cast(self, dtype, ctx=None)¶
Inplace cast of data type of the NdArray. It returns the reference values as a numpy.ndarray only if optional parameter ctx is not given, None otherwise.
 Parameters
dtype (
numpy.dtype
) – Numpy Data type.ctx (
nnabla.Context
, optional) – Context descriptor.
 Returns
numpy.array
ifctx
is None, otherwise nothing.
 clear(self)¶
Clear memories which this NdArray has and return them to allocator.
 clear_called¶
Checking if the array is not modified after cleared. This returns False until clear is called at the first time.
 copy_from(self, NdArray arr, use_current_context=True)¶
Copy values from another NdArray object.
It returns the caller object itself.
 Parameters
arr (NdArray) – Values will be copied to the caller object. The shape of
arr`
must be same as the caller object.use_current_context (bool) – If True, a copy is happening in a device and dtype specified in the current context (equivalent to call
F.identity(src, output=[self])
). Otherwise, a device and dtype in the source array is used. The default is True.
 Returns
 data¶
Returns the values held by this array as a
numpy.ndarray
. Note that only the references are returned, and the values are not copied. Therefore, modifying the returnednnabla.NdArray
will affect the data contained inside the NNabla array. This method can also be called as a setter where an array is created as the same type as rhs. There is an exception wherezero()
orfill(rhs)
is invoked if a scalar with a float or an integer <= 2^53 (as filling value is maintained as float64) is given.Note that this may implicitly invoke a data transfer from device arrays to the CPU.
 Parameters
value (
numpy.ndarray
) – Returns
 data_ptr(self, dtype, ctx=None)¶
Get array’s pointer.
The behavior is similar to
cast
method but returns the data pointer based on thectx
. If thectx
is not specified, the default context obtained bynn.get_current_context
is used. Parameters
dtype (
numpy.dtype
) – Numpy Data type.ctx (
nnabla.Context
, optional) – Context descriptor.
 Returns
The data pointer.
 Return type
 dtype¶
Get dtype.
 Returns
 fill(self, value)¶
Fill all of the elements with the provided scalar value.
Note
This doesn’t not fill values in an internal array with 0 immediately. An array is created as a requested data type when this array is used (in forward or backward computation for exampe), and is filled with the value.
 Parameters
value (float) – The value filled with.
 static from_numpy_array(nparr)¶
Create a NdArray object from Numpy array data.
The data is initialized with the given Numpy array.
 Parameters
nparr (ndarray) – Numpy multidimensional array.
 Returns
nnabla.NdArray
 get_data(self, str mode='rw', dtype=None)¶
Returns the values held by this array as a
numpy.ndarray
with a specified mode. Parameters
mode (str) – Computation becomes more efficient if right one is chosen. * ‘r’: Readonly access. * ‘w’: Writeonly access. * ‘rw’: You can both read and write.
dtype (
numpy.dtype
, optional) – Force dtype of a returned array.
See :function:`nnabla.NdArray.data` for more details.
 modification_count¶
Returns how many times modified after memory allocation or clearing buffer.
 ndim¶
Number of dimensions.
 Returns
int
 shape¶
Shape of the Nd array.
 Returns
tuple of int
 size¶
Total size of the Nd array.
 Returns
int
 size_from_axis(self, axis= 1)¶
Gets the size followed by the provided axis.
Example
a = nnabla.NdArray([10,9]) a.size_from_axis() # ==> 90 a.size_from_axis(0) # ==> 90 a.size_from_axis(1) # ==> 9 a.size_from_axis(2) # ==> 1
 strides¶
Strides.
 Returns
tuple of int
 zero(self)¶
Fill all of the elements with 0.
Note
This doesn’t not fill values in an internal array with 0 immediately. An array is created as a requested data type when this array is used (in forward or backward computation for exampe), and is filled with 0.
Variable¶
 class nnabla.Variable¶
Bases:
object
nnabla.Variable
is used to construct computation graphs (neural networks) together with functions in Functions and List of Parametric Functions . It also provides a method to execute forward and backward propagation of the network. Thennabla.Variable
class holds:Reference to the parent function in a computation graph. This provides traceability of all connections in the computation graph.
Both data and error signal (gradient) containers as
nnabla.NdArray
s.Some additional information of the computation graph.
Variable
overrides some arithmetic operators (+
,
,*
,/
,**
). Operands can be either a scalar number,NdArray
orVariable
. IfNdArray
is given as either of left or right operand, the arithmetic operation returns anNdArray
which stores the output of the computation immediately invoked. Otherwise, it returnsVariable
holds the graph connection. The computation is invoked immediately whennnabla.auto_forward
ornnabla.set_auto_forward(True)
is used.Note
Relational operators
==
and!=
of twoVariable
s are defined as an address comparison of underlying C++ instances (nbla::Variable
). Also,hash()
function, which is often used in a key forset
anddict
, is based on the address.See also
 Parameters
shape (Iterable of int) – Shape of variable.
need_grad (bool) – Flag for backprop or not.
 apply(self, **kwargs)¶
Helper for setting property, then return self.
 backward(self, grad=1, bool clear_buffer=False, communicator_callbacks=None, function_pre_hook=None, function_post_hook=None)¶
Performs a backward propagation starting from this variable until the root variable(s) is/are reached in the function graph. The propagation will stop at a variable with need_grad=False.
 Parameters
grad (scalar,
numpy.ndarray
,nnabla.NdArray
, or None) – The gradient signal value(s) of this variable. The default value 1 is used in an usual neural network training. This option is useful if you have a gradient computation module outside NNabla, and want to use that result as a gradient signal. Note that this doesn’t modifies the grad values of this variable, instead assign received values to its gradient temporarily. Also, if theVariable
you want to executennabla._variable.Variable.backward
is an unlinked variable from another, and the correspondingVariable
holds the precomputed gradient values, You need to set grad=None, otherwise, for that backward pass (propagated from the unlinkedVariable
), precomputed gradient values are ignored.clear_buffer (bool) – Clears the no longer referenced variables during backpropagation to save memory. Note that all unnecessary intermediate variables will be cleared unless set explicitly as
persistent=True
.communicator_callbacks (
nnabla.CommunicatorBackwardCallback
or list ofnnabla.CommunicatorBackwardCallback
) – The callback functions invoked when 1) backward computation of each function is finished and 2) all backward computation is finished.function_pre_hook (callable) – This callable object is called immediately before each function is executed. It must take
Function
as an input. The default is None.function_post_hook (callable) – This callable object is called immediately after each function is executed. It must take
Function
as an input. The default is None.
Example
We first explain simple backward usage.
import nnabla as nn import nnabla.functions as F import nnabla.parametric_functions as PF import numpy as np import nnabla.initializer as I rng = np.random.seed(217) initializer = I.UniformInitializer((0.1, 0.1), rng=rng) x = nn.Variable((8, 3, 32, 32)) x.d = np.random.random(x.shape) # random input, just for example. y0 = PF.convolution(x, outmaps=64, kernel=(3, 3), pad=(1, 1), stride=(2, 2), w_init=initializer, name="conv1", with_bias=False) y1 = F.relu(y0) y2 = PF.convolution(y1, outmaps=128, kernel=(3, 3), pad=(1, 1), stride=(2, 2), w_init=initializer, name="conv2", with_bias=False) y3 = F.relu(y2) y4 = F.average_pooling(y3, kernel=y3.shape[2:]) y5 = PF.affine(y4, 1, w_init=initializer) loss = F.mean(F.abs(y5  1.)) loss.forward() # Execute forward # We can check the current gradient of parameter. print(nn.get_parameters()["conv1/conv/W"].g)
Output :
[[[[0. 0. 0.] [0. 0. 0.] [0. 0. 0.]] ...
Initially all the gradient values should be zero. Then let’s see what happens after calling backward.
loss.backward() print(nn.get_parameters()["conv1/conv/W"].g)
Output :
[[[[ 0.00539637 0.00770839 0.0090611 ] [ 0.0078223 0.00978992 0.00720569] [ 0.00879023 0.00578172 0.00790895]] ...
Now we know the gradient values are computed and registered by calling
backward
. Note that callingbackward
successively accumulates the result. It means if we executebackward
again, we get the doubled result.loss.backward() # execute again. print(nn.get_parameters()["conv1/conv/W"].g)
We can see it’s accumulated.
[[[[ 0.01079273 0.01541678 0.0181222 ] [ 0.01564459 0.01957984 0.01441139] [ 0.01758046 0.01156345 0.0158179 ]] ...
Next is an advanced usage with an unlinked variable (please refer to
get_unlinked_variable
). We use the same network, but it is separated by the unlinked variable.import nnabla as nn import nnabla.functions as F import nnabla.parametric_functions as PF import numpy as np import nnabla.initializer as I rng = np.random.seed(217) # use the same random seed. initializer = I.UniformInitializer((0.1, 0.1), rng=rng) x = nn.Variable((8, 3, 32, 32)) x.d = np.random.random(x.shape) # random input, just for example. y0 = PF.convolution(x, outmaps=64, kernel=(3, 3), pad=(1, 1), stride=(2, 2), w_init=initializer, name="conv1", with_bias=False) y1 = F.relu(y0) y2 = PF.convolution(y1, outmaps=128, kernel=(3, 3), pad=(1, 1), stride=(2, 2), w_init=initializer, name="conv2", with_bias=False) y3 = F.relu(y2) y3_unlinked = y3.get_unlinked_variable() # the computation graph is cut apart here. y4 = F.average_pooling(y3_unlinked, kernel=y3_unlinked.shape[2:]) y5 = PF.affine(y4, 1, w_init=initializer) loss = F.mean(F.abs(y5  1.)) # Execute forward. y3.forward() # you need to execute forward at the unlinked variable first. loss.forward() # Then execute forward at the leaf variable. # Execute backward. loss.backward() # works, but backpropagation stops at y3_unlinked. print(nn.get_parameters()["conv1/conv/W"].g) # no gradient registered yet.
Output :
[[[[0. 0. 0.] [0. 0. 0.] [0. 0. 0.]] ...
We can confirm that backpropagation stops at
y3_unlinked
. Then let’s see how to execute backpropagation to the root variable (x
). Since it’s a little bit complicated, let us give you an example of common pitfall first. Note that this is an incorrect way and intended just to show the backward’s behavior.y3.backward() # this works, but computed gradient values are not correct. print(nn.get_parameters()["conv1/conv/W"].g)
Output :
[[[[ 17.795254 23.960905 25.51168 ] [ 20.661646 28.484127 19.406212 ] [ 26.91042 22.239697 23.395714 ]] ...
Note that this is a wrong result. The gradient held by
y3_unlinked
has been totally ignored. As described above, just callingbackward
, the gradient (of the leaf variable where you callbackward
) is considered to be 1.To execute backpropagation over 2 separate graphs correctly, We need to specify
grad=None
as shown below, then present gradient held by that variable is used for computation. (y3.backward(grad=y3_unlinked.g)
does the same thing.)#reset all the gradient values. for v in nn.get_parameters().values(): v.g = 0. for v in [y0, y1, y2, y3, y4, y5]: v.g = 0. # need to reset all the gradient values. loss.backward() # backpropagation starts from the leaf variable again. y3.backward(grad=None) # By this, it can take over the gradient held by y3_unlinked. print(nn.get_parameters()["conv1/conv/W"].g) # correct result.
This time you should have the same result.
[[[[ 0.00539637 0.00770839 0.0090611 ] [ 0.0078223 0.00978992 0.00720569] [ 0.00879023 0.00578172 0.00790895]] ...
 clear_all_graph_links(self)¶
Clear all intermediate functions and variables.
This method clear all intermediate functions and variables up to this variable in forward pass and is useful for the truncated backpropagation through time (truncated BPTT) in dynamic graph.
 d¶
Returns the values held by this variable, as a
numpy.ndarray
. Note that the values are referenced (not copied). Therefore, the modification of the returned ndarray will affect the data of the NNabla array. This method can be called as a setter to set the value held by this variable. Refer to the documentation of the setternnabla.NdArray.data
for detailed behaviors of the setter. Parameters
value (
numpy.ndarray
) (optional) – Returns
 forward(self, bool clear_buffer=False, bool clear_no_need_grad=False, function_pre_hook=None, function_post_hook=None)¶
Performs a forward propagation from the root node to this variable. The forward propagation is performed on a subset of variables determined by the dependency of this variable. The subset is recursively constructed by tracking variables that the variables in the subset depend on, starting from this variable, until it reaches the root variable(s) in the function graph. See also
forward_all
, which performs forward computations for all variables within the input graph. Parameters
clear_buffer (bool) – Clear the no longer referenced variables during forward propagation to save memory. This is usually set as True in an inference or a validation phase. Default is False. Note that all unnecessary intermediate variables will be cleared unless set explicitly as
persistent=True
.clear_no_need_grad (bool) – Clear the unreferenced variables with need_grad=False during forward propagation. True is usually used when calling this during training. This is ignored when clear_buffer=True.
function_pre_hook (callable) – This callable object is called immediately before each function is executed. It must take
Function
as an input. The default is None.function_post_hook (callable) – This callable object is called immediately after each function is executed. It must take
Function
as an input. The default is None.
 static from_numpy_array(data, grad=None, need_grad=None)¶
Create a Variable object from Numpy array(s).
The
data
is initialized with the given Numpy array, as well asgrad
if given.The shape is also determined by the given array.
 function_references¶
Returns a list of functions which take this variable as an input. This method can be called only as a getter.
 Returns
list of
nnabla.function.Function
 g¶
Returns the gradient values held by this variable, as a
numpy.ndarray
. Note that the values are referenced (not copied). Therefore, the modification of the returned ndarray will affect the data of the NNabla array. This method can be called as a setter to set the gradient held by this variable. Refer to the documentation of the setternnabla.NdArray.data
for detailed behaviors of the setter. Parameters
value (
numpy.ndarray
) – Returns
 get_unlinked_variable(self, need_grad=None)¶
Gets an unlinked (forgetting parent) variable that shares a Variable buffer instance.
 Parameters
need_grad (bool, optional) – By default, the unlinked variable will have the same need_grad flag with this variable instance. By specifying a boolean value, the new need_grad flags will be set to the unlinked variable. It is recommended to explicitly specify this option to avoid an unintended behavior.
Returns:
Variable
Note
The unlinked Variable behaves equivalent to the original variable in a comparison operator and hash function regardless whether or not the
need_grad
attribute is changed. See a note in the Variable class documentation. Also, for backward execution with unlinked variable(s), please refer tobackward
and its example.Example
import numpy as np import nnabla as nn import nnabla.parametric_functions as PF x = nn.Variable.from_numpy_array(np.array([[1, 2], [3, 4]])) y = PF.affine(x, 4, name="y") # Create a new variable of which graph connection is unlinked. # Recommend to specify need_grad option explicitly . z = y.get_unlinked_variable(need_grad=False) print(y.parent) # Affine print(z.parent) # z is unlinked from the parent x but shares the buffers of y. # None
 info¶
object
Information of the variable.
 Type
info
 ndim¶
Gets the number of dimensions of this variable.
 Returns
int
 need_grad¶
Gets or sets a boolean indicating whether backpropagation is performed at this variable.
 no_grad(self)¶
No gradients for the whole network.
This method is like
nnabla.no_grad
but can be used for the static network only, and useful for the case where the network is loaded from NNP format.Example
x = nn.Variable.from_numpy_array([2, 3]) y = <Network>(x).no_grad()
 parent¶
Returns the parent function of this variable. This method can also be called as a setter.
 Parameters
func (
nnabla.function.Function
) – Returns
 persistent¶
Returns the persistent flag of this variable. If True, the variable is not cleared even if clear options in
nnabla._variable.Variable.forward()
andnnabla._variable.Variable.backward()
are enabled. This is useful when you debug the variable values, or log them. This method can also be called as a setter. Parameters
b (bool) –
 Returns
bool
 recompute¶
Gets or sets a boolean indicating whether its data is cleared during forward propagation and recomputation is performed during backward propagation.
 reset_shape(self, shape, force=False)¶
Resizes the shape of the variable to a specified shape.
 Parameters
shape (Iterable of int) – Target shape.
force (bool) – Flag to force reshape.
Note
This method destructively changes the shape of the target variable. For safety,
reshape()
should be used instead. Returns
None
 reshape(self, shape, unlink=False)¶
Returns a new variable, where this variable is reshaped to a specified shape.
 rewire_on(self, var)¶
Rewire a successor graph of this variable on top of
var
. Parameters
var (
nnabla.Variable
) – The array elements and the parent function ofvar
is copied toself
as references. Note that the parent function ofvar
is removed.
Example
# A. Create a graph A. xa = nn.Variable((2, 8), need_grad=True) ya = F.tanh(PF.affine(xa, 10, name='a')) # B. Create a graph B. xb = nn.Variable((2, 16), need_grad=True) yb = F.tanh(PF.affine( F.tanh(PF.affine(xb, 8, name='b1')), 8, name='b2')) # C. Rewire the graph A on top of B such that # `xb>B>(yb>)xa>A>ya`. Note `yb` is gone. xa.rewire_on(yb) # D. Execute the rewired graph. xb.d = 1 ya.forward() ya.backward()
 size_from_axis(self, axis= 1)¶
Gets the size followed by the provided axis.
Example
a = nnabla.Variable([10,9]) a.size_from_axis() # ==> 90 a.size_from_axis(0) # ==> 90 a.size_from_axis(1) # ==> 9 a.size_from_axis(2) # ==> 1
 unlinked(self, need_grad=None)¶
This function is
deprecated
, use get_unlinked_variable instead.
 visit(self, f)¶
Visit functions recursively in forward order.
 Parameters
f (function) – Function object which takes
nnabla._function.Function
object as an argument. Returns
None
Example
import nnabla as nn import nnabla.functions as F import nnabla.parametric_functions as PF # Define a simple networkgraph def network_graph(x, maps=16, test=False): h = x h = PF.convolution(h, maps, kernel=(3, 3), pad=(1, 1), name="firstconv", with_bias=False) h = F.average_pooling(h, h.shape[2:]) pred = PF.affine(h, 10, name="pred") return pred # You can modify this PrintFunc to get the other information like inputs(nnabla_func.inputs), outputs and arguments(nnabla_func.info.args) of nnabla functions. class PrintFunc(object): def __call__(self, nnabla_func): print(nnabla_func.info.type_name) x = nn.Variable([1, 3, 16, 16]) output = network_graph(x) output.visit(PrintFunc())
Output :
Convolution AveragePooling Affine
 visit_check(self, f)¶
Visit functions recursively in forward order.
Note
If any of evaluation of the function object returns True, the visit propagation will stop immediately, and will return True.
 Parameters
f (function) – Function object which takes
nnabla._function.Function
object as an argument. Returns
bool Returns True if any of the function object call returns True.
Example
Define a simple networkgraph where AveragePooling function can be added explicitly as below:
def network_graph(x, add_avg_pool=False, maps=16, test=False): h = x h = PF.convolution(h, maps, kernel=(3, 3), pad=(1, 1), name="firstconv", with_bias=False) if add_avg_pool : h = F.average_pooling(h, h.shape[2:]) else : h = F.relu(h) pred = PF.affine(h, 10, name="pred") return pred # Define 'PrintFunc()' to check whether "AveragePooling" function exists in the networkgraph class PrintFunc(object): def __call__(self, nnabla_func): if nnabla_func.info.type_name =="AveragePooling" : print("{} exists in the graph".format(nnabla_func.info.type_name)) return True else : return False
Create a networkgraph which has AveragePooling function and call visit_check() method :
x = nn.Variable([1, 3, 16, 16]) output = network_graph(x, add_avg_pool=True) #Adding AveragePooling function to the graph print("The return value of visit_check() method is : {}".format(output.visit_check(PrintFunc())))
Output :
AveragePooling exists in the graph The return value of visit_check() method is : True
Create a networkgraph which doesn’t have AveragePooling function and call visit_check() method :
nn.clear_parameters() # call this in case you want to run the following code again output = network_graph(x, add_avg_pool=False) # Exclusion of AveragePooling function in the graph print("The return value of visit_check() method is : {}".format(output.visit_check(PrintFunc())))
Output :
The return value of visit_check() method is : False
Computation Graph¶
Computation Graph¶
 nnabla.forward_all(variables, bool clear_buffer=False, bool clear_no_need_grad=False, function_pre_hook=None, function_post_hook=None)¶
Performs a forward propagation up to variables specified as the 1st argument. See also
forward
. Parameters
clear_buffer (bool) –
Clear the no longer referenced variables during forward propagation to save memory. This is usually set as True in an inference or a validation phase. Default is False. Note that starting variable and destination variable of the input graph will not be cleared, regardless of their
persistent
flag. All intermediate variables will be cleared unless set explicitly aspersistent=True
. For example,forward_all([h_i, y], clear_buffer=True)
will clear all intermediate variables between
h_i
andy
unless set explicitly aspersistent=True
, buth_i
andy
will not be cleared regardless of theirpersistent
flag.clear_no_need_grad (bool) – Clear the unreferenced variables with need_grad=False during forward propagation. True is usually used when calling this during training. This is ignored when clear_buffer=True.
function_pre_hook (callable) – This callable object is called immediately before each function is executed. It must take
Function
as an input. The default is None.function_post_hook (callable) – This callable object is called immediately after each function is executed. It must take
Function
as an input. The default is None.
Example
import numpy as np import nnabla as nn import nnabla.parametric_functions as PF # Create a graph which has two outputs x = nn.Variable.from_numpy_array(np.array([[1, 2], [3, 4]])) y = PF.affine(x, 4, name="y") z = PF.affine(x, 8, name="z") # Execute a forward propagation recursively up to y and z nn.forward_all([y, z], clear_buffer)
 nnabla.no_grad(no_grad_=True)[source]¶
No gradients for the whole network.
No gradients are required when creating a network, such that when the forward pass is executed, all intermediate buffers except for the leafs in the network are gone at the same time, resulting in memory optimization.
This is useful for example when an output of a pretrained network is used for an input to another network, where the first pretrained network does not need to be finetuned, but the other network is optimized.
 Parameters
no_grad (bool) – No gradient flag. Default is True.
Example:
with nn.no_grad(): output0 = <Network0>(<input0>) output1 = <Network1>(<input1>, output0) loss = <Loss>(output1, <ground_truth>) loss.forward(clear_no_need_grad=True)
This context also works in the dynamic mode.
with nn.auto_forward(), nn.no_grad(): output0 = <Network0>(<input0>)
Note
When working with the static network, the need_grad property of the input (e.g., input image) must be False and do not forget to add
<root>.forward(clear_no_need_grad=True)
; otherwise, all intermediate buffers are not gone as expected.
Functions¶
All NNabla functions are derived from the nnabla.function.Function
class.
Function¶
 class nnabla.function.Function¶
Function interface class.
Instances of
nnabla.function.Function
are not directly created by users. It is indirectly created by the functions available innnabla.functions
. These functions returnnnabla.Variable
(s) holding the created function instance as the parent property. args¶
Experimental
Get args of the function.
 backward(self, inputs, outputs, accum=None)¶
 forward(self, inputs, outputs)¶
 grad_depends_output_data(self, int i, int o)¶
 info¶
object
 Type
info
 inplace_data(self, int i)¶
 inplace_data_with(self, int i)¶
 min_outputs(self)¶
 need_setup_recompute(self, int o)¶
 recompute(self, inputs, outputs)¶
 setup(self, inputs, outputs)¶
 setup_recompute(self, inputs, outputs)¶
 tags¶
Experimental
Get tags of the function.
 class nnabla.function.PythonFunction(ctx=None)¶
Creates a userdefined custom function in the subclsass.
To implement the naive multiplicaiton function of two variables using PythonFunction,
import nnabla as nn import nnabla.functions as F from nnabla.function import PythonFunction class Mul2(PythonFunction): def __init__(self, ctx): super(Mul2, self).__init__(ctx) @property def name(self): return self.__class__.__name__ def min_outputs(self): return 1 def setup_impl(self, inputs, outputs): i0 = inputs[0] i1 = inputs[1] assert i0.shape == i1.shape, "Shapes of inputs are different." o0 = outputs[0] o0.reset_shape(i0.shape, True) def forward_impl(self, inputs, outputs): x0 = inputs[0].data x1 = inputs[1].data y = outputs[0].data # We can also write like, y.copy_from(x0 * x1) y.copy_from(F.mul2(x0, x1)) def backward_impl(self, inputs, outputs, propagate_down, accum): # Data of inputs and outputs x0 = inputs[0].data x1 = inputs[1].data y = outputs[0].data # Grads of inputs and outputs dx0 = inputs[0].grad dx1 = inputs[1].grad dy = outputs[0].grad # backward w.r.t. x0 if propagate_down[0]: if accum[0]: dx0 += F.mul2(dy, x1) else: dx0.copy_from(F.mul2(dy, x1)) # backward w.r.t. x1 if propagate_down[1]: if accum[1]: dx1 += F.mul2(dy, x0) else: dx1.copy_from(F.mul2(dy, x0)) def grad_depends_output_data(self, i, o): return False def grad_depends_input_data(self, i, j): return True def mul2(x, y, ctx=None): func = Mul2(ctx) return func(x, y)
 __init__(self, ctx=None)¶
PythonFunction.__init__(self, ctx=None)
 Parameters
ctx (
nnabla.Context
) – Context used for the forward and backward pass. If not specified, the current context is used.
 backward_impl(self, inputs, outputs, propagate_down, accum)¶
Backward method.
 Parameters
inputs – (list of
nnabla.Variable
): Inputs to the function.outputs – (list of
nnabla.Variable
): Outputs from the function.
 property ctx¶
Context Return the context if the context is set in the constructor; otherwise return the global context
 forward_impl(self, inputs, outputs)¶
Forward method.
 Parameters
inputs – (list of
nnabla.Variable
): Inputs to the function.outputs – (list of
nnabla.Variable
): Outputs from the function.
 grad_depends_input_data(self, i, j)¶
Checking if ith input’ gradient computation requires jth input’s data or not.
 Parameters
i – (list of
nnabla.Variable
): Input variable index.i – (list of
nnabla.Variable
): Input variable index.
 grad_depends_output_data(self, i, o)¶
Checking if ith input’ gradient computation requires oth output’s data or not.
 Parameters
i – (list of
nnabla.Variable
): Input variable index.o – (list of
nnabla.Variable
): Output variable index.
 min_outputs(self)¶
Minimum number of outputs of the function.
 property name¶
Name of the function.
 setup_impl(self, inputs, outputs)¶
Setup method.
 Parameters
inputs – (list of
nnabla.Variable
): Inputs to the function.outputs – (list of
nnabla.Variable
): Outputs from the function.
List of Functions¶
The nnabla.functions
module provides various types of functions listed below.
These functions takes input nnabla.Variable
(s) as its leading argument(s), followed by options
specific to each function.
Note
The functions can also take NdArray
(s) as inputs instead
of Variable
(s). It will execute the function operation immediately,
and returns NdArray
(s) as output(s) holding output values of the
operation. We call this “Imperative Mode” (NdArray + Functions).
Neural Network Layers¶
 nnabla.functions.affine(x, weight, bias=None, base_axis=1, n_outputs= 1, outputs=None)[source]¶
Affine layer, also called as the fully connected layer. It calculates:
\[{\mathbf y} = {\mathbf A} {\mathbf x} + {\mathbf b}.\]where \({\mathbf x}\) is the input and \({\mathbf y}\) is the output.
 Parameters
x (Variable) – Input ND array with shape (\(M_0 \times ... \times M_{B1} \times D_B \times ... \times D_N\)). Dimensions before and after base_axis are flattened as if it is a matrix.
weight (Variable) – Weight matrix with shape (\((D_B \times ... \times D_N) \times L_{0} \times \ldots \times L_{I}\)) [parameter]
bias (Variable) – Bias vector (\(L_{0} \times \ldots \times L_{I}\)) [optional][parameter]
base_axis (int) – Base axis of Affine operation. Dimensions up to base_axis is treated as sample dimension. [default=
1
]
 Returns
\((B + 1)\)D array. (\(M_0 \times ... \times M_{B1} \times L_{0} \times \ldots \times L_{I}\))
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.convolution(x, weight, bias=None, base_axis=1, pad=None, stride=None, dilation=None, group=1, channel_last=False, n_outputs= 1, outputs=None)[source]¶
ND Convolution with bias.
See references for dilated convolution (a.k.a. atrous convolution).
References
Note
Convolution is a computationally intensive operation that should preferrably be run with the
cudnn
backend. NNabla then uses CuDNN library functions to determine and cache the fastest algorithm for the given set of convolution parameters, which results in additional memory consumption which may pose a problem for GPUs with insufficient memory size. In that case, theNNABLA_CUDNN_WORKSPACE_LIMIT
environment variable can be used to restrict the choice of algorithms to those that fit the given workspace memory limit, expressed in bytes. In some cases it may also be desired to restrict the automatic search to algorithms that produce deterministic (reproducable) results. This can be requested by setting the the environment variableNNABLA_CUDNN_DETERMINISTIC
to a nonzero value. Parameters
x (Variable) – \((B + 1 + N)\)D array (\(M_1 \times ... \times M_B \times C \times L_1 \times ... \times L_N\)).
weight (Variable) – \((2 + N)\)D array (\(C' \times C \times K_1 \times ... \times K_N\)). [parameter]
bias (Variable) – Bias vector (\(C'\)). [optional][parameter]
base_axis (int) – base axis \(B\). [default=
1
]pad (
tuple
ofint
) – Padding sizes for dimensions. [default=(0,) * (len(x.shape)  (base_axis+1))
]stride (
tuple
ofint
) – Stride sizes for dimensions. [default=(1,) * (len(x.shape)  (base_axis+1))
]dilation (
tuple
ofint
) – Dilation sizes for dimensions. [default=(1,) * (len(x.shape)  (base_axis+1))
]group (int) – Number of groups of channels. This makes the connection across channels sparser, by grouping connections along the mapping direction. [default=
1
]channel_last (bool) – If True, the last dimension is considered as channel dimension, a.k.a NHWC order. [default=
False
]
 Returns
\((B + 1 + N)\)D array (\(M_1 \times ... \times M_B \times C' \times L'_1 \times ... \times L'_N\)).
A spatial size of the output is calculated as
\[L'_i = \frac{L_i + 2 p_i  d_i (k_i  1)  1}{s_i} + 1,\]where \(L_i\) is the spatial size, \(p_i\) is the padding, \(d_i\) is the dilation, \(k_i\) is the kernel size, and \(s_i\) is the stride for \(i\)th spatial dimension. The same calculation can also be applied to the other spatial dimensions.
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.depthwise_convolution(x, weight, bias=None, base_axis=1, pad=None, stride=None, dilation=None, multiplier=1, n_outputs= 1, outputs=None)[source]¶
ND Depthwise Convolution with bias.
References
 Parameters
x (Variable) – \((B + 1 + N)\)D array (\(M_1 \times ... \times M_B \times C \times L_1 \times ... \times L_N\)).
weight (Variable) – \((1 + N)\)D array (\(C \times K_1 \times ... \times K_N\)). [parameter]
bias (Variable) – Bias vector (\(C'\)). [optional][parameter]
base_axis (int) – base axis \(B\). [default=
1
]pad (
tuple
ofint
) – Padding sizes for dimensions. [default=(0,) * (len(x.shape)  (base_axis+1))
]stride (
tuple
ofint
) – Stride sizes for dimensions. [default=(1,) * (len(x.shape)  (base_axis+1))
]dilation (
tuple
ofint
) – Dilation sizes for dimensions. [default=(1,) * (len(x.shape)  (base_axis+1))
]multiplier (int) – Number of output feature maps per input feature map. [default=
1
]
 Returns
\((B + 1 + N)\)D array (\(M_1 \times ... \times M_B \times C' \times L'_1 \times ... \times L'_N\)).
The output map size \(C'\) is \(C\) multiplied by \(m\)
\[C' = m \times C,\]where \(m\) is the multiplier.
A spatial size of the output is calculated as
\[L'_i = \frac{L_i + 2 p_i  d_i (k_i  1)  1}{s_i} + 1,\]where \(L_i\) is the spatial size, \(p_i\) is the padding, \(d_i\) is the dilation, \(k_i\) is the kernel size, and \(s_i\) is the stride for \(i\)th spatial dimension. The same calculation can also be applied to the other spatial dimensions.
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.deconvolution(x, weight, bias=None, base_axis=1, pad=None, stride=None, dilation=None, group=1, channel_last=False, output_padding=None, n_outputs= 1, outputs=None)[source]¶
ND deconvolution, also known as transposed convolution, with bias operates backward convolution (derivative of the output w.r.t. the input) plus channelwise learned bias.
The weights are specified in the same manner as
convolution()
, as if it was an ordinary convolution function. The forward operation ofdeconvolution()
will then be operationally equivalent to the backward pass ofconvolution()
. Therefore, the number of input channels (can be seen as output channels of forward convolution) is specified in the first dimension, and the number of the output channels divided by the number of groups is specified in the second dimension.For
stride > 1
, a parameterwise identical deconvolution on the output of a convolution may not produce the same output shape as the input to the convolution if, due to striding, the convolution did not fully cover the input spatial dimension. Theoutput_padding
parameter can then be used to appropriately increase the calculated output shape. Note that this is used to find the output shape for the deconvolution operation, but not to add zeropadding to the output. Parameters
x (Variable) – \((B + 1 + N)\)D array (\(M_1 \times ... \times M_B \times C \times L_1 \times ... \times L_N\)).
weight (Variable) – \((2 + N)\)D array (\(C \times C' \times K_1 \times ... \times K_N\)). [parameter]
bias (Variable) – Bias vector (\(C'\)). [optional][parameter]
base_axis (int) – base axis \(B\). [default=
1
]pad (
tuple
ofint
) – Padding sizes for dimensions. [default=(0,) * (len(x.shape)  (base_axis+1))
]stride (
tuple
ofint
) – Stride sizes for dimensions. [default=(1,) * (len(x.shape)  (base_axis+1))
]dilation (
tuple
ofint
) – Dilation sizes for dimensions. [default=(1,) * (len(x.shape)  (base_axis+1))
]group (int) – Number of groups of channels. This makes the connection across channels sparser, by grouping connections along the mapping direction. [default=
1
]channel_last (bool) – If True, the last dimension is considered as channel dimension, a.k.a NHWC order. [default=
False
]output_padding (
tuple
ofint
) – Additional size added to the output shape. [default=(0,) * (len(x.shape)  (base_axis+1))
]
 Returns
\((B + 1 + N)\)D array (\(M_1 \times ... \times M_B \times C' \times L'_1 \times ... \times L'_N\)).
A spatial size of the output is calculated as
\[L'_i =s_i (L_i  1)  2 p_i + d_i (k_i  1) + 1,\]where \(s_i\) is the stride, \(L_i\) is the spatial size, \(p_i\) is the padding, \(d_i\) is the dilation, and \(k_i\) is the kernel size for \(i\)th spatial dimension. The same calculation can also be applied to the other spatial dimensions.
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.depthwise_deconvolution(x, weight, bias=None, base_axis=1, pad=None, stride=None, dilation=None, divisor=1, n_outputs= 1, outputs=None)[source]¶
Depthwise deconvolution computes the transposed depthwise convolution with bias for onedimensional and twodimensional input data.
 Parameters
x (Variable) – \((B + 1 + N)\)D array (\(M_1 \times ... \times M_B \times C \times L_1 \times ... \times L_N\)).
weight (Variable) – \((1 + N)\)D array (\(C \times K_1 \times ... \times K_N\)). [parameter]
bias (Variable) – Bias vector (\(C'\)). [optional][parameter]
base_axis (int) – base axis \(B\). [default=
1
]pad (
tuple
ofint
) – Padding sizes for dimensions. [default=(0,) * (len(x.shape)  (base_axis+1))
]stride (
tuple
ofint
) – Stride sizes for dimensions. [default=(1,) * (len(x.shape)  (base_axis+1))
]dilation (
tuple
ofint
) – Dilation sizes for dimensions. [default=(1,) * (len(x.shape)  (base_axis+1))
]divisor (int) – Number of input feature maps per output feature map. [default=
1
]
 Returns
\((B + 1 + N)\)D array (\(M_1 \times ... \times M_B \times C' \times L'_1 \times ... \times L'_N\)).
The output map size \(C'\) is \(C\) multiplied by \(m\)
\[C' = \frac{C}{d},\]where \(d\) is the divisor.
A spatial size of the output is calculated as
\[L'_i =s_i (L_i  1)  2 p_i + d_i (k_i  1) + 1,\]where \(s_i\) is the stride, \(L_i\) is the spatial size, \(p_i\) is the padding, \(d_i\) is the dilation, and \(k_i\) is the kernel size for \(i\)th spatial dimension. The same calculation can also be applied to the other spatial dimensions.
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.deformable_convolution(x, weight, offset, mask=None, bias=None, base_axis=1, pad=None, stride=None, dilation=None, group=1, deformable_group=1, channel_last=False, n_outputs= 1, outputs=None)[source]¶
2D Deformable Convolution with bias. Another convolution with fixed output channels must be passed externally to calculate the offsets and mask. Mask should be normalized to \([0,1]\) interval.
\[\begin{eqnarray} y(p) = \sum_{k=1}^{K} w_k \cdot x(p + p_k + \Delta p_k) \cdot \Delta m_k, \end{eqnarray}\]where \(x\) and \(y\) are input and output, \(w_k\) is the weight, \(p\) is the pixel location of interest, \(p_k\) is the fixed displacement e.g., \(p_k \in \{(1, 1), (1, 0), \ldots (1, 1)\}\) for the 2D 3x3 receptive field, \(\Delta p_k\) is the learnable displacement, and \(\Delta m_k\) is the learnable scale normalized in \([0, 1]\) by a function like the sigmoid. Note that \(\Delta p_k\) and \(\Delta m_k\) are sampledependent, locationdependent, and featureindependent.
References
 Parameters
x (Variable) – \((B + 1 + N)\)D array (\(M_1 \times ... \times M_B \times C \times L_1 \times ... \times L_N\)).
weight (Variable) – \((2 + N)\)D array (\(C' \times C \times K_1 \times ... \times K_N\)). [parameter]
offset (Variable) – Offsets for deformable convolutions. Shape is fixed to \((N, deformable_group \times 2 \times Kh \times Kw, H, W)\). Offsets must be calculated externally through a separate convolution layer.
mask (Variable) – Normalized mask for deformable convolutions v2. Shape is fixed to \((N, deformable_group \times 2 \times Kh \times Kw, H, W)\). Masks must be calculated externally together with the offsets through a separate convolution layer. [optional]
bias (Variable) – Bias vector (\(C'\)). [optional][parameter]
base_axis (int) – base axis \(B\). [default=
1
]pad (
tuple
ofint
) – Padding sizes for dimensions. [default=(0,) * (len(x.shape)  (base_axis+1))
]stride (
tuple
ofint
) – Stride sizes for dimensions. [default=(1,) * (len(x.shape)  (base_axis+1))
]dilation (
tuple
ofint
) – Dilation sizes for dimensions. [default=(1,) * (len(x.shape)  (base_axis+1))
]group (int) – Number of groups of channels. This makes the connection across channels sparser, by grouping connections along the mapping direction. [default=
1
]deformable_group (int) – Number of deformable groups of channels. [default=
1
]channel_last (bool) – If True, the last dimension is considered as channel dimension, a.k.a NHWC order. [default=
False
]
 Returns
\((B + 1 + N)\)D array (\(M_1 \times ... \times M_B \times C' \times L'_1 \times ... \times L'_N\)).
A spatial size of the output is calculated as
\[L'_i = \frac{L_i + 2 p_i  d_i (k_i  1)  1}{s_i} + 1,\]where \(L_i\) is the spatial size, \(p_i\) is the padding, \(d_i\) is the dilation, \(k_i\) is the kernel size, and \(s_i\) is the stride for \(i\)th spatial dimension. The same calculation can also be applied to the other spatial dimensions.
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.adaptive_separable_convolution(x, vertical_kernel, horizontal_kernel, n_outputs= 1, outputs=None)[source]¶
2D Adaptive Separable Convolution for NCHW (the channelfirst tensor). Sample and pixel dependent vertical and horizontal kernels are dynamically generated ones, which are used for approximating a featureindependent 2D kernel in this function. Thus, the kernel used in this function is dependent on samples and pixels but independent on features.
If the padding is needed, use the pad function to the input \(x\) before this function.
Adaptive separable convolution is formulated as
\[\tilde{I}(c, h, w) = \sum_{j, i} K_v(j, h, w) \times K_h(i, h, w) \times I(c, h + j, w + i),\]where \(I(c, h, w)\) and \(\tilde{I}(c, h, w)\) are the input and output images at \(c\)th channel, \(h\)th height, \(w\)th width. \(K_V(:, h, w)\) and \(K_h(:, h, w)\) are vertical and horizontal 1D kernels at \(h\)th height and \(w\)th width.
References
 Parameters
 Returns
\(4D\) array (\(B \times C \times H  K_v + 1 \times W  K_h + 1\))
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.max_pooling(x, kernel, stride=None, ignore_border=True, pad=None, channel_last=False, n_outputs= 1, outputs=None)[source]¶
Max pooling. It pools the maximum values inside the scanning kernel:
\[y_{i_1, i_2} = \max_{k_1, k_2 \in K} (x_{i_1 + k_1, i_2 + k_2})\]where \(x_{i_1 + k_1, i_2 + k_2}\) is the input and \(y_{i_1, i_2}\) is the output.
 Parameters
x (Variable) – Input variable.
stride (
tuple
ofint
) – Subsampling factors for each spatial axis. [default=kernel
]ignore_border (bool) – If false, kernels covering borders are also considered for the output. [default=
True
]pad (
tuple
ofint
) – Border padding values for each spatial axis. Padding will be added both sides of the dimension. [default=(0,) * len(kernel)
]channel_last (bool) – If True, the last dimension is considered as channel dimension, a.k.a NHWC order. [default=
False
]
 Returns
Maximum values variable
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.average_pooling(x, kernel, stride=None, ignore_border=True, pad=None, channel_last=False, including_pad=True, n_outputs= 1, outputs=None)[source]¶
Average pooling. It pools the averaged values inside the scanning kernel:
\[y_{i_1, i_2} = \frac{1}{K_1 K_2} \sum_{k1} \sum_{k2} x_{i_1 + k_1, i_2 + k_2}\]where \(x_{i_1 + k_1, i_2 + k_2}\) is the input and \(y_{i_1, i_2}\) is the output.
 Parameters
x (Variable) – Input variable.
stride (
tuple
ofint
) – Subsampling factors for each spatial axis. [default=kernel
]ignore_border (bool) – If false, kernels covering borders are also considered for the output. [default=
True
]pad (
tuple
ofint
) – Border padding values for each spatial axis. Padding will be added both sides of the dimension. [default=(0,) * len(kernel)
]channel_last (bool) – If True, the last dimension is considered as channel dimension, a.k.a NHWC order. [default=
False
]including_pad (bool) – If true, border padding values are considered for the output. [default=
True
]
 Returns
Average values variable
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.global_average_pooling(x, n_outputs= 1, outputs=None)[source]¶
Warning
This function is experimental support, so please do not actively use it.
Global average pooling. It pools an averaged value from the whole image
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.sum_pooling(x, kernel, stride=None, ignore_border=True, pad=None, channel_last=False, n_outputs= 1, outputs=None)[source]¶
Sum pooling. It pools the summed values inside the scanning kernel:
\[y_{i_1, i_2} = \sum_{k1} \sum_{k2} x_{i_1 + k_1, i_2 + k_2}\]where \(x_{i_1 + k_1, i_2 + k_2}\) is the input and \(y_{i_1, i_2}\) is the output.
 Parameters
x (Variable) – Input variable.
stride (
tuple
ofint
) – Subsampling factors for each spatial axis. [default=kernel
]ignore_border (bool) – If false, kernels covering borders are also considered for the output. [default=
True
]pad (
tuple
ofint
) – Border padding values for each spatial axis. Padding will be added both sides of the dimension. [default=(0,) * len(kernel)
]channel_last (bool) – If True, the last dimension is considered as channel dimension, a.k.a NHWC order. [default=
False
]
 Returns
Summed values variable
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.unpooling(x, kernel, channel_last=False, n_outputs= 1, outputs=None)[source]¶
Inverse operation of pooling. It spreads the input values:
\[y_{k_1 i_1 + j_1, k_2 i_2 + j_2} = x_{i_1, i_2}\]where \(_{i_1, i_2}\) is the input and \(y_{k_1 i_1 + j_1, k_2 i_2 + j_2}\) is the output.
 Parameters
 Returns
Spread values variable
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.embed(x0, w, n_outputs= 1, outputs=None)[source]¶
Embed slices of a matrix/tensor with indexing array/tensor.
 Parameters
 Returns
Output with shape \((I_0, ..., I_N, W_1, ..., W_M)\)
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.rnn(x, h, weight_l0, weight=None, bias=None, num_layers=1, nonlinearity='tanh', dropout=None, bidirectional=False, training=True, n_outputs= 1, outputs=None)[source]¶
RNN function implements Elman RNN with nonlineraity to input sequence. RNN function is defined as following:
\[{\mathbf h_t} = {\mathbf \tanh}( {\mathbf w_{ih}} *{\mathbf x_t} + {\mathbf b_{ih}} + {\mathbf w_{hh}}* {\mathbf h_{(t1)}} + {\mathbf b_{hh}}).\]We use the following notations to describe the inputs and outputs below. \(T\): sequcne length, \(B\): batch size, \(I\): input size, \(L\): number of layers, \(D\): number of directions, can be either 1 or 2, \(H\): hidden size.
References
 Parameters
x (Variable) – Input ND array with shape \((T, B, I)\).
h (Variable) – Input ND array with shape \((L, D, B, H)\).
weight_l0 (Variable) – Input ND array with shape \((D, H, I + H)\). [parameter]
weight (Variable) – Input ND array with shape \((L1, D, H, D * H + H)\). [optional][parameter]
bias (Variable) – Input ND array with shape \((L, D, H)\). [optional][parameter]
num_layers (int) – Number of layers in the network. If set to 1, only the weights for the first layer will be invoked. Default is 1. [default=
1
]nonlinearity (string) – Type of nonlinearity applied to input sequcne. Must be either tanh or relu. Default is tanh. [default=
'tanh'
]dropout (float) – Dropout ratio applied to parameters. Default is 0.0. [default=
0.0
]bidirectional (bool) – If True, bidirectional computation will be performed in each layer. Default is False. [default=
False
]training (bool) – Backpropagation will be performed only when it is true. Default is True. [default=
True
]
 Returns
Output \(y\) with shape \((T, B, D * H)\) ~nnabla.Variable: Output \(h_n\) with shape \((L, D, B, H)\)
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.lstm(x, h, c, weight_l0, weight=None, bias=None, num_layers=1, dropout=None, bidirectional=False, training=True, n_outputs= 1, outputs=None)[source]¶
NStep LSTM layer.
\[\begin{split}{\mathbf f_t} &=& {\mathbf \sigma}( {\mathbf W_f} *{\mathbf x_t} + {\mathbf U_f}* {\mathbf h_{(t1)}} + {\mathbf b_f})\\ {\mathbf i_t} &=& {\mathbf \sigma}( {\mathbf W_i} *{\mathbf x_t} + {\mathbf U_i}* {\mathbf h_{(t1)}} + {\mathbf b_i})\\ {\mathbf o_t} &=& {\mathbf \sigma}( {\mathbf W_o} *{\mathbf x_t} + {\mathbf U_o}* {\mathbf h_{(t1)}} + {\mathbf b_o})\\ {\mathbf c_t} &=& {\mathbf f_t}\odot {\mathbf c_{(t1)}} + {\mathbf i_t}\odot {\mathbf \tanh}({\mathbf W_c}*{\mathbf x_t} + {\mathbf U_c} *{\mathbf h_{(t1)}} + {\mathbf b_c})\\ {\mathbf h_t} &=& {\mathbf o_t} \odot {\mathbf \tanh}({\mathbf c_t}).\end{split}\]We use the following notations to describe the inputs and outputs below. \(T\): sequcne length, \(B\): batch size, \(I\): input size, \(L\): number of layers, \(D\): number of directions, can be either 1 or 2, \(H\): hidden size.
References
 Parameters
x (Variable) – Input ND array with shape \((T, B, I)\).
h (Variable) – Input ND array with shape \((L, D, B, H)\).
c (Variable) – Input ND array with shape \((L, D, B, H)\).
weight_l0 (Variable) – weight parameters for the first layer. Shape is \((D, 4, H, I + H)\). [parameter]
weight (Variable) – weight parameters for the second layer and above. Shape is \((L1, D, 4, H, D * H + H)\). [optional][parameter]
bias (Variable) – Bias vector (\(L\)). Shape is \((L, D, 4, H)\). [optional][parameter]
num_layers (int) – Number of layers in the network. If set to 1, only the weights for the first layer will be invoked. Default is 1. [default=
1
]dropout (float) – Dropout ratio applied to parameters. Default is 0.0. [default=
0.0
]bidirectional (bool) – If True, bidirecitonal computation will be performed in each layer. Default is False. [default=
False
]training (bool) – Backpropagation will be performed only when it is True. Default is True. [default=
True
]
 Returns
Output \(y\) with shape \((T, B, D * H)\). Its memory layout can be reshaped as \((T, B, D, H)\). ~nnabla.Variable: Output \(h_n\) with shape \((L, D, B, H)\) ~nnabla.Variable: Output \(c_n\) with shape \((L, D, B, H)\)
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.gru(x, h, weight_l0, weight=None, bias=None, num_layers=1, dropout=None, bidirectional=False, training=True, n_outputs= 1, outputs=None)[source]¶
NStep GRU layer.
\[\begin{split}{\mathbf r_t} &=& {\mathbf \sigma}( {\mathbf W_r} *{\mathbf x_t} + {\mathbf U_r}* {\mathbf h_{(t1)}} + {\mathbf b_r})\\ {\mathbf z_t} &=& {\mathbf \sigma}( {\mathbf W_z} *{\mathbf x_t} + {\mathbf U_z}* {\mathbf h_{(t1)}} + {\mathbf b_z})\\ {\mathbf n_t} &=& {\mathbf \tanh}( {\mathbf W_n}{\mathbf x_t}+ {\mathbf b_{in}}+ {\mathbf r_n}\odot( {\mathbf U_n}{\mathbf h_{t1}}+ {\mathbf b_{hn}})) \\ {\mathbf h_t} &=& (1 {\mathbf z_t})\odot {\mathbf n_t} + {\mathbf z_t}\odot {\mathbf h_{t1}}.\end{split}\]We use the following notations to describe the inputs and outputs below. \(T\): sequcne length, \(B\): batch size, \(I\): input size, \(L\): number of layers, \(D\): number of directions, can be either 1 or 2, \(H\): hidden size.
References
 Parameters
x (Variable) – Input ND array with shape \((T, B, I)\).
h (Variable) – Input ND array with shape \((L, D, B, H)\).
weight_l0 (Variable) – weight parameters for the first layer. Shape is \((D, 3, H, I + H)\). [parameter]
weight (Variable) – weight parameters for the second layer and above. Shape is \((L1, D, 3, H, D * H + H)\). [optional][parameter]
bias (Variable) – Bias vector (\(L\)). Shape is \((L, D, 4, H)\). [optional][parameter]
num_layers (int) – Number of layers in the network. If set to 1, only the weights for the first layer will be invoked. Default is 1. [default=
1
]dropout (float) – Dropout ratio applied to parameters. Default is 0.0. [default=
0.0
]bidirectional (bool) – If True, bidirecitonal computation will be performed in each layer. Default is False. [default=
False
]training (bool) – Backpropagation will be performed only when it is True. Default is True. [default=
True
]
 Returns
Output \(y\) with shape \((T, B, D * H)\). Its memory layout can be reshaped as \((T, B, D, H)\). ~nnabla.Variable: Output \(h_n\) with shape \((L, D, B, H)\)
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.multi_head_attention(query, key, value, num_heads, q_weight, k_weight, v_weight, out_weight, q_bias=None, k_bias=None, v_bias=None, out_bias=None, attn_bias_k=None, attn_bias_v=None, dropout=0.0, additive_mask=None, key_padding_mask=None)[source]¶
MultiHeadAttention.
Computes multiheaded attention with query, key, and value. We use the following notations to describe the inputs and outputs below. \(L_T\): target sequence length, \(L_S\): source sequence length, \(B\): batch size, \(D\): input dimension, \(E\): embedding dimension, \(H\): number of attention heads.
References
A. Vaswani et al. “Attention is All You Need.” NIPS. 2017. <https://papers.nips.cc/paper/7181attentionisallyouneed.pdf>
 Parameters
query (Variable) – Input ND array with shape \((L_T, B, D_q)\).
key (Variable) – Input ND array with shape \((L_S, B, D_k)\).
value (Variable) – Input ND array with shape \((L_S, B, D_v)\).
num_heads (int) – Number of attention heads. Note that embedding dimensoin E must be divisible by the number of heads. Default is 12 which is conventional.
q_weight (Variable) – Input ND array with shape \((D_q, E)\).
k_weight (Variable) – Input ND array with shape \((D_k, E)\).
v_weight (Variable) – Input ND array with shape \((D_v, E_v)\).
out_weight (Variable) – Input ND array with shape \((D_v, E_{out})\).
q_bias (Variable, optional) – Input ND array with shape \((E, )\).
k_bias (Variable, optional) – Input ND array with shape \((E, )\).
v_bias (Variable, optional) – Input ND array with shape \((E_v, )\).
out_bias (Variable, optional) – Input ND array with shape \((E_{out}, )\).
attn_bias_k (Variable, optional) – Input ND array with shape \((E, )\).
attn_bias_v (Variable, optional) – Input ND array with shape \((E_v, )\).
dropout (float, optional) – Dropout ratio applied to parameters. Default is 0.
additive_mask (Variable, optional) – Input ND array with shape \((L_T, L_S)\). Values will be added to the attention layer to prevent attention to certain positions.
key_padding_mask (Variable, optional) – Input ND array with shape \((B, L_S)\). Specified padding elements will be ignored by the attention layer. Values must be either 1 or 0.
 Returns
Output \(y\) with shape \((L_T, B, E_{out})\) ~nnabla.Variable: Output \(h_n\) with shape \((B, L_T, L_S)\)
 Return type
 nnabla.functions.patch_correlation(x1, x2, patch=(1, 1), shift=(0, 0), patch_step=(1, 1), shift_step=(1, 1), padding=(0, 0, 0, 0), channel_last=False)[source]¶
Multiplicative patchwise comparison between inputs
x1
andx2
, which must both be 4dimensional NCHW (withchannel_last=False
) or NHWC (withchannel_last=True
) arrays (where N is the number of samples, H and W are the sample height and width and C is the number of channels). The function returns a 5D array with shape \((N, C_y, C_x, H_o, W_o)\) where \(H_o, W_o\) are determined by the possible patch locations within the, optionally padded, input image sizeand \(C_y, C_x\) are determined by the optionally shifted patch positions.Mathematically, the patch correlation is formulated as
\[O(s_y, s_x, h_0, w_0) = \sum_{c} \sum_{k_h} \sum_{k_w} I_1(c, h + k_h, w + k_w) \times I_2(c, h + k_h + s_h, w + k_w + s_w),\]where \(I_1(c, h, w)\) and \(I_2(c, h, w)\) are the inputs at \(c\)th channel, \(h\)th height, and \(w\)th width, \(k_h, k_w\) indices for the patch size and \(s_h, s_w\) indices for the shifts.
A single correlation value (per sample) is produced if the patch extends to the image dimensions and all other parameters use the default values.
>>> import numpy as np, nnabla as nn, nnabla.functions as F >>> N, C, H, W = (1, 2, 3, 4) >>> x = nn.Variable.from_numpy_array(np.ones([N, C, H, W])) >>> F.patch_correlation(x, x, patch=(H, W)).d array([[[[[24.]]]]], dtype=float32)
A patch that is smaller than the image size moves horizontal and vertical producing a value per position. The
patch_step
argument may be used to control the position increments.>>> F.patch_correlation(x, x, patch=(H1, W1)).d array([[[[[12., 12.], [12., 12.]]]]], dtype=float32) >>> F.patch_correlation(x, x, patch=(H1, W1), patch_step=(2, 1)).d array([[[[[12., 12.]]]]], dtype=float32)
Multiple correlations may be performed at each position between the patch from
x1
and patches fromx2
at relative offsets striding the maximum vertical and horizontal distance given by theshift
values at increments ofshift_step
. The shifted correlation values can be obtained for the from the second and third output dimension for the vertical and horizontal shifts.>>> F.patch_correlation(x, x, (H, 1), shift=(0, 1)).shape (1, 1, 3, 1, 4) >>> F.patch_correlation(x, x, (H, 1), shift=(0, 1)).d array([[[[[0., 6., 6., 6.]], [[6., 6., 6., 6.]], [[6., 6., 6., 0.]]]]], dtype=float32) >>> F.patch_correlation(x, x, (H, 1), shift=(0, 1), shift_step=(1, 2)).d array([[[[[0., 6., 6., 6.]], [[6., 6., 6., 0.]]]]], dtype=float32)
Padding with zero values may be applied individually to the top, bottom, left and right side of the input image.
>>> F.patch_correlation(x, x, patch=(H, W), padding=(0, 1, W, W)).d array([[[[[ 0., 6., 12., 18., 24., 18., 12., 6., 0.], [ 0., 4., 8., 12., 16., 12., 8., 4., 0.]]]]], dtype=float32)
This function may be used to implement the FlowNetC correlation layer.
>>> N, C, H, W = (1, 256, 44, 60) >>> x1, x2 = nn.Variable((N, C, H, W)), nn.Variable((N, C, H, W)) >>> F.patch_correlation(x1, x2, shift=20, shift_step=2).shape (1, 21, 21, 44, 60)
References
 Parameters
x1 (Variable) – Input ND array with shape \((N, C, H, W)\) or \((N, H, W, C)\).
x2 (Variable) – Input ND array with shape \((N, C, H, W)\) or \((N, H, W, C)\).
patch – A tuple with height and width of the correlation patch. A single integer expands to identical height and width.
shift – A tuple of maximum vertical and horizontal displacement of patches from
x2
that are correlated with a single patch fromx1
. A single integer expands to identical vertical and horizontal displacement.patch_step – A tuple of vertical and horizontal increments for advancing the position of the correlation patch within the input image shape. A single integer expands to identical vertical and horizontal increments.
shift_step – A tuple of vertical and horizontal increments for advancing the relative offset position within the shift range. A single integer expands to identical vertical and horizontal increments.
padding – A tuple of top, bottom, left and right padding extent. A tuple of two values yields identical top/bottom and left/right padding from the first and second tuple value. A single integer expands to indential padding extent for all sides.
channel_last – Last dimension is the channel (NHWC order) if True.
 Returns
ND array with shape \((N, C_y, C_x, H_o, W_o)\) or \((N, H, W, C_y, C_x)\) if
channel_last=True
.A spatial size of the output is calculated as
\[H_o = \frac{H + (top\_pad + bottom\_pad)  patch_v}{patch\_step_v} + 1.\]A channel size of the ouptut is calculated as
\[C_y = \frac{2 \times shift_v}{shift\_step_v} + 1.\]\(W_o\) and \(C_x\) are the same calculation with differenct components.
 Return type
Neural Network Activation¶
 nnabla.functions.sigmoid(x, n_outputs= 1, outputs=None)[source]¶
Elementwise sigmoid function.
\[f(x) = \frac{1}{1 + \exp(x)},\]Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.swish(x, n_outputs= 1, outputs=None)[source]¶
Elementwise swish function, by Ramachandran et al. (2017).
\[y_i = \frac{x_i}{1 + \exp(x_i)},\]References
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.tanh(x, n_outputs= 1, outputs=None)[source]¶
Elementwise hyperbolic tangent (tanh) function.
\[y_i = \tanh (x_i)\]Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.relu(x, inplace=False, n_outputs= 1, outputs=None)[source]¶
Elementwise Rectified Linear Unit (ReLU) function.
\[y_i = \max (0, x_i)\] Parameters
 Returns
ND array with the same shape as x
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.softmax(x, axis=None, n_outputs= 1, outputs=None)[source]¶
Softmax normalization. Calculates
\[y_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)}\]along the dimension specified by
axis
, where \(x_i\) is the input and \(y_i\) is the output. Parameters
 Returns
ND array with the same shape as x
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.log_softmax(x, axis=None, n_outputs= 1, outputs=None)[source]¶
Fused operation of Softmax normalization followed by log, which is defined as
\[y_i = \log \frac{\exp(x_i)}{\sum_j \exp(x_j)},\]where \(y_i\) is the input and \(x_i\) is the output at ith channel. An advantage of this fusion is reducing the numerical instability due to the log application.
The original definition can be rewritten as
\[y_i = x_i  \max_j(x_j)  \log\left(\sum_j \exp(x_j  \max_k(x_k))\right).\]It is more stable as a log is always applied to a value \(\ge e\), while a log can be evaluated for 0 in the nonfused operation.
Also, backward gradient computation is more stable than the original one as it doesn’t perform division by x due to a gradient of log. The definition is as following.
\[dx_i = dy_i  y_i * \sum_j dy_j\]where \(dx_i\) and \(dy_i\) denote gradients of loss wrt \(x_i\) and \(y_i\) respectively.
 Parameters
 Returns
ND array with the same shape as x
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.elu(x, alpha=1.0, n_outputs= 1, outputs=None)[source]¶
Elementwise Exponential Linear Unit (ELU) function.
\[\begin{split}y_i= \left\{ \begin{array}{ll} x_i & (x > 0)\\ \alpha (\exp(x_i)  1) & (x \leq 0) \end{array} \right..\end{split}\]References
 Parameters
 Returns
ND array with the same shape as x
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.selu(x, scale=1.05070098735548, alpha=1.673263242354377, n_outputs= 1, outputs=None)[source]¶
Elementwise Scaled Exponential Linear Unit (SELU) function by Klambauer et al. (2017).
\[\begin{split}y_i= \lambda \left\{ \begin{array}{ll} x_i & (x > 0)\\ \alpha (\exp(x_i)  1) & (x \leq 0) \end{array} \right..\end{split}\]The coefficients \(\lambda\) and \(\alpha\) default to the following values \(\lambda_{01}\) and \(\alpha_{01}\), respectively, provided by Klambauer et al. (2017):
\[\begin{split}\begin{array}{lll} \lambda_{01} &=& \left( 1  \operatorname{erfc}\left( \frac{1}{\sqrt{2}} \right) \sqrt{e} \right) \sqrt{2 \pi} \\ && \left( 2 \operatorname{erfc} \left( \sqrt{2} \right) e^2 + \pi \operatorname{erfc}\left( \frac{1}{\sqrt{2}} \right)^2 e \right. \\ && \left.  2(2 + \pi) \operatorname{erfc} \left( \frac{1}{\sqrt{2}} \right) \sqrt{e} + \pi + 2 \right)^{1/2} \\ &\approx& 1.0507 \\ \alpha_{01} &=&  \frac {\sqrt {\frac {2}{\pi}}} {\operatorname{erfc} \left( \frac{1}{\sqrt{2}} \right) \exp \left(\frac {1} {2} \right)  1} \\ &\approx& 1.67326 \end{array}\end{split}\]References
 Parameters
 Returns
ND array with the same shape as x
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.crelu(x, axis=1, n_outputs= 1, outputs=None)[source]¶
Elementwise Concatenated Rectified Linear Unit (CReLU) function. This function calculates the ReLU of \(x\) and \(x\) , then concatenates the results together at a specified axis, and returns the resulting array.
References
 Parameters
 Returns
ND array where axis dimension is doubled by concatenating.
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.celu(x, alpha=1.0, axis=1, n_outputs= 1, outputs=None)[source]¶
Elementwise Concatenated Exponential Linear Unit (CELU) function. Concatenates ELU outputs of positive and negative inputs together at specified axis.
 Parameters
 Returns
ND array where axis dimension is doubled by concatenating.
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.gelu(x, n_outputs= 1, outputs=None)[source]¶
Gaussian Error Unit (GELU) function.
\[GELU(x) = xP(X \leq x) = x \Phi (x)\]which is approximated by
\[GELU(x) = 0.5x (1 + \tanh ( \sqrt(2/\pi)(x + 0.044715x^3) ))\]References
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.mish(x, n_outputs= 1, outputs=None)[source]¶
Mish activation function.
\[Mish(x) = x \tanh(\log(1+\exp(x_i)))\]References
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.prelu(x0, x1, base_axis=1, n_outputs= 1, outputs=None)[source]¶
Elementwise Parametrized Rectified Linear Unit function. Calculates:
\[y_i = \max(0, x_i) + w_i \min(0, x_i)\]where negative slope \(w\) is learned and can vary across channels (an axis specified with
base_axis
). Parameters
 Returns
ND array.
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.leaky_relu(x, alpha=0.1, inplace=False, n_outputs= 1, outputs=None)[source]¶
Elementwise Leaky Rectified Linear Unit (ReLU) function.
It is defined as:
\[y_i = \alpha * \min(0, x_i) + \max (0, x_i)\] Parameters
 Returns
ND array with the same shape as x
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.relu6(x, n_outputs= 1, outputs=None)[source]¶
Elementwise ReLU6 function. Capping ReLU activation to 6 is often observed to learn sparse features earlier.
\[ReLU6(x) = \min(\max(0,x,),6)\]Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.hard_sigmoid(x, n_outputs= 1, outputs=None)[source]¶
Segmentwise linear approximation of sigmoid. Preferable when speed of computation is more important than precision. Returns \(0\) if \(x < 2.5\). Returns \(1\) if \(x> 2.5\). Returns \(0.2x + 0.5\) if \(2.5 <= x <= 2.5\).
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.hard_tanh(x, n_outputs= 1, outputs=None)[source]¶
Elementwise HardTanh function. Computationally cheaper than Tanh function. Returns \(1\) if \(x > 1\). Returns \(1\) if \(x < 1\). Returns \(x\) otherwise.
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.log_sigmoid(x, n_outputs= 1, outputs=None)[source]¶
Elementwise LogSigmoid function.
\[LogSigmoid(x) = \log(1/(1+\exp(x_i)))\]Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.softplus(x, beta=1.0, n_outputs= 1, outputs=None)[source]¶
Elementwise SoftPlus function. Unlike Sigmoid and Tanh that have upper and lower bound, SoftPlus is only lowerbounded by 0.
\[SoftPlus(x) = \frac{1}{\beta} * \log(1+\exp(\beta * x_i))\] Parameters
 Returns
ND array with the same shape as x
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.softsign(x, n_outputs= 1, outputs=None)[source]¶
Elementwise SoftSign. Can be used in place of Tanh function. While Tanh converges exponentially, SoftSign converges polynomially.
\[SoftSign(x) = x/(1+x)\]Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.tanh_shrink(x, n_outputs= 1, outputs=None)[source]¶
Elementwies TanhShrink function.
\[TanhShrink(x) = x  \tanh(x)\]Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.sinc(x, n_outputs= 1, outputs=None)[source]¶
Elementwise Sinc function. Unlike other popular activation functions, it has rises and falls. returns \(1\) if \(x = 0\). returns \(\sin(x)/x\) otherwise.
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
Normalization¶
 nnabla.functions.batch_normalization(x, beta, gamma, mean, variance, axes=[1], decay_rate=0.9, eps=1e05, batch_stat=True, output_stat=False, n_outputs=None)[source]¶
Batch normalization.
\[\begin{split}\begin{eqnarray} \mu &=& \frac{1}{M} \sum x_i \\ \sigma^2 &=& \frac{1}{M} \sum \left(x_i  \mu\right)^2 \\ \hat{x}_i &=& \frac{x_i  \mu}{\sqrt{\sigma^2 + \epsilon}} \\ y_i &=& \hat{x}_i \gamma + \beta. \end{eqnarray}\end{split}\]At testing time, the mean and variance values used are those that were computed during training by moving average.
References
 Parameters
x (Variable) – ND array of input.
beta (Variable or None) – ND array of beta which is learned. If None, the bias term is omitted.
gamma (Variable or None) – ND array of gamma which is learned. If None, the scale term is omitted.
mean (Variable or None) – ND array of running mean (modified during forward execution). If None, dummy variable is created and running mean is not updated. mean=None with batch_stat=False is prohibited.
variance (Variable or None) – ND array of running variance (modified during forward execution). If None, dummy variable is created and running variance is not updated. variance=None with batch_stat=False is prohibited.
axes (list of int or int) – Mean and variance are calculated along these axes.
decay_rate (float) – Decay rate of running mean and variance.
eps (float) – Tiny value to avoid zero division by std.
batch_stat (bool) – Use minibatch statistics rather than running ones. If False, mean and variance must be
~nnabla.Variable
. (None is prohibited.)output_stat (bool) – It true, the batch statistics of mean and variance, will be returned as Variables. They are also differentiable.
 Returns
Returns batch normalization output as
Variable
. Ifoutput_stat=True
, it also returns the mean and variance of the minibatch
See also
nnabla.function_bases.batch_normalization
.
 nnabla.functions.fused_batch_normalization(x, beta, gamma, mean, variance, z=None, axes=[1], decay_rate=0.9, eps=1e05, batch_stat=True, nonlinearity='relu', output_stat=False, n_outputs=None)[source]¶
Batch normalization fused with an add operation and an activation.
References
 Parameters
x (Variable) – ND array of input.
beta (Variable or None) – ND array of beta which is learned. If None, the bias term is omitted.
gamma (Variable or None) – ND array of gamma which is learned. If None, the scale term is omitted.
mean (Variable or None) – ND array of running mean (modified during forward execution). If None, dummy variable is created and running mean is never updated. mean=None with batch_stat=False is prohibited.
variance (Variable) – ND array of running variance (modified during forward execution). If None, dummy variable is created and running variance is not updated. variance=None with batch_stat=False is prohibited.
z (Variable, optional) – ND array
axes (list of int or int) – Mean and variance are calculated along these axes.
decay_rate (float) – Decay rate of running mean and variance.
eps (float) – Tiny value to avoid zero division by std.
batch_stat (bool) – Use minibatch statistics rather than running ones. If False, mean and variance must be
~nnabla.Variable
. (None is prohibited.)nonlinearity (str) – Nonlinearity chosen from relu. Default is relu.
output_stat (bool) – It true, the batch statistics of mean and variance, will be returned as Variables. They are also differentiable.
 Returns
Returns batch normalization output as
Variable
. Ifoutput_stat=True
, it also returns the mean and variance of the minibatch
See also
nnabla.function_bases.batch_normalization
.
 nnabla.functions.sync_batch_normalization(x, beta, gamma, mean, variance, comm, group='world', axes=[1], decay_rate=0.9, eps=1e05, batch_stat=True, output_stat=False, n_outputs=None)[source]¶
Synchronized batch normalization.
For some tasks (e.g., semantic segmentation), batch size will be too small and BatchNormalization layer might not work well. SyncBatchNorlization layer solves these problems by synchronizing batch stats (mean and var) between multiple processes.
\[\begin{split}\begin{eqnarray} \mu &=& \frac{1}{M} \sum x_i \\ \sigma^2 &=& \frac{1}{M} \left(\sum x_i  \mu\right)^2 \\ \hat{x}_i &=& \frac{x_i  \mu}{\sqrt{\sigma^2 + \epsilon}} \\ y_i &=& \hat{x}_i \gamma + \beta. \end{eqnarray}\end{split}\]References
Implementing Synchronized MultiGPU Batch Normalization https://hangzhang.org/PyTorchEncoding/notes/syncbn.html
 Parameters
x (Variable) – ND array of input.
beta (Variable or None) – ND array of beta which is learned. If None, the bias term is omitted.
gamma (Variable or None) – ND array of gamma which is learned. If None, the scale term is omitted.
mean (Variable or None) – ND array of running mean (modified during forward execution). If None, dummy variable is created and running mean is never updated. mean=None with batch_stat=False is prohibited.
variance (Variable or None) – ND array of running variance (modified during forward execution). If None, dummy variable is created and running variance is never updated. variance=None with batch_stat=False is prohibited.
comm (Communicator) – The communicator
group (string) – The name of the communicator group
axes (list of int or int) – Mean and variance are calculated along these axes.
decay_rate (float) – Decay rate of running mean and variance.
eps (float) – Tiny value to avoid zero division by std.
batch_stat (bool) – Use minibatch statistics rather than running ones. If False, mean and variance must be
~nnabla.Variable
. (None is prohibited.)output_stat (bool) – It true, the batch statistics of mean and variance, will be returned as Variables. They are also differentiable.
 Returns
Returns batch normalization output as
Variable
. Ifoutput_stat=True
, it also returns the mean and variance of the minibatch
See also
nnabla.function_bases.batch_normalization
.
 nnabla.functions.mean_subtraction(x, mean, t, base_axis=1, update_running_mean=True)[source]¶
It subtracts the mean of the elements of the input array, and normalizes it to \(0\). Preprocessing arrays with this function has the effect of improving accuracy in various tasks such as image classification.
At training time, this function is defined as
\[\begin{split}\begin{eqnarray} \mu &=& \frac{1}{M} \sum x_i \\ y_i &=& x_i  \mu \end{eqnarray}\end{split}\]At testing time, the mean values used are those that were computed during training by moving average.
Note
The backward performs an approximated differentiation that takes into account only the latest minibatch.
 Parameters
x (Variable) – ND array of input.
mean (Variable) – ND array of running mean (modified during forward execution).
t (Variable) – Scalar of num of iteration of running mean (modified during forward execution).
base_axis (int) – Base axis of Mean Subtraction operation. Dimensions up to base_axis is treated as sample dimension. [default=``1``]
update_running_mean (bool) – Update running mean during forward execution. [default=``True``]
 Returns
ND array.
 Return type
See also
nnabla.function_bases.mean_subtraction
.
 nnabla.functions.norm_normalization(x, p=None, axes=None, eps=1e12)[source]¶
Norm normalization.
\[y = \frac{x_i}{\x\_p}\] Parameters
x (Variable) – ND array.
p (float) – Order of the norm. [default=
2
]axes (repeated int64) – Axes to be reduced. If empty list is given, all dimensions are reduced. [default=
range(x.ndim)
]eps (float) – Epsilon for the normalization. This
eps
is added before taking the pth root in the norm computation. [default=1e12
]
 Returns
ND array
 Return type
 nnabla.functions.clip_by_value(x, min, max)[source]¶
Clip inputs by values.
\[\begin{split}y = \begin{cases} max & (x > max) \\ x & (otherwise) \\ min & (x < min) \end{cases}.\end{split}\] Parameters
x (Variable) – An input variable.
min (Variable or float) – A min variable or float value by which
x
is clipped. Note that if Variable is given, its shape must be the same asx
’s.max (Variable or float) – A max variable or float value by which
x
is clipped. Note that if Variable is given, its shape must be the same asx
’s
 Returns
ND array.
 Return type
 nnabla.functions.clip_grad_by_value(x, min, max, n_outputs= 1, outputs=None)[source]¶
In forward pass, the function behaves as the identity.
In backward pass,
\[\begin{split}g_x = \begin{cases} max & (g_y > max) \\ g_y & (otherwise) \\ min & (g_y < min) \end{cases}.\end{split}\]A typical case for use is to prevent the gradient explosion through a whole computational graph. For example, if you want to clip gradient values for each feature map,
x = nn.Variable([16, 3, 32, 32]) min = F.broadcast(nn.Variable.from_numpy_array(np.asarray([1.0]).reshape((1, 1, 1, 1))), (16, 3, 32, 32)) max = F.broadcast(nn.Variable.from_numpy_array(np.asarray([1.0]).reshape((1, 1, 1, 1))), (16, 3, 32, 32)) c = F.clip_grad_by_value(x, min=min, max=max) h = PF.convolution(c, 64, (3, 3), pad=(1, 1))
 Parameters
x (Variable) – ND array of input.
min (Variable) – ND array of minimum input value by which the gradients of the
y
are clipped. Note that the shape ofmin
must be the same asx
’s and the backward tomin
is not performed.max (Variable) – ND array of maximum input value by which the gradients of the
y
are clipped. Note that the shape ofmax
must be the same asx
’s and the backward tomax
is not performed.
 Returns
ND array.
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.clip_by_norm(x, clip_norm, axis=None)[source]¶
Clip inputs by its L2 norm when the L2 norm is larger than the threshold value (defined by clip_norm). If it is less than the threshold, inputs are not modified. If it is applied, the operation is represented as
\[y = N \times \frac{x}{\x\_2}.\]where \(x\) is the input, \(y\) is the output, and \(N\) is
clip_norm
. this is the case thataxes
is not set. Whenaxes
is set, the norm is computed overaxes
. Parameters
 Returns
ND array.
 Return type
 nnabla.functions.clip_grad_by_norm(x, clip_norm=None, axes=None, n_outputs= 1, outputs=None)[source]¶
In the forward pass, the function behaves like the identity.
In the backward pass,
\[g_x = N \times \frac{g_y}{\g_y\_2}.\]where \(g_x\) is the gradient w.r.t the input, \(g_y\) is the gradient w.r.t. the output, and \(N\) is
clip_norm
where the norm of \(g_y\) becomes. this is the case thataxes
is not set. Whenaxes
is set, the norm is computed overaxes
.A typical case for use is to prevent the gradient explosion through a whole computational graph. For example, if you want to normalize gradient values over feature axis,
x = nn.Variable([16, 3, 32, 32]) c = F.clip_grad_by_norm(x, axes=(1, )) h = PF.convolution(c, 64, (3, 3), pad=(1, 1))
 Parameters
 Returns
ND array.
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.layer_normalization(x, beta, gamma, batch_axis=0, eps=1e05, output_stat=False)[source]¶
Applies Layer Normalization over an input tensor, which is defined as:
\[\begin{split}\begin{eqnarray} \mu^l &=& \frac{1}{H} \sum_{i=1}^{H} x_i^l \\ \sigma^l &=& \sqrt{\frac{1}{H} \sum_{i=1}^{H} \left(x_i^l  \mu^l\right)^2 + \epsilon} \\ y &=& \frac{x  \mu^l}{\sigma^l} \gamma + \beta \end{eqnarray}\end{split}\]where \(x\) and \(y\) are input and output variable, \(\mu^l\) and \(\sigma^l\) are the mean and std of each layer which is separately calculated for each batch, and \(\beta\) and \(\gamma\) are adaptive biases and gains.
If the input shape is [B, C, H, W] (= batch_axis=0), the shape of calculated mean and std are [B, 1, 1, 1]
References
 Parameters
x (Variable) – An input variable.
beta (Variable or None) – An Adaptive biases. If None, the bias term is omitted.
gamma (Variable or None) – An Adaptive gains. If None, the scale term is omitted.
batch_axis (int or repeated int) – Axes mean and variance are taken.
eps (float) – Tiny value to avoid zero division by std.
output_stat (bool) – If true, calculated mean and variance are also returned.
 Returns
output variable which is normalized its statics and rescaled by alpha and beta. *
Variable
: Mean (if ``output_stat=True`). *Variable
: Std (if ``output_stat=True`) Return type
 nnabla.functions.instance_normalization(x, beta, gamma, channel_axis=1, batch_axis=0, eps=1e05, output_stat=False)[source]¶
Applies Instance Normalization over an input tensor, which is defined as:
\[\begin{split}\begin{eqnarray} \mu^i &=& \frac{1}{H} \sum_{i=1}^{H} x_i^i \\ \sigma^i &=& \sqrt{\frac{1}{H} \sum_{i=1}^{H} \left(x_i^i  \mu^i\right)^2 + \epsilon} \\ y &=& \frac{x  \mu^i}{\sigma^i} \gamma + \beta \end{eqnarray}\end{split}\]where \(x\) and \(y\) are input and output variable, \(\mu^i\) and \(\sigma^i\) are the mean and std of each instance which is separately calculated for each batch and channel, and \(\gamma\) and \(\beta\) are adaptive gains and biases.
If the input shape is [B, C, H, W] (= channel_axis=1, batch_axis=0), the shape of calculated mean and std are [B, C, 1, 1]
References
 Parameters
x (Variable) – An input variable.
beta (Variable or None) – An Adaptive biases. If None, the bias term is omitted.
gamma (Variable or None) – An Adaptive gains. If None, the scale term is omitted.
channel_axis (int) – Channel axis.
batch_axis (int or repeated int) – Batch axes.
eps (float) – Tiny value to avoid zero division by std.
output_stat (bool) – If true, the batch statistics of mean and variance.
 Returns
Normalized output variable. *
Variable
: Mean (if ``output_stat=True`) *Variable
: Std (if ``output_stat=True`) Return type
 nnabla.functions.group_normalization(x, beta, gamma, num_groups, channel_axis=1, batch_axis=0, eps=1e05, output_stat=False)[source]¶
Applies Group Normalization over an input tensor, which is defined as:
\[\begin{split}\begin{eqnarray} \mu^g &=& \frac{1}{H} \sum_{i=1}^{H} x_i^g \\ \sigma^g &=& \sqrt{\frac{1}{H} \sum_{i=1}^{H} \left(x_i^g  \mu^g\right)^2 + \epsilon} \\ y &=& \frac{x  \mu^g}{\sigma^g} \gamma + \beta \end{eqnarray}\end{split}\]where \(x\) and \(y\) are input and output variable, \(\mu^g\) and \(\sigma^g\) are the mean and std of each group which contains
num_channels / num_groups
channels, and \(\gamma\) and \(\beta\) are adaptive gains and biases.The input channels, specified by
channel_axis
, are separated intonum_groups
groups, and the mean and std are calculated over the each group. For example, if the input shape is [B, C, H, W] (= channel_axis=1, batch_axis=0), an input variable is once reshaped to [B, num_groups, C / num_groups, H, W] and standardize by its mean and std whose shapes are [B, num_groups, 1, 1, 1]. Finally, an output variable is reshaped again to the original input shape (= [B, C, H, W] in the case above).References
 Parameters
x (Variable) – An input variable.
beta (Variable or None) – An Adaptive biases. If None, the bias term is omitted.
gamma (Variable or None) – An Adaptive gains. If None, the scale term is omitted.
num_groups (int) – A number of groups. The channel dim of ‘x’ must be integer multiple of
num_groups
.channel_axis (int) – Channel axis.
batch_axis (int or repeated int) – Batch axes.
eps (float) – Tiny value to avoid zero division by std.
output_stat (bool) – If true, the batch statistics of mean and variance.
 Returns
Normalized output variable. *
Variable
: Mean (if ``output_stat=True`) *Variable
: Std (if ``output_stat=True`) Return type
 nnabla.functions.weight_standardization(w, channel_axis=0, eps=1e05, output_stat=False)[source]¶
Applies Weight Standardization over an input weight, which is defined as:
\[\begin{split}\begin{eqnarray} \mu_{W_i} &=& \frac{1}{I} \sum_{j=1}^{I} W_{ij} \\ \sigma_{W_i} &=& \sqrt{\frac{1}{I} \sum_{i=1}^{I} \left(W_{ij}  \mu_{W_{i}}\right)^2 + \epsilon} \\ \hat{W_{ij}} &=& \frac{W_{ij}  \mu_{W_i}}{\sigma_{W_i}} \\ y &=& \hat{W} \ast x \end{eqnarray}\end{split}\]Example
import numpy as np import nnabla as nn import nnabla.functions as F import nnabla.parametric_functions as PF rng = np.random.RandomState(313) x = nn.Variable.from_numpy_array(rng.randn(*(32, 16, 3, 3))) # For convolution: def ws_callback_conv(w): return F.weight_standardization(w, channel_axis=0) y = PF.convolution(x, 10, (2, 2), apply_w=ws_callback_conv) # For affine: def ws_callback_affine(w): return F.weight_standardization(w, channel_axis=1) y = PF.affine(x, 10, apply_w=ws_callback_affine)
References
 nnabla.functions.weight_normalization(w, g, dim=0, eps=1e12, n_outputs= 1, outputs=None)[source]¶
Weight normalization.
\[\mathbf{w}_{WN} = g \dfrac{\mathbf{w}}{\\mathbf{w}\}\]where \(\mathbf{w}\) is the input weights to be normalized. and \(g\) is learnable multiplication factors each of which is applied to each data at
dim
.References
 Parameters
w (Variable) – ND array of learnable weights.
g (Variable) – 1D array of learnable scales.
dim (int) – Output dimension. For the other dimensions, the norms are computed. [default=
0
]eps (float) – Epsilon for the normalization. This
eps
is added before taking the sqrt in the norm computation. [default=1e12
]
 Returns
ND array
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.spectral_norm(w, u, dim=0, itr=1, eps=1e12, test=False, n_outputs= 1, outputs=None)[source]¶
Spectral Normalization.
\[\begin{split}W_{sn} = \\frac{W}{\\sigma(W)}.\end{split}\]where \(W\) is the input matrix, and the \(\\sigma(W)\) is the spectral norm of \(W\). The spectral norm is approximately computed by the power iteration.
References
Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida, “Spectral Normalization for Generative Adversarial Networks”, International Conference on Learning Representations. 2018.
 Parameters
w (Variable) – ND array of learnable weights. This is normally network parameter.
u (Variable) – 1D array of singular vector. When
test == False
, the data region ofu
will be updated during forward calculation.dim (int) – Output dimension. Default is 0. If the dimension is not 0, then the specified dimension becomes the mostleft dimension by transposing. [default=
0
]itr (int) – Number of power iterations. Default is 1. [default=
1
]eps (float) – Epsilon for the normalization. This
eps
is added before taking the sqrt in the norm computation. [default=1e12
]test (bool) – When in
True
,u
will not be updated. Default isFalse
. [default=False
]
 Returns
Spectrally normalized \(W_{sn}\) with the same shape as \(W\).
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
Reduction¶
 nnabla.functions.sum(x, axis=None, keepdims=False)[source]¶
Reduction along axes with sum operation.
 Parameters
 Returns
ND array.
 Return type
 nnabla.functions.mean(x, axis=None, keepdims=False)[source]¶
Reduction along axes with mean operation.
 Parameters
 Returns
ND array.
 Return type
 nnabla.functions.max(x, axis=None, keepdims=False, with_index=False, only_index=False)[source]¶
Reduce the input ND array
x
along the givenaxis
using the max operation. Theaxis
argument may be a single integer to reduce over one axis, a tuple of integers to reduce over multiple axes, orNone
to reduce over all axes. Ifkeepdims
isTrue
, the output will keep all reduced dimensions with size 1. Ifwith_index
is True, result is a tuple(sorted, indices)
or onlyindices
ifonly_index
is True. Settingonly_index
to True implies thatwith_index
is also True.import numpy as np import nnabla as nn import nnabla.functions as F nn.set_auto_forward(True) x = nn.Variable.from_numpy_array(np.random.rand(2, 3, 4)) maxval = F.max(x, axis=1) assert np.allclose(maxval.d, np.max(x.d, axis=1)) maxval, indices = F.max(x, axis=1, with_index=True) assert np.allclose(maxval.d, np.max(x.d, axis=1)) assert np.all(indices.d == np.argmax(x.d, axis=1)) indices = F.max(x, axis=1, only_index=True) assert np.all(indices.d == np.argmax(x.d, axis=1))
 Parameters
x (Variable) – An input variable.
axis (None, int or tuple of ints) – Axis or axes along which max is calculated. The default value
None
will reduce all dimensions.keepdims (bool) – Keep reduced axes as dimension with 1 element.
with_index (bool) – Return tuple of max values and index.
only_index (bool) – Return only the index of max values.
 Returns
ND array.
 Return type
 nnabla.functions.min(x, axis=None, keepdims=False, with_index=False, only_index=False)[source]¶
Reduce the input ND array
x
along the givenaxis
using the min operation. Theaxis
argument may be a single integer to reduce over one axis, a tuple of integers to reduce over multiple axes, orNone
to reduce over all axes. Ifkeepdims
isTrue
, the output will keep all reduced dimensions with size 1. Ifwith_index
is True, result is a tuple(sorted, indices)
or onlyindices
ifonly_index
is True. Settingonly_index
to True implies thatwith_index
is also True.import numpy as np import nnabla as nn import nnabla.functions as F nn.set_auto_forward(True) x = nn.Variable.from_numpy_array(np.random.rand(2, 3, 4)) minval = F.min(x, axis=1) assert np.allclose(minval.d, np.min(x.d, axis=1)) minval, indices = F.min(x, axis=1, with_index=True) assert np.allclose(minval.d, np.min(x.d, axis=1)) assert np.all(indices.d == np.argmin(x.d, axis=1)) indices = F.min(x, axis=1, only_index=True) assert np.all(indices.d == np.argmin(x.d, axis=1))
 Parameters
x (Variable) – An input variable.
axis (None, int or tuple of ints) – Axis or axes along which min is calculated. The default value
None
will reduce all dimensions.keepdims (bool) – Keep reduced axes as dimension with 1 element.
with_index (bool) – Return tuple of min values and index.
only_index (bool) – Return only the index of min values.
 Returns
ND array.
 Return type
 nnabla.functions.norm(x, p=None, axis=None, keepdims=False)[source]¶
Reduction along axes with norm operation.
\[y = \x\_p = \left( \sum_i x_i^p \right)^{\frac{1}{p}}\] Parameters
 Returns
ND array.
 Return type
 nnabla.functions.prod(x, axis=None, keepdims=False)[source]¶
Reduction along axes with product operation.
 Parameters
 Returns
ND array.
 Return type
Note
Backward computation is not accurate in a zero value input.
 nnabla.functions.reduce_sum(x, n_outputs= 1, outputs=None)[source]¶
Reduction along an axis with sum operation.
Note
This is deprecated. Use
sum
instead.Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.reduce_mean(x, n_outputs= 1, outputs=None)[source]¶
Reduction by mean along an axis.
Note
This is deprecated. Use
mean
instead.Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
Arithmetic¶
 nnabla.functions.add2(x0, x1, inplace=False, n_outputs= 1, outputs=None)[source]¶
Elementwise addition.
\[y_i = x^{(0)}_i + x^{(1)}_i\] Parameters
 Returns
ND array
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.add_n(*x, **kw)[source]¶
Elementwise addition.
\[y_i = x^{(0)}_i + . . . + x^{(n1)}_i\]Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.sub2(x0, x1, inplace=False, n_outputs= 1, outputs=None)[source]¶
Elementwise subtraction.
\[y_i = x^{(0)}_i  x^{(1)}_i\] Parameters
 Returns
ND array
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.mul2(x0, x1, inplace=False, n_outputs= 1, outputs=None)[source]¶
Elementwise multiplication.
\[y_i = x^{(0)}_i x^{(1)}_i\] Parameters
 Returns
ND array
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.mul_n(*x, **kw)[source]¶
Elementwise multiplication.
\[y_i = x^{(0)}_i . . . x^{(n1)}_i\]Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.div2(x0, x1, inplace=False, n_outputs= 1, outputs=None)[source]¶
Elementwise division.
\[y_i = \frac{x^{(0)}_i} {x^{(1)}_i}\] Parameters
 Returns
ND array
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.pow2(x0, x1, inplace=False, n_outputs= 1, outputs=None)[source]¶
Elementwise power function.
\[y_i = {(x^{(0)}_i)} ^ {x^{(1)}_i}\] Parameters
 Returns
ND array
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.add_scalar(x, val=1, inplace=False, n_outputs= 1, outputs=None)[source]¶
Elementwise scalar addition.
\[y_i = x_i + v\] Parameters
 Returns
ND array with the same shape as x
 Return type
Note
All nnabla functions in
nnabla.functions
are decorated with thennabla.function_bases.function_api
decorator, which queries the current context and passes it into the first argument of the original function. The original function always takes a context as the first argument.
 nnabla.functions.mul_scalar(x
1. comment out the function in functions.txt¶