Debugging

Deep neural networks are going deeper and deeper every year, requiring more components in the networks. Such complexity often misleads us to mal-configure the networks that can turn out be critical. Even if we correctly configure a neural network as desired, we may still want to find out its performance bottleneck, e.g., from which layer(s) the computational bottleneck comes.

In this debugging tutorial, we introduce the following ways to deal with such cases:

  1. visit method of a variable

  2. pretty-print

  3. simple graph viewer

  4. profiling utils

  5. value tracer

We will go over each technique, but first prepare the following reference model.

# If you run this notebook on Google Colab, uncomment and run the following to set up dependencies.
# !pip install nnabla-ext-cuda100
# !git clone https://github.com/sony/nnabla.git
# %cd nnabla/tutorial
# Python2/3 compatibility
from __future__ import print_function
from __future__ import absolute_import
from __future__ import division
import numpy as np
import nnabla as nn
import nnabla.logger as logger
import nnabla.functions as F
import nnabla.parametric_functions as PF
import nnabla.solvers as S

def block(x, maps, test=False, name="block"):
    h = x
    with nn.parameter_scope(name):
        with nn.parameter_scope("in-block-1"):
            h = PF.convolution(h, maps, kernel=(3, 3), pad=(1, 1), with_bias=False)
            h = PF.batch_normalization(h, batch_stat=not test)
            h = F.relu(h)
        with nn.parameter_scope("in-block-2"):
            h = PF.convolution(h, maps // 2, kernel=(3, 3), pad=(1, 1), with_bias=False)
            h = PF.batch_normalization(h, batch_stat=not test)
            h = F.relu(h)
        with nn.parameter_scope("in-block-3"):
            h = PF.convolution(h, maps, kernel=(3, 3), pad=(1, 1), with_bias=False)
            h = PF.batch_normalization(h, batch_stat=not test)

        if h.shape[1] != x.shape[1]:
            with nn.parameter_scope("skip"):
                s = PF.convolution(x, maps, kernel=(3, 3), pad=(1, 1), with_bias=False)
                s = PF.batch_normalization(s, batch_stat=not test)

    return F.relu(h + s)

def network(x, maps=16, test=False):
    h = x
    h = PF.convolution(h, maps, kernel=(3, 3), pad=(1, 1), name="first-conv", with_bias=False)
    h = PF.batch_normalization(h, batch_stat=not test, name="first-bn")
    h = F.relu(h)
    for l in range(4):
        h = block(h, maps * 2 ** (l + 1), name="block-{}".format(l))
        h = F.max_pooling(h, (2, 2))
    h = F.average_pooling(h, h.shape[2:])
    pred = PF.affine(h, 100, name="pred")
    return pred

Visit Method

Visit method of a variable takes either lambda, function, callable object as an argument and calls it over all NNabla functions where the variable can traverse in the forward order. It is easier to see the usage than expalined.

First of all, define the callable class.

class PrintFunc(object):
    def __call__(self, nnabla_func):
        print("==========")
        print(nnabla_func.info.type_name)
        print(nnabla_func.inputs)
        print(nnabla_func.outputs)
        print(nnabla_func.info.args)

This callable object takes a NNabla function, e.g., convolution, relu, etc., so a user can get information of that function.

nn.clear_parameters()  # this call is just in case to do the following code again

x = nn.Variable.from_numpy_array(np.random.randn(*[4, 3, 128, 128]))
pred = network(x)
pred.visit(PrintFunc())

This is the low-level API to see the graph information as you want by hand.

PPrint

PPrint method is one of the instantiation of the visit method. We can see the graph structure in the topological (forward) order in details. Here is a usage to see detailed information of a graph.

nn.clear_parameters()  # call this in case you want to run the following code agian

x = nn.Variable.from_numpy_array(np.random.randn(*[4, 3, 128, 128]))
pred = network(x)

# pprint
from nnabla.utils.inspection import pprint
pprint(pred, summary=True, forward=True, backward=True)

Simple Graph Viewer

Visit method is very useful for getting information about each function used in a graph, but it is hard to see the details of the whole network structure, e.g., which variable is connected to which variable. So we have a graph viewer that visually shows the whole structure of network, enabling us to debug more efficiently. Using this graph viewer is straightforward, as shown in the following code:

nn.clear_parameters()  # call this in case you want to run the following code agian

x = nn.Variable([4, 3, 128, 128])
pred = network(x)
import nnabla.experimental.viewers as V

graph = V.SimpleGraph(verbose=False)
graph.view(pred)

If one would like to see more detailed information as in visit method case, change verbose option to True.

graph = V.SimpleGraph(verbose=True)
graph.view(pred)

Now one can see detailed information!

Note that this viewer is mainly for NNabla users who want to write codes in python, so for those who like to see more beautiful network and play with that, please use Neural Network Console and visit https://dl.sony.com/.

Profiling Utils

Basically, this feature is for developers who want to know the whole stats in speed and which functions could be bottlenecks. NNabla provides a simple profiling tool. Once a network is prepared, one better to have other components to train the network like a loss function and solvers.

To create the profiler and see the results, run the following codes.

nn.clear_parameters()  # call this in case you want to run the following code agian

# Context
from nnabla.ext_utils import get_extension_context
device = "cudnn"
ctx = get_extension_context(device)
nn.set_default_context(ctx)

# Network
x = nn.Variable.from_numpy_array(np.random.randn(*[4, 3, 128, 128]))
t = nn.Variable([4, 1])
pred = network(x)
loss = F.mean(F.softmax_cross_entropy(pred, t))

# Solver
solver = S.Momentum()
solver.set_parameters(nn.get_parameters())

# Profiler
from nnabla.utils.profiler import GraphProfiler
B = GraphProfiler(loss, solver=solver, device_id=0, ext_name=device, n_run=100)
B.run()
print("Profile finished.")

# Report
from nnabla.utils.profiler import GraphProfilerCsvWriter
with open("./profile.csv", "w") as f:
    writer = GraphProfilerCsvWriter(B, file=f)
    writer.write()
print("Report is prepared.")

You can also find TimeProfiler to profile, but it is more fine-grained in measureing execution time.

With TimeProfiler, you can put a callback function to the forward and/or backward method in the training loop.

Value Tracer

We sometimes want to check if there exsits NaN/Inf. NanInfTracer is a convenient way to check if one of all layers in a graph has NaN/Inf value.

# Create graph again just in case
nn.clear_parameters()  # call this in case you want to run the following code agian

# Try to switch these two
x = nn.Variable.from_numpy_array(np.random.randn(*[4, 3, 64, 64]))
#x = nn.Variable([4, 3, 64, 64])
pred = network(x)

# NanInfTracer
from nnabla.utils.inspection import NanInfTracer
nit = NanInfTracer(trace_inf=True, trace_nan=True, need_details=True)

with nit.trace():
    # Try to comment either of these two or both
    pred.forward(function_post_hook=nit.forward_post_hook)
    pred.backward(function_post_hook=nit.backward_post_hook)

print(nit.check())