デバッグ

Deep neural networks are going deeper and deeper every year, requiring more components in the networks. Such complexity often misleads us to mal-configure the networks that can turn out be critical. Even if we correctly configure a neural network as desired, we may still want to find out its performance bottleneck, e.g., from which layer(s) the computational bottleneck comes.

In this debugging tutorial, we introduce the following ways to deal with such cases:

変数の visit メソッド
pretty-print
簡単なグラフビューアー
プロファイリング・ユーティリティー
value tracer

それぞれのテクニックについて説明しますが、まず、次のような参照モデルを準備しましょう。

# If you run this notebook on Google Colab, uncomment and run the following to set up dependencies.
# !pip install nnabla-ext-cuda100
# !git clone https://github.com/sony/nnabla.git
# %cd nnabla/tutorial

# Python2/3 compatibility
from __future__ import print_function
from __future__ import absolute_import
from __future__ import division

import numpy as np
import nnabla as nn
import nnabla.logger as logger
import nnabla.functions as F
import nnabla.parametric_functions as PF
import nnabla.solvers as S

def block(x, maps, test=False, name="block"):
    h = x
    with nn.parameter_scope(name):
        with nn.parameter_scope("in-block-1"):
            h = PF.convolution(h, maps, kernel=(3, 3), pad=(1, 1), with_bias=False)
            h = PF.batch_normalization(h, batch_stat=not test)
            h = F.relu(h)
        with nn.parameter_scope("in-block-2"):
            h = PF.convolution(h, maps // 2, kernel=(3, 3), pad=(1, 1), with_bias=False)
            h = PF.batch_normalization(h, batch_stat=not test)
            h = F.relu(h)
        with nn.parameter_scope("in-block-3"):
            h = PF.convolution(h, maps, kernel=(3, 3), pad=(1, 1), with_bias=False)
            h = PF.batch_normalization(h, batch_stat=not test)

        if h.shape[1] != x.shape[1]:
            with nn.parameter_scope("skip"):
                s = PF.convolution(x, maps, kernel=(3, 3), pad=(1, 1), with_bias=False)
                s = PF.batch_normalization(s, batch_stat=not test)

    return F.relu(h + s)

def network(x, maps=16, test=False):
    h = x
    h = PF.convolution(h, maps, kernel=(3, 3), pad=(1, 1), name="first-conv", with_bias=False)
    h = PF.batch_normalization(h, batch_stat=not test, name="first-bn")
    h = F.relu(h)
    for l in range(4):
        h = block(h, maps * 2 ** (l + 1), name="block-{}".format(l))
        h = F.max_pooling(h, (2, 2))
    h = F.average_pooling(h, h.shape[2:])
    pred = PF.affine(h, 100, name="pred")
    return pred

visit メソッド

変数の visit メソッドは、引数としてラムダ、関数、または呼び出し可能なオブジェクトをとり、変数が順方向でトラバースできる NNabla 関数経由で呼び出されます。説明より使い方を見る方が簡単です。

まず最初に、呼び出し可能なクラスを定義します。

class PrintFunc(object):
    def __call__(self, nnabla_func):
        print("==========")
        print(nnabla_func.info.type_name)
        print(nnabla_func.inputs)
        print(nnabla_func.outputs)
        print(nnabla_func.info.args)

この呼び出し可能なオブジェクトは、たとえば畳み込み ( コンボリューション ) や ReLU などの NNabla 関数をとるので、ユーザーはその関数から情報を得ることができます。

nn.clear_parameters()  # this call is just in case to do the following code again

x = nn.Variable.from_numpy_array(np.random.randn(*[4, 3, 128, 128]))
pred = network(x)
pred.visit(PrintFunc())

This is the low-level API to see the graph information as you want by hand.

PPrint

PPrint method is one of the instantiation of the visit method. We can see the graph structure in the topological (forward) order in details. Here is a usage to see detailed information of a graph.

nn.clear_parameters()  # call this in case you want to run the following code again

x = nn.Variable.from_numpy_array(np.random.randn(*[4, 3, 128, 128]))
pred = network(x)

# pprint
from nnabla.utils.inspection import pprint
pprint(pred, summary=True, forward=True, backward=True)

簡単なグラフビューアー

visit メソッドは、グラフで使われる各関数に関する情報を得るのにとても役立ちますが、例えばどの変数がどの変数に結合しているかといった、ネットワーク全体の構成の詳細を見るのは困難です。そこで、視覚的にネットワーク全体の構成を示すグラフビューアーを用いて、より効率的にデバッグできるようにします。次のコードで示すように、グラフビューアーを使うことは簡単です。

nn.clear_parameters()  # call this in case you want to run the following code again

x = nn.Variable([4, 3, 128, 128])
pred = network(x)

import nnabla.experimental.viewers as V

graph = V.SimpleGraph(verbose=False)
graph.view(pred)

visit メソッドの場合と同様により詳細な情報を見たい場合は、 verbose オプションを True に変えてください。

graph = V.SimpleGraph(verbose=True)
graph.view(pred)

これで詳細な情報が見ることができます !

このビューアーは主に Python でコードを書きたい NNabla ユーザー向けのためであり、より美しいネットワークを見てみたい、そのようなネットワークを試したい方は Neural Network Console をご利用いただき、https://dl.sony.com/ をご覧ください。

Profiling Utils

基本的に、この機能は、速度における全統計量やどの関数がボトルネックになっているかを知りたい 開発者向け です。 NNabla では簡単なプロファイリングツールを用意されています。ネットワークの準備ができたら、Loss 関数や solver のようなネットワークを学習するための他のコンポーネントを備えることを推奨します。

To create the profiler and see the results, run the following codes.

nn.clear_parameters()  # call this in case you want to run the following code again

# Context
from nnabla.ext_utils import get_extension_context
device = "cudnn"
ctx = get_extension_context(device)
nn.set_default_context(ctx)

# Network
x = nn.Variable.from_numpy_array(np.random.randn(*[4, 3, 128, 128]))
t = nn.Variable([4, 1])
pred = network(x)
loss = F.mean(F.softmax_cross_entropy(pred, t))

# Solver
solver = S.Momentum()
solver.set_parameters(nn.get_parameters())

# Profiler
from nnabla.utils.profiler import GraphProfiler
B = GraphProfiler(loss, solver=solver, device_id=0, ext_name=device, n_run=100)
B.run()
print("Profile finished.")

# Report
from nnabla.utils.profiler import GraphProfilerCsvWriter
with open("./profile.csv", "w") as f:
    writer = GraphProfilerCsvWriter(B, file=f)
    writer.write()
print("Report is prepared.")

You can also find TimeProfiler to profile, but it is more fine-grained in measuring execution time.

With TimeProfiler, you can put a callback function to the forward and/or backward method in the training loop.

Value Tracer

We sometimes want to check if there exists NaN/Inf. NanInfTracer is a convenient way to check if one of all layers in a graph has NaN/Inf value.

# Create graph again just in case
nn.clear_parameters()  # call this in case you want to run the following code again

# Try to switch these two
x = nn.Variable.from_numpy_array(np.random.randn(*[4, 3, 64, 64]))
#x = nn.Variable([4, 3, 64, 64])
pred = network(x)

# NanInfTracer
from nnabla.utils.inspection import NanInfTracer
nit = NanInfTracer(trace_inf=True, trace_nan=True, need_details=True)

with nit.trace():
    # Try to comment either of these two or both
    pred.forward(function_post_hook=nit.forward_post_hook)
    pred.backward(function_post_hook=nit.backward_post_hook)

print(nit.check())