Out-of-core execution

The nnabla.lms package provides APIs that allow users to execute large-scale networks than allotted GPU memory by utilizing out-of-core algorithm. Out-of-core algorithm, or external memory algorithm, is an algorithm that enables processing data that are too large to fit into a main memory at once.

SwapInOutScheduler

class nnabla.lms.SwapInOutScheduler

Interface class for out-of-core execution / training.

This API enables training neural networks whose size is larger than alloted GPU memory. See https://arxiv.org/abs/2010.14109 for more detail of shcheduling strategy.

Note

cuda_init.prefer_cuda_virtual_array() used in following example can be used under cuda >= 10.2 and cudnn >= 8. We utilize virtual memory management supported from cuda 10.2. Additionally, when we tested virtual memory management with cuda >= 10.2 and cudnn < 8, we found the computation results of some cuddn functions are inaccurate. So, when your environment has cuda < 10.2 or cudnn < 8, the virtual memory allocator in nnabla will not be built and you can’t use it. If you would like to use SwapInOutScheduler to the fullest extent, please install cuda >= 10.2 and cudnn >= 8 and reinstall the corresponding nnabla-ext-cuda package.

Example:

from nnabla.lms import SwapInOutScheduler

# Change memory allocator which is preferable for SwapInOutScheduler.
from nnabla_ext.cuda.init as cuda_init
cuda_init.prefer_cpu_pinned_array()  # To accelerate memory transfer, using pinned memory for cpu memory will be preferable.

# Only for cuda >= 10.2 and cudnn >= 8. This setting is the best for SwapInOutScheduler.
cuda_init.prefer_cuda_virtual_array()  # To reduce memory fragmentation due to cpu-gpu memory transfers, using virtual allocator for gpu memory will be preferable.

# create context for both host and device
from nnabla.ext_utils import get_extension_context
host_ctx = get_extension_context("cpu", device_id="", type_config="float") # device_id is dummy
device_ctx = get_extension_context("cudnn", device_id="0", type_config="float")

scheduler = SwapInOutScheduler(host_ctx, device_ctx, size=max_gpu_memory_size)

# Make sure to call `nn.set_default_context` after calling prefer_xxx_array() to activate a change of memory preference.
nn.set_default_context(device_ctx)

x = nn.Variable(...)
loss = build_network(x)

solver = S.Sgd(nn.get_parameters())

for i in range(iteration):
    # scheduling memory transfers for all tensors appearing under the context of scheduler.
    with scheduler:
        x.d = next_data()

        loss.forward(clear_no_need_grad=True)

        solver.zero_grad()
        loss.backward(clear_buffer=True)

        solver.update()

When you get Out-of-Memory (OOM) error under the SwapInOutScheduler, possibly there are 2 options to avoid this OOM.

Set small budget of GPU memory for scheduling.
Set small size for a physical memory chunk allocated by virtual memory allocator.

These are examplified as follows:

Example:

# 1. Set small budget of GPU memory for scheduling
# You can reduce the below ratio until you can execute your network.
memsize_for_scheduler = max_gpu_memory_size * 0.8
scheduler = SwapInOutScheduler(..., size=memsize_for_scheduler)

# 2. Set small size for a physical memory chunk allocated by virtual memory allocator
# In default, the chunk size is set as 20MB (20 << 20).
from nnabla_ext.cuda.init import set_cuda_virtual_memory_chunk_size
set_cuda_virtual_memory_chunk_size(2 << 20)  # Set 2MB, for example.

end_scheduling(self)

An interface to specify the end point for scheduling. A range between start_scheduling() and end_scheduling() is a target for a single scheduling.

Note that, using with statement of SwapInOutScheduler, end_scheduling() will be automatically called when exiting with statement. In general, avoid to use start_scheduling() and end_scheduling() and use with statement insted (with scheduler:, see an example above).

function_post_hook(self, func)

A callback executed as function_post_hook in forward and backward.

For all forward and backward wrapped by with statement of SwapInOutScheduler, this callback is automatically set. In general, avoid to set this manually and use with statement of SwapInOutScheduler.

function_pre_hook(self, func)

A callback executed as function_pre_hook in forward and backward.

For all forward and backward wrapped by with statement of SwapInOutScheduler, this callback is automatically set. In general, avoid to set this manually and use with statement of SwapInOutScheduler.

start_scheduling(self)

An interface to specify the starting point for scheduling. A range between start_scheduling() and end_scheduling() is a target for a single scheduling.

Note that, using with statement of SwapInOutScheduler, start_scheduling() will be automatically called when entering with statement. In general, avoid to use start_scheduling() and end_scheduling() and use with statement insted (with scheduler:, see an example above).

update_post_hook(self)

A callback executed as post_hook in all solver functions, e.g. solver.update, solver.weight_decay, solver.clip_grad_by_norm, and so on.

For all solver functions wrapped by with statement of SwapInOutScheduler, this callback is automatically set. In general, avoid to set this manually and use with statement of SwapInOutScheduler.

update_pre_hook(self)

A callback executed as pre_hook in all solver functions, e.g. solver.update, solver.weight_decay, solver.clip_grad_by_norm, and so on.

For all solver functions wrapped by with statement of SwapInOutScheduler, this callback is automatically set. In general, avoid to set this manually and use with statement of SwapInOutScheduler.