Data Iterators

NNabla provides various utilities for using data for training.


class nnabla.utils.data_source.DataSource(shuffle=False, rng=None)[source]

Bases: object

This class contains various properties and methods for the data source, which are utilized by py:class:DataIterator.

  • shuffle (bool) – Indicates whether the dataset is shuffled or not.

  • rng (None or numpy.random.RandomState) – Numpy random number generator.

property position

Data position in current epoch.


Data position

Return type


property shuffle

Whether dataset is shuffled or not.


whether dataset is shuffled.

Return type


property variables

Variable names of the data.


tuple of Variable names

Return type


class nnabla.utils.data_source.DataSourceWithFileCache(data_source, cache_dir=None, cache_file_name_prefix='cache', shuffle=False, rng=None)[source]

Bases: nnabla.utils.data_source.DataSource

This class contains properties and methods for data source that can be read from cache files, which are utilized by data iterator.

  • data_source (DataSource) – Instance of DataSource class which provides data.

  • cache_dir (str) – Location of file_cache. If this value is None, data_source.DataSourceWithFileCache creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise, data_source.DataSourceWithFileCache keeps created cache. Default is None.

  • cache_file_name_prefix (str) – Beginning of the filenames of cache files. Default is ‘cache’.

  • shuffle (bool) – Indicates whether the dataset is shuffled or not.

  • rng (None or numpy.random.RandomState) – Numpy random number generator.

property position

Data position in current epoch.


Data position

Return type


property shuffle

Whether dataset is shuffled or not.


whether dataset is shuffled.

Return type


property variables

Variable names of the data.


tuple of Variable names

Return type


class nnabla.utils.data_source.DataSourceWithMemoryCache(data_source, shuffle=False, rng=None)[source]

Bases: nnabla.utils.data_source.DataSource

This class contains properties and methods for data source that can be read from memory cache, which is utilized by data iterator.

  • data_source (DataSource) – Instance of DataSource class which provides data.

  • shuffle (bool) – Indicates whether the dataset is shuffled or not.

  • rng (None or numpy.random.RandomState) – Numpy random number generator.

property position

Data position in current epoch.


Data position

Return type


property shuffle

Whether dataset is shuffled or not.


whether dataset is shuffled.

Return type


property variables

Variable names of the data.


tuple of Variable names

Return type



class nnabla.utils.data_iterator.DataIterator(data_source, batch_size, rng=None, use_thread=True, epoch_begin_callbacks=[], epoch_end_callbacks=[])[source]

Bases: object

Collect data from data_source and yields bunch of data.

  • data_source (DataSource) – Instance of DataSource class witch provides data for this class.

  • batch_size (int) – Size of data unit.

  • rng (None or numpy.random.RandomState) – Numpy random number generator.

  • epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as a argument. These are called at the beginning of an epoch.

  • epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as a argument. These are called at the end of an epoch.

property batch_size

Number of training samples that next() returns.


Number of training samples.

Return type


property epoch

The number of times position() returns to zero.



Return type



It generates tuple of data.

For example, if self._variables == ('x', 'y')() This method returns :py:meth:` ( [[X] * batch_size], [[Y] * batch_size] )`


tuple of data for mini-batch in numpy.ndarray.

Return type


property position

Data position in current epoch.


Data position

Return type



Register epoch begin callback.


callback (function) – A function takes an epoch index as an argument.


Register epoch end callback.


callback (function) – A function takes an epoch index as an argument.

property size

Data size that DataIterator will generate. This is the largest integer multiple of batch_size not exceeding self._data_source.size().


Data size

Return type


slice(rng, num_of_slices=None, slice_pos=None, slice_start=None, slice_end=None, cache_dir=None, use_cache=False)[source]

Slices the data iterator so that newly generated data iterator has access to limited portion of the original data.

  • rng (numpy.random.RandomState) – Random generator for Initializer.

  • num_of_slices (int) – Total number of slices to be made. Muts be used together with slice_pos.

  • slice_pos (int) – Position of the slice to be assigned to the new data iterator. Must be used together with num_of_slices.

  • slice_start (int) – Starting position of the range to be sliced into new data iterator. Must be used together with slice_end.

  • slice_end (int) – End position of the range to be sliced into new data iterator. Must be used together with slice_start.

  • cache_dir (str) – Directory to save cache files. if cache_dir is None and use_cache is True, will used memory cache.

  • use_cache (bool) – Whether use cache for data_source.


from nnabla.utils.data_iterator import data_iterator_simple
import numpy as np

def load_func1(index):
    d = np.ones((2, 2)) * index
    return d

di = data_iterator_simple(load_func1, 1000, batch_size=3)

di_s1 = di.slice(None, num_of_slices=10, slice_pos=0)
di_s2 = di.slice(None, num_of_slices=10, slice_pos=1)

di_s3 = di.slice(None, slice_start=100, slice_end=200)
di_s4 = di.slice(None, slice_start=300, slice_end=400)
property variables

Variable names of the data.


tuple of Variable names

Return type



nnabla.utils.data_iterator.data_iterator(data_source, batch_size, rng=None, with_memory_cache=True, with_file_cache=False, cache_dir=None, epoch_begin_callbacks=[], epoch_end_callbacks=[])[source]

Helper method to use DataSource.

You can use DataIterator with your own DataSource for easy implementation of data sources.

For example,

ds = YourOwnImplementationOfDataSource()
batch = data_iterator(ds, batch_size)
  • data_source (DataSource) – Instance of DataSource class which provides data.

  • batch_size (int) – Batch size.

  • rng (None or numpy.random.RandomState) – Numpy random number generator.

  • with_memory_cache (bool) – If True, use data_source.DataSourceWithMemoryCache to wrap data_source. It is a good idea to set this as true unless data_source provides on-memory data. Default value is True.

  • with_file_cache (bool) – If True, use data_source.DataSourceWithFileCache to wrap data_source. If data_source is slow, enabling this option a is good idea. Default value is False.

  • cache_dir (str) – Location of file_cache. If this value is None, data_source.DataSourceWithFileCache creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise, data_source.DataSourceWithFileCache keeps created cache. Default is None.

  • epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.

  • epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.


Instance of DataIterator.

Return type


nnabla.utils.data_iterator.data_iterator_simple(load_func, num_examples, batch_size, shuffle=False, rng=None, with_memory_cache=True, with_file_cache=True, cache_dir=None, epoch_begin_callbacks=[], epoch_end_callbacks=[])[source]

A generator that yield s minibatch data as a tuple, as defined in load_func . It can unlimitedly yield minibatches at your request, queried from the provided data.

  • load_func (function) – Takes a single argument i, an index of an example in your dataset to be loaded, and returns a tuple of data. Every call by any index i must return a tuple of arrays with the same shape.

  • num_examples (int) – Number of examples in your dataset. Random sequence of indexes is generated according to this number.

  • batch_size (int) – Size of data unit.

  • shuffle (bool) – Indicates whether the dataset is shuffled or not. Default value is False.

  • rng (None or numpy.random.RandomState) – Numpy random number generator.

  • with_memory_cache (bool) – If True, use data_source.DataSourceWithMemoryCache to wrap data_source. It is a good idea to set this as true unless data_source provides on-memory data. Default value is True.

  • with_file_cache (bool) – If True, use data_source.DataSourceWithFileCache to wrap data_source. If data_source is slow, enabling this option a is good idea. Default value is False.

  • cache_dir (str) – Location of file_cache. If this value is None, data_source.DataSourceWithFileCache creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise, data_source.DataSourceWithFileCache keeps created cache. Default is None.

  • epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.

  • epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.


Instance of DataIterator.

Return type


Here is an example of load_func which returns an image and a label of a classification dataset.

import numpy as np
from nnabla.utils.image_utils import imread
image_paths = load_image_paths()
labels = load_labels()
def my_load_func(i):
        image: c x h x w array
        label: 0-shape array
    img = imread(image_paths[i]).astype('float32')
    return np.rollaxis(img, 2), np.array(labels[i])
nnabla.utils.data_iterator.data_iterator_csv_dataset(uri, batch_size, shuffle=False, rng=None, normalize=True, with_memory_cache=True, with_file_cache=True, cache_dir=None, epoch_begin_callbacks=[], epoch_end_callbacks=[])[source]

Get data directly from a dataset provided as a CSV file.

You can read files located on the local file system, http(s) servers or Amazon AWS S3 storage.

For example,

batch = data_iterator_csv_dataset('CSV_FILE.csv', batch_size, shuffle=True)
  • uri (str) – Location of dataset CSV file.

  • batch_size (int) – Size of data unit.

  • shuffle (bool) – Indicates whether the dataset is shuffled or not. Default value is False.

  • rng (None or numpy.random.RandomState) – Numpy random number generator.

  • normalize (bool) – If True, each sample in the data gets normalized by a factor of 255. Default is True.

  • with_memory_cache (bool) – If True, use data_source.DataSourceWithMemoryCache to wrap data_source. It is a good idea to set this as true unless data_source provides on-memory data. Default value is True.

  • with_file_cache (bool) – If True, use data_source.DataSourceWithFileCache to wrap data_source. If data_source is slow, enabling this option a is good idea. Default value is False.

  • cache_dir (str) – Location of file_cache. If this value is None, data_source.DataSourceWithFileCache creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise, data_source.DataSourceWithFileCache keeps created cache. Default is None.

  • epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.

  • epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.


Instance of DataIterator

Return type


nnabla.utils.data_iterator.data_iterator_cache(uri, batch_size, shuffle=False, rng=None, normalize=True, with_memory_cache=True, epoch_begin_callbacks=[], epoch_end_callbacks=[])[source]

Get data from the cache directory.

Cache files are read from the local file system.

For example,

batch = data_iterator_cache('CACHE_DIR', batch_size, shuffle=True)
  • uri (str) – Location of directory with cache files.

  • batch_size (int) – Size of data unit.

  • shuffle (bool) – Indicates whether the dataset is shuffled or not. Default value is False.

  • rng (None or numpy.random.RandomState) – Numpy random number generator.

  • normalize (bool) – If True, each sample in the data gets normalized by a factor of 255. Default is True.

  • with_memory_cache (bool) – If True, use data_source.DataSourceWithMemoryCache to wrap data_source. It is a good idea to set this as true unless data_source provides on-memory data. Default value is True.

  • epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.

  • epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.


Instance of DataIterator

Return type


nnabla.utils.data_iterator.data_iterator_concat_datasets(data_source_list, batch_size, shuffle=False, rng=None, with_memory_cache=True, with_file_cache=False, cache_dir=None, epoch_begin_callbacks=[], epoch_end_callbacks=[])[source]

Get data from multiple datasets.

For example,

batch = data_iterator_concat_datasets([DataSource0, DataSource1, ...], batch_size)
  • data_source_list (list of DataSource) – list of datasets.

  • batch_size (int) – Size of data unit.

  • shuffle (bool) – Indicates whether the dataset is shuffled or not. Default value is False.

  • rng (None or numpy.random.RandomState) – Numpy random number generator.

  • with_memory_cache (bool) – If True, use data_source.DataSourceWithMemoryCache to wrap data_source. It is a good idea to set this as true unless data_source provides on-memory data. Default value is True.

  • with_file_cache (bool) – If True, use data_source.DataSourceWithFileCache to wrap data_source. If data_source is slow, enabling this option a is good idea. Default value is False.

  • cache_dir (str) – Location of file_cache. If this value is None, data_source.DataSourceWithFileCache creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise, data_source.DataSourceWithFileCache keeps created cache. Default is None.

  • epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.

  • epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.


Instance of DataIterator

Return type
