Data Iterators

NNabla provides various utilities for using data for training.

DataSource

class nnabla.utils.data_source.DataSource(shuffle=False, rng=None)[source]

Bases: object

This class contains various properties and methods for the data source, which are utilized by py:class:DataIterator.

Parameters
  • shuffle (bool) – Indicates whether the dataset is shuffled or not.

  • rng (None or numpy.random.RandomState) – Numpy random number generator.

property position

Data position in current epoch.

Returns

Data position

Return type

int

property shuffle

Whether dataset is shuffled or not.

Returns

whether dataset is shuffled.

Return type

bool

property variables

Variable names of the data.

Returns

tuple of Variable names

Return type

tuple

class nnabla.utils.data_source.DataSourceWithFileCache(data_source, cache_dir=None, cache_file_name_prefix='cache', shuffle=False, rng=None)[source]

Bases: nnabla.utils.data_source.DataSource

This class contains properties and methods for data source that can be read from cache files, which are utilized by data iterator.

Parameters
  • data_source (DataSource) – Instance of DataSource class which provides data.

  • cache_dir (str) – Location of file_cache. If this value is None, data_source.DataSourceWithFileCache creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise, data_source.DataSourceWithFileCache keeps created cache. Default is None.

  • cache_file_name_prefix (str) – Beginning of the filenames of cache files. Default is ‘cache’.

  • shuffle (bool) – Indicates whether the dataset is shuffled or not.

  • rng (None or numpy.random.RandomState) – Numpy random number generator.

property position

Data position in current epoch.

Returns

Data position

Return type

int

property shuffle

Whether dataset is shuffled or not.

Returns

whether dataset is shuffled.

Return type

bool

property variables

Variable names of the data.

Returns

tuple of Variable names

Return type

tuple

class nnabla.utils.data_source.DataSourceWithMemoryCache(data_source, shuffle=False, rng=None)[source]

Bases: nnabla.utils.data_source.DataSource

This class contains properties and methods for data source that can be read from memory cache, which is utilized by data iterator.

Parameters
  • data_source (DataSource) – Instance of DataSource class which provides data.

  • shuffle (bool) – Indicates whether the dataset is shuffled or not.

  • rng (None or numpy.random.RandomState) – Numpy random number generator.

property position

Data position in current epoch.

Returns

Data position

Return type

int

property shuffle

Whether dataset is shuffled or not.

Returns

whether dataset is shuffled.

Return type

bool

property variables

Variable names of the data.

Returns

tuple of Variable names

Return type

tuple

DataIterator

class nnabla.utils.data_iterator.DataIterator(data_source, batch_size, rng=None, use_thread=True, epoch_begin_callbacks=[], epoch_end_callbacks=[], stop_exhausted=False)[source]

Bases: object

Collect data from data_source and yields bunch of data.

Parameters
  • data_source (DataSource) – Instance of DataSource class witch provides data for this class.

  • batch_size (int) – Size of data unit.

  • rng (None or numpy.random.RandomState) – Numpy random number generator.

  • use_thread (bool) – If use_thread is set to True, iterator will use another thread to fetch data. If use_thread is set to False, iterator will use current thread to fetch data.

  • epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as a argument. These are called at the beginning of an epoch.

  • epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as a argument. These are called at the end of an epoch.

  • stop_exhausted (bool) – If stop_exhausted is set to False, iterator will be reset so that iteration can be continued. If stop_exhausted is set to True, iterator will raise StopIteration to stop the loop.

property batch_size

Number of training samples that next() returns.

Returns

Number of training samples.

Return type

int

property epoch

The number of times position() returns to zero.

Returns

epoch

Return type

int

next()[source]

It generates tuple of data.

For example, if self._variables == ('x', 'y')() This method returns :py:meth:` ( [[X] * batch_size], [[Y] * batch_size] )`

Returns

tuple of data for mini-batch in numpy.ndarray.

Return type

tuple

property position

Data position in current epoch.

Returns

Data position

Return type

int

register_epoch_begin_callback(callback)[source]

Register epoch begin callback.

Parameters

callback (function) – A function takes an epoch index as an argument.

register_epoch_end_callback(callback)[source]

Register epoch end callback.

Parameters

callback (function) – A function takes an epoch index as an argument.

property size

Data size that DataIterator will generate. This is the largest integer multiple of batch_size not exceeding self._data_source.size().

Returns

Data size

Return type

int

slice(rng, num_of_slices=None, slice_pos=None, slice_start=None, slice_end=None, cache_dir=None, use_cache=False)[source]

Slices the data iterator so that newly generated data iterator has access to limited portion of the original data.

Parameters
  • rng (numpy.random.RandomState) – Random generator for Initializer.

  • num_of_slices (int) – Total number of slices to be made. Muts be used together with slice_pos.

  • slice_pos (int) – Position of the slice to be assigned to the new data iterator. Must be used together with num_of_slices.

  • slice_start (int) – Starting position of the range to be sliced into new data iterator. Must be used together with slice_end.

  • slice_end (int) – End position of the range to be sliced into new data iterator. Must be used together with slice_start.

  • cache_dir (str) – Directory to save cache files. if cache_dir is None and use_cache is True, will used memory cache.

  • use_cache (bool) – Whether use cache for data_source.

Example:

from nnabla.utils.data_iterator import data_iterator_simple
import numpy as np

def load_func1(index):
    d = np.ones((2, 2)) * index
    return d

di = data_iterator_simple(load_func1, 1000, batch_size=3)

di_s1 = di.slice(None, num_of_slices=10, slice_pos=0)
di_s2 = di.slice(None, num_of_slices=10, slice_pos=1)

di_s3 = di.slice(None, slice_start=100, slice_end=200)
di_s4 = di.slice(None, slice_start=300, slice_end=400)
property variables

Variable names of the data.

Returns

tuple of Variable names

Return type

tuple

Utilities

nnabla.utils.data_iterator.data_iterator(data_source, batch_size, rng=None, use_thread=True, with_memory_cache=True, with_file_cache=False, cache_dir=None, epoch_begin_callbacks=[], epoch_end_callbacks=[], stop_exhausted=False)[source]

Helper method to use DataSource.

You can use DataIterator with your own DataSource for easy implementation of data sources.

For example,

ds = YourOwnImplementationOfDataSource()
batch = data_iterator(ds, batch_size)
Parameters
  • data_source (DataSource) – Instance of DataSource class which provides data.

  • batch_size (int) – Batch size.

  • rng (None or numpy.random.RandomState) – Numpy random number generator.

  • use_thread (bool) – If use_thread is set to True, iterator will use another thread to fetch data. If use_thread is set to False, iterator will use current thread to fetch data.

  • with_memory_cache (bool) – If True, use data_source.DataSourceWithMemoryCache to wrap data_source. It is a good idea to set this as true unless data_source provides on-memory data. Default value is True.

  • with_file_cache (bool) – If True, use data_source.DataSourceWithFileCache to wrap data_source. If data_source is slow, enabling this option a is good idea. Default value is False.

  • cache_dir (str) – Location of file_cache. If this value is None, data_source.DataSourceWithFileCache creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise, data_source.DataSourceWithFileCache keeps created cache. Default is None.

  • epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.

  • epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.

  • stop_exhausted (bool) – If stop_exhausted is set to False, iterator will be reset so that iteration can be continued. If stop_exhausted is set to True, iterator will raise StopIteration to stop the loop.

Returns

Instance of DataIterator.

Return type

DataIterator

nnabla.utils.data_iterator.data_iterator_simple(load_func, num_examples, batch_size, shuffle=False, rng=None, use_thread=True, with_memory_cache=False, with_file_cache=False, cache_dir=None, epoch_begin_callbacks=[], epoch_end_callbacks=[], stop_exhausted=False)[source]

A generator that yield s minibatch data as a tuple, as defined in load_func . It can unlimitedly yield minibatches at your request, queried from the provided data.

Parameters
  • load_func (function) – Takes a single argument i, an index of an example in your dataset to be loaded, and returns a tuple of data. Every call by any index i must return a tuple of arrays with the same shape.

  • num_examples (int) – Number of examples in your dataset. Random sequence of indexes is generated according to this number.

  • batch_size (int) – Size of data unit.

  • shuffle (bool) – Indicates whether the dataset is shuffled or not. Default value is False.

  • rng (None or numpy.random.RandomState) – Numpy random number generator.

  • use_thread (bool) – If use_thread is set to True, iterator will use another thread to fetch data. If use_thread is set to False, iterator will use current thread to fetch data.

  • with_memory_cache (bool) – If True, use data_source.DataSourceWithMemoryCache to wrap data_source. It is a good idea to set this as true unless data_source provides on-memory data. Default value is False.

  • with_file_cache (bool) – If True, use data_source.DataSourceWithFileCache to wrap data_source. If data_source is slow, enabling this option a is good idea. Default value is False.

  • cache_dir (str) – Location of file_cache. If this value is None, data_source.DataSourceWithFileCache creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise, data_source.DataSourceWithFileCache keeps created cache. Default is None.

  • epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.

  • epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.

  • stop_exhausted (bool) – If stop_exhausted is set to False, iterator will be reset so that iteration can be continued. If stop_exhausted is set to True, iterator will raise StopIteration to stop the loop.

Returns

Instance of DataIterator.

Return type

DataIterator

Here is an example of load_func which returns an image and a label of a classification dataset.

import numpy as np
from nnabla.utils.image_utils import imread
image_paths = load_image_paths()
labels = load_labels()
def my_load_func(i):
    '''
    Returns:
        image: c x h x w array
        label: 0-shape array
    '''
    img = imread(image_paths[i]).astype('float32')
    return np.rollaxis(img, 2), np.array(labels[i])
nnabla.utils.data_iterator.data_iterator_csv_dataset(uri, batch_size, shuffle=False, rng=None, use_thread=True, normalize=True, with_memory_cache=True, with_file_cache=True, cache_dir=None, epoch_begin_callbacks=[], epoch_end_callbacks=[], stop_exhausted=False)[source]

Get data directly from a dataset provided as a CSV file.

You can read files located on the local file system, http(s) servers or Amazon AWS S3 storage.

For example,

batch = data_iterator_csv_dataset('CSV_FILE.csv', batch_size, shuffle=True)
Parameters
  • uri (str) – Location of dataset CSV file.

  • batch_size (int) – Size of data unit.

  • shuffle (bool) – Indicates whether the dataset is shuffled or not. Default value is False.

  • rng (None or numpy.random.RandomState) – Numpy random number generator.

  • use_thread (bool) – If use_thread is set to True, iterator will use another thread to fetch data. If use_thread is set to False, iterator will use current thread to fetch data.

  • normalize (bool) – If True, each sample in the data gets normalized by a factor of 255. Default is True.

  • with_memory_cache (bool) – If True, use data_source.DataSourceWithMemoryCache to wrap data_source. It is a good idea to set this as true unless data_source provides on-memory data. Default value is True.

  • with_file_cache (bool) – If True, use data_source.DataSourceWithFileCache to wrap data_source. If data_source is slow, enabling this option a is good idea. Default value is False.

  • cache_dir (str) – Location of file_cache. If this value is None, data_source.DataSourceWithFileCache creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise, data_source.DataSourceWithFileCache keeps created cache. Default is None.

  • epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.

  • epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.

  • stop_exhausted (bool) – If stop_exhausted is set to False, iterator will be reset so that iteration can be continued. If stop_exhausted is set to True, iterator will raise StopIteration to stop the loop.

Returns

Instance of DataIterator

Return type

DataIterator

nnabla.utils.data_iterator.data_iterator_cache(uri, batch_size, shuffle=False, rng=None, use_thread=True, normalize=True, with_memory_cache=True, epoch_begin_callbacks=[], epoch_end_callbacks=[], stop_exhausted=False)[source]

Get data from the cache directory.

Cache files are read from the local file system.

For example,

batch = data_iterator_cache('CACHE_DIR', batch_size, shuffle=True)
Parameters
  • uri (str) – Location of directory with cache files.

  • batch_size (int) – Size of data unit.

  • shuffle (bool) – Indicates whether the dataset is shuffled or not. Default value is False.

  • rng (None or numpy.random.RandomState) – Numpy random number generator.

  • use_thread (bool) – If use_thread is set to True, iterator will use another thread to fetch data. If use_thread is set to False, iterator will use current thread to fetch data.

  • normalize (bool) – If True, each sample in the data gets normalized by a factor of 255. Default is True.

  • with_memory_cache (bool) – If True, use data_source.DataSourceWithMemoryCache to wrap data_source. It is a good idea to set this as true unless data_source provides on-memory data. Default value is True.

  • epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.

  • epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.

  • stop_exhausted (bool) – If stop_exhausted is set to False, iterator will be reset so that iteration can be continued. If stop_exhausted is set to True, iterator will raise StopIteration to stop the loop.

Returns

Instance of DataIterator

Return type

DataIterator

nnabla.utils.data_iterator.data_iterator_concat_datasets(data_source_list, batch_size, shuffle=False, rng=None, use_thread=True, with_memory_cache=True, with_file_cache=False, cache_dir=None, epoch_begin_callbacks=[], epoch_end_callbacks=[], stop_exhausted=False)[source]

Get data from multiple datasets.

For example,

batch = data_iterator_concat_datasets([DataSource0, DataSource1, ...], batch_size)
Parameters
  • data_source_list (list of DataSource) – list of datasets.

  • batch_size (int) – Size of data unit.

  • shuffle (bool) – Indicates whether the dataset is shuffled or not. Default value is False.

  • rng (None or numpy.random.RandomState) – Numpy random number generator.

  • use_thread (bool) – If use_thread is set to True, iterator will use another thread to fetch data. If use_thread is set to False, iterator will use current thread to fetch data.

  • with_memory_cache (bool) – If True, use data_source.DataSourceWithMemoryCache to wrap data_source. It is a good idea to set this as true unless data_source provides on-memory data. Default value is True.

  • with_file_cache (bool) – If True, use data_source.DataSourceWithFileCache to wrap data_source. If data_source is slow, enabling this option a is good idea. Default value is False.

  • cache_dir (str) – Location of file_cache. If this value is None, data_source.DataSourceWithFileCache creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise, data_source.DataSourceWithFileCache keeps created cache. Default is None.

  • epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.

  • epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.

  • stop_exhausted (bool) – If stop_exhausted is set to False, iterator will be reset so that iteration can be continued. If stop_exhausted is set to True, iterator will raise StopIteration to stop the loop.

Returns

Instance of DataIterator

Return type

DataIterator