Data Iterators

NNabla provides various utilities for using data for training.

DataSource

class nnabla.utils.data_source.DataSource(shuffle=False, rng=None)[source]

Bases: object

Detailed documentation is available in data_source_design.

class nnabla.utils.data_source.DataSourceWithFileCache(data_source, cache_dir=None, shuffle=False, rng=None)[source]

Bases: nnabla.utils.data_source.DataSource

Detailed documentation is available in data_source_with_file_cache_design.

class nnabla.utils.data_source.DataSourceWithMemoryCache(data_source, shuffle=False, rng=None)[source]

Bases: nnabla.utils.data_source.DataSource

Detailed documentation is available in data_source_with_memory_cache_design.

DataIterator

class nnabla.utils.data_iterator.DataIterator(data_source, batch_size, rng=None, epoch_begin_callbacks=[], epoch_end_callbacks=[])[source]

Bases: object

Collect data from data_source_design and yields bunch of data.

Detailed documentation is available in data_iterator_design.

Parameters:
  • data_source (DataSource) – Instance of DataSource class witch provides data for this class.
  • batch_size (int) – Size of data unit.
  • rng (None or numpy.random.RandomState) – Numpy random number generator.
  • epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as a argument. These are called at the beginning of an epoch.
  • epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as a argument. These are called at the end of an epoch.
batch_size

Number of data in next() returns.

Returns:Number of data.
Return type:int
epoch

How many times does the position() return to zero.

Returns:epoch
Return type:int
next()[source]

It generates tuple of data.

For example, if self._variables == (‘x’, ‘y’) This method returns, ( [[X] * batch_size], [[Y} * batch_size] )

Returns:tuple of data for mini-batch in numpy.ndarray.
Return type:tuple
position

Data position in current epoch.

Returns:Data position
Return type:int
register_epoch_begin_callback(callback)[source]

Register epoch begin callback.

Parameters:callback (function) – A function takes an epoch index as a argument.
register_epoch_end_callback(callback)[source]

Register epoch end callback.

Parameters:callback (function) – A function takes an epoch index as a argument.
size

Data size that DataIterator will generate. This is largest integer multiple of batch_size not exceeding self._data_source.size

Returns:Data size
Return type:int
variables

Variable names of the data.

Returns:tuple of Variable names
Return type:tuple

Utilities

nnabla.utils.data_iterator.data_iterator(data_source, batch_size, rng=None, with_memory_cache=True, with_file_cache=False, cache_dir=None, epoch_begin_callbacks=[], epoch_end_callbacks=[])[source]

Helper method to use DataSource.

You can use DataIterator with your own DataSource for easy implementation of data sources.

For example,

ds = YourOwnImplementOfDataSource()

with data_iterator(ds, batch_size) as di:
    for data in di:
        SOME CODE TO USE data.
Parameters:
  • data_source (DataSource) – Instance of DataSource class witch provides data.
  • batch_size (int) – Batch size.
  • rng (None or numpy.random.RandomState) – Numpy random number generator.
  • with_memory_cache (bool) – If it is True, use data_source.DataSourceWithMemoryCache to wrap data_source. It is good idea set this always true unless data_source provides on-memory data. Default value is True.
  • with_file_cache (bool) – If it is True, use data_source.DataSourceWithFileCache to wrap data_source. If data_source is very slow, enable this option is good idea. Default value is False.
  • cache_dir (str) – Location of file_cache. If this value is None, data_source.DataSourceWithFileCache creates file caches implicitly on temporary directory and erase them all when data_iterator was finished. Otherwise, data_source.DataSourceWithFileCache keeps created cache. Default is None.
  • epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as a argument. These are called at the beginning of an epoch.
  • epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as a argument. These are called at the end of an epoch.
Returns:

Instance of DataIterator.

Return type:

DataIterator

nnabla.utils.data_iterator.data_iterator_simple(load_func, num_examples, batch_size, shuffle=False, rng=None, with_memory_cache=True, with_file_cache=True, cache_dir=None, epoch_begin_callbacks=[], epoch_end_callbacks=[])[source]

A generator that yield s minibatch data as a tuple, as defined in load_func . It can unlimitedly yield minibatches at your request, queried from the provided data.

Parameters:
  • load_func (function) – Takes a single argument i, an index of an example in your dataset to be loaded, and returns a tuple of data. Every calls by any index i must returns a tuple of arrays with the same shape.
  • num_examples (int) – Number of examples of your dataset. Random sequence of indexes is generated according to this number.
Returns:

Instance of DataIterator.

Return type:

DataIterator

Here is an example of load_func which returns an image and a label of a classification dataset.

import numpy as np
from scipy.misc import imread
image_paths = load_image_paths()
labels = load_labels()
def my_load_func(i):
    '''
    Returns:
        image: c x h x w array
        label: 0-shape array
    '''
    img = imread(image_paths[i]).astype('float32')
    return np.rollaxis(img, 2), np.array(labels[i])
nnabla.utils.data_iterator.data_iterator_csv_dataset(uri, batch_size, shuffle, rng=None, normalize=True, with_memory_cache=True, with_file_cache=True, cache_dir=None, epoch_begin_callbacks=[], epoch_end_callbacks=[])[source]

Get data directly from a dataset provided as a CSV file.

You can read files located on the local file system, http(s) servers or Amazon AWS S3 storages.

For example,

with data_iterator_csv_dataset('CSV_FILE.csv', batch_size) as di:
    for data in di:
        SOME CODE TO USE data.
Parameters:uri (str) – Location of dataset CSV file.
Returns:Instance of DataIterator
Return type:DataIterator
nnabla.utils.data_iterator.data_iterator_cache(uri, batch_size, shuffle, rng=None, normalize=True, with_memory_cache=True, epoch_begin_callbacks=[], epoch_end_callbacks=[])[source]

Get data from the cache directory.

Cache files are read from the local file system.

For example,

with data_iterator_cache('CACHE_DIR', batch_size) as di:
    for data in di:
        SOME CODE TO USE data.
Parameters:uri (str) – Location of directory with cache files.
Returns:Instance of DataIterator
Return type:DataIterator