Data Iterators

NNabla provides various utilities for using data for training.

DataSource

class nnabla.utils.data_source.DataSource(shuffle=False, rng=None)[source]

Bases: object

This class contains various properties and methods for the data source, which are utilized by py:class:DataIterator.

Parameters:

shuffle (bool) – Indicates whether the dataset is shuffled or not.
rng (None or numpy.random.RandomState) – Numpy random number generator.

apply_order()[source]: This function is called when a batch is finished and an epoch is finished. It is time for update the order if shuffle is true.

property position

Data position in current epoch.

Returns:: Data position
Return type:: int

property shuffle

Whether dataset is shuffled or not.

Returns:: whether dataset is shuffled.
Return type:: bool

property variables

Variable names of the data.

Returns:: tuple of Variable names
Return type:: tuple

class nnabla.utils.data_source.DataSourceWithFileCache(data_source, cache_dir=None, cache_file_name_prefix='cache', shuffle=False, rng=None)[source]

Bases: DataSource

This class contains properties and methods for data source that can be read from cache files, which are utilized by data iterator.

Parameters:

data_source (DataSource) – Instance of DataSource class which provides data.
cache_dir (str) – Location of file_cache. If this value is None, data_source.DataSourceWithFileCache creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise, data_source.DataSourceWithFileCache keeps created cache. Default is None.
cache_file_name_prefix (str) – Beginning of the filenames of cache files. Default is ‘cache’.
shuffle (bool) – Indicates whether the dataset is shuffled or not.
rng (None or numpy.random.RandomState) – Numpy random number generator.

apply_order(): This function is called when a batch is finished and an epoch is finished. It is time for update the order if shuffle is true.

property position

Data position in current epoch.

Returns:: Data position
Return type:: int

property shuffle

Whether dataset is shuffled or not.

Returns:: whether dataset is shuffled.
Return type:: bool

property variables

Variable names of the data.

Returns:: tuple of Variable names
Return type:: tuple

class nnabla.utils.data_source.DataSourceWithMemoryCache(data_source, shuffle=False, rng=None)[source]

Bases: DataSource

This class contains properties and methods for data source that can be read from memory cache, which is utilized by data iterator.

Parameters:

data_source (DataSource) – Instance of DataSource class which provides data.
shuffle (bool) – Indicates whether the dataset is shuffled or not.
rng (None or numpy.random.RandomState) – Numpy random number generator.

apply_order(): This function is called when a batch is finished and an epoch is finished. It is time for update the order if shuffle is true.

property position

Data position in current epoch.

Returns:: Data position
Return type:: int

property shuffle

Whether dataset is shuffled or not.

Returns:: whether dataset is shuffled.
Return type:: bool

property variables

Variable names of the data.

Returns:: tuple of Variable names
Return type:: tuple

DataIterator

class nnabla.utils.data_iterator.DataIterator(data_source, batch_size, rng=None, use_thread=True, epoch_begin_callbacks=[], epoch_end_callbacks=[], stop_exhausted=False)[source]

Bases: object

Collect data from data_source and yields bunch of data.

Parameters:

data_source (DataSource) – Instance of DataSource class witch provides data for this class.
batch_size (int) – Size of data unit.
rng (None or numpy.random.RandomState) – Numpy random number generator.
use_thread (bool) – If use_thread is set to True, iterator will use another thread to fetch data. If use_thread is set to False, iterator will use current thread to fetch data.
epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as a argument. These are called at the beginning of an epoch.
epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as a argument. These are called at the end of an epoch.
stop_exhausted (bool) – If stop_exhausted is set to False, iterator will be reset so that iteration can be continued. If stop_exhausted is set to True, iterator will raise StopIteration to stop the loop.

property batch_size

Number of training samples that next() returns.

Returns:: Number of training samples.
Return type:: int

property epoch

The number of times position() returns to zero.

Returns:: epoch
Return type:: int

next()[source]

It generates tuple of data.

For example, if self._variables == ('x', 'y')() This method returns :py:meth:` ( [[X] * batch_size], [[Y] * batch_size] )`

Returns:: tuple of data for mini-batch in numpy.ndarray.
Return type:: tuple

property position

Data position in current epoch.

Returns:: Data position
Return type:: int

register_epoch_begin_callback(callback)[source]

Register epoch begin callback.

Parameters:: callback (function) – A function takes an epoch index as an argument.

register_epoch_end_callback(callback)[source]

Register epoch end callback.

Parameters:: callback (function) – A function takes an epoch index as an argument.

property size

Data size that DataIterator will generate. This is the largest integer multiple of batch_size not exceeding self._data_source.size().

Returns:: Data size
Return type:: int

slice(rng, num_of_slices=None, slice_pos=None, slice_start=None, slice_end=None, cache_dir=None, use_cache=False, drop_last=False)[source]

Slices the data iterator so that newly generated data iterator has access to limited portion of the original data.

Parameters:

rng (numpy.random.RandomState) – Random generator for Initializer.
num_of_slices (int) – Total number of slices to be made. Muts be used together with slice_pos.
slice_pos (int) – Position of the slice to be assigned to the new data iterator. Must be used together with num_of_slices.
slice_start (int) – Starting position of the range to be sliced into new data iterator. Must be used together with slice_end.
slice_end (int) – End position of the range to be sliced into new data iterator. Must be used together with slice_start.
cache_dir (str) – Directory to save cache files. if cache_dir is None and use_cache is True, will used memory cache.
use_cache (bool) – Whether use cache for data_source.
drop_last (bool) – If it is True, the samples if the number of samples cannot be evenly divisible are dropped, If it is False, the samples are duplicated so that it is evenly divisible.

Example:

from nnabla.utils.data_iterator import data_iterator_simple
import numpy as np

def load_func1(index):
    d = np.ones((2, 2)) * index
    return d

di = data_iterator_simple(load_func1, 1000, batch_size=3)

di_s1 = di.slice(None, num_of_slices=10, slice_pos=0)
di_s2 = di.slice(None, num_of_slices=10, slice_pos=1)

di_s3 = di.slice(None, slice_start=100, slice_end=200)
di_s4 = di.slice(None, slice_start=300, slice_end=400)

property variables

Variable names of the data.

Returns:: tuple of Variable names
Return type:: tuple

Utilities

nnabla.utils.data_iterator.data_iterator(data_source, batch_size, rng=None, use_thread=True, with_memory_cache=True, with_file_cache=False, cache_dir=None, epoch_begin_callbacks=[], epoch_end_callbacks=[], stop_exhausted=False)[source]

Helper method to use DataSource.

You can use DataIterator with your own DataSource for easy implementation of data sources.

For example,

ds = YourOwnImplementationOfDataSource()
batch = data_iterator(ds, batch_size)

Parameters:

data_source (DataSource) – Instance of DataSource class which provides data.
batch_size (int) – Batch size.
rng (None or numpy.random.RandomState) – Numpy random number generator.
use_thread (bool) – If use_thread is set to True, iterator will use another thread to fetch data. If use_thread is set to False, iterator will use current thread to fetch data.
with_memory_cache (bool) – If True, use data_source.DataSourceWithMemoryCache to wrap data_source. It is a good idea to set this as true unless data_source provides on-memory data. Default value is True.
with_file_cache (bool) – If True, use data_source.DataSourceWithFileCache to wrap data_source. If data_source is slow, enabling this option a is good idea. Default value is False.
cache_dir (str) – Location of file_cache. If this value is None, data_source.DataSourceWithFileCache creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise, data_source.DataSourceWithFileCache keeps created cache. Default is None.
epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.
epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.
stop_exhausted (bool) – If stop_exhausted is set to False, iterator will be reset so that iteration can be continued. If stop_exhausted is set to True, iterator will raise StopIteration to stop the loop.

Returns:

Instance of DataIterator.

Return type:

DataIterator

nnabla.utils.data_iterator.data_iterator_simple(load_func, num_examples, batch_size, shuffle=False, rng=None, use_thread=True, with_memory_cache=False, with_file_cache=False, cache_dir=None, epoch_begin_callbacks=[], epoch_end_callbacks=[], stop_exhausted=False)[source]

A generator that yield s minibatch data as a tuple, as defined in load_func . It can unlimitedly yield minibatches at your request, queried from the provided data.

Parameters:

load_func (function) – Takes a single argument i, an index of an example in your dataset to be loaded, and returns a tuple of data. Every call by any index i must return a tuple of arrays with the same shape.
num_examples (int) – Number of examples in your dataset. Random sequence of indexes is generated according to this number.
batch_size (int) – Size of data unit.
shuffle (bool) – Indicates whether the dataset is shuffled or not. Default value is False.
rng (None or numpy.random.RandomState) – Numpy random number generator.
use_thread (bool) – If use_thread is set to True, iterator will use another thread to fetch data. If use_thread is set to False, iterator will use current thread to fetch data.
with_memory_cache (bool) – If True, use data_source.DataSourceWithMemoryCache to wrap data_source. It is a good idea to set this as true unless data_source provides on-memory data. Default value is False.
with_file_cache (bool) – If True, use data_source.DataSourceWithFileCache to wrap data_source. If data_source is slow, enabling this option a is good idea. Default value is False.
cache_dir (str) – Location of file_cache. If this value is None, data_source.DataSourceWithFileCache creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise, data_source.DataSourceWithFileCache keeps created cache. Default is None.
epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.
epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.
stop_exhausted (bool) – If stop_exhausted is set to False, iterator will be reset so that iteration can be continued. If stop_exhausted is set to True, iterator will raise StopIteration to stop the loop.

Returns:

Instance of DataIterator.

Return type:

DataIterator

Here is an example of load_func which returns an image and a label of a classification dataset.

import numpy as np
from nnabla.utils.image_utils import imread
image_paths = load_image_paths()
labels = load_labels()
def my_load_func(i):
    '''
    Returns:
        image: c x h x w array
        label: 0-shape array
    '''
    img = imread(image_paths[i]).astype('float32')
    return np.rollaxis(img, 2), np.array(labels[i])

nnabla.utils.data_iterator.data_iterator_csv_dataset(uri, batch_size, shuffle=False, rng=None, use_thread=True, normalize=True, with_memory_cache=True, with_file_cache=True, cache_dir=None, epoch_begin_callbacks=[], epoch_end_callbacks=[], stop_exhausted=False)[source]

Get data directly from a dataset provided as a CSV file.

You can read files located on the local file system, http(s) servers or Amazon AWS S3 storage.

For example,

batch = data_iterator_csv_dataset('CSV_FILE.csv', batch_size, shuffle=True)

Parameters:

uri (str) – Location of dataset CSV file.
batch_size (int) – Size of data unit.
shuffle (bool) – Indicates whether the dataset is shuffled or not. Default value is False.
rng (None or numpy.random.RandomState) – Numpy random number generator.
use_thread (bool) – If use_thread is set to True, iterator will use another thread to fetch data. If use_thread is set to False, iterator will use current thread to fetch data.
normalize (bool) – If True, each sample in the data gets normalized by a factor of 255. Default is True.
with_memory_cache (bool) – If True, use data_source.DataSourceWithMemoryCache to wrap data_source. It is a good idea to set this as true unless data_source provides on-memory data. Default value is True.
with_file_cache (bool) – If True, use data_source.DataSourceWithFileCache to wrap data_source. If data_source is slow, enabling this option a is good idea. Default value is False.
cache_dir (str) – Location of file_cache. If this value is None, data_source.DataSourceWithFileCache creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise, data_source.DataSourceWithFileCache keeps created cache. Default is None.
epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.
epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.
stop_exhausted (bool) – If stop_exhausted is set to False, iterator will be reset so that iteration can be continued. If stop_exhausted is set to True, iterator will raise StopIteration to stop the loop.

Returns:

Instance of DataIterator

Return type:

DataIterator

nnabla.utils.data_iterator.data_iterator_cache(uri, batch_size, shuffle=False, rng=None, use_thread=True, normalize=True, with_memory_cache=True, epoch_begin_callbacks=[], epoch_end_callbacks=[], stop_exhausted=False)[source]

Get data from the cache directory.

Cache files are read from the local file system.

For example,

batch = data_iterator_cache('CACHE_DIR', batch_size, shuffle=True)

Parameters:

uri (str) – Location of directory with cache files.
batch_size (int) – Size of data unit.
shuffle (bool) – Indicates whether the dataset is shuffled or not. Default value is False.
rng (None or numpy.random.RandomState) – Numpy random number generator.
use_thread (bool) – If use_thread is set to True, iterator will use another thread to fetch data. If use_thread is set to False, iterator will use current thread to fetch data.
normalize (bool) – If True, each sample in the data gets normalized by a factor of 255. Default is True.
with_memory_cache (bool) – If True, use data_source.DataSourceWithMemoryCache to wrap data_source. It is a good idea to set this as true unless data_source provides on-memory data. Default value is True.
epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.
epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.
stop_exhausted (bool) – If stop_exhausted is set to False, iterator will be reset so that iteration can be continued. If stop_exhausted is set to True, iterator will raise StopIteration to stop the loop.

Returns:

Instance of DataIterator

Return type:

DataIterator

nnabla.utils.data_iterator.data_iterator_concat_datasets(data_source_list, batch_size, shuffle=False, rng=None, use_thread=True, with_memory_cache=True, with_file_cache=False, cache_dir=None, epoch_begin_callbacks=[], epoch_end_callbacks=[], stop_exhausted=False)[source]

Get data from multiple datasets.

For example,

batch = data_iterator_concat_datasets([DataSource0, DataSource1, ...], batch_size)

Parameters:

data_source_list (list of DataSource) – list of datasets.
batch_size (int) – Size of data unit.
shuffle (bool) – Indicates whether the dataset is shuffled or not. Default value is False.
rng (None or numpy.random.RandomState) – Numpy random number generator.
use_thread (bool) – If use_thread is set to True, iterator will use another thread to fetch data. If use_thread is set to False, iterator will use current thread to fetch data.
with_memory_cache (bool) – If True, use data_source.DataSourceWithMemoryCache to wrap data_source. It is a good idea to set this as true unless data_source provides on-memory data. Default value is True.
with_file_cache (bool) – If True, use data_source.DataSourceWithFileCache to wrap data_source. If data_source is slow, enabling this option a is good idea. Default value is False.
cache_dir (str) – Location of file_cache. If this value is None, data_source.DataSourceWithFileCache creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise, data_source.DataSourceWithFileCache keeps created cache. Default is None.
epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.
epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.
stop_exhausted (bool) – If stop_exhausted is set to False, iterator will be reset so that iteration can be continued. If stop_exhausted is set to True, iterator will raise StopIteration to stop the loop.

Returns:

Instance of DataIterator

Return type:

DataIterator