Data Iterators¶
NNabla provides various utilities for using data for training.
DataSource¶
- class nnabla.utils.data_source.DataSource(shuffle=False, rng=None)[source]¶
Bases:
object
This class contains various properties and methods for the data source, which are utilized by py:class:DataIterator.
- Parameters
shuffle (bool) – Indicates whether the dataset is shuffled or not.
rng (None or
numpy.random.RandomState
) – Numpy random number generator.
- property shuffle¶
Whether dataset is shuffled or not.
- Returns
whether dataset is shuffled.
- Return type
- class nnabla.utils.data_source.DataSourceWithFileCache(data_source, cache_dir=None, cache_file_name_prefix='cache', shuffle=False, rng=None)[source]¶
Bases:
DataSource
This class contains properties and methods for data source that can be read from cache files, which are utilized by data iterator.
- Parameters
data_source (
DataSource
) – Instance of DataSource class which provides data.cache_dir (str) – Location of file_cache. If this value is None,
data_source.DataSourceWithFileCache
creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise,data_source.DataSourceWithFileCache
keeps created cache. Default is None.cache_file_name_prefix (str) – Beginning of the filenames of cache files. Default is ‘cache’.
shuffle (bool) – Indicates whether the dataset is shuffled or not.
rng (None or
numpy.random.RandomState
) – Numpy random number generator.
- property shuffle¶
Whether dataset is shuffled or not.
- Returns
whether dataset is shuffled.
- Return type
- class nnabla.utils.data_source.DataSourceWithMemoryCache(data_source, shuffle=False, rng=None)[source]¶
Bases:
DataSource
This class contains properties and methods for data source that can be read from memory cache, which is utilized by data iterator.
- Parameters
data_source (
DataSource
) – Instance of DataSource class which provides data.shuffle (bool) – Indicates whether the dataset is shuffled or not.
rng (None or
numpy.random.RandomState
) – Numpy random number generator.
- property shuffle¶
Whether dataset is shuffled or not.
- Returns
whether dataset is shuffled.
- Return type
DataIterator¶
- class nnabla.utils.data_iterator.DataIterator(data_source, batch_size, rng=None, use_thread=True, epoch_begin_callbacks=[], epoch_end_callbacks=[], stop_exhausted=False)[source]¶
Bases:
object
Collect data from
data_source
and yields bunch of data.- Parameters
data_source (
DataSource
) – Instance of DataSource class witch provides data for this class.batch_size (int) – Size of data unit.
rng (None or
numpy.random.RandomState
) – Numpy random number generator.use_thread (bool) – If
use_thread
is set to True, iterator will use another thread to fetch data. Ifuse_thread
is set to False, iterator will use current thread to fetch data.epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as a argument. These are called at the beginning of an epoch.
epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as a argument. These are called at the end of an epoch.
stop_exhausted (bool) – If
stop_exhausted
is set to False, iterator will be reset so that iteration can be continued. Ifstop_exhausted
is set to True, iterator will raise StopIteration to stop the loop.
- property batch_size¶
Number of training samples that
next()
returns.- Returns
Number of training samples.
- Return type
- property epoch¶
The number of times
position()
returns to zero.- Returns
epoch
- Return type
- next()[source]¶
It generates tuple of data.
For example, if
self._variables == ('x', 'y')()
This method returns :py:meth:` ( [[X] * batch_size], [[Y] * batch_size] )`- Returns
tuple of data for mini-batch in numpy.ndarray.
- Return type
- register_epoch_begin_callback(callback)[source]¶
Register epoch begin callback.
- Parameters
callback (function) – A function takes an epoch index as an argument.
- register_epoch_end_callback(callback)[source]¶
Register epoch end callback.
- Parameters
callback (function) – A function takes an epoch index as an argument.
- property size¶
Data size that DataIterator will generate. This is the largest integer multiple of batch_size not exceeding
self._data_source.size()
.- Returns
Data size
- Return type
- slice(rng, num_of_slices=None, slice_pos=None, slice_start=None, slice_end=None, cache_dir=None, use_cache=False, drop_last=False)[source]¶
Slices the data iterator so that newly generated data iterator has access to limited portion of the original data.
- Parameters
rng (numpy.random.RandomState) – Random generator for Initializer.
num_of_slices (int) – Total number of slices to be made. Muts be used together with
slice_pos
.slice_pos (int) – Position of the slice to be assigned to the new data iterator. Must be used together with
num_of_slices
.slice_start (int) – Starting position of the range to be sliced into new data iterator. Must be used together with
slice_end
.slice_end (int) – End position of the range to be sliced into new data iterator. Must be used together with
slice_start
.cache_dir (str) – Directory to save cache files. if cache_dir is None and use_cache is True, will used memory cache.
use_cache (bool) – Whether use cache for data_source.
drop_last (bool) – If it is True, the samples if the number of samples cannot be evenly divisible are dropped, If it is False, the samples are duplicated so that it is evenly divisible.
Example:
from nnabla.utils.data_iterator import data_iterator_simple import numpy as np def load_func1(index): d = np.ones((2, 2)) * index return d di = data_iterator_simple(load_func1, 1000, batch_size=3) di_s1 = di.slice(None, num_of_slices=10, slice_pos=0) di_s2 = di.slice(None, num_of_slices=10, slice_pos=1) di_s3 = di.slice(None, slice_start=100, slice_end=200) di_s4 = di.slice(None, slice_start=300, slice_end=400)
Utilities¶
- nnabla.utils.data_iterator.data_iterator(data_source, batch_size, rng=None, use_thread=True, with_memory_cache=True, with_file_cache=False, cache_dir=None, epoch_begin_callbacks=[], epoch_end_callbacks=[], stop_exhausted=False)[source]¶
Helper method to use
DataSource
.You can use
DataIterator
with your ownDataSource
for easy implementation of data sources.For example,
ds = YourOwnImplementationOfDataSource() batch = data_iterator(ds, batch_size)
- Parameters
data_source (
DataSource
) – Instance of DataSource class which provides data.batch_size (int) – Batch size.
rng (None or
numpy.random.RandomState
) – Numpy random number generator.use_thread (bool) – If
use_thread
is set to True, iterator will use another thread to fetch data. Ifuse_thread
is set to False, iterator will use current thread to fetch data.with_memory_cache (bool) – If
True
, usedata_source.DataSourceWithMemoryCache
to wrapdata_source
. It is a good idea to set this as true unless data_source provides on-memory data. Default value is True.with_file_cache (bool) – If
True
, usedata_source.DataSourceWithFileCache
to wrapdata_source
. Ifdata_source
is slow, enabling this option a is good idea. Default value is False.cache_dir (str) – Location of file_cache. If this value is None,
data_source.DataSourceWithFileCache
creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise,data_source.DataSourceWithFileCache
keeps created cache. Default is None.epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.
epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.
stop_exhausted (bool) – If
stop_exhausted
is set to False, iterator will be reset so that iteration can be continued. Ifstop_exhausted
is set to True, iterator will raise StopIteration to stop the loop.
- Returns
Instance of DataIterator.
- Return type
- nnabla.utils.data_iterator.data_iterator_simple(load_func, num_examples, batch_size, shuffle=False, rng=None, use_thread=True, with_memory_cache=False, with_file_cache=False, cache_dir=None, epoch_begin_callbacks=[], epoch_end_callbacks=[], stop_exhausted=False)[source]¶
A generator that
yield
s minibatch data as a tuple, as defined inload_func
. It can unlimitedly yield minibatches at your request, queried from the provided data.- Parameters
load_func (function) – Takes a single argument
i
, an index of an example in your dataset to be loaded, and returns a tuple of data. Every call by any indexi
must return a tuple of arrays with the same shape.num_examples (int) – Number of examples in your dataset. Random sequence of indexes is generated according to this number.
batch_size (int) – Size of data unit.
shuffle (bool) – Indicates whether the dataset is shuffled or not. Default value is False.
rng (None or
numpy.random.RandomState
) – Numpy random number generator.use_thread (bool) – If
use_thread
is set to True, iterator will use another thread to fetch data. Ifuse_thread
is set to False, iterator will use current thread to fetch data.with_memory_cache (bool) – If
True
, usedata_source.DataSourceWithMemoryCache
to wrapdata_source
. It is a good idea to set this as true unless data_source provides on-memory data. Default value is False.with_file_cache (bool) – If
True
, usedata_source.DataSourceWithFileCache
to wrapdata_source
. Ifdata_source
is slow, enabling this option a is good idea. Default value is False.cache_dir (str) – Location of file_cache. If this value is None,
data_source.DataSourceWithFileCache
creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise,data_source.DataSourceWithFileCache
keeps created cache. Default is None.epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.
epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.
stop_exhausted (bool) – If
stop_exhausted
is set to False, iterator will be reset so that iteration can be continued. Ifstop_exhausted
is set to True, iterator will raise StopIteration to stop the loop.
- Returns
Instance of DataIterator.
- Return type
Here is an example of
load_func
which returns an image and a label of a classification dataset.import numpy as np from nnabla.utils.image_utils import imread image_paths = load_image_paths() labels = load_labels() def my_load_func(i): ''' Returns: image: c x h x w array label: 0-shape array ''' img = imread(image_paths[i]).astype('float32') return np.rollaxis(img, 2), np.array(labels[i])
- nnabla.utils.data_iterator.data_iterator_csv_dataset(uri, batch_size, shuffle=False, rng=None, use_thread=True, normalize=True, with_memory_cache=True, with_file_cache=True, cache_dir=None, epoch_begin_callbacks=[], epoch_end_callbacks=[], stop_exhausted=False)[source]¶
Get data directly from a dataset provided as a CSV file.
You can read files located on the local file system, http(s) servers or Amazon AWS S3 storage.
For example,
batch = data_iterator_csv_dataset('CSV_FILE.csv', batch_size, shuffle=True)
- Parameters
uri (str) – Location of dataset CSV file.
batch_size (int) – Size of data unit.
shuffle (bool) – Indicates whether the dataset is shuffled or not. Default value is False.
rng (None or
numpy.random.RandomState
) – Numpy random number generator.use_thread (bool) – If
use_thread
is set to True, iterator will use another thread to fetch data. Ifuse_thread
is set to False, iterator will use current thread to fetch data.normalize (bool) – If True, each sample in the data gets normalized by a factor of 255. Default is True.
with_memory_cache (bool) – If
True
, usedata_source.DataSourceWithMemoryCache
to wrapdata_source
. It is a good idea to set this as true unless data_source provides on-memory data. Default value is True.with_file_cache (bool) – If
True
, usedata_source.DataSourceWithFileCache
to wrapdata_source
. Ifdata_source
is slow, enabling this option a is good idea. Default value is False.cache_dir (str) – Location of file_cache. If this value is None,
data_source.DataSourceWithFileCache
creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise,data_source.DataSourceWithFileCache
keeps created cache. Default is None.epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.
epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.
stop_exhausted (bool) – If
stop_exhausted
is set to False, iterator will be reset so that iteration can be continued. Ifstop_exhausted
is set to True, iterator will raise StopIteration to stop the loop.
- Returns
Instance of DataIterator
- Return type
- nnabla.utils.data_iterator.data_iterator_cache(uri, batch_size, shuffle=False, rng=None, use_thread=True, normalize=True, with_memory_cache=True, epoch_begin_callbacks=[], epoch_end_callbacks=[], stop_exhausted=False)[source]¶
Get data from the cache directory.
Cache files are read from the local file system.
For example,
batch = data_iterator_cache('CACHE_DIR', batch_size, shuffle=True)
- Parameters
uri (str) – Location of directory with cache files.
batch_size (int) – Size of data unit.
shuffle (bool) – Indicates whether the dataset is shuffled or not. Default value is False.
rng (None or
numpy.random.RandomState
) – Numpy random number generator.use_thread (bool) – If
use_thread
is set to True, iterator will use another thread to fetch data. Ifuse_thread
is set to False, iterator will use current thread to fetch data.normalize (bool) – If True, each sample in the data gets normalized by a factor of 255. Default is True.
with_memory_cache (bool) – If
True
, usedata_source.DataSourceWithMemoryCache
to wrapdata_source
. It is a good idea to set this as true unless data_source provides on-memory data. Default value is True.epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.
epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.
stop_exhausted (bool) – If
stop_exhausted
is set to False, iterator will be reset so that iteration can be continued. Ifstop_exhausted
is set to True, iterator will raise StopIteration to stop the loop.
- Returns
Instance of DataIterator
- Return type
- nnabla.utils.data_iterator.data_iterator_concat_datasets(data_source_list, batch_size, shuffle=False, rng=None, use_thread=True, with_memory_cache=True, with_file_cache=False, cache_dir=None, epoch_begin_callbacks=[], epoch_end_callbacks=[], stop_exhausted=False)[source]¶
Get data from multiple datasets.
For example,
batch = data_iterator_concat_datasets([DataSource0, DataSource1, ...], batch_size)
- Parameters
data_source_list (list of DataSource) – list of datasets.
batch_size (int) – Size of data unit.
shuffle (bool) – Indicates whether the dataset is shuffled or not. Default value is False.
rng (None or
numpy.random.RandomState
) – Numpy random number generator.use_thread (bool) – If
use_thread
is set to True, iterator will use another thread to fetch data. Ifuse_thread
is set to False, iterator will use current thread to fetch data.with_memory_cache (bool) – If
True
, usedata_source.DataSourceWithMemoryCache
to wrapdata_source
. It is a good idea to set this as true unless data_source provides on-memory data. Default value is True.with_file_cache (bool) – If
True
, usedata_source.DataSourceWithFileCache
to wrapdata_source
. Ifdata_source
is slow, enabling this option a is good idea. Default value is False.cache_dir (str) – Location of file_cache. If this value is None,
data_source.DataSourceWithFileCache
creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise,data_source.DataSourceWithFileCache
keeps created cache. Default is None.epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.
epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.
stop_exhausted (bool) – If
stop_exhausted
is set to False, iterator will be reset so that iteration can be continued. Ifstop_exhausted
is set to True, iterator will raise StopIteration to stop the loop.
- Returns
Instance of DataIterator
- Return type