Data Iterators¶
NNabla provides various utilities for using data for training.
DataSource¶
-
class
nnabla.utils.data_source.
DataSource
(shuffle=False, rng=None)[source]¶ Bases:
object
This class contains various properties and methods for the data source, which are utilized by py:class:DataIterator.
Parameters: - shuffle (bool) – Indicates whether the dataset is shuffled or not.
- rng (None or
numpy.random.RandomState
) – Numpy random number generator.
-
class
nnabla.utils.data_source.
DataSourceWithFileCache
(data_source, cache_dir=None, cache_file_name_prefix='cache', shuffle=False, rng=None)[source]¶ Bases:
nnabla.utils.data_source.DataSource
This class contains properties and methods for data source that can be read from cache files, which are utilized by data iterator.
Parameters: - data_source (
DataSource
) – Instance of DataSource class which provides data. - cache_dir (str) – Location of file_cache.
If this value is None,
data_source.DataSourceWithFileCache
creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise,data_source.DataSourceWithFileCache
keeps created cache. Default is None. - cache_file_name_prefix (str) – Beginning of the filenames of cache files. Default is ‘cache’.
- shuffle (bool) – Indicates whether the dataset is shuffled or not.
- rng (None or
numpy.random.RandomState
) – Numpy random number generator.
- data_source (
-
class
nnabla.utils.data_source.
DataSourceWithMemoryCache
(data_source, shuffle=False, rng=None)[source]¶ Bases:
nnabla.utils.data_source.DataSource
This class contains properties and methods for data source that can be read from memory cache, which is utilized by data iterator.
Parameters: - data_source (
DataSource
) – Instance of DataSource class which provides data. - shuffle (bool) – Indicates whether the dataset is shuffled or not.
- rng (None or
numpy.random.RandomState
) – Numpy random number generator.
- data_source (
DataIterator¶
-
class
nnabla.utils.data_iterator.
DataIterator
(data_source, batch_size, rng=None, use_thread=True, epoch_begin_callbacks=[], epoch_end_callbacks=[])[source]¶ Bases:
object
Collect data from
data_source
and yields bunch of data.Parameters: - data_source (
DataSource
) – Instance of DataSource class witch provides data for this class. - batch_size (int) – Size of data unit.
- rng (None or
numpy.random.RandomState
) – Numpy random number generator. - epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as a argument. These are called at the beginning of an epoch.
- epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as a argument. These are called at the end of an epoch.
-
batch_size
¶ Number of training samples that
next()
returns.Returns: Number of training samples. Return type: int
-
epoch
¶ The number of times
position()
returns to zero.Returns: epoch Return type: int
-
next
()[source]¶ It generates tuple of data.
For example, if
self._variables == ('x', 'y')()
This method returns :py:meth:` ( [[X] * batch_size], [[Y] * batch_size] )`Returns: tuple of data for mini-batch in numpy.ndarray. Return type: tuple
-
register_epoch_begin_callback
(callback)[source]¶ Register epoch begin callback.
Parameters: callback (function) – A function takes an epoch index as an argument.
-
register_epoch_end_callback
(callback)[source]¶ Register epoch end callback.
Parameters: callback (function) – A function takes an epoch index as an argument.
-
size
¶ Data size that DataIterator will generate. This is the largest integer multiple of batch_size not exceeding
self._data_source.size()
.Returns: Data size Return type: int
-
slice
(rng, num_of_slices=None, slice_pos=None, slice_start=None, slice_end=None, cache_dir=None)[source]¶ Slices the data iterator so that newly generated data iterator has access to limited portion of the original data.
Parameters: - rng (numpy.random.RandomState) – Random generator for Initializer.
- num_of_slices (int) – Total number of slices to be made. Muts be used together with
slice_pos
. - slice_pos (int) – Position of the slice to be assigned to the new data iterator. Must be used together with
num_of_slices
. - slice_start (int) – Starting position of the range to be sliced into new data iterator. Must be used together with
slice_end
. - slice_end (int) – End position of the range to be sliced into new data iterator. Must be used together with
slice_start
. - cache_dir (str) – Directory to save cache files
Example:
from nnabla.utils.data_iterator import data_iterator_simple import numpy as np def load_func1(index): d = np.ones((2, 2)) * index return d di = data_iterator_simple(load_func1, 1000, batch_size=3) di_s1 = di.slice(None, num_of_slices=10, slice_pos=0) di_s2 = di.slice(None, num_of_slices=10, slice_pos=1) di_s3 = di.slice(None, slice_start=100, slice_end=200) di_s4 = di.slice(None, slice_start=300, slice_end=400)
- data_source (
Utilities¶
-
nnabla.utils.data_iterator.
data_iterator
(data_source, batch_size, rng=None, with_memory_cache=True, with_file_cache=False, cache_dir=None, epoch_begin_callbacks=[], epoch_end_callbacks=[])[source]¶ Helper method to use
DataSource
.You can use
DataIterator
with your ownDataSource
for easy implementation of data sources.For example,
ds = YourOwnImplementationOfDataSource() batch = data_iterator(ds, batch_size)
Parameters: - data_source (
DataSource
) – Instance of DataSource class which provides data. - batch_size (int) – Batch size.
- rng (None or
numpy.random.RandomState
) – Numpy random number generator. - with_memory_cache (bool) – If
True
, usedata_source.DataSourceWithMemoryCache
to wrapdata_source
. It is a good idea to set this as true unless data_source provides on-memory data. Default value is True. - with_file_cache (bool) – If
True
, usedata_source.DataSourceWithFileCache
to wrapdata_source
. Ifdata_source
is slow, enabling this option a is good idea. Default value is False. - cache_dir (str) – Location of file_cache.
If this value is None,
data_source.DataSourceWithFileCache
creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise,data_source.DataSourceWithFileCache
keeps created cache. Default is None. - epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.
- epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.
Returns: Instance of DataIterator.
Return type: - data_source (
-
nnabla.utils.data_iterator.
data_iterator_simple
(load_func, num_examples, batch_size, shuffle=False, rng=None, with_memory_cache=True, with_file_cache=True, cache_dir=None, epoch_begin_callbacks=[], epoch_end_callbacks=[])[source]¶ A generator that
yield
s minibatch data as a tuple, as defined inload_func
. It can unlimitedly yield minibatches at your request, queried from the provided data.Parameters: - load_func (function) – Takes a single argument
i
, an index of an example in your dataset to be loaded, and returns a tuple of data. Every call by any indexi
must return a tuple of arrays with the same shape. - num_examples (int) – Number of examples in your dataset. Random sequence of indexes is generated according to this number.
- batch_size (int) – Size of data unit.
- shuffle (bool) – Indicates whether the dataset is shuffled or not. Default value is False.
- rng (None or
numpy.random.RandomState
) – Numpy random number generator. - with_memory_cache (bool) – If
True
, usedata_source.DataSourceWithMemoryCache
to wrapdata_source
. It is a good idea to set this as true unless data_source provides on-memory data. Default value is True. - with_file_cache (bool) – If
True
, usedata_source.DataSourceWithFileCache
to wrapdata_source
. Ifdata_source
is slow, enabling this option a is good idea. Default value is False. - cache_dir (str) – Location of file_cache.
If this value is None,
data_source.DataSourceWithFileCache
creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise,data_source.DataSourceWithFileCache
keeps created cache. Default is None. - epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.
- epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.
Returns: Instance of DataIterator.
Return type: Here is an example of
load_func
which returns an image and a label of a classification dataset.import numpy as np from nnabla.utils.image_utils import imread image_paths = load_image_paths() labels = load_labels() def my_load_func(i): ''' Returns: image: c x h x w array label: 0-shape array ''' img = imread(image_paths[i]).astype('float32') return np.rollaxis(img, 2), np.array(labels[i])
- load_func (function) – Takes a single argument
-
nnabla.utils.data_iterator.
data_iterator_csv_dataset
(uri, batch_size, shuffle=False, rng=None, normalize=True, with_memory_cache=True, with_file_cache=True, cache_dir=None, epoch_begin_callbacks=[], epoch_end_callbacks=[])[source]¶ Get data directly from a dataset provided as a CSV file.
You can read files located on the local file system, http(s) servers or Amazon AWS S3 storage.
For example,
batch = data_iterator_csv_dataset('CSV_FILE.csv', batch_size, shuffle=True)
Parameters: - uri (str) – Location of dataset CSV file.
- batch_size (int) – Size of data unit.
- shuffle (bool) – Indicates whether the dataset is shuffled or not. Default value is False.
- rng (None or
numpy.random.RandomState
) – Numpy random number generator. - normalize (bool) – If True, each sample in the data gets normalized by a factor of 255. Default is True.
- with_memory_cache (bool) – If
True
, usedata_source.DataSourceWithMemoryCache
to wrapdata_source
. It is a good idea to set this as true unless data_source provides on-memory data. Default value is True. - with_file_cache (bool) – If
True
, usedata_source.DataSourceWithFileCache
to wrapdata_source
. Ifdata_source
is slow, enabling this option a is good idea. Default value is False. - cache_dir (str) – Location of file_cache.
If this value is None,
data_source.DataSourceWithFileCache
creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise,data_source.DataSourceWithFileCache
keeps created cache. Default is None. - epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.
- epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.
Returns: Instance of DataIterator
Return type:
-
nnabla.utils.data_iterator.
data_iterator_cache
(uri, batch_size, shuffle=False, rng=None, normalize=True, with_memory_cache=True, epoch_begin_callbacks=[], epoch_end_callbacks=[])[source]¶ Get data from the cache directory.
Cache files are read from the local file system.
For example,
batch = data_iterator_cache('CACHE_DIR', batch_size, shuffle=True)
Parameters: - uri (str) – Location of directory with cache files.
- batch_size (int) – Size of data unit.
- shuffle (bool) – Indicates whether the dataset is shuffled or not. Default value is False.
- rng (None or
numpy.random.RandomState
) – Numpy random number generator. - normalize (bool) – If True, each sample in the data gets normalized by a factor of 255. Default is True.
- with_memory_cache (bool) – If
True
, usedata_source.DataSourceWithMemoryCache
to wrapdata_source
. It is a good idea to set this as true unless data_source provides on-memory data. Default value is True. - epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.
- epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.
Returns: Instance of DataIterator
Return type:
-
nnabla.utils.data_iterator.
data_iterator_concat_datasets
(data_source_list, batch_size, shuffle=False, rng=None, with_memory_cache=True, with_file_cache=False, cache_dir=None, epoch_begin_callbacks=[], epoch_end_callbacks=[])[source]¶ Get data from multiple datasets.
For example,
batch = data_iterator_concat_datasets([DataSource0, DataSource1, ...], batch_size)
Parameters: - data_source_list (list of DataSource) – list of datasets.
- batch_size (int) – Size of data unit.
- shuffle (bool) – Indicates whether the dataset is shuffled or not. Default value is False.
- rng (None or
numpy.random.RandomState
) – Numpy random number generator. - with_memory_cache (bool) – If
True
, usedata_source.DataSourceWithMemoryCache
to wrapdata_source
. It is a good idea to set this as true unless data_source provides on-memory data. Default value is True. - with_file_cache (bool) – If
True
, usedata_source.DataSourceWithFileCache
to wrapdata_source
. Ifdata_source
is slow, enabling this option a is good idea. Default value is False. - cache_dir (str) – Location of file_cache.
If this value is None,
data_source.DataSourceWithFileCache
creates file caches implicitly on temporary directory and erases them all when data_iterator is finished. Otherwise,data_source.DataSourceWithFileCache
keeps created cache. Default is None. - epoch_begin_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the beginning of an epoch.
- epoch_end_callbacks (list of functions) – An item is a function which takes an epoch index as an argument. These are called at the end of an epoch.
Returns: Instance of DataIterator
Return type: