DataAbstract¶

class dabstract.abstract.abstract.DataAbstract(data: Iterable, output_datatype: str = 'auto', workers: int = 0, buffer_len: int = 3, load_memory: bool = False)¶

Bases: dabstract.abstract.abstract.Abstract

DataAbstract combines the functionality offered by the function parallel_op to allow parallel processing with arbitrary access of the data. In parallel_op you’re only given a Generator, however, you might be interested in more flexibility such as indexing your data in a particular range and still having this parallized. Additionally, as all classes in abstract follow the convention that they can only process one example at a time (i.e., data[0] is possible but not data[0:5]) this function is also used to simply add multi-indexing to any iterable and provide automatic stacking of that data either into a list or np.ndarray if possible.

This function is reused multiple times throughout the Dataset class and is key to the lazy/eager processing flow in this framework.

Consider the following case where you have created a ProcessingChain and you use MapAbstract to have a lazy processor of your data:

$   processor = ProcessingChain().add(some_function).add(another_function)
$   lazy_processed_data = MapAbstract(data, processor)

In this situation you can index lazy_processed_data only one by one. Using DataAbstract, multi-indexing is provided:

$   lazy_processed_data = DataAbstract(lazy_processed_data)
$   lazy_processed_data_subset = lazy_processed_data[0:5]

By default no multi processing is active. You can keep your same workflow as before and add an argument such as the amount of workers (default=0) and so on. i.e,

$   lazy_multiprocessed_data = DataAbstract(lazy_processed_data, workers=5)
$   lazy_multiprocessed_data_subset = lazy_multiprocessed_data[0:5]

Similarly, you can use it as a Generator:

$   for example in lazy_multiprocessed_data:
$       do_something

This class has like other abstract classes a .get() method, which enables you to provide additional args and kwargs to your abstract function and whether or not to return propagated information. More information on that can be read in the docstring of that specific method.

Parameters

dataIterable

Iterable object to be parralelise and multi-index

output_data_typestr [‘auto’,’numpy’,’list’]

When multi-indexing (e.g., data[0:5]) it could be handy to automatically try to stack these examples into a np.ndarray or a list. In case of ‘auto’ it always tries to stack it in a np.ndarray. If not feasible due to different sizes it will provide a list In case of ‘np.ndarray’ or ‘list’ it obviously only tries to go for the former or the latter.

typestr [‘threadpool’,’processpool’]

String to select either ‘threadpool’ or ‘processpool’

workersint

Amount of parallel workers

buffer_lenint

The length of the buffer in case of a generator:

for data in dataset:
    do_something(data)

This will cue up buffer_len instances of data while do_something() is busy.

Returns

dataGenerator: The generator will return Union[Generator, Tuple[Generator, Dict]] When return_info is True, it returns a tuple of the exanoke and a Dictionary containing propagated information When return_info is False, it returns the example

get(index: Iterable = None, return_info: bool = False, workers: int = 0, buffer_len: int = 3, return_generator: bool = False, verbose: bool = False, *args: list, **kwargs: Dict) → Any¶

Parameters

indexIterable: Indices to retrieve data from
return_infobool: Return information that has been propagated through a chain of processors and abstract’s. For example, if one has used WavDataReader from dabstract.dataprocessor this will retrieve you the sampling frequency (‘fs’)
workersint: Amount of workers used for loading the data (default = 1)
buffer_lenint: Buffer_len of the pool (default = 3)
return_generatorbool: Return generator object with the data if True or return tuple (data, info) if return_info is True else return data (default = False)
verbosebool: If True show progress (default = False)
argsList: additional param to provide to the function if needed
kwargsDict: additional param to provide to the function if needed

Returns

dataAny

When iterating if will return a Generator For each sample generator will return Union[Generator, Tuple[Generator, Dict]]

When return_info is True, it returns a tuple of the data and a Dictionary containing propagated information When return_info is False, it returns a Generator

When indexing the dataset it will return:: When return_info is True, it returns a tuple of a List or np.ndarray and a Dictionary containing propagated information When return_info is False, it returns a List or np.ndarray