Dataset class¶

class dabstract.dataset.dataset.Dataset(paths: list = None, test_only: Optional[bool] = False, **kwargs)¶

Bases: object

Dataset base class

This is the dataset base class. It essentially is a DictSeqAbstract with additional functionality, such as management for: crossvalidation, feature extraction, example splitting and sample selection.

This class should not be used on it’s own. It is a base class for other datasets. When using this class as a base for your own dataset, one should use the following structure:

$ class EXAMPLE(dataset):
$     def __init__(self,
$                  paths=None,
$                  test_only=0,
$                  other=...
$                  **kwargs):
$         # init dict abstract
$         super().__init__(name=self.__class__.__name__,
$                          filter=filter,
$                          test_only=test_only)
$         #init other variables
$
$     # Data: get data
$     def set_data(self, paths):
$         # set up dataset containing the data and optional lazy mapping and so on
$         # the dataset is essentially a wrapped DictSeqAbstract. All your data is
$         # is accessible through self.. e.g. len(self), self.add, self.concat, ...
$         self.add('data', ... )
$         self.add('label', ... )
$         return self
$
$     def prepare(self,paths):
$         # prepare data here, i.e. download

One is advised to check the examples in dabstract/examples/introduction on how to work with datasets before reading the rest of this help.

To initialise this dataset the only mandatory field is paths and paths[‘feat’] specifically. Paths should be provided as such:

$   paths={'data': path_to_data,
$          'meta': path_to_meta,
$          'feat': path_to_feat}
$   dataset = EXAMPLE(paths={...})

The other entries for ‘data’ and ‘meta’ are just a suggestion and one can add as much as they like. However, it is advised to keep this convention if possible.

The class offers the following key functionality on top of your dataset definition, which can be called by the following methods:

.add - Add another key to the dataset
.add_dict - Add the keys and fields of an existing dataset or DictSeqAbstract to this one
.concat - concat dataset with dataset
.remove - remove key from dataset
.add_map - add mapping to a key
.add_split - add a splitting operation to your dataset
.add_select - apply a selection to your dataset
.add_alias - add an alias to another key
.keys - show the set of keys
.set_active_keys - set an active key
.reset_active_keys - reset the active keys
.unpack - unpack DictSeq to a list representation
.set_data - overwrite this method with yours to set your data
.load_memory - load a particular key into memory
.summary - show a summary of the dataset
.prepare_feat - compute the features and save to disk
.set_xval - set crossvalidation folds
.get_xval_set - get a subdataset givin the folds

The full explanation for each method is provided as a docstring at each method.

Parameters

pathsdict or str:

Path configuration in the form of a dictionary. For example:

$   paths={ 'data': path_to_data,
$           'meta': path_to_meta,
$           'feat': path_to_feat}

test_onlybool

To specify if this dataset should be used for testing or both testing and train. This is only relevant if multiple datasets are combined and set_xval() is used. For example:

test_only = 0 -> use for both train and test
test_only = 1 -> use only for test

Returns

dataset class

add(key: str, data: Any, info: List[dict] = None, lazy: bool = True, **kwargs) → None¶

Add key to dataset. Requirement: data should be as long as len(self)

Parameters

keystr: key to add
dataseq/dictseq/np/list: data to add
infolist: additional information that can be added that will be progated along with the data
lazybool: apply lazily or not

add_alias(key: str, new_key: str) → None¶: Add an alias to a particular key. Handy if you would like to use a dataset and add e.g. data/target referring to something.

add_dict(data: dict, lazy: bool = True, **kwargs) → None¶

Add the keys of a dictionary to the existing dataset Requirement: length of each item in the dict should be as long as len(self)

Parameters

lazybool: let this dict be lazy or not
datadictseq/dict: dict to add

add_map(key: str, map_fct: Callable, lazy: bool = None) → None¶

Add a mapping to a key

Parameters

lazybool: apply lazily or not
keystr: key to apply the mapping to
map_fctCallable: fct which performs y = f(x)

add_select(selector: Any, *arg, parameters: Optional[dict] = <class 'dict'>, eval_data: Any = None, **kwargs) → None¶

Add a selection to the dataset

This function add a selector to the dataset. The input to this function can either be a function that does the selection or a name/parameter pair that is used to search for that function in dabstract.dataset.select AND in the specified os.environ[“dabstract_CUSTOM_DIR”]. When defining custom selector functions, one can either provide this function directly OR place them in os.environ[“dabstract_CUSTOM_DIR”] / dataset / select.py. Any usage for custom function uses the same directory structure as dabstract.

Besides a function one can also directly provide indices.

dabstract already has a set of build-in selectors in dabstract.dataset.select such that one can simply do:

$  self.add_select(random_subsample, parameters=dict('ratio': 0.5))

for random subsampling, and:

$  self.add_select(subsample_by_str, parameters=dict('key': ..., 'keep': ...))

for selecting based on a key and a particular value One can also also use the lambda function such as:

$  self.add_select((lambda x,k: x['data']['subdb'][k]))

Or directly use indices such as:

$  indices = np.array[0,1,2,3,4])
$  self.add_select(indices)

Parameters

selectorCallable/str/List[int]/np.ndarray: selector defined as a str (translated to fct internally) or function or indices
parametersdict: additional parameters in case name is a str to init the function/class
eval_dataAny: data which could be used to available selector on in case no indices but a function is used. Note that if no eval_data is selected it simply assumes the dataset itself to evaluate on.
arg/kwargs:: additional param to provide to the function if needed

add_split(split_size: Union[float, int] = None, constraint: Optional[str] = None, type: str = 'seconds', reference_key: str = None, **kwargs) → None¶

Add a splitting operation to the dataset

This is a functionality handy if you for example have a dataset with chunks of 60s while you want examples of 1s but you do not want to reformat your entire dataset. This functionality does it in a lazy manner, e.g. splitting is only performed when needed. For this it needs apriori information on the output_shape of each example and the sampling frequency. This is automatically available IF you use FolderDictSeqAbstract data structure, as this creates DictSeq to your dataset containing filepath, filename, .. and info. The info entry contains the output_shape, sampling rate of your data. This work for folders containing .wav files AND for extracted features in the numpy format when this was performed using self.prepare_feat in this class. This class basically uses SplitAbstract and SampleReplicateAbstract. Key’s including information, will be splitted, while keys including only data will be replicated depending on the splitting rate.

split_sizefloat/int: split size in seconds/samples depending on ‘metric’
constraintNone/str: Option ‘power2’ creates sizes with a order of 2 (used for autoencoders)
typestr: split_size type (‘seconds’,’samples’)
reference_keystr: if samples is set as a size, one needs to provide a key reference to acquire time_step information from.

concat(data: Dataset, intersect: bool = False, adjust_base: bool = True) → Dataset ¶

Add the keys of a dictionary to the existing dataset Requirement: data should be as long as len(self)

Parameters

datadictseq/dict: dict to add
intersectbool: keep intersection of the two dicts based on the keys
adjust_basebool: protect the original dataset from adjusting.

Returns

datasetDataset class

get_folds() → int¶: get the amount of folds after .set_xval() is done

get_unique(key: str, fold: int = None, set: str = None, return_idx=False) → List[Any]¶

returns the unique values and corresponding ids to the examples that belong to a unique group for a particular key/item.

If not fold/set is specified, it will return the unique value and ids for all data. If both are specified, i.e. fold = 1 and set = ‘test’ it will return those associated with that dataset. Note that this only works if xval is initialised in set_xval().

While get_unique(.., return_idx=False) returns the unique values of a dataset, e.g.:

$   print(data['example'])
        [1,2,3,1]
$   print(data.get_unique('example'))
        [1,2,3]

get_unique(.., return_idx=True) also returns the associated indices:

$   print(data.get_unique('example', return_idx=True))
        [[1,2,3], [[0,3],[2],[3]], [[1,2],[3],[4]]

This is primarily useful for plotting data based on a particular separating variable.

Parameters

keystr: key to get unique values from
foldint: fold to get unique content of
setstr: set to get unique content of
return_idx: bool: returns the idx corresponding to the unique values or not

Returns

unique_valuesList[np.ndarray]: Unique value ids of that key corresponding to the output of .get_unique(…)
data_idsList[np.ndarray]: idx of data matching a particular unique_value (optional if return_idx = True)
plot_idsList[np.ndarray]: sequential plot idx for a particular unique_value (optional if return_idx = True)

get_xval_set(set: str = None, fold: int = None, keys: str = 'all', lazy: bool = True, workers: int = 1, buffer_len: int = 3) → dabstract.abstract.abstract.Select ¶

Get a crossvalidation subset of your dataset

This function return a subdataset of the original one based on which set you want and which fold

Parameters

setstr: set should be in (‘train’,’test’,’val’) depending on what the crossvalidation fct returned
foldint: get a particular fold
keysstr: get a subset of the keys, e.g. only input and target
lazybool: apply lazily
workersint: amount of workers in case lazy is false
buffer_lenint: used buffer length for multiprocessing in case lazy is false

keys() → None¶: Show the keys in the dataset

load_memory(key: str, workers: int = 2, buffer_len: int = 2, keep_structure: bool = False, verbose: bool = True) → None¶

Load data of a particular key from memory

If you want to already load some data in memory as this might be the faster option you can use function.

Parameters

keystr: key to be loaded in memory
workersint: amount of workers used for loading the data
buffer_lenint: buffer_len of the pool
keep_structurebool: keep structure up another class than DictSeqAbstract
verbosebool: provide print feedback

pop(key: str = None) → Any¶

prepare(paths: Dict[str, str]) → None¶: Placeholder for the dataset. You can add dataset download ops here.

prepare_feat(key: str, fe_name: str, fe_dp: dabstract.dataprocessor.processing_chain.ProcessingChain, new_key: str = None, overwrite: bool = False, allow_data_pop: bool = True, verbose: bool = True, workers: int = 2, buffer_len: int = 2) → None¶

Utility function to manage feature saving and loading.

This function manages the feature extraction and loading for you. What it basically does it when you provide a particular feature extraction it processes, saves and keeps some information. Next time when this is called, it does not compute again, but initiates the dataset in a lazy way such that features are read from disk. If you want to be read in memory, you can use self.load_memory(key,…). Additionally, it also offers multi-processing when extracting the features. The features are added as an additional key to your dataset OR replaces the key containing the source data. Files are writting away in the following order:

self[‘path’][‘feat’] / key / fe_name / …

the files inside that folder will have the same structure as the original files have, except that now they are writting as npy files.

It is required that ‘key’ contains a dictionary containing filepath, example, subdb and info in order to make this functionality work. This means that you should use self.add_subdict_from_folder() for the raw data.

Parameters

keystr: key to extract features from.
fe_namestr: the name of the feature extraction, which will be used to define the foldername
fe_dpProcessingChain: processing_chain applied to the data
new_keystr/None: If None, then key will be overwritten with the data. If a string, then a new key is added to the dataset.
overwritebool: overwrite the features that already saved
workersint: amount of workers used for loading data and extracting features
buffer_lenint: buffer_len of the pool

remove(key: str) → None¶: Remove a particular key in the dataset

reset_active_key() → None¶: Reset active keys (DEPRECATED)

reset_active_keys() → None¶: Reset active keys

set_active_keys(keys: Union[List[str], str]) → None¶: Set an active key. An active key simply lets a DictSeq mimic a Seq. When integer indexing a dataset it return a dictionary. In some cases it is desired that it only return the data from one particular key OR a set of keys.

set_data(paths: Dict[str, str]) → None¶: Placeholder that should be used to set your data in your own database class E.g. self.add(..) and so on

set_xval(name: Union[str, function, List[int], numpy.ndarray], parameters: Dict = {}, save_path: str = None, overwrite: bool = True) → None¶

Set the cross-validation folds

This function sets the crossvalidation folds. This works similar as with self.add_select(). You can either provide a name/parameters pair where name is a string that refers to a particular function available in either dabstract.dataset.xval OR os.environ[“dabstract_CUSTOM_DIR”] / dataset / xval.py. The former is a build-in xval while the latter offers you to add a custom function, which might be added to dabstract later on if validated. An other option is to provide the function directly through ‘name’. Finally, it also offers to save your xval configuration such that it’s identical to last experiment OR depending on where you save, use the same xval for different experiments.

dabstract already has a set of build-in selectors in dabstract.dataset.xval such that one can simply do:

$  self.set_xval(group_random_kfold, parameters=dict('folds': 4, 'val_frac=1/3, group_key='group'))

for random crossvalidation with a group constraint, and:

$  self.set_xval(sequential_kfold, parameters=dict('folds': 4, 'val_frac=1/3, group_key='group'))

for sequential crossvalidation with a group constraint, and:

$  self.set_xval(stratified_kfold, parameters=dict('folds': 4, 'val_frac=1/3))

for stratified crossvalidation, and:

$  self.set_xval(stratified_kfold, parameters=dict('folds': 4, 'val_frac=1/3))

for random crossvalidation.

Parameters

nameCallable/xval_func/str/List[int],np.ndarray: xval defined as a str (translated to fct internally) or function
parametersdict: additional parameters in case name is a str to init the function/class
save_dirstr: filepath to where to pickle the xval folds
overwritebool: overwrite the saved file

summary() → None¶: Print a dataset summary

unpack(keys: List[str]) → dabstract.abstract.abstract.UnpackAbstract ¶: Unpack the dictionary into a sequence This function return a dataset that, when indexed, return a list containing the items of ‘keys’ in that order.