Dataset class

class dabstract.dataset.dataset.Dataset(paths: list = None, test_only: Optional[bool] = False, **kwargs)

Bases: object

Dataset base class

This is the dataset base class. It essentially is a DictSeqAbstract with additional functionality, such as management for: crossvalidation, feature extraction, example splitting and sample selection.

This class should not be used on it’s own. It is a base class for other datasets. When using this class as a base for your own dataset, one should use the following structure:

$ class EXAMPLE(dataset):
$     def __init__(self,
$                  paths=None,
$                  test_only=0,
$                  other=...
$                  **kwargs):
$         # init dict abstract
$         super().__init__(name=self.__class__.__name__,
$                          filter=filter,
$                          test_only=test_only)
$         #init other variables
$
$     # Data: get data
$     def set_data(self, paths):
$         # set up dataset containing the data and optional lazy mapping and so on
$         # the dataset is essentially a wrapped DictSeqAbstract. All your data is
$         # is accessible through self.. e.g. len(self), self.add, self.concat, ...
$         self.add('data', ... )
$         self.add('label', ... )
$         return self
$
$     def prepare(self,paths):
$         # prepare data here, i.e. download

One is advised to check the examples in dabstract/examples/introduction on how to work with datasets before reading the rest of this help.

To initialise this dataset the only mandatory field is paths and paths[‘feat’] specifically. Paths should be provided as such:

$   paths={'data': path_to_data,
$          'meta': path_to_meta,
$          'feat': path_to_feat}
$   dataset = EXAMPLE(paths={...})

The other entries for ‘data’ and ‘meta’ are just a suggestion and one can add as much as they like. However, it is advised to keep this convention if possible.

The class offers the following key functionality on top of your dataset definition, which can be called by the following methods:

.add - Add another key to the dataset
.add_dict - Add the keys and fields of an existing dataset or DictSeqAbstract to this one
.concat - concat dataset with dataset
.remove - remove key from dataset
.add_map - add mapping to a key
.add_split - add a splitting operation to your dataset
.add_select - apply a selection to your dataset
.add_alias - add an alias to another key
.keys - show the set of keys
.set_active_keys - set an active key
.reset_active_keys - reset the active keys
.unpack - unpack DictSeq to a list representation
.set_data - overwrite this method with yours to set your data
.load_memory - load a particular key into memory
.summary - show a summary of the dataset
.prepare_feat - compute the features and save to disk
.set_xval - set crossvalidation folds
.get_xval_set - get a subdataset givin the folds

The full explanation for each method is provided as a docstring at each method.

Parameters
pathsdict or str:

Path configuration in the form of a dictionary. For example:

$   paths={ 'data': path_to_data,
$           'meta': path_to_meta,
$           'feat': path_to_feat}
test_onlybool

To specify if this dataset should be used for testing or both testing and train. This is only relevant if multiple datasets are combined and set_xval() is used. For example:

test_only = 0 -> use for both train and test
test_only = 1 -> use only for test
Returns
dataset class
add(key: str, data: Any, info: List[dict] = None, lazy: bool = True, **kwargs) → None

Add key to dataset. Requirement: data should be as long as len(self)

Parameters
keystr

key to add

dataseq/dictseq/np/list

data to add

infolist

additional information that can be added that will be progated along with the data

lazybool

apply lazily or not

add_alias(key: str, new_key: str) → None

Add an alias to a particular key. Handy if you would like to use a dataset and add e.g. data/target referring to something.

add_dict(data: dict, lazy: bool = True, **kwargs) → None

Add the keys of a dictionary to the existing dataset Requirement: length of each item in the dict should be as long as len(self)

Parameters
lazybool

let this dict be lazy or not

datadictseq/dict

dict to add

add_map(key: str, map_fct: Callable, lazy: bool = None) → None

Add a mapping to a key

Parameters
lazybool

apply lazily or not

keystr

key to apply the mapping to

map_fctCallable

fct which performs y = f(x)

add_select(selector: Any, *arg, parameters: Optional[dict] = <class 'dict'>, eval_data: Any = None, **kwargs) → None

Add a selection to the dataset

This function add a selector to the dataset. The input to this function can either be a function that does the selection or a name/parameter pair that is used to search for that function in dabstract.dataset.select AND in the specified os.environ[“dabstract_CUSTOM_DIR”]. When defining custom selector functions, one can either provide this function directly OR place them in os.environ[“dabstract_CUSTOM_DIR”] / dataset / select.py. Any usage for custom function uses the same directory structure as dabstract.

Besides a function one can also directly provide indices.

dabstract already has a set of build-in selectors in dabstract.dataset.select such that one can simply do:

$  self.add_select(random_subsample, parameters=dict('ratio': 0.5))

for random subsampling, and:

$  self.add_select(subsample_by_str, parameters=dict('key': ..., 'keep': ...))

for selecting based on a key and a particular value One can also also use the lambda function such as:

$  self.add_select((lambda x,k: x['data']['subdb'][k]))

Or directly use indices such as:

$  indices = np.array[0,1,2,3,4])
$  self.add_select(indices)
Parameters
selectorCallable/str/List[int]/np.ndarray

selector defined as a str (translated to fct internally) or function or indices

parametersdict

additional parameters in case name is a str to init the function/class

eval_dataAny

data which could be used to available selector on in case no indices but a function is used. Note that if no eval_data is selected it simply assumes the dataset itself to evaluate on.

arg/kwargs:

additional param to provide to the function if needed

add_split(split_size: Union[float, int] = None, constraint: Optional[str] = None, type: str = 'seconds', reference_key: str = None, **kwargs) → None

Add a splitting operation to the dataset

This is a functionality handy if you for example have a dataset with chunks of 60s while you want examples of 1s but you do not want to reformat your entire dataset. This functionality does it in a lazy manner, e.g. splitting is only performed when needed. For this it needs apriori information on the output_shape of each example and the sampling frequency. This is automatically available IF you use FolderDictSeqAbstract data structure, as this creates DictSeq to your dataset containing filepath, filename, .. and info. The info entry contains the output_shape, sampling rate of your data. This work for folders containing .wav files AND for extracted features in the numpy format when this was performed using self.prepare_feat in this class. This class basically uses SplitAbstract and SampleReplicateAbstract. Key’s including information, will be splitted, while keys including only data will be replicated depending on the splitting rate.

split_sizefloat/int

split size in seconds/samples depending on ‘metric’

constraintNone/str

Option ‘power2’ creates sizes with a order of 2 (used for autoencoders)

typestr

split_size type (‘seconds’,’samples’)

reference_keystr

if samples is set as a size, one needs to provide a key reference to acquire time_step information from.

concat(data: Dataset, intersect: bool = False, adjust_base: bool = True)Dataset

Add the keys of a dictionary to the existing dataset Requirement: data should be as long as len(self)

Parameters
datadictseq/dict

dict to add

intersectbool

keep intersection of the two dicts based on the keys

adjust_basebool

protect the original dataset from adjusting.

Returns
datasetDataset class
get_folds() → int

get the amount of folds after .set_xval() is done

get_unique(key: str, fold: int = None, set: str = None, return_idx=False) → List[Any]

returns the unique values and corresponding ids to the examples that belong to a unique group for a particular key/item.

If not fold/set is specified, it will return the unique value and ids for all data. If both are specified, i.e. fold = 1 and set = ‘test’ it will return those associated with that dataset. Note that this only works if xval is initialised in set_xval().

While get_unique(.., return_idx=False) returns the unique values of a dataset, e.g.:

$   print(data['example'])
        [1,2,3,1]
$   print(data.get_unique('example'))
        [1,2,3]

get_unique(.., return_idx=True) also returns the associated indices:

$   print(data.get_unique('example', return_idx=True))
        [[1,2,3], [[0,3],[2],[3]], [[1,2],[3],[4]]

This is primarily useful for plotting data based on a particular separating variable.

Parameters
keystr

key to get unique values from

foldint

fold to get unique content of

setstr

set to get unique content of

return_idx: bool

returns the idx corresponding to the unique values or not

Returns
unique_valuesList[np.ndarray]

Unique value ids of that key corresponding to the output of .get_unique(…)

data_idsList[np.ndarray]

idx of data matching a particular unique_value (optional if return_idx = True)

plot_idsList[np.ndarray]

sequential plot idx for a particular unique_value (optional if return_idx = True)

get_xval_set(set: str = None, fold: int = None, keys: str = 'all', lazy: bool = True, workers: int = 1, buffer_len: int = 3)dabstract.abstract.abstract.Select

Get a crossvalidation subset of your dataset

This function return a subdataset of the original one based on which set you want and which fold

Parameters
setstr

set should be in (‘train’,’test’,’val’) depending on what the crossvalidation fct returned

foldint

get a particular fold

keysstr

get a subset of the keys, e.g. only input and target

lazybool

apply lazily

workersint

amount of workers in case lazy is false

buffer_lenint

used buffer length for multiprocessing in case lazy is false

keys() → None

Show the keys in the dataset

load_memory(key: str, workers: int = 2, buffer_len: int = 2, keep_structure: bool = False, verbose: bool = True) → None

Load data of a particular key from memory

If you want to already load some data in memory as this might be the faster option you can use function.

Parameters
keystr

key to be loaded in memory

workersint

amount of workers used for loading the data

buffer_lenint

buffer_len of the pool

keep_structurebool

keep structure up another class than DictSeqAbstract

verbosebool

provide print feedback

pop(key: str = None) → Any
prepare(paths: Dict[str, str]) → None

Placeholder for the dataset. You can add dataset download ops here.

prepare_feat(key: str, fe_name: str, fe_dp: dabstract.dataprocessor.processing_chain.ProcessingChain, new_key: str = None, overwrite: bool = False, allow_data_pop: bool = True, verbose: bool = True, workers: int = 2, buffer_len: int = 2) → None

Utility function to manage feature saving and loading.

This function manages the feature extraction and loading for you. What it basically does it when you provide a particular feature extraction it processes, saves and keeps some information. Next time when this is called, it does not compute again, but initiates the dataset in a lazy way such that features are read from disk. If you want to be read in memory, you can use self.load_memory(key,…). Additionally, it also offers multi-processing when extracting the features. The features are added as an additional key to your dataset OR replaces the key containing the source data. Files are writting away in the following order:

self[‘path’][‘feat’] / key / fe_name / …

the files inside that folder will have the same structure as the original files have, except that now they are writting as npy files.

It is required that ‘key’ contains a dictionary containing filepath, example, subdb and info in order to make this functionality work. This means that you should use self.add_subdict_from_folder() for the raw data.

Parameters
keystr

key to extract features from.

fe_namestr

the name of the feature extraction, which will be used to define the foldername

fe_dpProcessingChain

processing_chain applied to the data

new_keystr/None

If None, then key will be overwritten with the data. If a string, then a new key is added to the dataset.

overwritebool

overwrite the features that already saved

workersint

amount of workers used for loading data and extracting features

buffer_lenint

buffer_len of the pool

remove(key: str) → None

Remove a particular key in the dataset

reset_active_key() → None

Reset active keys (DEPRECATED)

reset_active_keys() → None

Reset active keys

set_active_keys(keys: Union[List[str], str]) → None

Set an active key. An active key simply lets a DictSeq mimic a Seq. When integer indexing a dataset it return a dictionary. In some cases it is desired that it only return the data from one particular key OR a set of keys.

set_data(paths: Dict[str, str]) → None

Placeholder that should be used to set your data in your own database class E.g. self.add(..) and so on

set_xval(name: Union[str, function, List[int], numpy.ndarray], parameters: Dict = {}, save_path: str = None, overwrite: bool = True) → None

Set the cross-validation folds

This function sets the crossvalidation folds. This works similar as with self.add_select(). You can either provide a name/parameters pair where name is a string that refers to a particular function available in either dabstract.dataset.xval OR os.environ[“dabstract_CUSTOM_DIR”] / dataset / xval.py. The former is a build-in xval while the latter offers you to add a custom function, which might be added to dabstract later on if validated. An other option is to provide the function directly through ‘name’. Finally, it also offers to save your xval configuration such that it’s identical to last experiment OR depending on where you save, use the same xval for different experiments.

dabstract already has a set of build-in selectors in dabstract.dataset.xval such that one can simply do:

$  self.set_xval(group_random_kfold, parameters=dict('folds': 4, 'val_frac=1/3, group_key='group'))

for random crossvalidation with a group constraint, and:

$  self.set_xval(sequential_kfold, parameters=dict('folds': 4, 'val_frac=1/3, group_key='group'))

for sequential crossvalidation with a group constraint, and:

$  self.set_xval(stratified_kfold, parameters=dict('folds': 4, 'val_frac=1/3))

for stratified crossvalidation, and:

$  self.set_xval(stratified_kfold, parameters=dict('folds': 4, 'val_frac=1/3))

for random crossvalidation.

Parameters
nameCallable/xval_func/str/List[int],np.ndarray

xval defined as a str (translated to fct internally) or function

parametersdict

additional parameters in case name is a str to init the function/class

save_dirstr

filepath to where to pickle the xval folds

overwritebool

overwrite the saved file

summary() → None

Print a dataset summary

unpack(keys: List[str])dabstract.abstract.abstract.UnpackAbstract

Unpack the dictionary into a sequence This function return a dataset that, when indexed, return a list containing the items of ‘keys’ in that order.