Dataset class¶
-
class
dabstract.dataset.dataset.Dataset(paths: list = None, test_only: Optional[bool] = False, **kwargs)¶ Bases:
objectDataset base class
This is the dataset base class. It essentially is a DictSeqAbstract with additional functionality, such as management for: crossvalidation, feature extraction, example splitting and sample selection.
This class should not be used on it’s own. It is a base class for other datasets. When using this class as a base for your own dataset, one should use the following structure:
$ class EXAMPLE(dataset): $ def __init__(self, $ paths=None, $ test_only=0, $ other=... $ **kwargs): $ # init dict abstract $ super().__init__(name=self.__class__.__name__, $ filter=filter, $ test_only=test_only) $ #init other variables $ $ # Data: get data $ def set_data(self, paths): $ # set up dataset containing the data and optional lazy mapping and so on $ # the dataset is essentially a wrapped DictSeqAbstract. All your data is $ # is accessible through self.. e.g. len(self), self.add, self.concat, ... $ self.add('data', ... ) $ self.add('label', ... ) $ return self $ $ def prepare(self,paths): $ # prepare data here, i.e. downloadOne is advised to check the examples in dabstract/examples/introduction on how to work with datasets before reading the rest of this help.
To initialise this dataset the only mandatory field is paths and paths[‘feat’] specifically. Paths should be provided as such:
$ paths={'data': path_to_data, $ 'meta': path_to_meta, $ 'feat': path_to_feat} $ dataset = EXAMPLE(paths={...})The other entries for ‘data’ and ‘meta’ are just a suggestion and one can add as much as they like. However, it is advised to keep this convention if possible.
The class offers the following key functionality on top of your dataset definition, which can be called by the following methods:
.add - Add another key to the dataset .add_dict - Add the keys and fields of an existing dataset or DictSeqAbstract to this one .concat - concat dataset with dataset .remove - remove key from dataset .add_map - add mapping to a key .add_split - add a splitting operation to your dataset .add_select - apply a selection to your dataset .add_alias - add an alias to another key .keys - show the set of keys .set_active_keys - set an active key .reset_active_keys - reset the active keys .unpack - unpack DictSeq to a list representation .set_data - overwrite this method with yours to set your data .load_memory - load a particular key into memory .summary - show a summary of the dataset .prepare_feat - compute the features and save to disk .set_xval - set crossvalidation folds .get_xval_set - get a subdataset givin the folds
The full explanation for each method is provided as a docstring at each method.
- Parameters
- pathsdict or str:
Path configuration in the form of a dictionary. For example:
$ paths={ 'data': path_to_data, $ 'meta': path_to_meta, $ 'feat': path_to_feat}- test_onlybool
To specify if this dataset should be used for testing or both testing and train. This is only relevant if multiple datasets are combined and set_xval() is used. For example:
test_only = 0 -> use for both train and test test_only = 1 -> use only for test
- Returns
- dataset class
-
add(key: str, data: Any, info: List[dict] = None, lazy: bool = True, **kwargs) → None¶ Add key to dataset. Requirement: data should be as long as len(self)
- Parameters
- keystr
key to add
- dataseq/dictseq/np/list
data to add
- infolist
additional information that can be added that will be progated along with the data
- lazybool
apply lazily or not
-
add_alias(key: str, new_key: str) → None¶ Add an alias to a particular key. Handy if you would like to use a dataset and add e.g. data/target referring to something.
-
add_dict(data: dict, lazy: bool = True, **kwargs) → None¶ Add the keys of a dictionary to the existing dataset Requirement: length of each item in the dict should be as long as len(self)
- Parameters
- lazybool
let this dict be lazy or not
- datadictseq/dict
dict to add
-
add_map(key: str, map_fct: Callable, lazy: bool = None) → None¶ Add a mapping to a key
- Parameters
- lazybool
apply lazily or not
- keystr
key to apply the mapping to
- map_fctCallable
fct which performs y = f(x)
-
add_select(selector: Any, *arg, parameters: Optional[dict] = <class 'dict'>, eval_data: Any = None, **kwargs) → None¶ Add a selection to the dataset
This function add a selector to the dataset. The input to this function can either be a function that does the selection or a name/parameter pair that is used to search for that function in dabstract.dataset.select AND in the specified os.environ[“dabstract_CUSTOM_DIR”]. When defining custom selector functions, one can either provide this function directly OR place them in os.environ[“dabstract_CUSTOM_DIR”] / dataset / select.py. Any usage for custom function uses the same directory structure as dabstract.
Besides a function one can also directly provide indices.
dabstract already has a set of build-in selectors in dabstract.dataset.select such that one can simply do:
$ self.add_select(random_subsample, parameters=dict('ratio': 0.5))for random subsampling, and:
$ self.add_select(subsample_by_str, parameters=dict('key': ..., 'keep': ...))for selecting based on a key and a particular value One can also also use the lambda function such as:
$ self.add_select((lambda x,k: x['data']['subdb'][k]))
Or directly use indices such as:
$ indices = np.array[0,1,2,3,4]) $ self.add_select(indices)
- Parameters
- selectorCallable/str/List[int]/np.ndarray
selector defined as a str (translated to fct internally) or function or indices
- parametersdict
additional parameters in case name is a str to init the function/class
- eval_dataAny
data which could be used to available selector on in case no indices but a function is used. Note that if no eval_data is selected it simply assumes the dataset itself to evaluate on.
- arg/kwargs:
additional param to provide to the function if needed
-
add_split(split_size: Union[float, int] = None, constraint: Optional[str] = None, type: str = 'seconds', reference_key: str = None, **kwargs) → None¶ Add a splitting operation to the dataset
This is a functionality handy if you for example have a dataset with chunks of 60s while you want examples of 1s but you do not want to reformat your entire dataset. This functionality does it in a lazy manner, e.g. splitting is only performed when needed. For this it needs apriori information on the output_shape of each example and the sampling frequency. This is automatically available IF you use FolderDictSeqAbstract data structure, as this creates DictSeq to your dataset containing filepath, filename, .. and info. The info entry contains the output_shape, sampling rate of your data. This work for folders containing .wav files AND for extracted features in the numpy format when this was performed using self.prepare_feat in this class. This class basically uses SplitAbstract and SampleReplicateAbstract. Key’s including information, will be splitted, while keys including only data will be replicated depending on the splitting rate.
- split_sizefloat/int
split size in seconds/samples depending on ‘metric’
- constraintNone/str
Option ‘power2’ creates sizes with a order of 2 (used for autoencoders)
- typestr
split_size type (‘seconds’,’samples’)
- reference_keystr
if samples is set as a size, one needs to provide a key reference to acquire time_step information from.
-
concat(data: Dataset, intersect: bool = False, adjust_base: bool = True) → Dataset¶ Add the keys of a dictionary to the existing dataset Requirement: data should be as long as len(self)
- Parameters
- datadictseq/dict
dict to add
- intersectbool
keep intersection of the two dicts based on the keys
- adjust_basebool
protect the original dataset from adjusting.
- Returns
- datasetDataset class
-
get_folds() → int¶ get the amount of folds after .set_xval() is done
-
get_unique(key: str, fold: int = None, set: str = None, return_idx=False) → List[Any]¶ returns the unique values and corresponding ids to the examples that belong to a unique group for a particular key/item.
If not fold/set is specified, it will return the unique value and ids for all data. If both are specified, i.e. fold = 1 and set = ‘test’ it will return those associated with that dataset. Note that this only works if xval is initialised in set_xval().
While get_unique(.., return_idx=False) returns the unique values of a dataset, e.g.:
$ print(data['example']) [1,2,3,1] $ print(data.get_unique('example')) [1,2,3]get_unique(.., return_idx=True) also returns the associated indices:
$ print(data.get_unique('example', return_idx=True)) [[1,2,3], [[0,3],[2],[3]], [[1,2],[3],[4]]This is primarily useful for plotting data based on a particular separating variable.
- Parameters
- keystr
key to get unique values from
- foldint
fold to get unique content of
- setstr
set to get unique content of
- return_idx: bool
returns the idx corresponding to the unique values or not
- Returns
- unique_valuesList[np.ndarray]
Unique value ids of that key corresponding to the output of .get_unique(…)
- data_idsList[np.ndarray]
idx of data matching a particular unique_value (optional if return_idx = True)
- plot_idsList[np.ndarray]
sequential plot idx for a particular unique_value (optional if return_idx = True)
-
get_xval_set(set: str = None, fold: int = None, keys: str = 'all', lazy: bool = True, workers: int = 1, buffer_len: int = 3) → dabstract.abstract.abstract.Select¶ Get a crossvalidation subset of your dataset
This function return a subdataset of the original one based on which set you want and which fold
- Parameters
- setstr
set should be in (‘train’,’test’,’val’) depending on what the crossvalidation fct returned
- foldint
get a particular fold
- keysstr
get a subset of the keys, e.g. only input and target
- lazybool
apply lazily
- workersint
amount of workers in case lazy is false
- buffer_lenint
used buffer length for multiprocessing in case lazy is false
-
keys() → None¶ Show the keys in the dataset
-
load_memory(key: str, workers: int = 2, buffer_len: int = 2, keep_structure: bool = False, verbose: bool = True) → None¶ Load data of a particular key from memory
If you want to already load some data in memory as this might be the faster option you can use function.
- Parameters
- keystr
key to be loaded in memory
- workersint
amount of workers used for loading the data
- buffer_lenint
buffer_len of the pool
- keep_structurebool
keep structure up another class than DictSeqAbstract
- verbosebool
provide print feedback
-
pop(key: str = None) → Any¶
-
prepare(paths: Dict[str, str]) → None¶ Placeholder for the dataset. You can add dataset download ops here.
-
prepare_feat(key: str, fe_name: str, fe_dp: dabstract.dataprocessor.processing_chain.ProcessingChain, new_key: str = None, overwrite: bool = False, allow_data_pop: bool = True, verbose: bool = True, workers: int = 2, buffer_len: int = 2) → None¶ Utility function to manage feature saving and loading.
This function manages the feature extraction and loading for you. What it basically does it when you provide a particular feature extraction it processes, saves and keeps some information. Next time when this is called, it does not compute again, but initiates the dataset in a lazy way such that features are read from disk. If you want to be read in memory, you can use self.load_memory(key,…). Additionally, it also offers multi-processing when extracting the features. The features are added as an additional key to your dataset OR replaces the key containing the source data. Files are writting away in the following order:
self[‘path’][‘feat’] / key / fe_name / …
the files inside that folder will have the same structure as the original files have, except that now they are writting as npy files.
It is required that ‘key’ contains a dictionary containing filepath, example, subdb and info in order to make this functionality work. This means that you should use self.add_subdict_from_folder() for the raw data.
- Parameters
- keystr
key to extract features from.
- fe_namestr
the name of the feature extraction, which will be used to define the foldername
- fe_dpProcessingChain
processing_chain applied to the data
- new_keystr/None
If None, then key will be overwritten with the data. If a string, then a new key is added to the dataset.
- overwritebool
overwrite the features that already saved
- workersint
amount of workers used for loading data and extracting features
- buffer_lenint
buffer_len of the pool
-
remove(key: str) → None¶ Remove a particular key in the dataset
-
reset_active_key() → None¶ Reset active keys (DEPRECATED)
-
reset_active_keys() → None¶ Reset active keys
-
set_active_keys(keys: Union[List[str], str]) → None¶ Set an active key. An active key simply lets a DictSeq mimic a Seq. When integer indexing a dataset it return a dictionary. In some cases it is desired that it only return the data from one particular key OR a set of keys.
-
set_data(paths: Dict[str, str]) → None¶ Placeholder that should be used to set your data in your own database class E.g. self.add(..) and so on
-
set_xval(name: Union[str, function, List[int], numpy.ndarray], parameters: Dict = {}, save_path: str = None, overwrite: bool = True) → None¶ Set the cross-validation folds
This function sets the crossvalidation folds. This works similar as with self.add_select(). You can either provide a name/parameters pair where name is a string that refers to a particular function available in either dabstract.dataset.xval OR os.environ[“dabstract_CUSTOM_DIR”] / dataset / xval.py. The former is a build-in xval while the latter offers you to add a custom function, which might be added to dabstract later on if validated. An other option is to provide the function directly through ‘name’. Finally, it also offers to save your xval configuration such that it’s identical to last experiment OR depending on where you save, use the same xval for different experiments.
dabstract already has a set of build-in selectors in dabstract.dataset.xval such that one can simply do:
$ self.set_xval(group_random_kfold, parameters=dict('folds': 4, 'val_frac=1/3, group_key='group'))for random crossvalidation with a group constraint, and:
$ self.set_xval(sequential_kfold, parameters=dict('folds': 4, 'val_frac=1/3, group_key='group'))for sequential crossvalidation with a group constraint, and:
$ self.set_xval(stratified_kfold, parameters=dict('folds': 4, 'val_frac=1/3))for stratified crossvalidation, and:
$ self.set_xval(stratified_kfold, parameters=dict('folds': 4, 'val_frac=1/3))for random crossvalidation.
- Parameters
- nameCallable/xval_func/str/List[int],np.ndarray
xval defined as a str (translated to fct internally) or function
- parametersdict
additional parameters in case name is a str to init the function/class
- save_dirstr
filepath to where to pickle the xval folds
- overwritebool
overwrite the saved file
-
summary() → None¶ Print a dataset summary
-
unpack(keys: List[str]) → dabstract.abstract.abstract.UnpackAbstract¶ Unpack the dictionary into a sequence This function return a dataset that, when indexed, return a list containing the items of ‘keys’ in that order.