Description
Description
IO read write functions typically follow this signature:
def load(path: Union[Path, str], **kwars) -> obj:
...
def save(obj, path: Union[Path, str], **kwars) -> None:
...
Creating custom datasets should ideally be as easy as supplying load
/ save
function(s) that follow this function signature.
Context
kedro
supports an extensive range of datasets, but it is not exhaustive. Popular libraries used by relatively niche communities like xarray
and arviz
aren't currently supported.
Beyond these examples, unofficially adding support for more obscure datasets would be easier.
Initially, I was looking to implement something like this and asked in the Discord chat if this pattern made sense.
Then, @datajoely suggested I open a feature request.
Possible Implementation
We can consider a Dataset
class factory. Maybe GenericDataset
with a class method .create_custom_dataset(name: str, load: callable, save: callable)
.
Usage would look something like xarrayNetCDF = GenericDataset("xarrayNetCDF", xarray.open_dataset, lambda x, path, **kwargs: x.to_netcdf(path, **kwargs))
.
Entries can be added to the data catalog yaml just as with any other custom dataset implementation.
Possible Alternatives
LambdaDataset
is very similar but theload
, andsave
are hard coded in the implementation, and cannot be parameterized in the data catalog, as far as I'm aware- Subclassing
AbstractDataset
is an option, but this feature request seeks to reduce boilerplate when defining new datasets - Adding xarray support #1346 officially requires implementing nuances like cloud file storage, partitioned datasets, lazy loading, etc.