Skip to content

Commit 06fe7f9

Browse files
Provide default options for chunking of datasets (issue #635) (#636)
* Add initial chunk parameters and function to read it from file * First draft of applying chunk configurations * Minor fixes * Create listDatasetsOfNeurodataType.m * Add new template for dataset configuration json * Update applyChunkConfiguration and dependent functions to work with new template * Remove unused condition in applyChunkConfiguration * Update getParentType.m * Add different dataset configuration profiles * Consistently name functions and code using datasetConfiguration instead of chunkConfiguration * Test-related fixes * simplify readDatasetConfiguration Replaces switch block with formatted string * Create applyCustomMatNWBPropertyNames.m Function that will ensure the dataset configuration conforms with MatNWB specific implementation details * Update configuration/archive_dataset_configuration.json Co-authored-by: Ben Dichter <[email protected]> * Add docstring for function applyCustomMatNWBPropertyNames.m * Update listDatasetsOfNeurodataType.m Resolve name for dataset if the name field is missing * Fix compute chunk size Rename flexible dimension to "flex" Use product of fixed dimensions to compute size of flex dimension * Update readDatasetConfiguration.m Add function to update dataset configuration to conform with MatNWB specific implementation (i.e, Dataset types (like VectorData) having a data property) * Update resolveDatasetConfigForDataType.m * Change dataset configuration json + necessary changes to functions * clrean up functions and add tests * Add utility method for retrieving relative path from classname * Add utility function for finding directoroy where neurodata type classes are generated * Update loadNamespace.m Make generatedTypesDirectory optional. If not provided, it will be inferred from MATLAB's search path Improve docstring, variablenames and error message * Update findRootDirectoryForGeneratedTypes.m Throw error from caller instead * Update Namespace.m Add property type constraints because name (and version?) are expected to be character types * Work around for matlab bug with which ... -all * Add unit tests for schemes.utility.findRootDirectoryForGeneratedTypes * Update listDatasetsOfNeurodataType.m Use updated version of schemes.loadNamespace. Instead of assuming types are located in the matnwb root directory, the functions finds the location of types from the MATLAB's search path * Update DatasetConfigurationTest.m * Remove DatasetConfigurationTest and clean up ApplyDatasetConfigurationTest * Update reconfigureDataPipe.m * More test cleanup * Update ecephys livescript to conform with NWBInspector (@nwbinspector-PR575) * Create dataset_configuration_schema.json * Fix dataset configuration schema bugs and modify instances to pass validation * Update readDatasetConfiguration to extract datasetSpecifications after schema/instance update * Update cloud_dataset_configuration.json * Remove comments * Remove debug statements and unreachable cases * Update ApplyDatasetConfigurationTest.m Update test to check that keys for dataset configuration of Dataset-based neurodata types are renamed by appending _data, because MatNWB adds a data property to all Dataset-based classes * Update applyCustomMatNWBPropertyNames.m Remove unused code and unreachable error * Added values for storageFormat and schemaVersion in the configuration instances * Update dataset_configuration_schema.json Reorder properties * Update schema and code to support setting target_chunk_size_unit... to bytes, kiB, MiB or GiB * Add tag to test which is not supported in older releases (<R2022a) * Make mustHaveField function in namespace as it is now used by multiple functions * Add unittest for "target_chunk_size_unit" config property * Remove level from compression, and keep parameters Specify "level" as a property in the parameters object * Rename (compression) "algorithm" to "method" * Update computeChunkSizeFromConfig.m Add warning if chunk target size is exceeded due to conflicting chunk size specifications * Add unit tests for testing that specific parameters are applied on the DataPipe * Update ApplyDatasetConfigurationTest.m Suppress warning that has been added and will be triggered by some tests in this class * Clean trailing whitespace * Update ComputeChunkSizeFromConfigTest.m Suppress warning that has been added and will be triggered by some tests in this class * Remove redunant test Testing of chunkDimensionConstraints are handled in ComputeChunkSizeFromConfigTest * Update ApplyDatasetConfigurationTest.m Renamed and moved test --------- Co-authored-by: Ben Dichter <[email protected]>
1 parent 54679f6 commit 06fe7f9

22 files changed

+1699
-0
lines changed
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
function datasetConfiguration = applyCustomMatNWBPropertyNames(datasetConfiguration)
2+
% applyCustomMatNWBPropertyNames - Processes a dataset configuration structure to apply custom MatNWB property names.
3+
%
4+
% datasetConfiguration = applyCustomMatNWBPropertyNames(datasetConfiguration)
5+
%
6+
% This function iterates through each field of the input structure and checks
7+
% if the field corresponds to a known NWB type (using a mapping from short
8+
% names to fully qualified class names). For each recognized field:
9+
%
10+
% - It retrieves the full class name and determines its superclasses.
11+
% - If the class is a subclass of "types.untyped.MetaClass":
12+
% * If it is also a "types.untyped.GroupClass", the function recursively
13+
% processes the subgroup configuration.
14+
% * If it is a "types.untyped.DatasetClass", it wraps the existing
15+
% configuration in a structure with a "data" property.
16+
% - If the field is not associated with a recognized NWB type, it remains
17+
% unchanged.
18+
%
19+
% Input:
20+
% datasetConfiguration - A 1x1 struct containing dataset configuration
21+
% data.
22+
%
23+
% Output:
24+
% datasetConfiguration - The updated configuration structure with custom
25+
% property names.
26+
27+
arguments
28+
datasetConfiguration (1,1) struct
29+
end
30+
31+
classNameMap = getNwbTypesClassnameMap();
32+
33+
fields = fieldnames(datasetConfiguration);
34+
35+
for i = 1:numel(fields)
36+
37+
thisField = fields{i};
38+
39+
% Split of last part if the field name is "nested"
40+
if contains(thisField, '_')
41+
shortName = extractAfter(thisField, '_');
42+
else
43+
shortName = thisField;
44+
end
45+
46+
if ~isKey(classNameMap, shortName)
47+
continue % Not a neurodata / nwb type
48+
end
49+
50+
fullClassName = classNameMap(shortName);
51+
superclassNames = superclasses(fullClassName);
52+
53+
if any(strcmp(superclassNames, "types.untyped.MetaClass"))
54+
thisSubConfig = datasetConfiguration.(thisField);
55+
if any(strcmp(superclassNames, "types.untyped.GroupClass"))
56+
% Todo: Remove this? Nested specs are currently not supported.
57+
elseif any(strcmp(superclassNames, "types.untyped.DatasetClass"))
58+
% Rename the field to include the _data suffix
59+
newFieldName = sprintf('%s_data', thisField);
60+
datasetConfiguration.(newFieldName) = thisSubConfig;
61+
datasetConfiguration = rmfield(datasetConfiguration, thisField);
62+
end
63+
else
64+
% For non-NWB types, leave the field unmodified.
65+
end
66+
end
67+
end
68+
69+
function ancestorPath = getAncestorPath(initialPath, numSteps)
70+
% getAncestorPath - Get an ancestor directory path.
71+
%
72+
% ancestorPath = GETANCESTORPATH(initialPath, numSteps)
73+
%
74+
% Input:
75+
% initialPath - A string representing the starting file or directory path.
76+
% numSteps - A positive integer indicating the number of directory
77+
% levels to move up.
78+
%
79+
% Output:
80+
% ancestorPath - A string representing the ancestor directory path.
81+
82+
arguments
83+
initialPath (1,1) string
84+
numSteps (1,1) double
85+
end
86+
splitPath = split(initialPath, filesep);
87+
88+
ancestorPath = fullfile(splitPath{1:end-numSteps}); % char output
89+
90+
% Ensure the path starts with a file separator on Unix systems.
91+
if isunix && ~startsWith(ancestorPath, filesep)
92+
ancestorPath = [filesep ancestorPath];
93+
end
94+
end
95+
96+
function map = getNwbTypesClassnameMap()
97+
% getNwbTypesClassnameMap - Constructs a mapping between NWB type short names
98+
% and their fully qualified class names.
99+
%
100+
% map = GETNWBTYPESCLASSNAMEMAP()
101+
%
102+
% The function locates the directory containing NWB type definitions
103+
% (using the location of 'types.core.NWBFile' as a reference) and searches
104+
% recursively for all MATLAB class definition files (*.m). It then filters
105+
% out files in the '+types/+untyped' and '+types/+util' folders.
106+
%
107+
% Output:
108+
% map - A mapping object (either a dictionary or containers.Map) where:
109+
% * Keys : Short class names (derived from file names without the .m extension).
110+
% * Values : Fully qualified class names in the format "types.namespace.ClassName".
111+
112+
typesClassDirectory = getAncestorPath( which('types.core.NWBFile'), 2 );
113+
114+
% Find all MATLAB class files recursively within the directory.
115+
L = dir(fullfile(typesClassDirectory, '**', '*.m'));
116+
117+
% Exclude files from the '+types/+untyped' and '+types/+util' directories.
118+
ignore = contains({L.folder}, fullfile('+types', '+untyped')) | ...
119+
contains({L.folder}, fullfile('+types', '+util'));
120+
L(ignore) = [];
121+
122+
% Extract namespace and class names from the file paths.
123+
[~, namespaceNames] = fileparts({L.folder});
124+
namespaceNames = string( strrep(namespaceNames, '+', '') );
125+
classNames = string( strrep( {L.name}, '.m', '') );
126+
127+
% Compose fully qualified class names using the namespace and class name.
128+
fullClassNames = matnwb.common.composeFullClassName(namespaceNames, classNames);
129+
130+
% Create a mapping from the short class names to the fully qualified class names.
131+
try
132+
map = dictionary(classNames, fullClassNames);
133+
catch % Fallback for older versions of MATLAB.
134+
map = containers.Map(classNames, fullClassNames);
135+
end
136+
end
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
function chunkSize = computeChunkSizeFromConfig(A, configuration)
2+
% computeChunkSizeFromConfig - Compute the chunk size for a dataset using the provided configuration.
3+
% This function determines the chunk size for a dataset based on the chunk
4+
% constraints/strategies provided in the configuration structure. It adjusts
5+
% dimensions according to rules: 'max' uses the dataset size, fixed numbers
6+
% use their value, and 'flex' calculates the dimension size to approximate the
7+
% target chunk size in bytes.
8+
%
9+
% Inputs:
10+
% A - A numeric dataset whose chunk size is to be computed.
11+
% configuration (1,1) struct - Struct defining chunking strategy for
12+
% different ranks of a dataset.
13+
%
14+
% Output:
15+
% chunkSize - A vector specifying the chunk size for each dimension of A.
16+
17+
arguments
18+
A {mustBeNumeric}
19+
configuration (1,1) struct ...
20+
{matnwb.common.mustHaveField(configuration, "strategy_by_rank", "target_chunk_size", "target_chunk_size_unit")}
21+
end
22+
23+
% Get dataset size
24+
dataSize = size(A);
25+
numDimensions = numel(dataSize);
26+
27+
% NWB / H5 supports true 1D vectors. If the data is a vector, represent
28+
% dataSize as a scalar for computation of chunkSize.
29+
if numDimensions == 2 && any(dataSize==1)
30+
numDimensions = 1;
31+
originalDataSize = dataSize;
32+
dataSize(dataSize==1) = [];
33+
end
34+
35+
% Retrieve constraints for current rank.
36+
strategy = configuration.strategy_by_rank;
37+
rankFieldName = sprintf('x%d', numDimensions); % Adjust for quirk in MATLAB where fieldname of numeric value is prepended with "x" when reading from json
38+
if ~isfield(strategy, rankFieldName)
39+
error('NWB:ComputeChunkSizeFromConfig:MatchingRankNotFound', ...
40+
'Configuration for %d dimensions is missing.', numDimensions)
41+
end
42+
constraints = strategy.(rankFieldName);
43+
assert(iscell(constraints), ...
44+
'Expected constraints for dimensions to be provided as a cell array, got %s.', class(constraints))
45+
46+
% Determine the target number of array elements per chunk.
47+
targetChunkSizeBytes = io.config.internal.getTargetChunkSizeInBytes(configuration);
48+
elementSizeBytes = io.config.internal.getDataByteSize(A) / numel(A); % bytes per element
49+
targetNumElements = targetChunkSizeBytes / elementSizeBytes; % Per chunk
50+
51+
% Preallocate arrays.
52+
chunkSize = zeros(1, numDimensions);
53+
isFlexDim = false(1, numDimensions);
54+
55+
isFlex = @(x) ischar(x) && strcmp(x, 'flex');
56+
isMax = @(x) ischar(x) && strcmp(x, 'max');
57+
58+
% Calculate chunk size for each dimension
59+
for dim = 1:numDimensions
60+
if dim > numel(constraints)
61+
% Use full size for dimensions beyond the specification
62+
chunkSize(dim) = dataSize(dim);
63+
else
64+
thisDimensionConstraint = constraints{dim};
65+
if isFlex(thisDimensionConstraint)
66+
isFlexDim(dim) = true;
67+
% Leave chunkSize(dim) to be determined.
68+
elseif isMax(thisDimensionConstraint)
69+
chunkSize(dim) = dataSize(dim);
70+
elseif isnumeric(thisDimensionConstraint)
71+
chunkSize(dim) = min([thisDimensionConstraint, dataSize(dim)]);
72+
% thisDimensionConstraint is upper bound
73+
else
74+
error('NWB:ComputeChunkSizeFromConfig:InvalidConstraint', ...
75+
'Invalid chunk constraint for dimension %d.', dim);
76+
end
77+
end
78+
end
79+
80+
% Compute the product of fixed dimensions (number of elements per chunk).
81+
if any(~isFlexDim)
82+
fixedProduct = prod(chunkSize(~isFlexDim));
83+
else
84+
fixedProduct = 1;
85+
end
86+
87+
% For flex dimensions, compute the remaining number of elements
88+
% and allocate them equally in the exponent space.
89+
nFlex = sum(isFlexDim);
90+
if nFlex > 0
91+
remainingElements = targetNumElements / fixedProduct;
92+
% Ensure remainingElements is at least 1.
93+
remainingElements = max(remainingElements, 1);
94+
% Compute an equal allocation factor for each flex dimension.
95+
elementsPerFlexDimension = nthroot(remainingElements, nFlex);
96+
% Assign computed chunk size for each flex dimension.
97+
for dim = find(isFlexDim)
98+
proposedSize = max(1, round(elementsPerFlexDimension));
99+
% Do not exceed the full dimension size.
100+
chunkSize(dim) = min(proposedSize, dataSize(dim));
101+
end
102+
end
103+
104+
% Ensure chunk size does not exceed dataset size in any dimension
105+
chunkSize = min(chunkSize, dataSize);
106+
107+
if numDimensions == 1
108+
originalDataSize(originalDataSize~=1) = chunkSize;
109+
chunkSize = originalDataSize;
110+
end
111+
112+
actualBytesPerChunk = prod(chunkSize) * elementSizeBytes;
113+
if actualBytesPerChunk > targetChunkSizeBytes
114+
warning('NWB:ComputeChunkSizeFromConfig:TargetSizeExceeded', ...
115+
['The provided dataset configuration produces chunks that have a ', ...
116+
'larger bytesize than the specified target chunk size.'])
117+
end
118+
end
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
function dataPipe = configureDataPipeFromData(numericData, datasetConfig)
2+
% configureDataPipeFromData - Configure a DataPipe from numeric data and dataset configuration
3+
4+
import io.config.internal.computeChunkSizeFromConfig
5+
import types.untyped.datapipe.properties.DynamicFilter
6+
7+
chunkSize = computeChunkSizeFromConfig(numericData, datasetConfig.chunking);
8+
maxSize = size(numericData);
9+
10+
dataPipeArgs = {...
11+
"data", numericData, ...
12+
"maxSize", maxSize, ...
13+
"chunkSize", chunkSize };
14+
15+
hasShuffle = ~isempty(datasetConfig.compression.prefilters)...
16+
&& contains(datasetConfig.compression.prefilters, 'shuffle');
17+
18+
% Check if the configured compression method is DEFLATE (gzip)
19+
if strcmpi(datasetConfig.compression.method, "deflate") ...
20+
|| strcmpi(datasetConfig.compression.method, "gzip")
21+
if isempty(datasetConfig.compression.parameters) ...
22+
|| ~isfield(datasetConfig.compression.parameters, 'level')
23+
defaultCompressionLevel = 3;
24+
warning('NWB:DataPipeConfiguration:LevelParameterNotSet', ...
25+
['The dataset configuration does not contain a value for ', ...
26+
'the "level" parameter of the Deflate filter. The default ', ...
27+
'value %d will be used.'], defaultCompressionLevel)
28+
compressionLevel = defaultCompressionLevel;
29+
else
30+
compressionLevel = datasetConfig.compression.parameters.level;
31+
end
32+
% Use standard compression filters
33+
dataPipeArgs = [ dataPipeArgs, ...
34+
{'hasShuffle', hasShuffle, ...
35+
'compressionLevel', compressionLevel} ...
36+
];
37+
else
38+
% Create property list of custom filters for dataset creation
39+
parameters = struct2cell(datasetConfig.compression.parameters);
40+
compressionFilter = DynamicFilter( ...
41+
datasetConfig.compression.method, ...
42+
parameters{:} );
43+
44+
if hasShuffle
45+
shuffleFilter = types.untyped.datapipe.properties.Shuffle();
46+
filters = [shuffleFilter compressionFilter];
47+
else
48+
filters = compressionFilter;
49+
end
50+
dataPipeArgs = [ dataPipeArgs, ...
51+
{'filters', filters} ];
52+
end
53+
54+
% Create the datapipe.
55+
dataPipe = types.untyped.DataPipe( dataPipeArgs{:} );
56+
end
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
function configuration = flipChunkDimensions(configuration)
2+
%FLIPCHUNKDIMENSIONS Reverses (flips left-right) the chunk dimension arrays
3+
% in a structure.
4+
%
5+
% configuration = flipChunkDimensions(configuration) locates the
6+
% strategy_by_rank substructure in a configuration structure and flips the
7+
% array for each rank field.
8+
%
9+
% This is needed because MatNWB dimensions are flipped upon export to
10+
% hdf5 files and the specification is defined based on the dimension
11+
% ordering in NWB schemas / hdf5
12+
13+
if isstruct(configuration)
14+
fields = fieldnames(configuration);
15+
for i = 1:length(fields)
16+
fieldName = fields{i};
17+
if strcmp(fieldName, 'strategy_by_rank')
18+
% Process the chunk_dimensions field
19+
configuration.(fieldName) = ...
20+
processChunkDimensions(configuration.(fieldName));
21+
else
22+
% Otherwise, recursively process the field
23+
configuration.(fieldName) = ...
24+
io.config.internal.flipChunkDimensions(configuration.(fieldName));
25+
end
26+
end
27+
else
28+
% Pass
29+
end
30+
end
31+
32+
function cd = processChunkDimensions(cd)
33+
% Process the chunk_dimensions field.
34+
rankFieldNames = fieldnames(cd);
35+
36+
for i = 1:numel(rankFieldNames)
37+
thisRank = rankFieldNames{i};
38+
cd.(thisRank) = flipud(cd.(thisRank));
39+
end
40+
end
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
function byteSize = getDataByteSize(data)
2+
% getDataByteSize - Get bytesize of a numeric array
3+
dataType = class(data);
4+
bytesPerDataPoint = io.getMatTypeSize(dataType);
5+
6+
byteSize = numel(data) .* bytesPerDataPoint;
7+
end

0 commit comments

Comments
 (0)