Provide default options for chunking of datasets (issue #635) #636

ehennestad · 2024-11-23T11:40:59Z

Fix #635

Motivation

Provide better chunking options for files in cloud storage

How to test the behavior?

To be determined

Todo

add version and storage format to instances
Support kilobytes, megabytes, gigabytes in target_chunk_size_unit property
remove level from compression, and keep parameters
improve testing to test that all the chunking and compression parameters are set correctly

Checklist

Have you ensured the PR description clearly describes the problem and solutions?
Have you checked to ensure that there aren't other open or previously closed Pull Requests for the same change?
If this PR fixes an issue, is the first line of the PR description fix #XX where XX is the issue number?

ehennestad · 2024-11-23T11:44:18Z

Some open questions:

Should each possible property / dataset name be present in the configuration (chunk_params.json)? Or should there be one default that applies to all datasets, and overwrites would happen on specific datasets?
~~Should each dataset be a candidate for chunking, or only datasets like "data", "timestamps". Any others?~~
~~Should chunking be specified for datasets that are smaller than the chunk size? I.e chunkSize == maxSize?~~
Should chunk_dimensions be specified for each set of dimension options, similar to how it is done in the nwb schema?

I.e. https://github.com/NeurodataWithoutBorders/nwb-schema/blob/473fcc41e871288767cfb37d83315cca7469b9d1/core/nwb.base.yaml#L100-L110

dims:
    - - num_times
    - - num_times
      - num_DIM2
    - - num_times
      - num_DIM2
      - num_DIM3
    - - num_times
      - num_DIM2
      - num_DIM3
      - num_DIM4

@bendichter

ehennestad · 2025-01-16T21:07:39Z

Current implementation for schema/definition:

    "default_dataset_configuration": {
        "layout": "chunked",
        "target_chunk_size": {
            "value": 10000000,
            "unit": "bytes"
        },
        "chunk_dimensions": [null],
        "compression": {
            "algorithm": "gzip",
            "level": 6,
            "parameters": {},
            "prefilters": ["shuffle"]
        }
    },
    specific_dataset_overrides...

Open questions:
@bendichter
Should chunk_dimensions be specified for each set of dimension options, similar to how it is done in the nwb schema?

dims:
    - - num_times
    - - num_times
      - num_DIM2
    - - num_times
      - num_DIM2
      - num_DIM3
    - - num_times
      - num_DIM2
      - num_DIM3
      - num_DIM4

What would be the syntax?
Example (json):

"chunk_dimensions": [ [null], [null, 32], [null, 32, max] ]

ehennestad · 2025-01-21T09:32:08Z

Need to determine what to do with nested data types. For example: A RoiResponseSeries can be part of a Fluorescence or a DfOverF group. Should the spec support defining configuration for nested neurodata types dependent on were they are located, i.e:

"Fluorescence": {
        "RoiResponseSeries": {
            "data": {
                "chunk_dimensions": [null, 16]}
        }
    },
    "DfOverF": {
        "RoiResponseSeries": {
            "data": {
                "chunk_dimensions": [null, 32]}
        }
    }

…ew template

codecov · 2025-01-21T16:23:36Z

Codecov Report

Attention: Patch coverage is 97.43590% with 7 lines in your changes missing coverage. Please review.

Project coverage is 94.99%. Comparing base (54679f6) to head (0f04455).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...+config/+internal/applyCustomMatNWBPropertyNames.m	94.28%	2 Missing ⚠️
+schemes/listDatasetsOfNeurodataType.m	92.59%	2 Missing ⚠️
+io/+config/+internal/computeChunkSizeFromConfig.m	98.18%	1 Missing ⚠️
+io/+config/+internal/getTargetChunkSizeInBytes.m	92.30%	1 Missing ⚠️
+io/+config/applyDatasetConfiguration.m	97.50%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #636      +/-   ##
==========================================
+ Coverage   94.86%   94.99%   +0.13%     
==========================================
  Files         146      160      +14     
  Lines        5565     5838     +273     
==========================================
+ Hits         5279     5546     +267     
- Misses        286      292       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ataWithoutBorders/matnwb into 635-customizable-chunking

…ad of chunkConfiguration

Replaces switch block with formatted string

…ataWithoutBorders/matnwb into 635-customizable-chunking

configuration/archive_dataset_configuration.json

Function that will ensure the dataset configuration conforms with MatNWB specific implementation details

…to 635-customizable-chunking

Update test to check that keys for dataset configuration of Dataset-based neurodata types are renamed by appending _data, because MatNWB adds a data property to all Dataset-based classes

Remove unused code and unreachable error

… instances

Reorder properties

to bytes, kiB, MiB or GiB

…e functions

Specify "level" as a property in the parameters object

Add warning if chunk target size is exceeded due to conflicting chunk size specifications

…e DataPipe

Suppress warning that has been added and will be triggered by some tests in this class

Testing of chunkDimensionConstraints are handled in ComputeChunkSizeFromConfigTest

Renamed and moved test

ehennestad added 2 commits November 23, 2024 10:43

Add initial chunk parameters and function to read it from file

5f48a4a

First draft of applying chunk configurations

4822417

Minor fixes

cfdefd6

ehennestad changed the title ~~635 Provide default options for chunking of datasets~~ Provide default options for chunking of datasets (Issue #635) Nov 23, 2024

ehennestad changed the title ~~Provide default options for chunking of datasets (Issue #635)~~ Provide default options for chunking of datasets (issue #635) Nov 23, 2024

ehennestad added 3 commits January 21, 2025 11:13

Create listDatasetsOfNeurodataType.m

e164ce0

Add new template for dataset configuration json

e5f9bc7

Update applyChunkConfiguration and dependent functions to work with n…

c7402d8

…ew template

ehennestad added 10 commits January 21, 2025 21:57

Merge branch 'master' into 635-customizable-chunking

a287409

Remove unused condition in applyChunkConfiguration

b32c3c4

Update getParentType.m

37e68e1

Merge branch '635-customizable-chunking' of https://github.com/Neurod…

c20cb49

…ataWithoutBorders/matnwb into 635-customizable-chunking

Add different dataset configuration profiles

3d5d2ac

Consistently name functions and code using datasetConfiguration inste…

9e623a2

…ad of chunkConfiguration

Test-related fixes

665bf5c

Merge branch 'master' into 635-customizable-chunking

837d808

simplify readDatasetConfiguration

32771ea

Replaces switch block with formatted string

Merge branch '635-customizable-chunking' of https://github.com/Neurod…

9512763

…ataWithoutBorders/matnwb into 635-customizable-chunking

bendichter mentioned this pull request Jan 27, 2025

[Feature]: Chunk all TimeSeries data and timestamps by default NeurodataWithoutBorders/pynwb#1945

Open

3 tasks

ehennestad added 4 commits January 31, 2025 16:10

Merge branch 'master' into 635-customizable-chunking

fcef5de

Merge branch '635-customizable-chunking' of https://github.com/Neurod…

470a8e8

…ataWithoutBorders/matnwb into 635-customizable-chunking

Merge branch 'main' into 635-customizable-chunking

b47f153

Merge branch '635-customizable-chunking' of https://github.com/Neurod…

52c5c2a

…ataWithoutBorders/matnwb into 635-customizable-chunking

bendichter reviewed Feb 11, 2025

View reviewed changes

configuration/archive_dataset_configuration.json Outdated Show resolved Hide resolved

ehennestad added 2 commits February 22, 2025 10:44

Merge branch 'main' into 635-customizable-chunking

ab77624

Create applyCustomMatNWBPropertyNames.m

8bde775

Function that will ensure the dataset configuration conforms with MatNWB specific implementation details

ehennestad added 5 commits March 18, 2025 11:35

Remove comments

c883a02

Merge branch 'update-ecephys-tutorial-fix-nonascending-unit-times' in…

f43fe93

…to 635-customizable-chunking

Remove debug statements and unreachable cases

4af5d80

Update ApplyDatasetConfigurationTest.m

7d86ff2

Update test to check that keys for dataset configuration of Dataset-based neurodata types are renamed by appending _data, because MatNWB adds a data property to all Dataset-based classes

Update applyCustomMatNWBPropertyNames.m

8c95362

Remove unused code and unreachable error

ehennestad marked this pull request as ready for review March 18, 2025 12:58

ehennestad added 21 commits March 19, 2025 19:52

Merge branch 'main' into 635-customizable-chunking

2f6e763

Merge branch 'main' into 635-customizable-chunking

8abd825

Merge branch 'main' into 635-customizable-chunking

c5702df

Merge branch 'main' into 635-customizable-chunking

4ff5eec

Merge branch 'main' into 635-customizable-chunking

b5eb25a

Added values for storageFormat and schemaVersion in the configuration…

ceedbd4

… instances

Update dataset_configuration_schema.json

388bd1b

Reorder properties

Update schema and code to support setting target_chunk_size_unit...

6755b08

to bytes, kiB, MiB or GiB

Add tag to test which is not supported in older releases (<R2022a)

18c8fbf

Make mustHaveField function in namespace as it is now used by multipl…

69ea841

…e functions

Add unittest for "target_chunk_size_unit" config property

dc8c2c3

Remove level from compression, and keep parameters

cb89c89

Specify "level" as a property in the parameters object

Rename (compression) "algorithm" to "method"

dfcb164

Update computeChunkSizeFromConfig.m

42f231c

Add warning if chunk target size is exceeded due to conflicting chunk size specifications

Add unit tests for testing that specific parameters are applied on th…

b2b7df1

…e DataPipe

Update ApplyDatasetConfigurationTest.m

311ecd0

Suppress warning that has been added and will be triggered by some tests in this class

Clean trailing whitespace

8dd4553

Update ComputeChunkSizeFromConfigTest.m

f464f61

Suppress warning that has been added and will be triggered by some tests in this class

Remove redunant test

0eed5f0

Testing of chunkDimensionConstraints are handled in ComputeChunkSizeFromConfigTest

Update ApplyDatasetConfigurationTest.m

de0446c

Renamed and moved test

Merge branch 'main' into 635-customizable-chunking

0f04455

bendichter approved these changes Apr 30, 2025

View reviewed changes

ehennestad merged commit 06fe7f9 into main Apr 30, 2025
15 checks passed

ehennestad mentioned this pull request May 25, 2025

Fix bug automatic datapipe configuration #717

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Provide default options for chunking of datasets (issue #635) #636

Provide default options for chunking of datasets (issue #635) #636

Uh oh!

ehennestad commented Nov 23, 2024 •

edited

Loading

Uh oh!

ehennestad commented Nov 23, 2024 •

edited

Loading

Uh oh!

ehennestad commented Jan 16, 2025

Uh oh!

ehennestad commented Jan 21, 2025

Uh oh!

codecov bot commented Jan 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Provide default options for chunking of datasets (issue #635) #636

Provide default options for chunking of datasets (issue #635) #636

Uh oh!

Conversation

ehennestad commented Nov 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

How to test the behavior?

Todo

Checklist

Uh oh!

ehennestad commented Nov 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehennestad commented Jan 16, 2025

Uh oh!

ehennestad commented Jan 21, 2025

Uh oh!

codecov bot commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ehennestad commented Nov 23, 2024 •

edited

Loading

ehennestad commented Nov 23, 2024 •

edited

Loading

codecov bot commented Jan 21, 2025 •

edited

Loading