Skip to content

Provide default options for chunking of datasets (issue #635) #636

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 79 commits into from
Apr 30, 2025

Conversation

ehennestad
Copy link
Collaborator

@ehennestad ehennestad commented Nov 23, 2024

Fix #635

Motivation

Provide better chunking options for files in cloud storage

How to test the behavior?

To be determined

Todo

  • add version and storage format to instances
  • Support kilobytes, megabytes, gigabytes in target_chunk_size_unit property
  • remove level from compression, and keep parameters
  • improve testing to test that all the chunking and compression parameters are set correctly

Checklist

  • Have you ensured the PR description clearly describes the problem and solutions?
  • Have you checked to ensure that there aren't other open or previously closed Pull Requests for the same change?
  • If this PR fixes an issue, is the first line of the PR description fix #XX where XX is the issue number?

@ehennestad
Copy link
Collaborator Author

ehennestad commented Nov 23, 2024

Some open questions:

  • Should each possible property / dataset name be present in the configuration (chunk_params.json)? Or should there be one default that applies to all datasets, and overwrites would happen on specific datasets?
  • Should each dataset be a candidate for chunking, or only datasets like "data", "timestamps". Any others?
  • Should chunking be specified for datasets that are smaller than the chunk size? I.e chunkSize == maxSize?
  • Should chunk_dimensions be specified for each set of dimension options, similar to how it is done in the nwb schema?

I.e. https://github.com/NeurodataWithoutBorders/nwb-schema/blob/473fcc41e871288767cfb37d83315cca7469b9d1/core/nwb.base.yaml#L100-L110

dims:
    - - num_times
    - - num_times
      - num_DIM2
    - - num_times
      - num_DIM2
      - num_DIM3
    - - num_times
      - num_DIM2
      - num_DIM3
      - num_DIM4

@bendichter

@ehennestad ehennestad changed the title 635 Provide default options for chunking of datasets Provide default options for chunking of datasets (Issue #635) Nov 23, 2024
@ehennestad ehennestad changed the title Provide default options for chunking of datasets (Issue #635) Provide default options for chunking of datasets (issue #635) Nov 23, 2024
@ehennestad
Copy link
Collaborator Author

Current implementation for schema/definition:

    "default_dataset_configuration": {
        "layout": "chunked",
        "target_chunk_size": {
            "value": 10000000,
            "unit": "bytes"
        },
        "chunk_dimensions": [null],
        "compression": {
            "algorithm": "gzip",
            "level": 6,
            "parameters": {},
            "prefilters": ["shuffle"]
        }
    },
    specific_dataset_overrides...

Open questions:
@bendichter
Should chunk_dimensions be specified for each set of dimension options, similar to how it is done in the nwb schema?

dims:
    - - num_times
    - - num_times
      - num_DIM2
    - - num_times
      - num_DIM2
      - num_DIM3
    - - num_times
      - num_DIM2
      - num_DIM3
      - num_DIM4

What would be the syntax?
Example (json):

"chunk_dimensions": [ [null], [null, 32], [null, 32, max] ]

@ehennestad
Copy link
Collaborator Author

Need to determine what to do with nested data types. For example: A RoiResponseSeries can be part of a Fluorescence or a DfOverF group. Should the spec support defining configuration for nested neurodata types dependent on were they are located, i.e:

"Fluorescence": {
        "RoiResponseSeries": {
            "data": {
                "chunk_dimensions": [null, 16]}
        }
    },
    "DfOverF": {
        "RoiResponseSeries": {
            "data": {
                "chunk_dimensions": [null, 32]}
        }
    }

Copy link

codecov bot commented Jan 21, 2025

Codecov Report

Attention: Patch coverage is 97.43590% with 7 lines in your changes missing coverage. Please review.

Project coverage is 94.99%. Comparing base (54679f6) to head (0f04455).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...+config/+internal/applyCustomMatNWBPropertyNames.m 94.28% 2 Missing ⚠️
+schemes/listDatasetsOfNeurodataType.m 92.59% 2 Missing ⚠️
+io/+config/+internal/computeChunkSizeFromConfig.m 98.18% 1 Missing ⚠️
+io/+config/+internal/getTargetChunkSizeInBytes.m 92.30% 1 Missing ⚠️
+io/+config/applyDatasetConfiguration.m 97.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #636      +/-   ##
==========================================
+ Coverage   94.86%   94.99%   +0.13%     
==========================================
  Files         146      160      +14     
  Lines        5565     5838     +273     
==========================================
+ Hits         5279     5546     +267     
- Misses        286      292       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Function that will ensure the dataset configuration conforms with MatNWB specific implementation details
Update test to check that keys for dataset configuration of Dataset-based neurodata types are renamed by appending _data, because MatNWB adds a data property to all Dataset-based classes
Remove unused code and unreachable error
@ehennestad ehennestad marked this pull request as ready for review March 18, 2025 12:58
Specify "level" as a property in the parameters object
Add warning if chunk target size is exceeded due to conflicting chunk size specifications
Suppress warning that has been added and will be triggered by some tests in this class
Suppress warning that has been added and will be triggered by some tests in this class
Testing of chunkDimensionConstraints are handled in ComputeChunkSizeFromConfigTest
@ehennestad ehennestad merged commit 06fe7f9 into main Apr 30, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Add default and customizable configuration for dataset chunking
2 participants