Improve write performance of shards #2977

balbasty · 2025-04-11T11:30:41Z

The poor write performance of sharded zarrs in the zarr-python implementation is currently a major limiting factor to its adoption by our group. We found that writing shard-by-shard in an empty sharded array is one magnitude slower than writing in unsharded zarrs. This is surprising, as writing full shards should only be marginally slower than writing unsharded chunks.

While this 2023 discussion suggests that the latency is caused by the re-generation of the index table, this does not seem to be the case in the latest implementation, which saves all chunks in memory and (properly) waits for all chunks to be available before generating the index table (see _encode_partial_single).

Instead I found that a major cause of slowdown comes from the implementation of the Buffer class, which calls np.concatenate every time bytes are added to the buffer. As a proof of concept, I have implemented an alternative DelayedBuffer class that keeps individual byte chunks in a list, and only concatenates them when needed. On a simple benchmark that uses 512**3 shards and 128**3 chunks and a local store, it reduces the time to write one shard from ~10 sec to ~1 sec, which is on par with the time taken to write the same 512**3 array in an unsharded zarr (~0.9 sec).

I am keeping this as a draft for now as it is a hacky proof-of-concept implementation, but I am happy to clean it up if this is found to be a good solution (with guidance on how to implement the delayed buffer in a way that is compatible with the buffer prototype logic, which I don't fully understand). All tests pass except one that checks whether a store receives a TestBuffer (as it instead receives a `DelayedBuffer).

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.rst
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

…fer concatenation

…r buffer)

d-v-b · 2025-04-11T11:42:26Z

@balbasty thank you so much for this work. I think your detective work here will be very much appreciated.

general question: why are we doing concatenation at all? is there a reason why we can't statically allocate all the memory we need in advance? I thought the sharding format gave explicit byte ranges for each chunk, and thus the size of any combination of shards can be known prior to fetching anything.

balbasty · 2025-04-11T15:50:35Z

general question: why are we doing concatenation at all? is there a reason why we can't statically allocate all the memory we need in advance? I thought the sharding format gave explicit byte ranges for each chunk, and thus the size of any combination of shards can be known prior to fetching anything.

I don't believe so. The index table has a fixed size, but the chunks have variable size (hence the index table). Otherwise compressed chunks would take more space than needed. The format is either index_table + stack([encoded_chunks]) or stack([encoded_chunks]) + index_table.

d-v-b · 2025-04-19T14:26:40Z

general question: why are we doing concatenation at all? is there a reason why we can't statically allocate all the memory we need in advance? I thought the sharding format gave explicit byte ranges for each chunk, and thus the size of any combination of shards can be known prior to fetching anything.

I don't believe so. The index table has a fixed size, but the chunks have variable size (hence the index table). Otherwise compressed chunks would take more space than needed. The format is either index_table + stack([encoded_chunks]) or stack([encoded_chunks]) + index_table.

What I mean is that, when we get the index table, we also get the size of each compressed chunk. And when we are fetching chunks from a shard, we always know in advance which chunks we need. So it seems like the combination of the shard index + the set of requested chunks is sufficient to specify the required memory for compressed chunks exactly. Does this check out?

balbasty added 6 commits April 10, 2025 13:04

Sharding: do not merge new/old if old is empty + avoid sequential buf…

4082888

…fer concatenation

Preliminary benchmark for sharded writes

09bc21d

Refactor DelayedBuffer + implement efficient __getitem__

1a9158a

Revert to original MergingShardBuilder (if/else not needed with faste…

23fbc55

…r buffer)

Implement __setitem__ in DelayedBuffer

050477d

Pass tests (except one)

9b66073

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 11, 2025

LDeakin mentioned this pull request Apr 17, 2025

(chore): update benchmarks LDeakin/zarr_benchmarks#6

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve write performance of shards #2977

Improve write performance of shards #2977

balbasty commented Apr 11, 2025

d-v-b commented Apr 11, 2025

balbasty commented Apr 11, 2025

d-v-b commented Apr 19, 2025

Improve write performance of shards #2977

Are you sure you want to change the base?

Improve write performance of shards #2977

Conversation

balbasty commented Apr 11, 2025

d-v-b commented Apr 11, 2025

balbasty commented Apr 11, 2025

d-v-b commented Apr 19, 2025