-
-
Notifications
You must be signed in to change notification settings - Fork 325
Improve write performance of shards #2977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…fer concatenation
@balbasty thank you so much for this work. I think your detective work here will be very much appreciated. general question: why are we doing concatenation at all? is there a reason why we can't statically allocate all the memory we need in advance? I thought the sharding format gave explicit byte ranges for each chunk, and thus the size of any combination of shards can be known prior to fetching anything. |
I don't believe so. The index table has a fixed size, but the chunks have variable size (hence the index table). Otherwise compressed chunks would take more space than needed. The format is either |
What I mean is that, when we get the index table, we also get the size of each compressed chunk. And when we are fetching chunks from a shard, we always know in advance which chunks we need. So it seems like the combination of the shard index + the set of requested chunks is sufficient to specify the required memory for compressed chunks exactly. Does this check out? |
The poor write performance of sharded zarrs in the zarr-python implementation is currently a major limiting factor to its adoption by our group. We found that writing shard-by-shard in an empty sharded array is one magnitude slower than writing in unsharded zarrs. This is surprising, as writing full shards should only be marginally slower than writing unsharded chunks.
While this 2023 discussion suggests that the latency is caused by the re-generation of the index table, this does not seem to be the case in the latest implementation, which saves all chunks in memory and (properly) waits for all chunks to be available before generating the index table (see
_encode_partial_single
).Instead I found that a major cause of slowdown comes from the implementation of the
Buffer
class, which callsnp.concatenate
every time bytes are added to the buffer. As a proof of concept, I have implemented an alternativeDelayedBuffer
class that keeps individual byte chunks in a list, and only concatenates them when needed. On a simple benchmark that uses512**3
shards and128**3
chunks and a local store, it reduces the time to write one shard from ~10 sec to ~1 sec, which is on par with the time taken to write the same512**3
array in an unsharded zarr (~0.9 sec).I am keeping this as a draft for now as it is a hacky proof-of-concept implementation, but I am happy to clean it up if this is found to be a good solution (with guidance on how to implement the delayed buffer in a way that is compatible with the buffer prototype logic, which I don't fully understand). All tests pass except one that checks whether a store receives a
TestBuffer
(as it instead receives a `DelayedBuffer).TODO:
docs/user-guide/*.rst
changes/