-
Notifications
You must be signed in to change notification settings - Fork 392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No blob reuse with zstd:chunked partial pull, even when ConcurrentBlobCopiesSemaphore
concurrency is 1
#2798
Comments
Thanks! In terms of expectations, writes to c/storage currently have two parts: (This is not an API promise, it can be reasonably considered suboptimal, and some structural changes might be coming.) One part is ~the download ( Downloads can happen in an ~arbitrary order, that only costs disk space. Commits must happen in layer order. The semaphore really only controls the download part. If it were set to allow 5 concurrent layers, you should see 5 downloads start concurrently, and at the moment the first layer download is finished, commits will start. That can be immediately after layer 1 is downloaded and before layer 2 is finishes download, or it might be after all 5 layers are downloaded (e.g. if layer 1 is much larger than layer 5). So, when pulling 2 images concurrently this way, it can easily happen that both pulls start 5 concurrent layer downloads, and only after 2x5 layers are downloaded, the first layer is actually committed — but all of these downloads have already happened, twice. There’s a long-standing RFE containers/podman#1911 to improve this (“starting N applications sharing a base image triggers a thundering herd of redundant base layer downloads”), but it is fairly difficult to do in the no-daemon design of Podman. I wouldn’t want to promise that this is going to reliably improve any time soon. Limiting the concurrency to exactly one helps, in that the download of the first layer is the only one that is started, and after the layer is downloaded, it can always be committed; and that makes it visible to other layer downloads and other concurrent pulls. And the implementation happens to work in a way where the commit happens while holding … but there seems to be a straightforward bug in c/image, where |
Yeah. Generally my idea was: suppose you're pulling N layers across multiple images and you limit concurrency to C << N. You set C just as high as to consume all available network bandwidth, but not higher. For any pair of randomly selected layers there's a nontrivial chance that they will share chunks or even whole blobs. Then, since the small concurrency limit C establishes happens-before relationships between certain layers (layer C+1 will see layers 1..C etc.), they will reuse each-other's chunks. containers/podman#1911 is not required for this simple behavior. So it's desirable to commit layers as early as possible (as to make them visible for later pulled layers to reuse). |
BTW, I thought that is only true for the old implementation (non partial pulls), which is untaring the layers sequentially. Which is kinda bad for performance. (And also there's the global lock...) I guess it's still true to certain degree, but whether it affects performance or not depends on what "commit" entails. In the partial pull implementation, does commit also have to perform some heavy long-running operations (like the old implementation's untaring of layers)? Or does it only do quick metadata updates? |
In case it helps, I think a fix would be #2799 . I’m afraid I can’t fully dedicate time to this now, but maybe it is useful as is — if this works at all, but you still see that reuse doesn’t happen, please collect debug-level logs. |
It’s not true for the underlying file storage, but the c/storage layer metadata is built around the concept of a layer DAG where a child can only be created after a parent exists (and image layers are immutable, so it isn’t possible to just allocate IDs in advance and populate them in random order). This is something we are thinking about changing (and one of the reasons I wouldn’t want to commit too strongly to existing behaviors), because most modern backends don’t strictly need parents to exist, because this reduces layer reuse (updating the base layer EDIT breaks sharing of an unmodified child layer), and because we have removed the device-mapper driver which really needed this structure. But it’s not clear we can completely eliminate this model, and anyway it will take quite some time to change. |
I'll take a look tomorrow, thanks! |
Yep, #2799 appears to fix the issue. I can see now that much less data is being transferred in my experiments and it's a bit faster. In particular Thanks @mtrmac ! BTW I'm still wondering about this:
From what I understood, in the partial pull implementation |
I think it’s only a rename, + metadata operations (the c/storage locks are a bit costly, but O(1)), for the layer contents. We do write “tar-split” metadata on commit, and if composefs is enabled, generate composefs metadata. Both could probably be staged to disk before commit, but that would probably be a bit more work than the simple fix above (and it would mostly be in the c/storage repo), so I’d prefer to track that separately. |
tl;dr: If layer X is pulled before layer Y and they share chunks or even the full blob, pulling Y should skip fetching from the network whatever is already in local storage. But this doesn't seem to happen in partial pulls mode if we use
ConcurrentBlobCopiesSemaphore
to sequence the layer pulls. It happens with gzip. It also happens if Y is from another image and we start thecopy.Image
after X finished pulling. It seems that the semaphore doesn't establish a clear happens-before relationship between the layer pulls, perhaps some metadata updates are being done concurrently still.I have the following golang script to test performance of image pulling with various concurrency settings and in different compression formats:
I create a semaphore to limit concurrency which I pass as
ConcurrentBlobCopiesSemaphore
.I have a local docker registry which stores ubuntu:24.10 image in zstd:chunked format and in gzip format:
ubuntu1:zstd
is also tagged asubuntu2:zstd
in the registry, andubuntu1:gzip
is tagged asubuntu2:gzip
.Baseline is pulling a single image:
I also measure how many bytes are transferred from the docker registry. Here
0.030 GB
was transferred, which is the size of this single layer image. Makes sense.Next experiment is to pull both
ubuntu1:gzip
andubuntu2:gzip
. I expect only one of them to actually pull the data over network, and the other to reuse the blob, since we're limitingimageCopier.copyLayer
concurrency to 1.Indeed, we see "already exists" in the logs, and my metrics show that
0.030 GB
was transferred. Great.Now let's do the same with
zstd
.Baseline:
0.031 GB transferred.
A bit worse than gzip but that's ok, maybe this small layer just didn't compress with zstd:chunked that well.
But now the bad part:
as the logs show, the blob was pulled twice! And my metrics also show
0.062 GB
was transferred over the interface.I also did this experiment on large (~2GB compressed) image with ~10 layers. Every single layer was pulled twice when using zstd version, even with concurrency 1.
It's not that zstd:chunked is not capable of reusing layers or even chunks. It is, which we can see if we serialize the operations on user level i.e. pull ubuntu1:zstd and then ubuntu2:zstd without clearing local store in between:
But perhaps when we rely on
ConcurrentBlobCopiesSemaphore
for serialization, something wrong is going underneath with the metadata updates.The text was updated successfully, but these errors were encountered: