Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No blob reuse with zstd:chunked partial pull, even when ConcurrentBlobCopiesSemaphore concurrency is 1 #2798

Open
kbr- opened this issue Mar 20, 2025 · 8 comments · May be fixed by #2799
Open
Labels
kind/bug A defect in an existing functionality (or a PR fixing it)

Comments

@kbr-
Copy link

kbr- commented Mar 20, 2025

tl;dr: If layer X is pulled before layer Y and they share chunks or even the full blob, pulling Y should skip fetching from the network whatever is already in local storage. But this doesn't seem to happen in partial pulls mode if we use ConcurrentBlobCopiesSemaphore to sequence the layer pulls. It happens with gzip. It also happens if Y is from another image and we start the copy.Image after X finished pulling. It seems that the semaphore doesn't establish a clear happens-before relationship between the layer pulls, perhaps some metadata updates are being done concurrently still.


I have the following golang script to test performance of image pulling with various concurrency settings and in different compression formats:

package main

import (
	"context"
	"flag"
	"fmt"
	"os"
	"strings"
	"sync"
	"time"

	"github.com/containers/image/v5/copy"
	"github.com/containers/image/v5/docker"
	"github.com/containers/image/v5/signature"
	"github.com/containers/image/v5/transports/alltransports"
	"github.com/containers/image/v5/types"
	"github.com/containers/storage"
	"github.com/containers/storage/pkg/reexec"
	"github.com/containers/storage/pkg/unshare"
	"github.com/sirupsen/logrus"
	"golang.org/x/sync/semaphore"
)

func main() {
	if reexec.Init() {
		return
	}

	unshare.MaybeReexecUsingUserNamespace(false)

	registryURL := flag.String("registry", "", "URL of the Docker registry")
	imageList := flag.String("images", "", "Comma-separated list of tagged image names")
	concurrency := flag.Int("concurrency", 3, "Concurrency factor for downloads")
	enablePartialImages := flag.Bool("enable-partial-images", false, "Whether to enable partial images")
	debugLogging := flag.Bool("debug", false, "Enable logrus debug logging")
	flag.Parse()

	if *debugLogging {
		logrus.SetLevel(logrus.DebugLevel)
	}
	logrus.SetFormatter(&logrus.TextFormatter{
		FullTimestamp: true,
	})

	if *registryURL == "" || *imageList == "" {
		fmt.Println("Registry URL and image list are required")
		flag.Usage()
		os.Exit(1)
	}

	images := strings.Split(*imageList, ",")
	for i, img := range images {
		images[i] = strings.TrimSpace(img)
	}

	storeOptions, err := storage.DefaultStoreOptions()
	if err != nil {
		logrus.Fatalf("Failed to get default store options: %v", err)
	}

	storeOptions.GraphDriverName = "overlay"
	storeOptions.PullOptions = map[string]string{
		"enable_partial_images": fmt.Sprintf("%t", *enablePartialImages),
	}

	store, err := storage.GetStore(storeOptions)
	if err != nil {
		logrus.Fatalf("Failed to initialize storage: %v", err)
	}

	defer func() {
		if layers, err := store.Shutdown(false); err != nil {
			logrus.Errorf("Error shutting down storage: %v, still mounted layers: %v", err, layers)
		}
	}()

	policy := &signature.Policy{
		Default: []signature.PolicyRequirement{
			signature.NewPRInsecureAcceptAnything(),
		},
	}
	policyContext, err := signature.NewPolicyContext(policy)
	if err != nil {
		logrus.Fatalf("Failed to create policy context: %v", err)
	}

	defer func() {
		if err := policyContext.Destroy(); err != nil {
			logrus.Errorf("Error destroying policy context: %v", err)
		}
	}()

	sem := semaphore.NewWeighted(int64(*concurrency))
	ctx := context.Background()

	sysCtx := &types.SystemContext{
		DockerInsecureSkipTLSVerify:       types.NewOptionalBool(true),
		OCIInsecureSkipTLSVerify:          true,
		DockerDaemonInsecureSkipTLSVerify: true,
	}

	startTime := time.Now()

	var wg sync.WaitGroup

	for _, imageName := range images {
		wg.Add(1)
		go func(img string) {
			defer wg.Done()

			srcRef, err := docker.ParseReference(fmt.Sprintf("//%s/%s", *registryURL, img))
			if err != nil {
				logrus.Errorf("Failed to parse reference for %s: %v", img, err)
				return
			}

			destRef, err := alltransports.ParseImageName(fmt.Sprintf("containers-storage:%s", img))
			if err != nil {
				logrus.Errorf("Failed to parse destination reference for %s: %v", img, err)
				return
			}

			options := &copy.Options{
				ReportWriter:                  os.Stdout,
				DestinationCtx:                sysCtx,
				SourceCtx:                     sysCtx,
				ConcurrentBlobCopiesSemaphore: sem,
			}

			logrus.Infof("Pulling %s from %s", img, *registryURL)
			_, err = copy.Image(ctx, policyContext, destRef, srcRef, options)

			if err != nil {
				logrus.Errorf("Failed to pull %s: %v", img, err)
				return
			}

			logrus.Infof("Successfully pulled %s", img)
		}(imageName)
	}

	wg.Wait()

	duration := time.Since(startTime)
	fmt.Printf("\nTotal pull operation completed in %s\n", duration)
}

I create a semaphore to limit concurrency which I pass as ConcurrentBlobCopiesSemaphore.

I have a local docker registry which stores ubuntu:24.10 image in zstd:chunked format and in gzip format:

$ skopeo inspect --tls-verify=false docker://localhost:5000/ubuntu1:zstd --raw | jq
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:e40b6e31bd8cba664c267ee7a21689b1f98585818942761ff1cb8cd3c2df323d",
    "size": 2296
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+zstd",
      "digest": "sha256:0cc74fd753a0d2d2e8ebc140549a660831a99429f5b44275cc4dda88a90c7257",
      "size": 33180938,
      "annotations": {
        "io.github.containers.zstd-chunked.manifest-checksum": "sha256:a906ece44060de0be1de012cd50bcaceec3ac6bb680509c7ef8a1106a01365a4",
        "io.github.containers.zstd-chunked.manifest-position": "32649294:380035:1797807:1",
        "io.github.containers.zstd-chunked.tarsplit-position": "33029337:151529:3784936"
      }
    }
  ]
}
$ skopeo inspect --tls-verify=false docker://localhost:5000/ubuntu1:gzip --raw | jq
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:e40b6e31bd8cba664c267ee7a21689b1f98585818942761ff1cb8cd3c2df323d",
    "size": 2296
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:d14e9fd344e3d4bb8ae535902ba39e40cca2ec9ab404e4cc1bf911e088b5361c",
      "size": 32585328
    }
  ]
}

ubuntu1:zstd is also tagged as ubuntu2:zstd in the registry, and ubuntu1:gzip is tagged as ubuntu2:gzip.

Baseline is pulling a single image:

$ go run ./ -registry localhost:5000 -images ubuntu1:gzip -enable-partial-images -concurrency 1
INFO[2025-03-20T12:04:20+01:00] Pulling ubuntu1:gzip from localhost:5000     
Getting image source signatures
Copying blob d14e9fd344e3 done   | 
Copying config e40b6e31bd done   | 
Writing manifest to image destination
INFO[2025-03-20T12:04:21+01:00] Successfully pulled ubuntu1:gzip             

Total pull operation completed in 774.178588ms

I also measure how many bytes are transferred from the docker registry. Here 0.030 GB was transferred, which is the size of this single layer image. Makes sense.

Next experiment is to pull both ubuntu1:gzip and ubuntu2:gzip. I expect only one of them to actually pull the data over network, and the other to reuse the blob, since we're limiting imageCopier.copyLayer concurrency to 1.

$ go run ./ -registry localhost:5000 -images ubuntu1:gzip,ubuntu2:gzip -enable-partial-images -concurrency 1
INFO[2025-03-20T12:06:44+01:00] Pulling ubuntu2:gzip from localhost:5000     
INFO[2025-03-20T12:06:44+01:00] Pulling ubuntu1:gzip from localhost:5000     
Getting image source signatures
Getting image source signatures
Copying blob d14e9fd344e3 done   | 
Copying blob d14e9fd344e3 skipped: already exists  
Copying config e40b6e31bd done   | 
Writing manifest to image destination
Copying config e40b6e31bd done   | 
Writing manifest to image destination
INFO[2025-03-20T12:06:45+01:00] Successfully pulled ubuntu1:gzip             
INFO[2025-03-20T12:06:45+01:00] Successfully pulled ubuntu2:gzip             

Total pull operation completed in 798.510689ms

Indeed, we see "already exists" in the logs, and my metrics show that 0.030 GB was transferred. Great.

Now let's do the same with zstd.
Baseline:

$ go run ./ -registry localhost:5000 -images ubuntu1:zstd -enable-partial-images -concurrency 1                                                                                          
INFO[2025-03-20T12:07:35+01:00] Pulling ubuntu1:zstd from localhost:5000     
Getting image source signatures
Copying blob 0cc74fd753a0 done  31.6MiB / 31.6MiB (skipped: 540.0b = 0.00%)
Copying config e40b6e31bd done   | 
Writing manifest to image destination
INFO[2025-03-20T12:07:36+01:00] Successfully pulled ubuntu1:zstd             

Total pull operation completed in 844.650036ms

0.031 GB transferred.

A bit worse than gzip but that's ok, maybe this small layer just didn't compress with zstd:chunked that well.

But now the bad part:

$ go run ./ -registry localhost:5000 -images ubuntu1:zstd,ubuntu2:zstd -enable-partial-images -concurrency 1
INFO[2025-03-20T12:08:37+01:00] Pulling ubuntu2:zstd from localhost:5000     
INFO[2025-03-20T12:08:37+01:00] Pulling ubuntu1:zstd from localhost:5000     
Getting image source signatures
Getting image source signatures
Copying blob 0cc74fd753a0 done  31.6MiB / 31.6MiB (skipped: 540.0b = 0.00%)
Copying blob 0cc74fd753a0 done  31.6MiB / 31.6MiB (skipped: 540.0b = 0.00%)
Copying config e40b6e31bd done   | 
Writing manifest to image destination
Copying config e40b6e31bd done   | 
Writing manifest to image destination
INFO[2025-03-20T12:08:38+01:00] Successfully pulled ubuntu2:zstd             
INFO[2025-03-20T12:08:38+01:00] Successfully pulled ubuntu1:zstd             

Total pull operation completed in 1.632340319s

as the logs show, the blob was pulled twice! And my metrics also show 0.062 GB was transferred over the interface.

I also did this experiment on large (~2GB compressed) image with ~10 layers. Every single layer was pulled twice when using zstd version, even with concurrency 1.

It's not that zstd:chunked is not capable of reusing layers or even chunks. It is, which we can see if we serialize the operations on user level i.e. pull ubuntu1:zstd and then ubuntu2:zstd without clearing local store in between:

$ go run ./ -registry localhost:5000 -images ubuntu1:zstd -enable-partial-images -concurrency 1
INFO[2025-03-20T12:10:51+01:00] Pulling ubuntu1:zstd from localhost:5000     
Getting image source signatures
Copying blob 0cc74fd753a0 done  31.6MiB / 31.6MiB (skipped: 540.0b = 0.00%)
Copying config e40b6e31bd done   | 
Writing manifest to image destination
INFO[2025-03-20T12:10:52+01:00] Successfully pulled ubuntu1:zstd             

Total pull operation completed in 816.480472ms
$ go run ./ -registry localhost:5000 -images ubuntu2:zstd -enable-partial-images -concurrency 1
INFO[2025-03-20T12:10:54+01:00] Pulling ubuntu2:zstd from localhost:5000     
Getting image source signatures
Copying blob 0cc74fd753a0 skipped: already exists  
Copying config e40b6e31bd done   | 
Writing manifest to image destination
INFO[2025-03-20T12:10:54+01:00] Successfully pulled ubuntu2:zstd             

Total pull operation completed in 84.912156ms

But perhaps when we rely on ConcurrentBlobCopiesSemaphore for serialization, something wrong is going underneath with the metadata updates.

@mtrmac
Copy link
Collaborator

mtrmac commented Mar 20, 2025

Thanks!

In terms of expectations, writes to c/storage currently have two parts: (This is not an API promise, it can be reasonably considered suboptimal, and some structural changes might be coming.) One part is ~the download (putBlobToPendingFile / PutBlobPartial), another is creating a layer in c/storage (commitLayer), so that the layer is actually visible outside of that one pull operation.

Downloads can happen in an ~arbitrary order, that only costs disk space. Commits must happen in layer order.

The semaphore really only controls the download part. If it were set to allow 5 concurrent layers, you should see 5 downloads start concurrently, and at the moment the first layer download is finished, commits will start. That can be immediately after layer 1 is downloaded and before layer 2 is finishes download, or it might be after all 5 layers are downloaded (e.g. if layer 1 is much larger than layer 5).

So, when pulling 2 images concurrently this way, it can easily happen that both pulls start 5 concurrent layer downloads, and only after 2x5 layers are downloaded, the first layer is actually committed — but all of these downloads have already happened, twice.

There’s a long-standing RFE containers/podman#1911 to improve this (“starting N applications sharing a base image triggers a thundering herd of redundant base layer downloads”), but it is fairly difficult to do in the no-daemon design of Podman. I wouldn’t want to promise that this is going to reliably improve any time soon.


Limiting the concurrency to exactly one helps, in that the download of the first layer is the only one that is started, and after the layer is downloaded, it can always be committed; and that makes it visible to other layer downloads and other concurrent pulls. And the implementation happens to work in a way where the commit happens while holding ConcurrentBlobCopiesSemaphore — it’s not clear that this is intentional, but it works that way now.

… but there seems to be a straightforward bug in c/image, where PutBlobPartial does not call queueOrCommit; so, layer commits only happen after all layers of that image have been downloaded, and in this test scenario, both parallel downloads can download layer 1 before the commit happens.

@mtrmac mtrmac added the kind/bug A defect in an existing functionality (or a PR fixing it) label Mar 20, 2025
@kbr-
Copy link
Author

kbr- commented Mar 20, 2025

Limiting the concurrency to exactly one helps, in that the download of the first layer is the only one that is started, and after the layer is downloaded, it can always be committed; and that makes it visible to other layer downloads and other concurrent pulls. And the implementation happens to work in a way where the commit happens while holding ConcurrentBlobCopiesSemaphore — it’s not clear that this is intentional, but it works that way now.

Yeah. Generally my idea was: suppose you're pulling N layers across multiple images and you limit concurrency to C << N. You set C just as high as to consume all available network bandwidth, but not higher. For any pair of randomly selected layers there's a nontrivial chance that they will share chunks or even whole blobs. Then, since the small concurrency limit C establishes happens-before relationships between certain layers (layer C+1 will see layers 1..C etc.), they will reuse each-other's chunks.

containers/podman#1911 is not required for this simple behavior.

So it's desirable to commit layers as early as possible (as to make them visible for later pulled layers to reuse).

@kbr-
Copy link
Author

kbr- commented Mar 20, 2025

Commits must happen in layer order.

BTW, I thought that is only true for the old implementation (non partial pulls), which is untaring the layers sequentially. Which is kinda bad for performance. (And also there's the global lock...)

I guess it's still true to certain degree, but whether it affects performance or not depends on what "commit" entails.

In the partial pull implementation, does commit also have to perform some heavy long-running operations (like the old implementation's untaring of layers)? Or does it only do quick metadata updates?

@mtrmac
Copy link
Collaborator

mtrmac commented Mar 20, 2025

In case it helps, I think a fix would be #2799 . I’m afraid I can’t fully dedicate time to this now, but maybe it is useful as is — if this works at all, but you still see that reuse doesn’t happen, please collect debug-level logs.

@mtrmac
Copy link
Collaborator

mtrmac commented Mar 20, 2025

Commits must happen in layer order.

BTW, I thought that is only true for the old implementation (non partial pulls), which is untaring the layers sequentially.

It’s not true for the underlying file storage, but the c/storage layer metadata is built around the concept of a layer DAG where a child can only be created after a parent exists (and image layers are immutable, so it isn’t possible to just allocate IDs in advance and populate them in random order).

This is something we are thinking about changing (and one of the reasons I wouldn’t want to commit too strongly to existing behaviors), because most modern backends don’t strictly need parents to exist, because this reduces layer reuse (updating the base layer EDIT breaks sharing of an unmodified child layer), and because we have removed the device-mapper driver which really needed this structure.

But it’s not clear we can completely eliminate this model, and anyway it will take quite some time to change.

@kbr-
Copy link
Author

kbr- commented Mar 20, 2025

In case it helps, I think a fix would be #2799 . I’m afraid I can’t fully dedicate time to this now, but maybe it is useful as is — if this works at all, but you still see that reuse doesn’t happen, please collect debug-level logs.

I'll take a look tomorrow, thanks!

@kbr-
Copy link
Author

kbr- commented Mar 22, 2025

Yep, #2799 appears to fix the issue. I can see now that much less data is being transferred in my experiments and it's a bit faster. In particular ubuntu1:zstd,ubuntu2:zstd example from report transfers 0.031 GB with concurrency 1, as it should.

Thanks @mtrmac !


BTW I'm still wondering about this:

In the partial pull implementation, does commit also have to perform some heavy long-running operations (like the old implementation's untaring of layers)? Or does it only do quick metadata updates?

From what I understood, in the partial pull implementation commitLayer is only supposed to perform a quick move/rename, but I'm still seeing quite a bit of time being spent there (I haven't investigated deeper where though.) And this part is still completely sequential, so if it does some heavy CPU stuff, that's quite bad.

@mtrmac
Copy link
Collaborator

mtrmac commented Mar 24, 2025

I think it’s only a rename, + metadata operations (the c/storage locks are a bit costly, but O(1)), for the layer contents. We do write “tar-split” metadata on commit, and if composefs is enabled, generate composefs metadata.

Both could probably be staged to disk before commit, but that would probably be a bit more work than the simple fix above (and it would mostly be in the c/storage repo), so I’d prefer to track that separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A defect in an existing functionality (or a PR fixing it)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants