Skip to content

No blob reuse with zstd:chunked partial pull, even when ConcurrentBlobCopiesSemaphore concurrency is 1 #2798

Closed
@kbr-

Description

@kbr-

tl;dr: If layer X is pulled before layer Y and they share chunks or even the full blob, pulling Y should skip fetching from the network whatever is already in local storage. But this doesn't seem to happen in partial pulls mode if we use ConcurrentBlobCopiesSemaphore to sequence the layer pulls. It happens with gzip. It also happens if Y is from another image and we start the copy.Image after X finished pulling. It seems that the semaphore doesn't establish a clear happens-before relationship between the layer pulls, perhaps some metadata updates are being done concurrently still.


I have the following golang script to test performance of image pulling with various concurrency settings and in different compression formats:

package main

import (
	"context"
	"flag"
	"fmt"
	"os"
	"strings"
	"sync"
	"time"

	"github.com/containers/image/v5/copy"
	"github.com/containers/image/v5/docker"
	"github.com/containers/image/v5/signature"
	"github.com/containers/image/v5/transports/alltransports"
	"github.com/containers/image/v5/types"
	"github.com/containers/storage"
	"github.com/containers/storage/pkg/reexec"
	"github.com/containers/storage/pkg/unshare"
	"github.com/sirupsen/logrus"
	"golang.org/x/sync/semaphore"
)

func main() {
	if reexec.Init() {
		return
	}

	unshare.MaybeReexecUsingUserNamespace(false)

	registryURL := flag.String("registry", "", "URL of the Docker registry")
	imageList := flag.String("images", "", "Comma-separated list of tagged image names")
	concurrency := flag.Int("concurrency", 3, "Concurrency factor for downloads")
	enablePartialImages := flag.Bool("enable-partial-images", false, "Whether to enable partial images")
	debugLogging := flag.Bool("debug", false, "Enable logrus debug logging")
	flag.Parse()

	if *debugLogging {
		logrus.SetLevel(logrus.DebugLevel)
	}
	logrus.SetFormatter(&logrus.TextFormatter{
		FullTimestamp: true,
	})

	if *registryURL == "" || *imageList == "" {
		fmt.Println("Registry URL and image list are required")
		flag.Usage()
		os.Exit(1)
	}

	images := strings.Split(*imageList, ",")
	for i, img := range images {
		images[i] = strings.TrimSpace(img)
	}

	storeOptions, err := storage.DefaultStoreOptions()
	if err != nil {
		logrus.Fatalf("Failed to get default store options: %v", err)
	}

	storeOptions.GraphDriverName = "overlay"
	storeOptions.PullOptions = map[string]string{
		"enable_partial_images": fmt.Sprintf("%t", *enablePartialImages),
	}

	store, err := storage.GetStore(storeOptions)
	if err != nil {
		logrus.Fatalf("Failed to initialize storage: %v", err)
	}

	defer func() {
		if layers, err := store.Shutdown(false); err != nil {
			logrus.Errorf("Error shutting down storage: %v, still mounted layers: %v", err, layers)
		}
	}()

	policy := &signature.Policy{
		Default: []signature.PolicyRequirement{
			signature.NewPRInsecureAcceptAnything(),
		},
	}
	policyContext, err := signature.NewPolicyContext(policy)
	if err != nil {
		logrus.Fatalf("Failed to create policy context: %v", err)
	}

	defer func() {
		if err := policyContext.Destroy(); err != nil {
			logrus.Errorf("Error destroying policy context: %v", err)
		}
	}()

	sem := semaphore.NewWeighted(int64(*concurrency))
	ctx := context.Background()

	sysCtx := &types.SystemContext{
		DockerInsecureSkipTLSVerify:       types.NewOptionalBool(true),
		OCIInsecureSkipTLSVerify:          true,
		DockerDaemonInsecureSkipTLSVerify: true,
	}

	startTime := time.Now()

	var wg sync.WaitGroup

	for _, imageName := range images {
		wg.Add(1)
		go func(img string) {
			defer wg.Done()

			srcRef, err := docker.ParseReference(fmt.Sprintf("//%s/%s", *registryURL, img))
			if err != nil {
				logrus.Errorf("Failed to parse reference for %s: %v", img, err)
				return
			}

			destRef, err := alltransports.ParseImageName(fmt.Sprintf("containers-storage:%s", img))
			if err != nil {
				logrus.Errorf("Failed to parse destination reference for %s: %v", img, err)
				return
			}

			options := &copy.Options{
				ReportWriter:                  os.Stdout,
				DestinationCtx:                sysCtx,
				SourceCtx:                     sysCtx,
				ConcurrentBlobCopiesSemaphore: sem,
			}

			logrus.Infof("Pulling %s from %s", img, *registryURL)
			_, err = copy.Image(ctx, policyContext, destRef, srcRef, options)

			if err != nil {
				logrus.Errorf("Failed to pull %s: %v", img, err)
				return
			}

			logrus.Infof("Successfully pulled %s", img)
		}(imageName)
	}

	wg.Wait()

	duration := time.Since(startTime)
	fmt.Printf("\nTotal pull operation completed in %s\n", duration)
}

I create a semaphore to limit concurrency which I pass as ConcurrentBlobCopiesSemaphore.

I have a local docker registry which stores ubuntu:24.10 image in zstd:chunked format and in gzip format:

$ skopeo inspect --tls-verify=false docker://localhost:5000/ubuntu1:zstd --raw | jq
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:e40b6e31bd8cba664c267ee7a21689b1f98585818942761ff1cb8cd3c2df323d",
    "size": 2296
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+zstd",
      "digest": "sha256:0cc74fd753a0d2d2e8ebc140549a660831a99429f5b44275cc4dda88a90c7257",
      "size": 33180938,
      "annotations": {
        "io.github.containers.zstd-chunked.manifest-checksum": "sha256:a906ece44060de0be1de012cd50bcaceec3ac6bb680509c7ef8a1106a01365a4",
        "io.github.containers.zstd-chunked.manifest-position": "32649294:380035:1797807:1",
        "io.github.containers.zstd-chunked.tarsplit-position": "33029337:151529:3784936"
      }
    }
  ]
}
$ skopeo inspect --tls-verify=false docker://localhost:5000/ubuntu1:gzip --raw | jq
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:e40b6e31bd8cba664c267ee7a21689b1f98585818942761ff1cb8cd3c2df323d",
    "size": 2296
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:d14e9fd344e3d4bb8ae535902ba39e40cca2ec9ab404e4cc1bf911e088b5361c",
      "size": 32585328
    }
  ]
}

ubuntu1:zstd is also tagged as ubuntu2:zstd in the registry, and ubuntu1:gzip is tagged as ubuntu2:gzip.

Baseline is pulling a single image:

$ go run ./ -registry localhost:5000 -images ubuntu1:gzip -enable-partial-images -concurrency 1
INFO[2025-03-20T12:04:20+01:00] Pulling ubuntu1:gzip from localhost:5000     
Getting image source signatures
Copying blob d14e9fd344e3 done   | 
Copying config e40b6e31bd done   | 
Writing manifest to image destination
INFO[2025-03-20T12:04:21+01:00] Successfully pulled ubuntu1:gzip             

Total pull operation completed in 774.178588ms

I also measure how many bytes are transferred from the docker registry. Here 0.030 GB was transferred, which is the size of this single layer image. Makes sense.

Next experiment is to pull both ubuntu1:gzip and ubuntu2:gzip. I expect only one of them to actually pull the data over network, and the other to reuse the blob, since we're limiting imageCopier.copyLayer concurrency to 1.

$ go run ./ -registry localhost:5000 -images ubuntu1:gzip,ubuntu2:gzip -enable-partial-images -concurrency 1
INFO[2025-03-20T12:06:44+01:00] Pulling ubuntu2:gzip from localhost:5000     
INFO[2025-03-20T12:06:44+01:00] Pulling ubuntu1:gzip from localhost:5000     
Getting image source signatures
Getting image source signatures
Copying blob d14e9fd344e3 done   | 
Copying blob d14e9fd344e3 skipped: already exists  
Copying config e40b6e31bd done   | 
Writing manifest to image destination
Copying config e40b6e31bd done   | 
Writing manifest to image destination
INFO[2025-03-20T12:06:45+01:00] Successfully pulled ubuntu1:gzip             
INFO[2025-03-20T12:06:45+01:00] Successfully pulled ubuntu2:gzip             

Total pull operation completed in 798.510689ms

Indeed, we see "already exists" in the logs, and my metrics show that 0.030 GB was transferred. Great.

Now let's do the same with zstd.
Baseline:

$ go run ./ -registry localhost:5000 -images ubuntu1:zstd -enable-partial-images -concurrency 1                                                                                          
INFO[2025-03-20T12:07:35+01:00] Pulling ubuntu1:zstd from localhost:5000     
Getting image source signatures
Copying blob 0cc74fd753a0 done  31.6MiB / 31.6MiB (skipped: 540.0b = 0.00%)
Copying config e40b6e31bd done   | 
Writing manifest to image destination
INFO[2025-03-20T12:07:36+01:00] Successfully pulled ubuntu1:zstd             

Total pull operation completed in 844.650036ms

0.031 GB transferred.

A bit worse than gzip but that's ok, maybe this small layer just didn't compress with zstd:chunked that well.

But now the bad part:

$ go run ./ -registry localhost:5000 -images ubuntu1:zstd,ubuntu2:zstd -enable-partial-images -concurrency 1
INFO[2025-03-20T12:08:37+01:00] Pulling ubuntu2:zstd from localhost:5000     
INFO[2025-03-20T12:08:37+01:00] Pulling ubuntu1:zstd from localhost:5000     
Getting image source signatures
Getting image source signatures
Copying blob 0cc74fd753a0 done  31.6MiB / 31.6MiB (skipped: 540.0b = 0.00%)
Copying blob 0cc74fd753a0 done  31.6MiB / 31.6MiB (skipped: 540.0b = 0.00%)
Copying config e40b6e31bd done   | 
Writing manifest to image destination
Copying config e40b6e31bd done   | 
Writing manifest to image destination
INFO[2025-03-20T12:08:38+01:00] Successfully pulled ubuntu2:zstd             
INFO[2025-03-20T12:08:38+01:00] Successfully pulled ubuntu1:zstd             

Total pull operation completed in 1.632340319s

as the logs show, the blob was pulled twice! And my metrics also show 0.062 GB was transferred over the interface.

I also did this experiment on large (~2GB compressed) image with ~10 layers. Every single layer was pulled twice when using zstd version, even with concurrency 1.

It's not that zstd:chunked is not capable of reusing layers or even chunks. It is, which we can see if we serialize the operations on user level i.e. pull ubuntu1:zstd and then ubuntu2:zstd without clearing local store in between:

$ go run ./ -registry localhost:5000 -images ubuntu1:zstd -enable-partial-images -concurrency 1
INFO[2025-03-20T12:10:51+01:00] Pulling ubuntu1:zstd from localhost:5000     
Getting image source signatures
Copying blob 0cc74fd753a0 done  31.6MiB / 31.6MiB (skipped: 540.0b = 0.00%)
Copying config e40b6e31bd done   | 
Writing manifest to image destination
INFO[2025-03-20T12:10:52+01:00] Successfully pulled ubuntu1:zstd             

Total pull operation completed in 816.480472ms
$ go run ./ -registry localhost:5000 -images ubuntu2:zstd -enable-partial-images -concurrency 1
INFO[2025-03-20T12:10:54+01:00] Pulling ubuntu2:zstd from localhost:5000     
Getting image source signatures
Copying blob 0cc74fd753a0 skipped: already exists  
Copying config e40b6e31bd done   | 
Writing manifest to image destination
INFO[2025-03-20T12:10:54+01:00] Successfully pulled ubuntu2:zstd             

Total pull operation completed in 84.912156ms

But perhaps when we rely on ConcurrentBlobCopiesSemaphore for serialization, something wrong is going underneath with the metadata updates.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugA defect in an existing functionality (or a PR fixing it)

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions