Skip to content

When downloading zstd:chunked image from docker registry backed by s3 bucket, downloading manifest causes downloading the entire layer (effectively the full layer blob is downloaded twice) #2792

@kbr-

Description

@kbr-
  • AWS S3, and also minio, don't support multi-part range queries
  • PutBlobPartial function (in storage/storage_dest.go in containers/image) relies on HTTP multi-part range queries to obtain some metadata from the end of the blob. It performs a multi-part range query with 2 ranges (perhaps to get tar-split data and chunk manifest data)
func (s *dockerImageSource) GetBlobAt(ctx context.Context, info types.BlobInfo, chunks []private.ImageSourceChunk) (chan io.ReadCloser, chan error, error) {
	headers := make(map[string][]string)

	rangeVals := make([]string, 0, len(chunks))
	lastFound := false
	for _, c := range chunks {
		if lastFound {
			return nil, nil, fmt.Errorf("internal error: another chunk requested after an util-EOF chunk")
		}
		// If the Length is set to -1, then request anything after the specified offset.
		if c.Length == math.MaxUint64 {
			lastFound = true
			rangeVals = append(rangeVals, fmt.Sprintf("%d-", c.Offset))
		} else {
			rangeVals = append(rangeVals, fmt.Sprintf("%d-%d", c.Offset, c.Offset+c.Length-1))
		}
	}

	headers["Range"] = []string{fmt.Sprintf("bytes=%s", strings.Join(rangeVals, ","))}
  • docker registry started against s3/minio returns a 200 response instead of 206
  • this causes the library to download the full layer blob (silently, without any warning 👌), just to discard 99.9% of it in order to obtain the metadata
	switch res.StatusCode {
	case http.StatusOK:
		// if the server replied with a 200 status code, convert the full body response to a series of
		// streams as it would have been done with 206.
		streams := make(chan io.ReadCloser)
		errs := make(chan error)
		go splitHTTP200ResponseToPartial(streams, errs, res.Body, chunks)
  • then it calculates missing chunks and downloads a large part of the blob again

Possibly related: containers/storage#1928


To reproduce, you can use the following docker compose setup.
docker-compose.yml:

networks:
  registry_net:
    driver: bridge

services:
  minio:
    image: minio/minio
    container_name: minio
    ports:
      - "9000:9000"
      - "9001:9001"
    volumes:
      - ./minio-data:/data
    environment:
      - MINIO_ROOT_USER=minioadmin
      - MINIO_ROOT_PASSWORD=minioadmin
    command: server /data --console-address ":9001"
    networks:
      registry_net:
        aliases:
          - minio.localhost

  registry:
    image: registry:2
    container_name: minio-registry
    ports:
      - "5000:5000"
    volumes:
      - ./registry-config.yml:/etc/docker/registry/config.yml
    command: serve /etc/docker/registry/config.yml
    depends_on:
      - minio
    networks:
      - registry_net

registry-config.yml:

version: 0.1
storage:
  s3:
    accesskey: minioadmin
    secretkey: minioadmin
    region: us-east-1
    regionendpoint: http://minio.localhost:9000
    bucket: docker-registry
    secure: false
    rootdirectory: /
  delete:
    enabled: true
http:
  addr: :5000

create the directory ./minio-data
docker compose up, then use minio client mc to create the bucket docker-registry inside minio
(after setting mc alias reg):
mc mb reg/docker-registry

pick any image. Pull it to local storage with buildah. Then push it to the registry while compressing with zstd:chunked:

buildah push --tls-verify=false --compression-format zstd:chunked <image-id> docker://localhost:5000/img

then remove it from local storage since we're gonna pull it next

buildah rmi <image-id>

Now I was testing this with limited bandwidth using tc, then the behavior is visible to the naked eye: when pulling the image, at the end buildah gets "stuck" for a while at 0% (but download is happening behind the scenes, e.g. visible in slurm), that's most likely the part where it's downloading the manifests (actually downloading the whole blob 🥲)

here's the script to limit bandwidth, run it with sudo in the same directory where docker-compose.yml is

#!/bin/bash

# Need to create and start virtual interface first:
# sudo ip link add name ifb0 type ifb
# sudo ip link set dev ifb0 up
#
# virual interface used to turn $interface ingress traffic into egress traffic for ifb0

interface="$(docker compose config --format json | jq .networks.registry_net.name | xargs -I {} docker network ls -q --filter name={} --format br-\{\{.ID\}\})"

echo "Interface used by docker compose: $interface"

download_limit=100mbps

tc qdisc del dev "$interface" ingress 2>/dev/null
tc qdisc del dev "ifb0" root 2>/dev/null

# It's not possible to shape ingress, only police it
# So instead we reroute all ingress traffic into virtual interface "ifb0" on which it becomes egress
tc qdisc add dev "$interface" ingress
tc filter add dev "$interface" parent ffff: protocol ip u32 match u32 0 0 action mirred egress redirect dev ifb0

tc qdisc add dev ifb0 root handle 1: htb default 10
tc class add dev ifb0 parent 1: classid 1:1 htb rate "$download_limit"
tc class add dev ifb0 parent 1:1 classid 1:10 htb rate "$download_limit"

echo To remove: tc qdisc del dev "$interface" ingress
echo To remove: tc qdisc del dev "ifb0" root

(note that script requires ifb0 virtual interface to exist, see comment inside the script)
for my tests I was limiting to 100mbps and using ~700MB image (compressed in the registry)

$ skopeo inspect --tls-verify=false docker://localhost:5000/img --raw | jq -r '([.layers.[].size] | add)/1024/1024'
691.4367246627808

adjust the limits appropriately based on size of the image you're using for testing.

Now pull (also remember to set enable_partial_images to true)

time ~/go/bin/buildah --log-level trace --tls-verify=false pull docker://localhost:5000/img

with vanilla buildah this took for me:

real    0m17.138s
user    0m12.764s
sys     0m9.426s

now we can apply a workaround which translates multi-part range requests into a sequence of single-part range request (courtesy of @psarna):

diff --git a/vendor/github.com/containers/image/v5/storage/storage_dest.go b/vendor/github.com/containers/image/v5/storage/storage_dest.go
index 0af4523ff..360273633 100644
--- a/vendor/github.com/containers/image/v5/storage/storage_dest.go
+++ b/vendor/github.com/containers/image/v5/storage/storage_dest.go
@@ -323,20 +323,32 @@ type zstdFetcher struct {
 
 // GetBlobAt converts from chunked.GetBlobAt to BlobChunkAccessor.GetBlobAt.
 func (f *zstdFetcher) GetBlobAt(chunks []chunked.ImageSourceChunk) (chan io.ReadCloser, chan error, error) {
-       newChunks := make([]private.ImageSourceChunk, 0, len(chunks))
-       for _, v := range chunks {
-               i := private.ImageSourceChunk{
-                       Offset: v.Offset,
-                       Length: v.Length,
+       errs := make(chan error)
+       rc := make(chan io.ReadCloser)
+       logrus.Debug("chunks: ", chunks)
+       go func() {
+               for _, v := range chunks {
+                       i := private.ImageSourceChunk{
+                               Offset: v.Offset,
+                               Length: v.Length,
+                       }
+                       logrus.Debug("special careful single GetBlobAt to make docker registry work: ", i, " ", f.blobInfo)
+                       blobRc, blobErrs, blobErr := f.chunkAccessor.GetBlobAt(f.ctx, f.blobInfo, []private.ImageSourceChunk{i})
+                       if blobErr != nil {
+                               errs <- blobErr
+                       }
+                       // channels transplant
+                       for r := range blobRc {
+                               rc <- r
+                       }
+                       for e := range blobErrs {
+                               errs <- e
+                       }
                }
-               newChunks = append(newChunks, i)
-       }
-       rc, errs, err := f.chunkAccessor.GetBlobAt(f.ctx, f.blobInfo, newChunks)
-       if _, ok := err.(private.BadPartialRequestError); ok {
-               err = chunked.ErrBadRequest{}
-       }
-       return rc, errs, err
-
+               close(rc)
+               close(errs)
+       }()
+       return rc, errs, nil
 }
 
 // PutBlobPartial attempts to create a blob using the data that is already present

redo the experiment. I saw:

real    0m9.759s
user    0m12.078s
sys     0m8.066s

yay, 2x speedup

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugA defect in an existing functionality (or a PR fixing it)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions