-
Notifications
You must be signed in to change notification settings - Fork 401
Description
- AWS S3, and also minio, don't support multi-part range queries
PutBlobPartial
function (in storage/storage_dest.go in containers/image) relies on HTTP multi-part range queries to obtain some metadata from the end of the blob. It performs a multi-part range query with 2 ranges (perhaps to get tar-split data and chunk manifest data)
func (s *dockerImageSource) GetBlobAt(ctx context.Context, info types.BlobInfo, chunks []private.ImageSourceChunk) (chan io.ReadCloser, chan error, error) {
headers := make(map[string][]string)
rangeVals := make([]string, 0, len(chunks))
lastFound := false
for _, c := range chunks {
if lastFound {
return nil, nil, fmt.Errorf("internal error: another chunk requested after an util-EOF chunk")
}
// If the Length is set to -1, then request anything after the specified offset.
if c.Length == math.MaxUint64 {
lastFound = true
rangeVals = append(rangeVals, fmt.Sprintf("%d-", c.Offset))
} else {
rangeVals = append(rangeVals, fmt.Sprintf("%d-%d", c.Offset, c.Offset+c.Length-1))
}
}
headers["Range"] = []string{fmt.Sprintf("bytes=%s", strings.Join(rangeVals, ","))}
- docker registry started against s3/minio returns a 200 response instead of 206
- this causes the library to download the full layer blob (silently, without any warning 👌), just to discard 99.9% of it in order to obtain the metadata
switch res.StatusCode {
case http.StatusOK:
// if the server replied with a 200 status code, convert the full body response to a series of
// streams as it would have been done with 206.
streams := make(chan io.ReadCloser)
errs := make(chan error)
go splitHTTP200ResponseToPartial(streams, errs, res.Body, chunks)
- then it calculates missing chunks and downloads a large part of the blob again
Possibly related: containers/storage#1928
To reproduce, you can use the following docker compose setup.
docker-compose.yml
:
networks:
registry_net:
driver: bridge
services:
minio:
image: minio/minio
container_name: minio
ports:
- "9000:9000"
- "9001:9001"
volumes:
- ./minio-data:/data
environment:
- MINIO_ROOT_USER=minioadmin
- MINIO_ROOT_PASSWORD=minioadmin
command: server /data --console-address ":9001"
networks:
registry_net:
aliases:
- minio.localhost
registry:
image: registry:2
container_name: minio-registry
ports:
- "5000:5000"
volumes:
- ./registry-config.yml:/etc/docker/registry/config.yml
command: serve /etc/docker/registry/config.yml
depends_on:
- minio
networks:
- registry_net
registry-config.yml:
version: 0.1
storage:
s3:
accesskey: minioadmin
secretkey: minioadmin
region: us-east-1
regionendpoint: http://minio.localhost:9000
bucket: docker-registry
secure: false
rootdirectory: /
delete:
enabled: true
http:
addr: :5000
create the directory ./minio-data
docker compose up
, then use minio client mc
to create the bucket docker-registry
inside minio
(after setting mc
alias reg
):
mc mb reg/docker-registry
pick any image. Pull it to local storage with buildah
. Then push it to the registry while compressing with zstd:chunked
:
buildah push --tls-verify=false --compression-format zstd:chunked <image-id> docker://localhost:5000/img
then remove it from local storage since we're gonna pull it next
buildah rmi <image-id>
Now I was testing this with limited bandwidth using tc
, then the behavior is visible to the naked eye: when pulling the image, at the end buildah gets "stuck" for a while at 0% (but download is happening behind the scenes, e.g. visible in slurm
), that's most likely the part where it's downloading the manifests (actually downloading the whole blob 🥲)
here's the script to limit bandwidth, run it with sudo
in the same directory where docker-compose.yml
is
#!/bin/bash
# Need to create and start virtual interface first:
# sudo ip link add name ifb0 type ifb
# sudo ip link set dev ifb0 up
#
# virual interface used to turn $interface ingress traffic into egress traffic for ifb0
interface="$(docker compose config --format json | jq .networks.registry_net.name | xargs -I {} docker network ls -q --filter name={} --format br-\{\{.ID\}\})"
echo "Interface used by docker compose: $interface"
download_limit=100mbps
tc qdisc del dev "$interface" ingress 2>/dev/null
tc qdisc del dev "ifb0" root 2>/dev/null
# It's not possible to shape ingress, only police it
# So instead we reroute all ingress traffic into virtual interface "ifb0" on which it becomes egress
tc qdisc add dev "$interface" ingress
tc filter add dev "$interface" parent ffff: protocol ip u32 match u32 0 0 action mirred egress redirect dev ifb0
tc qdisc add dev ifb0 root handle 1: htb default 10
tc class add dev ifb0 parent 1: classid 1:1 htb rate "$download_limit"
tc class add dev ifb0 parent 1:1 classid 1:10 htb rate "$download_limit"
echo To remove: tc qdisc del dev "$interface" ingress
echo To remove: tc qdisc del dev "ifb0" root
(note that script requires ifb0
virtual interface to exist, see comment inside the script)
for my tests I was limiting to 100mbps and using ~700MB image (compressed in the registry)
$ skopeo inspect --tls-verify=false docker://localhost:5000/img --raw | jq -r '([.layers.[].size] | add)/1024/1024'
691.4367246627808
adjust the limits appropriately based on size of the image you're using for testing.
Now pull (also remember to set enable_partial_images
to true)
time ~/go/bin/buildah --log-level trace --tls-verify=false pull docker://localhost:5000/img
with vanilla buildah this took for me:
real 0m17.138s
user 0m12.764s
sys 0m9.426s
now we can apply a workaround which translates multi-part range requests into a sequence of single-part range request (courtesy of @psarna):
diff --git a/vendor/github.com/containers/image/v5/storage/storage_dest.go b/vendor/github.com/containers/image/v5/storage/storage_dest.go
index 0af4523ff..360273633 100644
--- a/vendor/github.com/containers/image/v5/storage/storage_dest.go
+++ b/vendor/github.com/containers/image/v5/storage/storage_dest.go
@@ -323,20 +323,32 @@ type zstdFetcher struct {
// GetBlobAt converts from chunked.GetBlobAt to BlobChunkAccessor.GetBlobAt.
func (f *zstdFetcher) GetBlobAt(chunks []chunked.ImageSourceChunk) (chan io.ReadCloser, chan error, error) {
- newChunks := make([]private.ImageSourceChunk, 0, len(chunks))
- for _, v := range chunks {
- i := private.ImageSourceChunk{
- Offset: v.Offset,
- Length: v.Length,
+ errs := make(chan error)
+ rc := make(chan io.ReadCloser)
+ logrus.Debug("chunks: ", chunks)
+ go func() {
+ for _, v := range chunks {
+ i := private.ImageSourceChunk{
+ Offset: v.Offset,
+ Length: v.Length,
+ }
+ logrus.Debug("special careful single GetBlobAt to make docker registry work: ", i, " ", f.blobInfo)
+ blobRc, blobErrs, blobErr := f.chunkAccessor.GetBlobAt(f.ctx, f.blobInfo, []private.ImageSourceChunk{i})
+ if blobErr != nil {
+ errs <- blobErr
+ }
+ // channels transplant
+ for r := range blobRc {
+ rc <- r
+ }
+ for e := range blobErrs {
+ errs <- e
+ }
}
- newChunks = append(newChunks, i)
- }
- rc, errs, err := f.chunkAccessor.GetBlobAt(f.ctx, f.blobInfo, newChunks)
- if _, ok := err.(private.BadPartialRequestError); ok {
- err = chunked.ErrBadRequest{}
- }
- return rc, errs, err
-
+ close(rc)
+ close(errs)
+ }()
+ return rc, errs, nil
}
// PutBlobPartial attempts to create a blob using the data that is already present
redo the experiment. I saw:
real 0m9.759s
user 0m12.078s
sys 0m8.066s
yay, 2x speedup