Skip to content

S3 Downloader potentially downloads corrupted object when the it is split into multiple parts #4986

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kevinjqiu opened this issue Sep 12, 2023 · 7 comments · Fixed by #5345
Labels
bug This issue is a bug.

Comments

@kevinjqiu
Copy link

Describe the bug

  • We have a system that's constantly updating an object stored in an S3 bucket
  • The object is downloaded by multiple clients concurrently
  • In our particular case, the stored object is gzipped but the issue applies to all sorts of files
  • When the object grows to over 5MB (which causes the downloader to download the object in multiple parts), we encountered an issue where the second part is downloaded from a different version of the object, therefore corrupting the download

Expected Behavior

The latest version of the object is downloaded

Current Behavior

An error is encountered from time to time, e.g.,

2023/09/12 14:54:39
panic: flate: corrupt input before offset 1089400

goroutine 1 [running]:
main.fetch(0x14000146640)
        /Users/kevinqiu/src/tmp/s3bugrepro/main.go:36 +0x2a0
main.main()
        /Users/kevinqiu/src/tmp/s3bugrepro/main.go:57 +0xd0
exit status 2

or

panic: gzip: invalid checksum

With logging turned on, it's observed that when a later part is being downloaded and when the object is updated before that, the later chunk from a different version is downloaded and therefore corrupting the output.

Reproduction Steps

Minimally reproducible example:

Producer

The producer is simply a script that uploads a gzipped file (greater than 5MB) to a bucket constantly

#! /bin/bash
while true; do
    files="0 1 2 3"
    for i in $files; do
        aws s3 cp $i s3://$BUCKET/TEST
        sleep 1
    done
done

Consumer

package main

import (
	"bytes"
	"compress/gzip"
	"fmt"
	"io"
	"time"

	"github.com/aws/aws-sdk-go/aws"
	"github.com/aws/aws-sdk-go/aws/session"
	"github.com/aws/aws-sdk-go/service/s3"
	"github.com/aws/aws-sdk-go/service/s3/s3manager"
)

func fetch(downloader *s3manager.Downloader) {
	buff := &aws.WriteAtBuffer{}
	goi := &s3.GetObjectInput{
		Bucket: aws.String("BUCKET"),  // replace with the real bucket name
		Key:    aws.String("TEST"),
	}
	_, err := downloader.Download(buff, goi)
	if err != nil {
		panic(err)
	}

	b := buff.Bytes()
	br := bytes.NewReader(b)
	gzr, err := gzip.NewReader(br)
	if err != nil {
		panic(err)
	}

	uzb, err := io.ReadAll(gzr)
	if err != nil && err != io.EOF {
		panic(err)
	}

	fmt.Printf("Size=%v\n", len(uzb))
}

func main() {
	s, err := session.NewSession(&aws.Config{
		Region:   aws.String("us-east-1"),
		LogLevel: aws.LogLevel(aws.LogDebugWithHTTPBody),
	})

	if err != nil {
		panic(err)
	}

	downloader := s3manager.NewDownloader(s, func(downloader *s3manager.Downloader) {
		downloader.PartSize = 1024 * 512  // set to a small chunk size so the problem can be reproduced sooner
	})

	for {
		fetch(downloader)
		time.Sleep(2 * time.Second)
	}
}

Possible Solution

When GetObjectInput.versionId is not provided by the user (which means getting the latest object version), send a request to first figure out the latest version of the object, and then set the versionId in the subsequent downloadChunk method.

Additional Information/Context

No response

SDK version used

1.44

Environment details (Version of Go (go version)? OS name and version, etc.)

1.20

@kevinjqiu kevinjqiu added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Sep 12, 2023
@kevinjqiu kevinjqiu changed the title S3 Downloader potentially downloads corrupted object when the object is downloaded with multiple parts S3 Downloader potentially downloads corrupted object when the object is split into multiple parts Sep 12, 2023
@kevinjqiu kevinjqiu changed the title S3 Downloader potentially downloads corrupted object when the object is split into multiple parts S3 Downloader potentially downloads corrupted object when the it is split into multiple parts Sep 12, 2023
@RanVaknin RanVaknin self-assigned this Sep 25, 2023
@RanVaknin RanVaknin removed the needs-triage This issue or PR still needs to be triaged. label Sep 29, 2023
@RanVaknin
Copy link
Contributor

Hi @kevinjqiu,

Thanks for reaching out. What you are describing is not a bug with the SDK, but rather a data intensive application concept called "read skew" and one solution for that is exactly what you described here:

When GetObjectInput.versionId is not provided by the user (which means getting the latest object version), send a request to first figure out the latest version of the object, and then set the versionId in the subsequent downloadChunk method.

The SDK team cannot cover this use case in our current implementation mainly because adding an extra API call would come with an additional cost both in terms of $ value, and performance hit, and would not accomodate the majority of customers. Not to mention it might not be backwards compatible.

The solution to this is implementing this mechanism on the application level. You need to implement you own version checking ("optimistic locking") mechanism to make sure you have the appropriate version before you run GetObject. The GetObjectInput interface takes in a VersionId that makes this easier:

latestObjectVersion := getLatestVersion("BUCKET", "TEST") // you need to implement this function.

goi := &s3.GetObjectInput{
    Bucket:    aws.String("BUCKET"),
    Key:       aws.String("TEST"),
    VersionId: aws.String(latestObjectVersion),
}

I'm not sure what is the business case for your application, but if the above approach proves to be too difficult, you might want to look into event driven approach with SQS for example.
instead of consumers independently polling S3 for the latest version of the object, they would be notified when a new version is available and then proceed to download it.

I understand that this must be frustrating to read, but as it is, its not really actionable by the SDK team so Im inclined to close this.
If you have any additional questions, please consider opening a discussion in the discussion tab on our repo.

Thanks again,
Ran~

@RanVaknin RanVaknin closed this as not planned Won't fix, can't repro, duplicate, stale Sep 29, 2023
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@kevinjqiu
Copy link
Author

@RanVaknin Thanks for the reply. I think at the very least, the risk of read skew should be called out in the documentation.

From the user's perspective, we call downloader.Download() to get the latest version of the object. We don't really know (or should we know) that the object is split and stored in multiple chunks and there's a possibility where different version of the chunks are downloaded if the underlying object has been updated in the process. The API should abstract this away but I understand the concerns you laid out, so I think at least the documentation should be updated to let the user know the risk exists and consider the mitigation on the client side. In our case, we found out the hard way.

@RanVaknin
Copy link
Contributor

RanVaknin commented Sep 29, 2023

Hi @kevinjqiu

Thanks for the follow up.

We don't really know (or should we know) that the object is split and stored in multiple chunks

You are using the Downloader package which is the multipart download utility for the SDK.
If you do not need to download the object in parts you can use S3.GetObject directly.

In terms of the documenting this, this is not an edge case unique to the SDK. Read Skews can happen in many many frameworks, and since you are designing a data intensive application, you need to be aware of the various strategies to mitigate their inherent caveats and risks. The same way we don't document that you could run into issues with idempotent writes when interacting with an RDS table, we don't document general programming best practices. These are concepts that apply at an infrastructural level and are not unique to client-side tools like the SDK

I appreciate your input and understand the concerns you've raised. Our goal is to keep the SDK documentation focused on its core functionality while providing clear guidance. If there are specific gaps or ambiguities you've noticed in relation to the SDK itself, feel free to open a separate documentation issue with actionable feedback.

Thanks again,
Ran~

@rittneje
Copy link
Contributor

rittneje commented Oct 7, 2023

@RanVaknin To be frank, I find it wholly unacceptable that AWS's stance here is to do absolutely nothing to help its customers. Unless you explicitly happen to test this race condition, and get the timing right, you won't notice until you already have an issue in production.

At the very least, the SDK MUST return an error if it detects that two chunks came from different versions during a multipart download.

Copy link

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@bkaws
Copy link

bkaws commented Apr 24, 2025

Thank you for your candid feedback. We have re-evaluated our previous position on this issue and agree that the better customer experience is to address it. We implemented a change in the S3 transfer manager client for the Go v2 SDK that addresses this issue, which we backported to the Go v1 SDK.

Starting with version v1.55.7, Downloader will now perform an ETag check from the initial GET response to confirm object parts are downloaded from the same version. If the object is modified during the multi-part download and no VersionId is provided in the request, the download will fail. Alternatively, users can still multipart download a specific version of a large object by passing VersionId in the request input. Upgrading to the latest version will resolve your issue.

As a reminder, this SDK is in maintenance mode and we're only releasing critical changes, so we encourage you to migrate to v2 if you can, but we continue to monitor this repository for comments and issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants