-
Notifications
You must be signed in to change notification settings - Fork 2.1k
S3 Downloader
potentially downloads corrupted object when the it is split into multiple parts
#4986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Downloader
potentially downloads corrupted object when the object is split into multiple parts
Downloader
potentially downloads corrupted object when the object is split into multiple partsDownloader
potentially downloads corrupted object when the it is split into multiple parts
Hi @kevinjqiu, Thanks for reaching out. What you are describing is not a bug with the SDK, but rather a data intensive application concept called "read skew" and one solution for that is exactly what you described here:
The SDK team cannot cover this use case in our current implementation mainly because adding an extra API call would come with an additional cost both in terms of $ value, and performance hit, and would not accomodate the majority of customers. Not to mention it might not be backwards compatible. The solution to this is implementing this mechanism on the application level. You need to implement you own version checking ("optimistic locking") mechanism to make sure you have the appropriate version before you run GetObject. The GetObjectInput interface takes in a latestObjectVersion := getLatestVersion("BUCKET", "TEST") // you need to implement this function.
goi := &s3.GetObjectInput{
Bucket: aws.String("BUCKET"),
Key: aws.String("TEST"),
VersionId: aws.String(latestObjectVersion),
} I'm not sure what is the business case for your application, but if the above approach proves to be too difficult, you might want to look into event driven approach with SQS for example. I understand that this must be frustrating to read, but as it is, its not really actionable by the SDK team so Im inclined to close this. Thanks again, |
|
@RanVaknin Thanks for the reply. I think at the very least, the risk of read skew should be called out in the documentation. From the user's perspective, we call |
Hi @kevinjqiu Thanks for the follow up.
You are using the Downloader package which is the multipart download utility for the SDK. In terms of the documenting this, this is not an edge case unique to the SDK. Read Skews can happen in many many frameworks, and since you are designing a data intensive application, you need to be aware of the various strategies to mitigate their inherent caveats and risks. The same way we don't document that you could run into issues with idempotent writes when interacting with an RDS table, we don't document general programming best practices. These are concepts that apply at an infrastructural level and are not unique to client-side tools like the SDK I appreciate your input and understand the concerns you've raised. Our goal is to keep the SDK documentation focused on its core functionality while providing clear guidance. If there are specific gaps or ambiguities you've noticed in relation to the SDK itself, feel free to open a separate documentation issue with actionable feedback. Thanks again, |
@RanVaknin To be frank, I find it wholly unacceptable that AWS's stance here is to do absolutely nothing to help its customers. Unless you explicitly happen to test this race condition, and get the timing right, you won't notice until you already have an issue in production. At the very least, the SDK MUST return an error if it detects that two chunks came from different versions during a multipart download. |
Comments on closed issues are hard for our team to see. |
Thank you for your candid feedback. We have re-evaluated our previous position on this issue and agree that the better customer experience is to address it. We implemented a change in the S3 transfer manager client for the Go v2 SDK that addresses this issue, which we backported to the Go v1 SDK. Starting with version v1.55.7, Downloader will now perform an ETag check from the initial GET response to confirm object parts are downloaded from the same version. If the object is modified during the multi-part download and no VersionId is provided in the request, the download will fail. Alternatively, users can still multipart download a specific version of a large object by passing VersionId in the request input. Upgrading to the latest version will resolve your issue. As a reminder, this SDK is in maintenance mode and we're only releasing critical changes, so we encourage you to migrate to v2 if you can, but we continue to monitor this repository for comments and issues. |
Describe the bug
Expected Behavior
The latest version of the object is downloaded
Current Behavior
An error is encountered from time to time, e.g.,
or
With logging turned on, it's observed that when a later part is being downloaded and when the object is updated before that, the later chunk from a different version is downloaded and therefore corrupting the output.
Reproduction Steps
Minimally reproducible example:
Producer
The producer is simply a script that uploads a gzipped file (greater than 5MB) to a bucket constantly
Consumer
Possible Solution
When
GetObjectInput.versionId
is not provided by the user (which means getting the latest object version), send a request to first figure out the latest version of the object, and then set theversionId
in the subsequentdownloadChunk
method.Additional Information/Context
No response
SDK version used
1.44
Environment details (Version of Go (
go version
)? OS name and version, etc.)1.20
The text was updated successfully, but these errors were encountered: