-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Retries to Sync Blocks #3975
Add Retries to Sync Blocks #3975
Conversation
Signed-off-by: Levi Harrison <[email protected]>
Signed-off-by: Levi Harrison <[email protected]>
pkg/storegateway/bucket_stores.go
Outdated
lastErr = s.SyncBlocks(ctx) | ||
if lastErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will return error on first failure, and will not do any retry. Better version would be:
for retries.Ongoing() {
err = s.SyncBlocks(ctx)
if err == nil {
return nil
}
// log error otherwise
retries.Wait()
}
return retries.Err()
pkg/storegateway/bucket_stores.go
Outdated
return u.syncUsersBlocks(ctx, func(ctx context.Context, s *store.BucketStore) error { | ||
return s.SyncBlocks(ctx) | ||
return u.syncUsersBlocksWithRetries(ctx, s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this design. We pass our own function to u.syncUsersBlocks()
. I was expecting something like this:
return u. syncUsersBlocksWithRetries(ctx, func(ctx context.Context, s *store.BucketStore) error {
return s.SyncBlocks(ctx)
}
And then having syncUsersBlocksWithRetries()
calling syncUsersBlocks()
.
Am I missing anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. I'll get on it!
Signed-off-by: Levi Harrison <[email protected]>
CI seems to be having some issues: /usr/bin/docker pull quay.io/cortexproject/build-image:upgrade-build-image-debian-491e60715-WIP
Error response from daemon: Head https://quay.io/v2/cortexproject/build-image/manifests/upgrade-build-image-debian-491e60715-WIP: received unexpected HTTP status: 502 Bad Gateway
Warning: Docker pull failed with exit code 1, back off 9.622 seconds before retry. This might have something to do with it: https://status.quay.io/incidents/vfs19hmq660h
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my previous comment! The new design looks way more clear. I just left a comment about the returned error and then we're good to go 🚀
pkg/storegateway/bucket_stores.go
Outdated
for retries.Ongoing() { | ||
err := u.syncUsersBlocks(ctx, f) | ||
if err == nil { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] it's the same but more clear:
return err | |
return nil |
pkg/storegateway/bucket_stores.go
Outdated
retries.Wait() | ||
} | ||
|
||
return retries.Err() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would suppress the actual error. I would save the last error returned by u.syncUsersBlocks(ctx, f)
and return it once all retries expired (eg. see joinMembersOnStartup()
at pkg/ring/kv/memberlist/memberlist_client.go
for reference).
Signed-off-by: Levi Harrison <[email protected]>
I think this is probably a flakey test, given that I changed almost nothing the latest commit.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
MinBackoff: 100 * time.Millisecond, | ||
MaxBackoff: 10 * time.Second, | ||
MaxRetries: 3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Given that maxRetries
is 3, we can never reach max backoff of 10 seconds. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok I originally stole that from here and just modified it along the way. Is a 1-second MaxBackoff
ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok to keep it as is. If you choose to use 1s as MaxBackoff, that's fine too. I think it's more important to do retry here, than how much time it waits.
Signed-off-by: Levi Harrison <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
What this PR does:
Adds retries to
BucketStore.SyncBlocks
withBucketStores.syncUsersBlocksWithRetries
.Although this fixes one part of #3802, I think that the bigger issue may be the initial sync (
BucketStores.InitialSync
), which also ends up callingBucketStore.SyncBlocks
, but after going throughBucketStore.InitialSync
in Thanos. Maybe retries should be added there in some fashion.cortex/pkg/storegateway/bucket_stores.go
Lines 138 to 143 in c28d326
cortex/vendor/github.com/thanos-io/thanos/pkg/store/bucket.go
Lines 463 to 466 in c28d326
Which issue(s) this PR fixes:
Fixes #3802
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]