Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Retries to Sync Blocks #3975

Merged
merged 5 commits into from
Mar 30, 2021

Conversation

LeviHarrison
Copy link
Contributor

What this PR does:

Adds retries to BucketStore.SyncBlocks with BucketStores.syncUsersBlocksWithRetries.

Although this fixes one part of #3802, I think that the bigger issue may be the initial sync (BucketStores.InitialSync), which also ends up calling BucketStore.SyncBlocks, but after going through BucketStore.InitialSync in Thanos. Maybe retries should be added there in some fashion.

func (u *BucketStores) InitialSync(ctx context.Context) error {
level.Info(u.logger).Log("msg", "synchronizing TSDB blocks for all users")
if err := u.syncUsersBlocks(ctx, func(ctx context.Context, s *store.BucketStore) error {
return s.InitialSync(ctx)
}); err != nil {

func (s *BucketStore) InitialSync(ctx context.Context) error {
if err := s.SyncBlocks(ctx); err != nil {
return errors.Wrap(err, "sync block")
}

Which issue(s) this PR fixes:
Fixes #3802

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Levi Harrison <[email protected]>
Signed-off-by: Levi Harrison <[email protected]>
Comment on lines 170 to 171
lastErr = s.SyncBlocks(ctx)
if lastErr != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will return error on first failure, and will not do any retry. Better version would be:

for retries.Ongoing() {
    err = s.SyncBlocks(ctx)
    if err == nil {
        return nil
    }
    // log error otherwise
    retries.Wait()
}
return retries.Err()

Comment on lines 155 to 156
return u.syncUsersBlocks(ctx, func(ctx context.Context, s *store.BucketStore) error {
return s.SyncBlocks(ctx)
return u.syncUsersBlocksWithRetries(ctx, s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this design. We pass our own function to u.syncUsersBlocks(). I was expecting something like this:

return u. syncUsersBlocksWithRetries(ctx, func(ctx context.Context, s *store.BucketStore) error {
  return s.SyncBlocks(ctx)
}

And then having syncUsersBlocksWithRetries() calling syncUsersBlocks().

Am I missing anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. I'll get on it!

@LeviHarrison
Copy link
Contributor Author

LeviHarrison commented Mar 22, 2021

CI seems to be having some issues:

/usr/bin/docker pull quay.io/cortexproject/build-image:upgrade-build-image-debian-491e60715-WIP
  Error response from daemon: Head https://quay.io/v2/cortexproject/build-image/manifests/upgrade-build-image-debian-491e60715-WIP: received unexpected HTTP status: 502 Bad Gateway
  Warning: Docker pull failed with exit code 1, back off 9.622 seconds before retry.

This might have something to do with it: https://status.quay.io/incidents/vfs19hmq660h

Image pulls from quay.io are currently failing with the code "502 Bad Gateway".

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my previous comment! The new design looks way more clear. I just left a comment about the returned error and then we're good to go 🚀

for retries.Ongoing() {
err := u.syncUsersBlocks(ctx, f)
if err == nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] it's the same but more clear:

Suggested change
return err
return nil

retries.Wait()
}

return retries.Err()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would suppress the actual error. I would save the last error returned by u.syncUsersBlocks(ctx, f) and return it once all retries expired (eg. see joinMembersOnStartup() at pkg/ring/kv/memberlist/memberlist_client.go for reference).

@LeviHarrison LeviHarrison requested a review from pracucci March 25, 2021 14:36
@LeviHarrison
Copy link
Contributor Author

LeviHarrison commented Mar 25, 2021

I think this is probably a flakey test, given that I changed almost nothing the latest commit.

alertmanager_test.go:337: 
         	Error Trace:	alertmanager_test.go:337
         	Error:      	        Received unexpected error:
         	            	        metric not found
                                        ...
                Test:       	        TestAlertmanagerSharding/legacy_alertstore

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Comment on lines +162 to +164
MinBackoff: 100 * time.Millisecond,
MaxBackoff: 10 * time.Second,
MaxRetries: 3,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Given that maxRetries is 3, we can never reach max backoff of 10 seconds. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok I originally stole that from here and just modified it along the way. Is a 1-second MaxBackoff ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok to keep it as is. If you choose to use 1s as MaxBackoff, that's fine too. I think it's more important to do retry here, than how much time it waits.

Signed-off-by: Levi Harrison <[email protected]>
@LeviHarrison LeviHarrison requested a review from pstibrany March 29, 2021 20:54
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@pracucci pracucci merged commit 5ec6f50 into cortexproject:master Mar 30, 2021
@LeviHarrison LeviHarrison deleted the sync-blocks-retry branch March 30, 2021 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

store-gateways crash on startup if they encounter intermittent error
3 participants