Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store-gateways crash on startup if they encounter intermittent error #3802

Closed
1 of 2 tasks
gouthamve opened this issue Feb 9, 2021 · 3 comments · Fixed by #3975
Closed
1 of 2 tasks

store-gateways crash on startup if they encounter intermittent error #3802

gouthamve opened this issue Feb 9, 2021 · 3 comments · Fixed by #3975

Comments

@gouthamve
Copy link
Contributor

Describe the bug
store-gateways do a LIST operation to get all the users and then load the bucket index for that user. These operations might hit a temporary timeout and this causes the store-gateway to crash and restart again. We should be resilient against minor issues like this.

level=warn ts=2021-02-09T10:53:50.301491118Z caller=bucket_stores.go:128 msg="failed to synchronize TSDB blocks" err="failed to synchronize TSDB blocks for user 198923: sync block: read bucket index: read bucket index: Get \"https://storage.googleapis.com/<bucket>/198923/bucket-index.json.gz\": net/http: TLS handshake timeout"
level=info ts=2021-02-09T10:53:50.301685441Z caller=basic_lifecycler.go:197 msg="ring lifecycler is shutting down" ring=store-gateway
level=error ts=2021-02-09T10:53:50.301711626Z caller=client.go:233 msg="error getting path" key=store-gateway err="Get \"http://consul.cortex-blocks-prod.svc.cluster.local:8500/v1/kv/store-gateway?index=73476708&stale=&wait=10000ms\": context canceled"
level=info ts=2021-02-09T10:53:50.310302637Z caller=basic_lifecycler.go:363 msg="unregistering instance from ring" ring=store-gateway
level=info ts=2021-02-09T10:53:50.320575066Z caller=basic_lifecycler.go:234 msg="instance removed from the ring" ring=store-gateway
level=error ts=2021-02-09T10:53:50.320726791Z caller=cortex.go:402 msg="module failed" module=store-gateway err="invalid service state: Failed, expected: Running, failure: initial blocks synchronization: failed to synchronize TSDB blocks for user 198923: sync block: read bucket index: read bucket index: Get \"https://storage.googleapis.com/<bucket>/198923/bucket-index.json.gz\": net/http: TLS handshake timeout"
level=info ts=2021-02-09T10:53:50.320787529Z caller=module_service.go:90 msg="module stopped" module=runtime-config
level=info ts=2021-02-09T10:53:50.320854193Z caller=module_service.go:90 msg="module stopped" module=memberlist-kv
level=info ts=2021-02-09T10:53:50.321051245Z caller=server_service.go:50 msg="server stopped"
level=info ts=2021-02-09T10:53:50.321073044Z caller=module_service.go:90 msg="module stopped" module=server
level=info ts=2021-02-09T10:53:50.321088676Z caller=cortex.go:391 msg="Cortex stopped"
level=error ts=2021-02-09T10:53:50.321159052Z caller=log.go:27 msg="error running cortex" err="failed services\ngithub.com/cortexproject/cortex/pkg/cortex.(*Cortex).Run...."

Storage Engine

  • Blocks
  • Chunks
@pracucci
Copy link
Contributor

As easy improvement would be adding retries to the per-user blocks synchronization. We did something similar in the compactor:

func (c *Compactor) compactUserWithRetries(ctx context.Context, userID string) error {
var lastErr error
retries := util.NewBackoff(ctx, util.BackoffConfig{
MinBackoff: c.compactorCfg.retryMinBackoff,
MaxBackoff: c.compactorCfg.retryMaxBackoff,
MaxRetries: c.compactorCfg.CompactionRetries,
})

An option may be:

  1. Add BucketStores.syncUserBlocksWithRetries() similarly to the compactor one
  2. Call syncUserBlocksWithRetries() like this:
func (u *BucketStores) SyncBlocks(ctx context.Context) error {
	return u.syncUsersBlocks(ctx, func(ctx context.Context, s *store.BucketStore) error {
		return s.syncUserBlocksWithRetries(ctx, s)
	})
}

@LeviHarrison
Copy link
Contributor

@pracucci Hi! I'd like to take a shot at implementing this. I'm assuming I should just add those fields to the tsdb.BucketStoreConfig, or should they go somewhere else?

@pracucci
Copy link
Contributor

pracucci commented Mar 18, 2021

@pracucci Hi! I'd like to take a shot at implementing this. I'm assuming I should just add those fields to the tsdb.BucketStoreConfig, or should they go somewhere else?

Thanks @LeviHarrison. I would start with an hardcoded backoff config like we did in the compactor (or other places too). 3 retries should be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants