Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "wait ring stability" to store-gateway and fix cold start issue #4271

Merged

Conversation

pracucci
Copy link
Contributor

@pracucci pracucci commented Jun 9, 2021

What this PR does:
The store-gateway suffers a cold start issue which this PR should fix.

Let's say we have a disruption across all store-gateways and we restart all of them. At startup, each store-gateway is in JOINING state in the ring and begin synching blocks. Due to how the BlocksSync ring operation works, if all store-gateways are in JOINING state (and they are during a cold start), each store-gateway replica syncs all the blocks, not just the blocks belonging to their own shard because we extend the replication set for instances in JOINING state.

This looks an apparently easy problem to solve, because we could just not extend the replication set for JOINING state but then this open to another issue. Let's say we have a pool of running store-gateways and we scale up. If the JOINING state doesn't extend the replication set, we can end up in a situation where the previous store-gateways holding a block all offload the block while new replicas are still loading it (JOINING state) so no store-gateway has that block and queries fail.

This PR tries to solve this problem changing the logic as follow:

  • Load a block when it belongs to store-gateway shard
  • Unload a block when it doesn't belong to store-gateway shard anymore but at least 1 owner is ACTIVE in the ring (otherwise wait before offloading)
  • If checking the ring fails, the store-gateway keeps the previously loaded blocks instead of offloading them

This logic change, in conjunction with the "wait ring stability" at startup, should solve the cold start issue. I've added unit tests to show it (they fail in master) and I've also manually tested in a dev cluster (my tests show working as expected).

Which issue(s) this PR fixes:
Fixes #2827 #3570

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@pracucci pracucci marked this pull request as draft June 9, 2021 16:56
}
}

// Filter implements block.MetadataFilter.
// This function is NOT safe for use by multiple goroutines concurrently.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not used concurrently. Not even Thanos filters are thread-safe. I've added the comment just to make it clear.

Copy link
Contributor

@jtlisi jtlisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, modulo a changelog entry

@@ -30,6 +33,13 @@ func (s *bucketStoreSeriesServer) Send(r *storepb.SeriesResponse) error {
s.Warnings = append(s.Warnings, errors.New(r.GetWarning()))
}

if rawHints := r.GetHints(); rawHints != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change related?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's just used by tests. I've added the ability to read hints too.

@bboreham
Copy link
Contributor

Does this help #3570 ?

@pracucci
Copy link
Contributor Author

Does this help #3570 ?

Yes, I think so, because it doesn't expand the replication set anymore for instances in JOINING state. However the replication set would be expanded anyway for unhealthy store-gateways, but if they restart quick enough (before they're detected as unhealthy) then no extra pressure should be added to other (healthy) store-gateways.

@pracucci pracucci force-pushed the improve-store-gateway-sync-logic branch from ee59803 to 883668a Compare June 10, 2021 16:06
@pracucci pracucci marked this pull request as ready for review June 11, 2021 08:32
pracucci and others added 14 commits June 11, 2021 10:32
Signed-off-by: Marco Pracucci <[email protected]>
…oreGateway_InitialSyncWithWaitRingStability

Signed-off-by: Marco Pracucci <[email protected]>
…itialised and logged on a per test case basis, cause tests ordering is unstable

Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Jacob Lisi <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
@pracucci pracucci force-pushed the improve-store-gateway-sync-logic branch from 84b59ef to 05c4265 Compare June 11, 2021 08:32
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Change makes sense to me, and PR looks great!

Copy link
Contributor

@stevesg stevesg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an expert on the logic in this area but the code all looks good to me.

@bboreham
Copy link
Contributor

Is the "Improved filterBlocksByRingSharding() logic" commit covered by the PR description? If not please add it.
Consider moving to a separate PR, because I can't see the relationship, and once all the commits are squashed it will be harder to look.

@pracucci
Copy link
Contributor Author

Is the "Improved filterBlocksByRingSharding() logic" commit covered by the PR description?

Good point. I've added this one to the PR description:

  • If checking the ring fails, the store-gateway keeps the previously loaded blocks instead of offloading them

@pracucci pracucci merged commit 868898a into cortexproject:master Jun 14, 2021
@@ -81,6 +81,14 @@ The store-gateway replication optionally supports [zone-awareness](../guides/zon
2. Enable blocks zone-aware replication via the `-store-gateway.sharding-ring.zone-awareness-enabled` CLI flag (or its respective YAML config option). Please be aware this configuration option should be set to store-gateways, queriers and rulers.
3. Rollout store-gateways, queriers and rulers to apply the new configuration

### Waiting for stable ring at startup

In the event of a cluster cold start or scale up of 2+ store-gateway instances at the same time we may end up in a situation where each new store-gateway instance starts at a slightly different time and thus each one runs the initial blocks sync based on a different state of the ring. For example, in case of a cold start, the first store-gateway joining the ring may load all blocks since the sharding logic runs based on the current state of the ring, which is 1 single store-gateway.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, shouldn't this be "greater than or equal to replication_factor store-gateway instances at the same time"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends. For a cold start, yes (because if you have a number of replicas <= RF then all replicas load all blocks). For the scale up case you may have a RF=3 and scale up by +2 and this PR still improve it cause the 2 new replicas will not load extra blocks they will not need anymore once they will be both ACTIVE in the ring (after the initial sync is completed).

@@ -81,6 +81,14 @@ The store-gateway replication optionally supports [zone-awareness](../guides/zon
2. Enable blocks zone-aware replication via the `-store-gateway.sharding-ring.zone-awareness-enabled` CLI flag (or its respective YAML config option). Please be aware this configuration option should be set to store-gateways, queriers and rulers.
3. Rollout store-gateways, queriers and rulers to apply the new configuration

### Waiting for stable ring at startup

In the event of a cluster cold start or scale up of 2+ store-gateway instances at the same time we may end up in a situation where each new store-gateway instance starts at a slightly different time and thus each one runs the initial blocks sync based on a different state of the ring. For example, in case of a cold start, the first store-gateway joining the ring may load all blocks since the sharding logic runs based on the current state of the ring, which is 1 single store-gateway.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wait stable store-gateway ring state before running the initial blocks sync
6 participants