-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "wait ring stability" to store-gateway and fix cold start issue #4271
Add "wait ring stability" to store-gateway and fix cold start issue #4271
Conversation
} | ||
} | ||
|
||
// Filter implements block.MetadataFilter. | ||
// This function is NOT safe for use by multiple goroutines concurrently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not used concurrently. Not even Thanos filters are thread-safe. I've added the comment just to make it clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, modulo a changelog entry
@@ -30,6 +33,13 @@ func (s *bucketStoreSeriesServer) Send(r *storepb.SeriesResponse) error { | |||
s.Warnings = append(s.Warnings, errors.New(r.GetWarning())) | |||
} | |||
|
|||
if rawHints := r.GetHints(); rawHints != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change related?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It's just used by tests. I've added the ability to read hints too.
Does this help #3570 ? |
Yes, I think so, because it doesn't expand the replication set anymore for instances in JOINING state. However the replication set would be expanded anyway for unhealthy store-gateways, but if they restart quick enough (before they're detected as unhealthy) then no extra pressure should be added to other (healthy) store-gateways. |
ee59803
to
883668a
Compare
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
…oreGateway_InitialSyncWithWaitRingStability Signed-off-by: Marco Pracucci <[email protected]>
…itialised and logged on a per test case basis, cause tests ordering is unstable Signed-off-by: Marco Pracucci <[email protected]>
…gedAfterScaleUp Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]> Co-authored-by: Jacob Lisi <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
84b59ef
to
05c4265
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. Change makes sense to me, and PR looks great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not an expert on the logic in this area but the code all looks good to me.
Is the "Improved filterBlocksByRingSharding() logic" commit covered by the PR description? If not please add it. |
Good point. I've added this one to the PR description:
|
@@ -81,6 +81,14 @@ The store-gateway replication optionally supports [zone-awareness](../guides/zon | |||
2. Enable blocks zone-aware replication via the `-store-gateway.sharding-ring.zone-awareness-enabled` CLI flag (or its respective YAML config option). Please be aware this configuration option should be set to store-gateways, queriers and rulers. | |||
3. Rollout store-gateways, queriers and rulers to apply the new configuration | |||
|
|||
### Waiting for stable ring at startup | |||
|
|||
In the event of a cluster cold start or scale up of 2+ store-gateway instances at the same time we may end up in a situation where each new store-gateway instance starts at a slightly different time and thus each one runs the initial blocks sync based on a different state of the ring. For example, in case of a cold start, the first store-gateway joining the ring may load all blocks since the sharding logic runs based on the current state of the ring, which is 1 single store-gateway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, shouldn't this be "greater than or equal to replication_factor
store-gateway instances at the same time"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends. For a cold start, yes (because if you have a number of replicas <= RF then all replicas load all blocks). For the scale up case you may have a RF=3 and scale up by +2 and this PR still improve it cause the 2 new replicas will not load extra blocks they will not need anymore once they will be both ACTIVE in the ring (after the initial sync is completed).
@@ -81,6 +81,14 @@ The store-gateway replication optionally supports [zone-awareness](../guides/zon | |||
2. Enable blocks zone-aware replication via the `-store-gateway.sharding-ring.zone-awareness-enabled` CLI flag (or its respective YAML config option). Please be aware this configuration option should be set to store-gateways, queriers and rulers. | |||
3. Rollout store-gateways, queriers and rulers to apply the new configuration | |||
|
|||
### Waiting for stable ring at startup | |||
|
|||
In the event of a cluster cold start or scale up of 2+ store-gateway instances at the same time we may end up in a situation where each new store-gateway instance starts at a slightly different time and thus each one runs the initial blocks sync based on a different state of the ring. For example, in case of a cold start, the first store-gateway joining the ring may load all blocks since the sharding logic runs based on the current state of the ring, which is 1 single store-gateway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
What this PR does:
The store-gateway suffers a cold start issue which this PR should fix.
Let's say we have a disruption across all store-gateways and we restart all of them. At startup, each store-gateway is in
JOINING
state in the ring and begin synching blocks. Due to how theBlocksSync
ring operation works, if all store-gateways are inJOINING
state (and they are during a cold start), each store-gateway replica syncs all the blocks, not just the blocks belonging to their own shard because we extend the replication set for instances inJOINING
state.This looks an apparently easy problem to solve, because we could just not extend the replication set for
JOINING
state but then this open to another issue. Let's say we have a pool of running store-gateways and we scale up. If theJOINING
state doesn't extend the replication set, we can end up in a situation where the previous store-gateways holding a block all offload the block while new replicas are still loading it (JOINING
state) so no store-gateway has that block and queries fail.This PR tries to solve this problem changing the logic as follow:
This logic change, in conjunction with the "wait ring stability" at startup, should solve the cold start issue. I've added unit tests to show it (they fail in
master
) and I've also manually tested in a dev cluster (my tests show working as expected).Which issue(s) this PR fixes:
Fixes #2827 #3570
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]