-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StoreGateway: fix store gateway shuffle sharding consistency issue when shard size change by 1 #5489
StoreGateway: fix store gateway shuffle sharding consistency issue when shard size change by 1 #5489
Conversation
|
// It doesn't round up shard size to be divisible to number of zones and make sure when scaling up/down one | ||
// shard size at a time, at most 1 instance can be changed. | ||
// It is only used in Store Gateway for now. | ||
ShuffleShardWithZoneStability(identifier string, size int) ReadRing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to put this behind a feature flag? Would it break some customers of Cortex who're using zone-awareness in store-gateways with shuffle-sharding?
By default, we use the old algorithm. Cortex has to be explicitly configured to use the new one.
We can remove this feature flag after two releases. ;)
25961f4
to
c811ad0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Thanks for the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! But I thought this is more of a bug fix in ring.ShuffleShard
rather than a new feature. I guess you created new function and flag just to avoid any regression for now, but is there a plan to migrate other components to use this func as well?
Yes, it is not a new feature but a bug fix specifically for Store Gateway due to consistency check. I don't know if it is necessary for other components. But because there is no consistency check to throw 5XXs due to this issue in other components, I hope it is not as urgent as this one. So no plan for migrations of other components at the moment. We probably need to assess the impact first for each component and see if this is really causing any issue for them. For example, I don't see a big issue of using the existing shuffle sharding algorithm for compactor. We can open another issue to track the assessment and migration. |
Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: Ben Ye <[email protected]>
2052fac
to
45467d4
Compare
Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: Ben Ye <[email protected]>
a338024
to
04641be
Compare
What this PR does:
Add a new method
ShuffleShardWithZoneStability
inReadRing
interface. It does the same thing asShuffleShard
but makes a slight change:numZones
when zone awareness is enabled.If shardSize is divisible by
numZones
, the behavior should be the same.If shardSize is not divisible by
numZones
, for example, shard size 8 and num zones is 3. Then 2 zones are assigned to have 3 instances and 1 zone has 2 instances. Because we iterate over zones using the same order so we can ensure the consistency.Which issue(s) this PR fixes:
Fixes #5467
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]