Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StoreGateway: fix store gateway shuffle sharding consistency issue when shard size change by 1 #5489

Merged
merged 10 commits into from
Aug 3, 2023

Conversation

yeya24
Copy link
Contributor

@yeya24 yeya24 commented Jul 27, 2023

What this PR does:

Add a new method ShuffleShardWithZoneStability in ReadRing interface. It does the same thing as ShuffleShard but makes a slight change:

  1. It doesn't round up shard size to be multiple of numZones when zone awareness is enabled.

If shardSize is divisible by numZones, the behavior should be the same.
If shardSize is not divisible by numZones, for example, shard size 8 and num zones is 3. Then 2 zones are assigned to have 3 instances and 1 zone has 2 instances. Because we iterate over zones using the same order so we can ensure the consistency.

Which issue(s) this PR fixes:
Fixes #5467

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@yeya24 yeya24 changed the title use shuffle sharding with zone stability StoreGateway: fix store gateway shuffle sharding consistency issue when shard size change by 1 Jul 27, 2023
@yeya24
Copy link
Contributor Author

yeya24 commented Jul 27, 2023

ShuffleShardWithZoneStability might not be a good name and tbh I am not sure how to name the new shuffle sharding algorithm/change. Any suggestions are welcomed.
For compatibility reason I think we might still want to keep the existing ShuffleShard method.

// It doesn't round up shard size to be divisible to number of zones and make sure when scaling up/down one
// shard size at a time, at most 1 instance can be changed.
// It is only used in Store Gateway for now.
ShuffleShardWithZoneStability(identifier string, size int) ReadRing
Copy link
Contributor

@harry671003 harry671003 Jul 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to put this behind a feature flag? Would it break some customers of Cortex who're using zone-awareness in store-gateways with shuffle-sharding?
By default, we use the old algorithm. Cortex has to be explicitly configured to use the new one.

We can remove this feature flag after two releases. ;)

@yeya24 yeya24 force-pushed the shuffle-sharding-change branch from 25961f4 to c811ad0 Compare July 28, 2023 08:48
Copy link
Contributor

@harry671003 harry671003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Thanks for the PR.

Copy link
Contributor

@justinjung04 justinjung04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! But I thought this is more of a bug fix in ring.ShuffleShard rather than a new feature. I guess you created new function and flag just to avoid any regression for now, but is there a plan to migrate other components to use this func as well?

@yeya24
Copy link
Contributor Author

yeya24 commented Jul 29, 2023

But I thought this is more of a bug fix in ring.ShuffleShard rather than a new feature. I guess you created new function and flag just to avoid any regression for now, but is there a plan to migrate other components to use this func as well?

Yes, it is not a new feature but a bug fix specifically for Store Gateway due to consistency check.

I don't know if it is necessary for other components. But because there is no consistency check to throw 5XXs due to this issue in other components, I hope it is not as urgent as this one.

So no plan for migrations of other components at the moment. We probably need to assess the impact first for each component and see if this is really causing any issue for them. For example, I don't see a big issue of using the existing shuffle sharding algorithm for compactor. We can open another issue to track the assessment and migration.

yeya24 added 6 commits August 3, 2023 00:29
Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: Ben Ye <[email protected]>
@yeya24 yeya24 force-pushed the shuffle-sharding-change branch from 2052fac to 45467d4 Compare August 3, 2023 07:29
@yeya24 yeya24 force-pushed the shuffle-sharding-change branch from a338024 to 04641be Compare August 3, 2023 21:26
@alanprot alanprot enabled auto-merge (squash) August 3, 2023 21:44
@alanprot alanprot merged commit d84394d into cortexproject:master Aug 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug in ring.shuffleShard() with Zone Aware Replication
5 participants