Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure queries return correctly during rolling upgrades of stateful cluster with RF 3 and only 3 nodes. #2503

Merged
merged 10 commits into from
Apr 22, 2020

Conversation

tomwilkie
Copy link
Contributor

@tomwilkie tomwilkie commented Apr 22, 2020

Signed-off-by: Tom Wilkie [email protected]

What this PR does:

  • Use a real ring with mock KV when testing distributor. This is to tease out errors in the replication logic.
  • Extend the distributor tests to cover the case of RF=3 with 2 ingesters.
  • Fix ring.GetAll to correctly calculate maxErrors when RF=3 and 2 ingesters.

Which issue(s) this PR fixes:
Fixes #2504

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

This is to teast out errors in the replication logic.

Signed-off-by: Tom Wilkie <[email protected]>
@pull-request-size pull-request-size bot added size/L and removed size/M labels Apr 22, 2020
Copy link
Contributor

@gouthamve gouthamve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow tricky bug, but the fix makes sense! Thanks for adding the test!

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ring changes LGTM. Due to the change in the distributor tests prepare() we need to also fix the tests TestDistributor_PushIngestionRateLimiter, TestDistributor_Push_LabelRemoval, TestDistributor_Push_ShouldGuaranteeShardingTokenConsistencyOverTheTime. I've also left a couple of comments in the code.

@tomwilkie tomwilkie marked this pull request as ready for review April 22, 2020 19:00
Copy link
Contributor

@jtlisi jtlisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I added a minor suggestion but it's not consequential at all.

Co-Authored-By: Jacob Lisi <[email protected]>
Signed-off-by: Tom Wilkie <[email protected]>
@tomwilkie tomwilkie force-pushed the rf-three-with-two-ingesters branch from 3eb03c0 to 2619f25 Compare April 22, 2020 19:54
@tomwilkie tomwilkie merged commit fa0fc64 into master Apr 22, 2020
@tomwilkie tomwilkie deleted the rf-three-with-two-ingesters branch April 22, 2020 21:00
@bboreham
Copy link
Contributor

Does this fix #1290 ?

gouthamve pushed a commit to gouthamve/cortex that referenced this pull request Apr 23, 2020
…luster with RF 3 and only 3 nodes. (cortexproject#2503)

* Use a real ring with mock KV when testing distributor.

This is to teast out errors in the replication logic.

Signed-off-by: Tom Wilkie <[email protected]>

* Extend distributor test to cover the case RF=3 with 2 ingesters.

Signed-off-by: Tom Wilkie <[email protected]>

* Ensure ring correctly calculates the number of allowed failures when RF=3 and #ingesters=2.

Signed-off-by: Tom Wilkie <[email protected]>

* Add changelog and review feedback.

Signed-off-by: Tom Wilkie <[email protected]>

* Refactor some distributor tests to try and get them to pass.

Signed-off-by: Tom Wilkie <[email protected]>

* Speed up tests but polling more frequently.

Signed-off-by: Tom Wilkie <[email protected]>

* Fix same bug on the write path.

Signed-off-by: Tom Wilkie <[email protected]>

* Tidy up the distributor tests.

Signed-off-by: Tom Wilkie <[email protected]>

* Make test correctly handle RF3 and 2 ingesters - previously was succeeding when it shouldn't

Signed-off-by: Tom Wilkie <[email protected]>

* Update pkg/ring/ring.go

Co-Authored-By: Jacob Lisi <[email protected]>
Signed-off-by: Tom Wilkie <[email protected]>

Co-authored-by: Jacob Lisi <[email protected]>
Signed-off-by: Goutham Veeramachaneni <[email protected]>
gouthamve added a commit that referenced this pull request Apr 23, 2020
…luster with RF 3 and only 3 nodes. (#2503) (#2512)

* Use a real ring with mock KV when testing distributor.

This is to teast out errors in the replication logic.

Signed-off-by: Tom Wilkie <[email protected]>

* Extend distributor test to cover the case RF=3 with 2 ingesters.

Signed-off-by: Tom Wilkie <[email protected]>

* Ensure ring correctly calculates the number of allowed failures when RF=3 and #ingesters=2.

Signed-off-by: Tom Wilkie <[email protected]>

* Add changelog and review feedback.

Signed-off-by: Tom Wilkie <[email protected]>

* Refactor some distributor tests to try and get them to pass.

Signed-off-by: Tom Wilkie <[email protected]>

* Speed up tests but polling more frequently.

Signed-off-by: Tom Wilkie <[email protected]>

* Fix same bug on the write path.

Signed-off-by: Tom Wilkie <[email protected]>

* Tidy up the distributor tests.

Signed-off-by: Tom Wilkie <[email protected]>

* Make test correctly handle RF3 and 2 ingesters - previously was succeeding when it shouldn't

Signed-off-by: Tom Wilkie <[email protected]>

* Update pkg/ring/ring.go

Co-Authored-By: Jacob Lisi <[email protected]>
Signed-off-by: Tom Wilkie <[email protected]>

Co-authored-by: Jacob Lisi <[email protected]>
Signed-off-by: Goutham Veeramachaneni <[email protected]>

Co-authored-by: Tom Wilkie <[email protected]>
Co-authored-by: Jacob Lisi <[email protected]>
@bboreham
Copy link
Contributor

bboreham commented Apr 23, 2020

I can't see that this fixes #1290.

AFAICS r.strategy.ShouldExtendReplicaSet() during a rolling update will still end up with 4 ingesters being passed in to Filter() which then computes minSuccess=3, so we cannot tolerate a single failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gaps when querying ingesters with replication factor = 3 and 2 healthy ingesters in the cluster
5 participants