Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix queriers shuffle-sharding blast radius containment #3901

Merged
merged 27 commits into from
Mar 9, 2021

Conversation

pracucci
Copy link
Contributor

@pracucci pracucci commented Mar 3, 2021

What this PR does:
As described in the #3571, when shuffle-sharding is enabled in the query-frontend/scheduler and a querier crashes, tenants are immediately resharded across the remaining queriers. This practically invalidates the assumption that shuffle-sharding can be used to contain the blast radius in case of a "poisoned query" on the read path: if a tenant repeatedly send a poisoned query over and over it has the ability to crash all queriers, and not just its shard.

In this PR I propose a solution to mitigate it, introducing a delay ("forget delay") between when a querier disconnects because of a crash and when a tenant's shard changes because of that.

To do it, a query-frontend/scheduler needs to know when a querier disconnects because of crash. I've introduced a "graceful shutdown notification" from the querier to query-frontend/scheduler: when a querier disconnects from the query-frontend/scheduler without sending such notification it means the querier crashed / abruptly terminated.

I've done some manual testing and looks working as expected.

Which issue(s) this PR fixes:
Fixes #3571

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@pracucci pracucci requested a review from pstibrany March 3, 2021 13:49
@pracucci pracucci changed the title Fix querier shuffle sharding Fix querier shuffle-sharding blast radius containment Mar 3, 2021
@pracucci pracucci changed the title Fix querier shuffle-sharding blast radius containment Fix queriers shuffle-sharding blast radius containment Mar 3, 2021
@pracucci pracucci marked this pull request as draft March 3, 2021 13:50
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, looking forward to see the tests. My comments are mostly some grammar suggestions... if you find these annoying, let me know and I will stop. (Some may also be wrong, so... :))

querierConnections map[string]int
// How long to wait before removing a querier which has got disconnected
// but hasn't notified a graceful shutdown.
forgetTimeout time.Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd suggest to move forgetTimeout to user_queues.go, and pass only threshold to forgetDisconnectQueriers. removeQuerierConnection is also using this field, but only to know whether to keep disconnected querier or not. We could pass that information as a bool to removeQuerierConnection. This would simplify queues a bit, and allow for easier testing of forgetDisconnectQueriers. I would also suggest passing time.Now() as parameter to removeQuerierConnection, to avoid having dependency on current time in queues. WDYT?

@pracucci pracucci force-pushed the fix-querier-shuffle-sharding branch from 2ccc0d8 to 03cb76f Compare March 4, 2021 10:48
@pull-request-size pull-request-size bot added size/XL and removed size/L labels Mar 4, 2021
Copy link
Contributor

@ranton256 ranton256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me functionally. I had a question about alternative naming of the configuration value being added, but otherwise LGTM.

@pracucci pracucci force-pushed the fix-querier-shuffle-sharding branch from 2732880 to 4ec324d Compare March 5, 2021 09:21
@pracucci pracucci marked this pull request as ready for review March 5, 2021 09:49
@pracucci pracucci requested a review from pstibrany March 5, 2021 09:50
Copy link
Contributor

@ranton256 ranton256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this and the changes.

CHANGELOG.md Outdated
@@ -91,6 +91,7 @@
* `cortex_bucket_store_chunk_pool_returned_bytes_total`
* [ENHANCEMENT] Alertmanager: load alertmanager configurations from object storage concurrently, and only load necessary configurations, speeding configuration synchronization process and executing fewer "GET object" operations to the storage when sharding is enabled. #3898
* [ENHANCEMENT] Blocks storage: Ingester can now stream entire chunks instead of individual samples to the querier. At the moment this feature must be explicitly enabled either by using `-ingester.stream-chunks-when-using-blocks` flag or `ingester_stream_chunks_when_using_blocks` (boolean) field in runtime config file, but these configuration options are temporary and will be removed when feature is stable. #3889
* [ENHANCEMENT] Query-frontend/scheduler: added querier forget delay (`-query-frontend.querier-forget-delay` and `-query-scheduler.querier-forget-delay`) to mitigate the blast radius in the event queriers crash because of a repeatedly sent "query of death" when shuffle-sharding is enabled. #3901
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cortex release 1.8.0 is now in progress. Could you please rebase master and move the CHANGELOG entry under the master / unreleased section?

@pracucci pracucci force-pushed the fix-querier-shuffle-sharding branch from 68d9eca to 8b915a5 Compare March 8, 2021 08:31
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work.

querierConnections map[string]int
// How long to wait before removing a querier which has got disconnected
// but hasn't notified about a graceful shutdown.
forgetDelay time.Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I would prefer to move forgetDelay out of queues struct, and pass forgetTime (when to forget given querier, or zero time if immediately) to removeQuerierConnection and only threshold to forgetDisconnectedQueriers. It's a small change, but helps to minimize change to the queues struct, which is already tricky.

@pracucci pracucci force-pushed the fix-querier-shuffle-sharding branch from 04220de to 1a2582d Compare March 8, 2021 16:11
pracucci and others added 12 commits March 9, 2021 08:30
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>
pracucci and others added 14 commits March 9, 2021 08:30
Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>
@pracucci pracucci force-pushed the fix-querier-shuffle-sharding branch from 1a2582d to 1ca9855 Compare March 9, 2021 07:30
Signed-off-by: Marco Pracucci <[email protected]>
@pracucci pracucci merged commit 1ddb423 into cortexproject:master Mar 9, 2021
@pracucci pracucci deleted the fix-querier-shuffle-sharding branch March 9, 2021 08:43
roystchiang pushed a commit to roystchiang/cortex that referenced this pull request Apr 6, 2022
…#3901)

* Added forget timeout support to queues

Signed-off-by: Marco Pracucci <[email protected]>

* Added notify shutdown rpc to query-frontend and query-scheduler proto

Signed-off-by: Marco Pracucci <[email protected]>

* Querier worker notifies shutdown to query-frontend/scheduler

Signed-off-by: Marco Pracucci <[email protected]>

* Log when query-frontend/scheduler receives a shutdown notification

Signed-off-by: Marco Pracucci <[email protected]>

* Added config option to configure the forget timeout

Signed-off-by: Marco Pracucci <[email protected]>

* Fixed re-connect while in forget waiting period

Signed-off-by: Marco Pracucci <[email protected]>

* Fixed unit tests

Signed-off-by: Marco Pracucci <[email protected]>

* Fixed GetNextRequestForQuerier() when a resharding happen after fogetting a querier

Signed-off-by: Marco Pracucci <[email protected]>

* Update pkg/frontend/v1/frontend.go

Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>

* Update pkg/scheduler/queue/user_queues.go

Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>

* Update pkg/scheduler/queue/user_queues.go

Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>

* Update pkg/scheduler/scheduler.go

Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>

* Update pkg/querier/worker/frontend_processor.go

Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>

* Updated comment based on review feedback

Signed-off-by: Marco Pracucci <[email protected]>

* Updated comment based on review feedback

Signed-off-by: Marco Pracucci <[email protected]>

* Updated generated doc

Signed-off-by: Marco Pracucci <[email protected]>

* Added name to services

Signed-off-by: Marco Pracucci <[email protected]>

* Moved forgetCheckPeriod where it's used

Signed-off-by: Marco Pracucci <[email protected]>

* Added queues forget timeout unit tests

Signed-off-by: Marco Pracucci <[email protected]>

* Added RequestQueue unit test

Signed-off-by: Marco Pracucci <[email protected]>

* Renamed querier forget timeout into delay

Signed-off-by: Marco Pracucci <[email protected]>

* Added timeout to the notify shutdown notification

Signed-off-by: Marco Pracucci <[email protected]>

* Updated doc

Signed-off-by: Marco Pracucci <[email protected]>

* Added CHANGELOG entry

Signed-off-by: Marco Pracucci <[email protected]>

* Update pkg/scheduler/scheduler.go

Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>

* Update pkg/frontend/v1/frontend.go

Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>

* Updated doc

Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Peter Štibraný <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Shuffle-sharding of queriers may not work as intended
3 participants