You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Added forget timeout support to queues
Signed-off-by: Marco Pracucci <[email protected]>
* Added notify shutdown rpc to query-frontend and query-scheduler proto
Signed-off-by: Marco Pracucci <[email protected]>
* Querier worker notifies shutdown to query-frontend/scheduler
Signed-off-by: Marco Pracucci <[email protected]>
* Log when query-frontend/scheduler receives a shutdown notification
Signed-off-by: Marco Pracucci <[email protected]>
* Added config option to configure the forget timeout
Signed-off-by: Marco Pracucci <[email protected]>
* Fixed re-connect while in forget waiting period
Signed-off-by: Marco Pracucci <[email protected]>
* Fixed unit tests
Signed-off-by: Marco Pracucci <[email protected]>
* Fixed GetNextRequestForQuerier() when a resharding happen after fogetting a querier
Signed-off-by: Marco Pracucci <[email protected]>
* Update pkg/frontend/v1/frontend.go
Signed-off-by: Marco Pracucci <[email protected]>
Co-authored-by: Peter Štibraný <[email protected]>
* Update pkg/scheduler/queue/user_queues.go
Signed-off-by: Marco Pracucci <[email protected]>
Co-authored-by: Peter Štibraný <[email protected]>
* Update pkg/scheduler/queue/user_queues.go
Signed-off-by: Marco Pracucci <[email protected]>
Co-authored-by: Peter Štibraný <[email protected]>
* Update pkg/scheduler/scheduler.go
Signed-off-by: Marco Pracucci <[email protected]>
Co-authored-by: Peter Štibraný <[email protected]>
* Update pkg/querier/worker/frontend_processor.go
Signed-off-by: Marco Pracucci <[email protected]>
Co-authored-by: Peter Štibraný <[email protected]>
* Updated comment based on review feedback
Signed-off-by: Marco Pracucci <[email protected]>
* Updated comment based on review feedback
Signed-off-by: Marco Pracucci <[email protected]>
* Updated generated doc
Signed-off-by: Marco Pracucci <[email protected]>
* Added name to services
Signed-off-by: Marco Pracucci <[email protected]>
* Moved forgetCheckPeriod where it's used
Signed-off-by: Marco Pracucci <[email protected]>
* Added queues forget timeout unit tests
Signed-off-by: Marco Pracucci <[email protected]>
* Added RequestQueue unit test
Signed-off-by: Marco Pracucci <[email protected]>
* Renamed querier forget timeout into delay
Signed-off-by: Marco Pracucci <[email protected]>
* Added timeout to the notify shutdown notification
Signed-off-by: Marco Pracucci <[email protected]>
* Updated doc
Signed-off-by: Marco Pracucci <[email protected]>
* Added CHANGELOG entry
Signed-off-by: Marco Pracucci <[email protected]>
* Update pkg/scheduler/scheduler.go
Signed-off-by: Marco Pracucci <[email protected]>
Co-authored-by: Peter Štibraný <[email protected]>
* Update pkg/frontend/v1/frontend.go
Signed-off-by: Marco Pracucci <[email protected]>
Co-authored-by: Peter Štibraný <[email protected]>
* Updated doc
Signed-off-by: Marco Pracucci <[email protected]>
Co-authored-by: Peter Štibraný <[email protected]>
Copy file name to clipboardExpand all lines: CHANGELOG.md
+1
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,7 @@
6
6
*[ENHANCEMENT] Ruler: added the following metrics when ruler sharding is enabled: #3916
7
7
*`cortex_ruler_clients`
8
8
*`cortex_ruler_client_request_duration_seconds`
9
+
*[ENHANCEMENT] Query-frontend/scheduler: added querier forget delay (`-query-frontend.querier-forget-delay` and `-query-scheduler.querier-forget-delay`) to mitigate the blast radius in the event queriers crash because of a repeatedly sent "query of death" when shuffle-sharding is enabled. #3901
Copy file name to clipboardExpand all lines: docs/guides/shuffle-sharding.md
+9
Original file line number
Diff line number
Diff line change
@@ -125,6 +125,15 @@ Note that this distribution happens in query-frontend, or query-scheduler if use
125
125
126
126
_The maximum number of queriers can be overridden on a per-tenant basis in the limits overrides configuration._
127
127
128
+
#### The impact of "query of death"
129
+
130
+
In the event a tenant is repeatedly sending a "query of death" which leads the querier to crash or getting killed because of out-of-memory, the crashed querier will get disconnected from the query-frontend or query-scheduler and a new querier will be immediately assigned to the tenant's shard. This practically invalidates the assumption that shuffle-sharding can be used to contain the blast radius in case of a query of death.
131
+
132
+
To mitigate it, Cortex allows to configure a delay between when a querier disconnects because of a crash and when the crashed querier is actually removed from the tenant's shard (and another healthy querier is added as replacement). A delay of 1 minute may be a reasonable trade-off:
The Cortex store-gateway -- used by the [blocks storage](../blocks-storage/_index.md) -- by default spreads each tenant's blocks across all running store-gateways.
// RegisterFlags adds the flags required to config this to the given FlagSet.
38
39
func (cfg*Config) RegisterFlags(f*flag.FlagSet) {
39
40
f.IntVar(&cfg.MaxOutstandingPerTenant, "querier.max-outstanding-requests-per-tenant", 100, "Maximum number of outstanding requests per tenant per frontend; requests beyond this error with HTTP 429.")
41
+
f.DurationVar(&cfg.QuerierForgetDelay, "query-frontend.querier-forget-delay", 0, "If a querier disconnects without sending notification about graceful shutdown, the query-frontend will keep the querier in the tenant's shard until the forget delay has passed. This feature is useful to reduce the blast radius when shuffle-sharding is enabled.")
40
42
}
41
43
42
44
typeLimitsinterface {
@@ -56,6 +58,10 @@ type Frontend struct {
56
58
requestQueue*queue.RequestQueue
57
59
activeUsers*util.ActiveUsersCleanupService
58
60
61
+
// Subservices manager.
62
+
subservices*services.Manager
63
+
subservicesWatcher*services.FailureWatcher
64
+
59
65
// Metrics.
60
66
queueLength*prometheus.GaugeVec
61
67
discardedRequests*prometheus.CounterVec
@@ -74,8 +80,7 @@ type request struct {
74
80
}
75
81
76
82
// New creates a new frontend. Frontend implements service, and must be started and stopped.
0 commit comments