-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ruler: optimised <prefix>/api/v1/rules and <prefix>/api/v1/alerts #3916
Ruler: optimised <prefix>/api/v1/rules and <prefix>/api/v1/alerts #3916
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I've also found a bug today that I believe it's worth fixing as part of this.
} | ||
conn, err := grpc.DialContext(ctx, rlr.Addr, dialOpts...) | ||
|
||
newGrps, err := grpcClient.(RulerClient).Rules(ctx, &RulesRequest{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there's one more bug we need to fix and I think we can test it by having the grpc method Rules
fail via escape failure.
It seems like gRPC methods don't like when you return nil
as a pointer to a struct they expect as response because they call new()
on it e.g. *RulesResponse
for Rules(ctx context.Context, in *RulesRequest) (*RulesResponse, error)
So, we need to do:
func (r *Ruler) Rules(ctx context.Context, in *RulesRequest) (*RulesResponse, error) {
userID, err := tenant.TenantID(ctx)
if err != nil {
return &RulesResponse{}, fmt.Errorf("no user id found in context")
}
groupDescs, err := r.getLocalRules(userID)
if err != nil {
return &RulesResponse{}, err
}
return &RulesResponse{Groups: groupDescs}, nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example error is:
msg="GET /api/prom/api/v1/rules (500) 4.012183ms Response: \"{\\\"status\\\":\\\"error\\\",\\\"data\\\":null,\\\"errorType\\\":\\\"server_error\\\",\\\"error\\\":\\\"unable to retrieve rules from other rulers, rpc error: code = Internal desc = grpc: error while marshaling: proto: Marshal called with nil\\\"}\"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the problem is the return
. We do return nil, err
in any gRPC function. Why should be different here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error while marshaling: proto: Marshal called with nil
I've been able to reproduce it on the request side, calling grpcClient.(RulerClient).Rules(ctx, nil)
instead of the actual grpcClient.(RulerClient).Rules(ctx, &RulesRequest{})
. In master we're calling .Rules(ctx, nil)
and it's something we have broken with the grpc upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See here:
#3917
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I originally thought the problem was intermittent so my hunch was on the return side.
c00d1df
to
c8951cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
CHANGELOG.md
Outdated
@@ -91,6 +91,10 @@ | |||
* `cortex_bucket_store_chunk_pool_returned_bytes_total` | |||
* [ENHANCEMENT] Alertmanager: load alertmanager configurations from object storage concurrently, and only load necessary configurations, speeding configuration synchronization process and executing fewer "GET object" operations to the storage when sharding is enabled. #3898 | |||
* [ENHANCEMENT] Blocks storage: Ingester can now stream entire chunks instead of individual samples to the querier. At the moment this feature must be explicitly enabled either by using `-ingester.stream-chunks-when-using-blocks` flag or `ingester_stream_chunks_when_using_blocks` (boolean) field in runtime config file, but these configuration options are temporary and will be removed when feature is stable. #3889 | |||
* [ENHANCEMENT] Ruler: optimized `<prefix>/api/v1/rules` and `<prefix>/api/v1/alerts` when ruler sharding is enabled. #3916 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cortex release 1.8.0 is now in progress. Could you please rebase master and move the CHANGELOG entry under the master / unreleased section?
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
c8951cf
to
c1f634a
Compare
Signed-off-by: Marco Pracucci <[email protected]>
…rtexproject#3916) * Use a grpc clients pool in the ruler Signed-off-by: Marco Pracucci <[email protected]> * Concurrently fetch rules from all rulers Signed-off-by: Marco Pracucci <[email protected]> * Added subservices manager Signed-off-by: Marco Pracucci <[email protected]> * Fixed Rules() grpc call Signed-off-by: Marco Pracucci <[email protected]> * Added integration test Signed-off-by: Marco Pracucci <[email protected]> * Added CHANGELOG entry Signed-off-by: Marco Pracucci <[email protected]> * Addressed review comments Signed-off-by: Marco Pracucci <[email protected]> * Fixed CHANGELOG Signed-off-by: Marco Pracucci <[email protected]>
What this PR does:

The ruler
/api/prom/api/v1/rules
is pretty slow in our clusters:The
/api/prom/api/v1/rules
endpoint is handled by the ruler'sAPI.PrometheusRules()
. When sharding is enabled, rules are fetches sequentially from rulers so in this PR I'm proposing to concurrently fetch them from all rulers. Along the way, I've also introduced the usage of a clients pool (like we already have in other places) and an integration test.Which issue(s) this PR fixes:
N/A
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]