Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translate internal store errors to prometheus API errors. #2571

Merged
merged 4 commits into from
May 26, 2020
Merged

Translate internal store errors to prometheus API errors. #2571

merged 4 commits into from
May 26, 2020

Conversation

pstibrany
Copy link
Contributor

This PR modifies how errors are reported from chunk store. Instead of using grpc errors with status codes, this PR introduces new QueryError type, which is then recognized by chunkStoreQuerier and translated to the appropriate error returned to promQL engine and API wrapper.

Prometheus API wrapper doesn't recognize grpc errors with status codes, but it does recognize some errors and adjusts status code based on them. Result of this PR is that QueryErrors are translated to 422 (best that Prometheus has to offer).

Previously, our httpgrpc.Errorf(http.StatusBadRequest, ...) were actually translated to status code 500, despite the attempt to use StatusBadRequest.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pstibrany for working on it! It's a bit tricky this change.

@@ -286,12 +289,12 @@ func (c *baseStore) validateQueryTimeRange(ctx context.Context, userID string, f
defer log.Span.Finish()

if *through < *from {
return false, httpgrpc.Errorf(http.StatusBadRequest, "invalid query, through < from (%s < %s)", through, from)
return false, QueryError(fmt.Sprintf("invalid query, through < from (%s < %s)", through, from))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not using httpgrpc.Errorf() in validateQueryTimeRange() changes the behaviour for LabelValuesForMetricName() and LabelNamesForMetricName() where we want to return an httpgrpc error. An option could be always wrapping the error there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LabelValuesForMetricName and LabelNamesForMetricName are not even used in Cortex :) (but they are in Loki).

I think Loki should recognize QueryError and convert it to appropriate error in the layer it is using. Store itself should not return gRPC errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make sure that it is handled correctly in Loki, if this PR is merged.

Copy link
Contributor

@jtlisi jtlisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (with a nit)

@@ -145,6 +145,14 @@ func runQueryFrontendTest(t *testing.T, setup queryFrontendSetup) {
c, err := e2ecortex.NewClient("", queryFrontend.HTTPEndpoint(), "", "", fmt.Sprintf("user-%d", userID))
require.NoError(t, err)

// No need to repeat this test for each user.
if userID == 0 && testMissingMetricName {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[thinking loudly] I'm a bit dubious about adding this further test here. The query-frontend integration test is become the place where everyone adds further cases and we risk to make it just more and more complicated. This case is already covered by the assertion you added to TestQuerierWithChunksStorage, so do we really need this too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I suspected that status code is mangled by query-frontend and wanted to make sure that it is propagated properly. Repeating some of the tests between querier and query-frontend makes sense imho, and we should perhaps refactor the code to avoid duplicated code, but repeat those tests.

@pstibrany pstibrany merged commit e5cab27 into cortexproject:master May 26, 2020
@pstibrany pstibrany deleted the prometheus-errors branch May 26, 2020 13:10
@bboreham
Copy link
Contributor

If users will get different error codes, doesn't this need a CHANGELOG entry?

@pstibrany
Copy link
Contributor Author

If users will get different error codes, doesn't this need a CHANGELOG entry?

Yes. I've sent #2773 to fix that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants