Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[querier] honor querier mint,maxt if no SelectHints are passed to Select #4413

Merged

Conversation

aallawala
Copy link
Contributor

@aallawala aallawala commented Aug 10, 2021

What this PR does:
This PR attempts to address an issue our org is seeing where each user's Rules Manager is unable to restore the FOR state, as intended via this Prometheus PR prometheus/prometheus#4061

What happens instead is, since the Rules Manager initiates a Select query with no SelectHints (even though the querier that is initialized via https://github.com/prometheus/prometheus/blob/b1ed4a0a663d/rules/manager.go#L710 bounds itself down to just the OutageTolerance window set within the ManagerOpts), the Cortex querier treats this as a series search instead of a metrics search. The current Cortex querier implementation looks for a nil SelectHints OR a SelectHints.Func == "series" to apply a Labels search instead of a time series search. This Labels search then strips out any samples via series.MetricsToSeriesSet before sending the result back to the caller.

The Ruler's RestoreForState expects a time series search as the samples are later used to restore the FOR state even though a nil is passed for SelectHints. Since the querier will return the MetricSeries without samples, the RestoreForState restores to the current timestamp, effectively resetting all progress that was done by the prior owning Ruler instance.

I believe this change is alright given that the time bounds of the Select query are checked further down in the querier's Select block. https://github.com/cortexproject/cortex/blob/master/pkg/querier/querier.go#L320-L323

I'm not entirely familiar with this code base, so there will probably be something I'm missing. All feedback is appreciated!

cc'ing the maintainers whom I've seen commits from WRT this, @pracucci @gouthamve @codesome

Which issue(s) this PR fixes:
Fixes # I didn't find a corresponding issue but please do update if there is one.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@aallawala aallawala force-pushed the aja_ruler_querier_timerange branch from c0acb4f to 7c4676e Compare August 10, 2021 01:26
@bboreham
Copy link
Contributor

Thanks for this. What you say seems plausible, but can we go a bit further and remove all assumptions about what the caller meant?
Comments previously talked about a "kludge" and "later versions of prometheus", but we know exactly which version of prometheus this code is linked against, so we ought to be able to check, no?

@pull-request-size pull-request-size bot added size/L and removed size/S labels Aug 15, 2021
@aallawala aallawala force-pushed the aja_ruler_querier_timerange branch 2 times, most recently from c1aee58 to 1ce0169 Compare August 15, 2021 07:33
@aallawala
Copy link
Contributor Author

aallawala commented Aug 16, 2021

Thanks for this. What you say seems plausible, but can we go a bit further and remove all assumptions about what the caller meant?
Comments previously talked about a "kludge" and "later versions of prometheus", but we know exactly which version of prometheus this code is linked against, so we ought to be able to check, no?

thank you for taking a look, @bboreham.

I dug into this a little bit and saw that the "kludge" comment was first introduced via a prometheus upgrade done via PR: #802 . Looking at the Gopkg dep list, that PR brought in the prometheus commit, prometheus/prometheus@c47fbcb, which was right before the prom 2.3.0 release.

The next comment, "later versions of prometheus", was added via PR: to specifically bring in the change where prometheus now submits a select hint with func="series" when initiating a series search versus passing a nil.

The prometheus version that's pinned against Cortex now does inject a SelectHint.func="series" when submitting a series search. I am unable to find any other instances where prometheus submits a nil for a series search.

@aallawala aallawala force-pushed the aja_ruler_querier_timerange branch from 3fa3565 to da6cc2e Compare August 16, 2021 22:17
@bboreham
Copy link
Contributor

Sorry, I think I have to retract my previous point - looking at prometheus/prometheus#8050 it talks about remote endpoints, so probably we can receive the request and its hints from different versions of Prometheus via remote-read.

That said, should the Rules Manager be changed to pass the series hint?

@aallawala aallawala force-pushed the aja_ruler_querier_timerange branch from c30ef07 to 74d0fdb Compare August 17, 2021 18:46
@aallawala
Copy link
Contributor Author

so probably we can receive the request and its hints from different versions of Prometheus via remote-read

Correct me if I'm wrong - Even if cortex can receive a read request and its hints from different versions of Prometheus via remote-read, isn't the entrypoint still gated via the api/vi/api.go? All series requests that go through the v1 API now have a SelectHint object initialized with a start/stop and func=series, versus having it be set to nil.

That said, should the Rules Manager be changed to pass the series hint?

I don't disagree that the Rules Manager should pass a series hint as well - but in this instance, rather than manipulating the core Rules Manager logic, this PR attempts to match the behavior of the querier implementations in Prometheus itself. The two main Prometheus querier implementations, blockQuerier and blockChunkQuerier, both will set the minT and maxT to the querier itself before utilizing the selectHints, should they exist.

Meanwhile, currently there are three Cortex querier implementations (with the fourth being the top-level querier that will fan out to the three queriers), the blocksStoreQuerier, the chunkStoreQuerier, and the distributorQuerier. Each handles the time range of the query in their own way.

The blocksStoreQuerier will honor the Prometheus behavior where it will first set the minT,maxT to the querier and utilize the SelectHints, should they exist.

The chunkStoreQuerier will only utilize the time range in the SelectHints. Should they not exist (or if the func="series"), the querier will return an empty set.

The distributorQuerier is similar to the chunkStoreQuerier in that it will only utilize the time range in the SelectHints. However, should they not exist (or if func="series"), the querier will process the query as a labels search instead of a time series search.

What do you think, @bboreham?

@aallawala aallawala force-pushed the aja_ruler_querier_timerange branch 2 times, most recently from 7db9fe2 to d1f80a3 Compare August 20, 2021 19:30
@pracucci
Copy link
Contributor

Very interesting!

I tend to agree with @aallawala here. I think we should match the current Prometheus logic and so when SelectHints is nil we should consider it as a normal query and not a "series" query. It's a series query only when hints are specified and hints.Func is "series".

That being said, the place where you changed the logic is not the only place where we check if hints == nil, so I think there are places to be changed. I suggest you to look for all references of storage.SelectHints.

@aallawala aallawala force-pushed the aja_ruler_querier_timerange branch 2 times, most recently from eb6815d to b91f37e Compare August 30, 2021 03:41
@aallawala
Copy link
Contributor Author

Very interesting!

I tend to agree with @aallawala here. I think we should match the current Prometheus logic and so when SelectHints is nil we should consider it as a normal query and not a "series" query. It's a series query only when hints are specified and hints.Func is "series".

That being said, the place where you changed the logic is not the only place where we check if hints == nil, so I think there are places to be changed. I suggest you to look for all references of storage.SelectHints.

Thank you for taking a look, @pracucci.

I went ahead and updated the remaining two base querier implementations, chunksStoreQuerier and distributorQueryable. The blocksStoreQuerier already matches the Prometheus behavior where it will first set the minT,maxT to the querier and utilize the SelectHints, should they exist.

Would there happen to be any other places that should be updated?

Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Note there are a couple of merge conflicts, which I would typically rebase for but you can try doing a merge if you like. If you do it in the GitHub UI it may lack the DCO sign-off.

@aallawala aallawala force-pushed the aja_ruler_querier_timerange branch from b91f37e to aacbfd4 Compare September 2, 2021 17:01
@aallawala
Copy link
Contributor Author

LGTM

Note there are a couple of merge conflicts, which I would typically rebase for but you can try doing a merge if you like. If you do it in the GitHub UI it may lack the DCO sign-off.

Thank you @bboreham for taking another look! I rebased to address the conflicts upstream. PTAL and let me know if there are other changes to be made

@aallawala aallawala force-pushed the aja_ruler_querier_timerange branch from aacbfd4 to 2282c4f Compare September 3, 2021 18:00
aallawala and others added 8 commits September 15, 2021 09:56
Signed-off-by: Abdurrahman J. Allawala <[email protected]>
Signed-off-by: Abdurrahman J. Allawala <[email protected]>
Signed-off-by: Abdurrahman J. Allawala <[email protected]>
Signed-off-by: Abdurrahman J. Allawala <[email protected]>
Signed-off-by: Abdurrahman J. Allawala <[email protected]>
@aallawala aallawala force-pushed the aja_ruler_querier_timerange branch from 2282c4f to 347ba9a Compare September 15, 2021 16:57
@aallawala
Copy link
Contributor Author

Hi, could I get some eyes on this please?

@alvinlin123
Copy link
Contributor

Thanks! LGTM.

@alvinlin123 alvinlin123 merged commit e65da18 into cortexproject:master Sep 17, 2021
@aallawala aallawala deleted the aja_ruler_querier_timerange branch September 17, 2021 19:41
@alvinlin123 alvinlin123 mentioned this pull request Sep 17, 2021
2 tasks
srijan55 pushed a commit to srijan55/cortex that referenced this pull request Nov 26, 2021
…ect (cortexproject#4413)

* [querier] honor querier mint,maxt if no SelectHints are passed to Select

Signed-off-by: Abdurrahman J. Allawala <[email protected]>

* export mockDistributor for later use in the ruler tests

Signed-off-by: Abdurrahman J. Allawala <[email protected]>

* add test for restoring FOR state in the ruler via distributors

Signed-off-by: Abdurrahman J. Allawala <[email protected]>

* test for always active alert

Signed-off-by: Abdurrahman J. Allawala <[email protected]>

* move querier/testutils to querier pkg

Signed-off-by: Abdurrahman J. Allawala <[email protected]>

* add empty chunk store to ruler test

Signed-off-by: Abdurrahman J. Allawala <[email protected]>

* add changelog entry

Signed-off-by: Abdurrahman J. Allawala <[email protected]>

* lint

Signed-off-by: Abdurrahman J. Allawala <[email protected]>

* honor mint,maxt if sp is null for other querier implementations

Signed-off-by: Abdurrahman J. Allawala <[email protected]>

* missed addition in rebase

Signed-off-by: Abdurrahman J. Allawala <[email protected]>

Co-authored-by: Alvin Lin <[email protected]>
Signed-off-by: Manish Kumar Gupta <[email protected]>
alvinlin123 added a commit to ac1214/cortex that referenced this pull request Jan 14, 2022
…ect (cortexproject#4413)

* [querier] honor querier mint,maxt if no SelectHints are passed to Select

Signed-off-by: Abdurrahman J. Allawala <[email protected]>

* export mockDistributor for later use in the ruler tests

Signed-off-by: Abdurrahman J. Allawala <[email protected]>

* add test for restoring FOR state in the ruler via distributors

Signed-off-by: Abdurrahman J. Allawala <[email protected]>

* test for always active alert

Signed-off-by: Abdurrahman J. Allawala <[email protected]>

* move querier/testutils to querier pkg

Signed-off-by: Abdurrahman J. Allawala <[email protected]>

* add empty chunk store to ruler test

Signed-off-by: Abdurrahman J. Allawala <[email protected]>

* add changelog entry

Signed-off-by: Abdurrahman J. Allawala <[email protected]>

* lint

Signed-off-by: Abdurrahman J. Allawala <[email protected]>

* honor mint,maxt if sp is null for other querier implementations

Signed-off-by: Abdurrahman J. Allawala <[email protected]>

* missed addition in rebase

Signed-off-by: Abdurrahman J. Allawala <[email protected]>

Co-authored-by: Alvin Lin <[email protected]>
Signed-off-by: Alvin Lin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants