Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new query stats metrics to track prometheus querystats #6228

Conversation

SungJin1212
Copy link
Member

@SungJin1212 SungJin1212 commented Sep 23, 2024

What this PR does:
Add new query stats metrics cortex_query_samples_scanned_total and cortex_query_peak_samples.
These track scannedSamples and PeakSamples for each in https://github.com/prometheus/prometheus/blob/main/util/stats/query_stats.go

Which issue(s) this PR fixes:
Fixes

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@SungJin1212 SungJin1212 force-pushed the Add-metrics-for-queryable-samples-and-peak-samples branch 4 times, most recently from 0a571de to d4ddae7 Compare September 23, 2024 12:20

h.queryPeakSamples = promauto.With(reg).NewCounterVec(prometheus.CounterOpts{
Name: "cortex_query_peak_samples_total",
Help: "Highest count of samples considered to execute a query.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain the usecase of tracking peak samples using a counter metric?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think like promQL: sum by (user) (irate(cortex_query_peak_samples_total[5m])), the cortex user can track the rate of how many samples are loaded in memory for each tenant's overall queries.

One usecase is:

  • The cortex user can track query memory usage (used by loaded samples) for each tenant and turn promQL to reduce query memory usage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. For a counter metric I feel it is more useful to just track total samples, not peak samples.
Would it be more useful to use native histogram instead?

Copy link
Member Author

@SungJin1212 SungJin1212 Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the histogram is the best.

@SungJin1212 SungJin1212 force-pushed the Add-metrics-for-queryable-samples-and-peak-samples branch 3 times, most recently from afea4c7 to 1e37e70 Compare September 27, 2024 01:47
@SungJin1212
Copy link
Member Author

@yeya24
Could you please take a look?

@SungJin1212 SungJin1212 force-pushed the Add-metrics-for-queryable-samples-and-peak-samples branch 2 times, most recently from 0cfa84b to 797b6ee Compare October 3, 2024 00:34
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think I am ok. Would be good to have another approval from maintainer before merging this.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Oct 7, 2024
@SungJin1212 SungJin1212 force-pushed the Add-metrics-for-queryable-samples-and-peak-samples branch from 797b6ee to 1ac4463 Compare October 10, 2024 01:09
@SungJin1212
Copy link
Member Author

@alanprot @friedrichg
could you please take a look? :)

Copy link
Member

@friedrichg friedrichg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome tests. Small questions related to naming

CHANGELOG.md Outdated
@@ -8,6 +8,7 @@
* [FEATURE] Ruler: Minimize chances of missed rule group evaluations that can occur due to OOM kills, bad underlying nodes, or due to an unhealthy ruler that appears in the ring as healthy. This feature is enabled via `-ruler.enable-ha-evaluation` flag. #6129
* [FEATURE] Store Gateway: Add an in-memory chunk cache. #6245
* [FEATURE] Chunk Cache: Support multi level cache and add metrics. #6249
* [ENHANCEMENT] Query Frontend: Add new query stats metrics `cortex_query_total_queryable_samples_total` and `cortex_query_peak_samples` to track totalQueryableSamples and peakSample per user. #6228
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the metric cortex_query_samples_processed_total or cortex_query_total_queryable_samples_total?

Can you double check this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for catching it. cortex_query_samples_processed_total is right.

@@ -39,4 +39,10 @@ message Stats {
// The total size of postings touched in store gateway for a specific query, in bytes.
// Only successful requests from querier to store gateway are included.
uint64 store_gateway_touched_posting_bytes = 12;
// The total number of samples scanned while evaluating a query.
// Equal to TotalSamples in https://github.com/prometheus/prometheus/blob/main/util/stats/query_stats.go
uint64 processed_samples = 13;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand where the word "processed" comes from. Can you explain?

Maybe scanned_samples sounds better?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Processed means the total number of considered samples when the query is executed.
But now, I feel like scanned is more intuitive to understand. Thanks.

@SungJin1212 SungJin1212 force-pushed the Add-metrics-for-queryable-samples-and-peak-samples branch 3 times, most recently from 2853868 to 593582c Compare October 21, 2024 09:46
querySeconds *prometheus.CounterVec
querySeries *prometheus.CounterVec
queryFetchedSamples *prometheus.CounterVec
queryTotalQueryableSamples *prometheus.CounterVec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
queryTotalQueryableSamples *prometheus.CounterVec
queryScannedSamples *prometheus.CounterVec

wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good for consistency. Thanks you!

@SungJin1212 SungJin1212 force-pushed the Add-metrics-for-queryable-samples-and-peak-samples branch from 593582c to a097955 Compare October 22, 2024 01:14
@friedrichg friedrichg requested a review from yeya24 October 22, 2024 06:35
@SungJin1212 SungJin1212 force-pushed the Add-metrics-for-queryable-samples-and-peak-samples branch from a097955 to 6c77531 Compare October 23, 2024 07:42
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@yeya24 yeya24 merged commit 3b5011b into cortexproject:master Oct 23, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm This PR has been approved by a maintainer size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants