-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support ingesting exemplars into TSDB when blocks storage is enabled #4124
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just left a few comments that are mostly questions to other reviewers.
As for Marty's comments, those could also use some input:
- This is enabled with a new -blocks-storage.tsdb.max-exemplars= command line argument. This is available only to the ingester, but ideally there is a way to have the distributor be aware and skip validation of exemplars (currently it always validates any exemplars even if discarded by the ingester). Is there a recommend config location to have the param shared between both distributors and ingesters?
I know we just pass all config flags to all components when we deploy them, but I think that's strictly the single binary + our deployment strategy? Probably not something all users would want to do. Any recommendations here from other maintainers? AFAICT there aren't many common config flags used across services unless they're part of a common component/package like the KV store, but exemplars are not their own package yet. If someone can point me to an example of such a config flag I can make the changes here.
- Exemplars are counted in the rate limiting in the distributor. This seems good since exemplars have processing overhead, but wanted to double check if there is something else that should be done.
Exemplars should probably be their own rate limit, 1/10 or 1/100 of the sample rate limit IMO.
d.receivedMetadata.DeleteLabelValues(userID) | ||
d.incomingSamples.DeleteLabelValues(userID) | ||
d.incomingExemplars.DeleteLabelValues(userID) | ||
d.incomingMetadata.DeleteLabelValues(userID) | ||
d.nonHASamples.DeleteLabelValues(userID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we potentially have nonHAExemplars
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it's an useful information. I'm a bit dubious about it.
@@ -1479,6 +1516,7 @@ func (i *Ingester) createTSDB(userID string) (*userTSDB, error) { | |||
WALSegmentSize: i.cfg.BlocksStorageConfig.TSDB.WALSegmentSizeBytes, | |||
SeriesLifecycleCallback: userDB, | |||
BlocksToDelete: userDB.blocksToDelete, | |||
MaxExemplars: i.cfg.BlocksStorageConfig.TSDB.MaxExemplars, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only value we need to pass to enable the in-memory storage within TSDB, which we already have via vendoring. (nothing to do here, just pointing this out to other reviewers)
pkg/ingester/ingester_v2.go
Outdated
// already exist. If it does not then drop. | ||
if ref == 0 && len(ts.Exemplars) > 0 { | ||
updateFirstPartial(func() error { | ||
return wrappedTSDBIngestExemplarErr(errors.New("exemplars not ingested because series not already present"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not ingesting the exemplar if we don't already know about the series is the correct thing to do, but I'm not sure what kind of error message we want to return to users here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not ingesting the exemplar if we don't already know about the series is the correct thing to do
Agree.
but I'm not sure what kind of error message we want to return to users here
Do we want to return an error at all? Metadata ingestion is best-effort and we never return any error about it. Should exemplars ingestion be a best-effort as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to return an error at all? Metadata ingestion is best-effort and we never return any error about it. Should exemplars ingestion be a best-effort as well?
We should IMO, exemplars are more useful/important than metadata. Exemplar ingestion could be best-effort right now for the in-memory storage, though it shouldn't be forever when we eventually have a better storage implementation. FWIW what Marty has done here is essentially exactly what would happen in Prometheus' exemplar storage if you tried to store an exemplar for a series: https://github.com/prometheus/prometheus/blob/main/tsdb/head.go#L1326-L1345
Can you please rebase against master to pull in #4137 and make sure to put the changelog entry to the top? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good job @mdisibio! This is a very high quality PR considered it's one of your first Cortex contributions 👏 I left few comments I would be glad you to take a look.
Some other notes:
- It's OK not having a CHANGELOG entry yet, given this is a WIP
- Please mark it experimental in
docs/configuration/v1-guarantees.md
This is enabled with a new -blocks-storage.tsdb.max-exemplars= command line argument
Let me think a bit more on the config option.
Exemplars are counted in the rate limiting in the distributor.
LGTM. They're also counted towards the samples/s ingestion rate limit, which is what we do for the metadata and it LGTM too.
// Exemplar labels, different than series labels | ||
repeated LabelPair labels = 1 [(gogoproto.nullable) = false, (gogoproto.customtype) = "LabelAdapter"]; | ||
double value = 2; | ||
int64 timestamp_ms = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would name this timestamp
to keep it consistent with Prometheus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already use timestamp_ms
for Sample
in this file, and there's been discussion about the name of the field in the Prometheus Exemplar remote write PR: https://github.com/prometheus/prometheus/pull/8296/files#r611218904
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My take is that the message should equal to Prometheus one (unless there's a good reason to not). I've seen in the Prometheus PR it's named timestamp
so I was wondering if there's a good reason to name it differently here.
Definitely not a blocker.
Samples: make([]Sample, 0, expectedSamplesPerSeries), | ||
Labels: make([]LabelAdapter, 0, expectedLabels), | ||
Samples: make([]Sample, 0, expectedSamplesPerSeries), | ||
Exemplars: make([]Exemplar, 0, expectedExemplarsPerSeries), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually expect 1 exemplar per series? If a client has exemplars tracking enabled, do we actually expect an exemplar for each series in each remote write request? Could you share more thoughts about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, each series will contain either samples or exemplars. The current prometheus remote write implementation only populates 1 exemplar per series. Data on exemplars is limited, but we estimate that exemplars will be 1% of samples typically. It seems the decision is whether 1% is worth prealloc/pooling. Additionally, exemplars are larger objects than samples as they contain additional labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the decision is whether 1% is worth prealloc/pooling
Exactly. I was wondering if it's worth preallocating. I'm wondering if it'actually better to initialise Exemplars
to nil
adding a comment to the reason. Then we can reconsider it when profiling a prod cluster.
Thoughts?
d.receivedMetadata.DeleteLabelValues(userID) | ||
d.incomingSamples.DeleteLabelValues(userID) | ||
d.incomingExemplars.DeleteLabelValues(userID) | ||
d.incomingMetadata.DeleteLabelValues(userID) | ||
d.nonHASamples.DeleteLabelValues(userID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it's an useful information. I'm a bit dubious about it.
Fixed a lot of the review comments from @pracucci review, though for the life of me I can't figure out the per user metrics registry. I added the new metrics to the tsdb metrics test, and the new metrics have the user labels attached properly, but the expected output says the user labels should not be attached. |
// Exemplar labels, different than series labels | ||
repeated LabelPair labels = 1 [(gogoproto.nullable) = false, (gogoproto.customtype) = "LabelAdapter"]; | ||
double value = 2; | ||
int64 timestamp_ms = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My take is that the message should equal to Prometheus one (unless there's a good reason to not). I've seen in the Prometheus PR it's named timestamp
so I was wondering if there's a good reason to name it differently here.
Definitely not a blocker.
Samples: make([]Sample, 0, expectedSamplesPerSeries), | ||
Labels: make([]LabelAdapter, 0, expectedLabels), | ||
Samples: make([]Sample, 0, expectedSamplesPerSeries), | ||
Exemplars: make([]Exemplar, 0, expectedExemplarsPerSeries), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the decision is whether 1% is worth prealloc/pooling
Exactly. I was wondering if it's worth preallocating. I'm wondering if it'actually better to initialise Exemplars
to nil
adding a comment to the reason. Then we can reconsider it when profiling a prod cluster.
Thoughts?
@@ -496,6 +512,29 @@ func newTSDBMetrics(r prometheus.Registerer) *tsdbMetrics { | |||
"Total number of TSDB checkpoint creations attempted.", | |||
nil, nil), | |||
|
|||
tsdbExemplarsTotal: prometheus.NewDesc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point and I agree the more granularity the better from an observability perspective. We also have clusters with a large number of tenants per ingester (several thousands tenants / ingester) so this can easily end up in additional significative metrics exported by Cortex.
That's why I was asking if all of them are required. We typically start small and split by user if it turns out the metric is not useful unless splitted by user.
I think the following could be global to have a sense of the rate:
cortex_ingester_tsdb_exemplar_exemplars_appended_total
cortex_ingester_tsdb_exemplar_exemplars_in_storage
cortex_ingester_tsdb_exemplar_series_with_exemplars_in_storage
cortex_ingester_tsdb_exemplar_out_of_order_exemplars_total
That being said, up to you. Not a blocker!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for addressing my feedback. LGTM! 🤘 🚀 👏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
Should this PR have a changelog entry, or do you plan to add it after query-path is updated too?
@@ -496,6 +512,29 @@ func newTSDBMetrics(r prometheus.Registerer) *tsdbMetrics { | |||
"Total number of TSDB checkpoint creations attempted.", | |||
nil, nil), | |||
|
|||
tsdbExemplarsTotal: prometheus.NewDesc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is cortex_ingester_tsdb_exemplar_last_exemplars_timestamp_seconds
useful to have per-user? (Asking to understand better.)
e46bdf9
to
f22d2d0
Compare
Thank you for addressing my feedback. |
…to tsdb when blocks storage is enabled. Signed-off-by: Martin Disibio <[email protected]>
Signed-off-by: Martin Disibio <[email protected]>
Signed-off-by: Martin Disibio <[email protected]>
…rded per reason Signed-off-by: Martin Disibio <[email protected]>
Signed-off-by: Martin Disibio <[email protected]>
Signed-off-by: Martin Disibio <[email protected]>
…lars Signed-off-by: Martin Disibio <[email protected]>
Signed-off-by: Martin Disibio <[email protected]>
Signed-off-by: Callum Styan <[email protected]>
Signed-off-by: Callum Styan <[email protected]>
Signed-off-by: Martin Disibio <[email protected]>
Signed-off-by: Martin Disibio <[email protected]>
…. Count runes in exemplar labels instead of bytes Signed-off-by: Martin Disibio <[email protected]>
…feedback about limiting cardinality. Update other code formatting and logic based on review feedback Signed-off-by: Martin Disibio <[email protected]>
Simplify string formatting Co-authored-by: Peter Štibraný <[email protected]> Signed-off-by: Martin Disibio <[email protected]>
Signed-off-by: Martin Disibio <[email protected]>
Signed-off-by: Martin Disibio <[email protected]>
Head branch was pushed to by a user without write access
What this PR does:
Support for an in-memory buffer of exemplars was added to TSDB recently. This PR takes the first steps to supporting the same in cortex's ingest path by enabling the feature in TSDB and storing exemplars from the remote write data. Future PRs will add query support and integration with per-tenant limits.
This PR is marked WIP because there are several parts that could use some consideration:
This is enabled with a new
-blocks-storage.tsdb.max-exemplars=<n>
command line argument. This is available only to the ingester, but ideally there is a way to have the distributor be aware and skip validation of exemplars (currently it always validates any exemplars even if discarded by the ingester). Is there a recommend config location to have the param shared between both distributors and ingesters?Exemplars are counted in the rate limiting in the distributor. This seems good since exemplars have processing overhead, but wanted to double check if there is something else that should be done.
There are 5 exemplar metrics in TSDB and they are exposed for per-tenant. However, a new
cortex_ingester_ingested_exemplars_total
global metric was also added which follows the pattern for samples, so although partially redundant seems worthwhile.The PR for remote write of exemplars in Prometheus (Add Exemplar Remote Write support prometheus/prometheus#8296) is not merged yet, so the proto is still subject to change.
Note - Source branch was moved to
grafana/cortex
repo so a new PR was created. Old PR is here: #4104