Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ingesting exemplars into TSDB when blocks storage is enabled #4124

Merged
merged 17 commits into from
May 6, 2021

Conversation

mdisibio
Copy link
Contributor

What this PR does:
Support for an in-memory buffer of exemplars was added to TSDB recently. This PR takes the first steps to supporting the same in cortex's ingest path by enabling the feature in TSDB and storing exemplars from the remote write data. Future PRs will add query support and integration with per-tenant limits.

This PR is marked WIP because there are several parts that could use some consideration:

  1. This is enabled with a new -blocks-storage.tsdb.max-exemplars=<n> command line argument. This is available only to the ingester, but ideally there is a way to have the distributor be aware and skip validation of exemplars (currently it always validates any exemplars even if discarded by the ingester). Is there a recommend config location to have the param shared between both distributors and ingesters?

  2. Exemplars are counted in the rate limiting in the distributor. This seems good since exemplars have processing overhead, but wanted to double check if there is something else that should be done.

  3. There are 5 exemplar metrics in TSDB and they are exposed for per-tenant. However, a new cortex_ingester_ingested_exemplars_total global metric was also added which follows the pattern for samples, so although partially redundant seems worthwhile.

  4. The PR for remote write of exemplars in Prometheus (Add Exemplar Remote Write support prometheus/prometheus#8296) is not merged yet, so the proto is still subject to change.

Note - Source branch was moved to grafana/cortex repo so a new PR was created. Old PR is here: #4104

Copy link
Contributor

@cstyan cstyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left a few comments that are mostly questions to other reviewers.

As for Marty's comments, those could also use some input:

  1. This is enabled with a new -blocks-storage.tsdb.max-exemplars= command line argument. This is available only to the ingester, but ideally there is a way to have the distributor be aware and skip validation of exemplars (currently it always validates any exemplars even if discarded by the ingester). Is there a recommend config location to have the param shared between both distributors and ingesters?

I know we just pass all config flags to all components when we deploy them, but I think that's strictly the single binary + our deployment strategy? Probably not something all users would want to do. Any recommendations here from other maintainers? AFAICT there aren't many common config flags used across services unless they're part of a common component/package like the KV store, but exemplars are not their own package yet. If someone can point me to an example of such a config flag I can make the changes here.

  1. Exemplars are counted in the rate limiting in the distributor. This seems good since exemplars have processing overhead, but wanted to double check if there is something else that should be done.

Exemplars should probably be their own rate limit, 1/10 or 1/100 of the sample rate limit IMO.

d.receivedMetadata.DeleteLabelValues(userID)
d.incomingSamples.DeleteLabelValues(userID)
d.incomingExemplars.DeleteLabelValues(userID)
d.incomingMetadata.DeleteLabelValues(userID)
d.nonHASamples.DeleteLabelValues(userID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we potentially have nonHAExemplars as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it's an useful information. I'm a bit dubious about it.

@@ -1479,6 +1516,7 @@ func (i *Ingester) createTSDB(userID string) (*userTSDB, error) {
WALSegmentSize: i.cfg.BlocksStorageConfig.TSDB.WALSegmentSizeBytes,
SeriesLifecycleCallback: userDB,
BlocksToDelete: userDB.blocksToDelete,
MaxExemplars: i.cfg.BlocksStorageConfig.TSDB.MaxExemplars,
Copy link
Contributor

@cstyan cstyan Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only value we need to pass to enable the in-memory storage within TSDB, which we already have via vendoring. (nothing to do here, just pointing this out to other reviewers)

// already exist. If it does not then drop.
if ref == 0 && len(ts.Exemplars) > 0 {
updateFirstPartial(func() error {
return wrappedTSDBIngestExemplarErr(errors.New("exemplars not ingested because series not already present"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not ingesting the exemplar if we don't already know about the series is the correct thing to do, but I'm not sure what kind of error message we want to return to users here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not ingesting the exemplar if we don't already know about the series is the correct thing to do

Agree.

but I'm not sure what kind of error message we want to return to users here

Do we want to return an error at all? Metadata ingestion is best-effort and we never return any error about it. Should exemplars ingestion be a best-effort as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to return an error at all? Metadata ingestion is best-effort and we never return any error about it. Should exemplars ingestion be a best-effort as well?

We should IMO, exemplars are more useful/important than metadata. Exemplar ingestion could be best-effort right now for the in-memory storage, though it shouldn't be forever when we eventually have a better storage implementation. FWIW what Marty has done here is essentially exactly what would happen in Prometheus' exemplar storage if you tried to store an exemplar for a series: https://github.com/prometheus/prometheus/blob/main/tsdb/head.go#L1326-L1345

@gouthamve
Copy link
Contributor

Can you please rebase against master to pull in #4137 and make sure to put the changelog entry to the top?

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good job @mdisibio! This is a very high quality PR considered it's one of your first Cortex contributions 👏 I left few comments I would be glad you to take a look.

Some other notes:

  1. It's OK not having a CHANGELOG entry yet, given this is a WIP
  2. Please mark it experimental in docs/configuration/v1-guarantees.md

This is enabled with a new -blocks-storage.tsdb.max-exemplars= command line argument

Let me think a bit more on the config option.

Exemplars are counted in the rate limiting in the distributor.

LGTM. They're also counted towards the samples/s ingestion rate limit, which is what we do for the metadata and it LGTM too.

// Exemplar labels, different than series labels
repeated LabelPair labels = 1 [(gogoproto.nullable) = false, (gogoproto.customtype) = "LabelAdapter"];
double value = 2;
int64 timestamp_ms = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would name this timestamp to keep it consistent with Prometheus.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already use timestamp_ms for Sample in this file, and there's been discussion about the name of the field in the Prometheus Exemplar remote write PR: https://github.com/prometheus/prometheus/pull/8296/files#r611218904

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My take is that the message should equal to Prometheus one (unless there's a good reason to not). I've seen in the Prometheus PR it's named timestamp so I was wondering if there's a good reason to name it differently here.

Definitely not a blocker.

Samples: make([]Sample, 0, expectedSamplesPerSeries),
Labels: make([]LabelAdapter, 0, expectedLabels),
Samples: make([]Sample, 0, expectedSamplesPerSeries),
Exemplars: make([]Exemplar, 0, expectedExemplarsPerSeries),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually expect 1 exemplar per series? If a client has exemplars tracking enabled, do we actually expect an exemplar for each series in each remote write request? Could you share more thoughts about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, each series will contain either samples or exemplars. The current prometheus remote write implementation only populates 1 exemplar per series. Data on exemplars is limited, but we estimate that exemplars will be 1% of samples typically. It seems the decision is whether 1% is worth prealloc/pooling. Additionally, exemplars are larger objects than samples as they contain additional labels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the decision is whether 1% is worth prealloc/pooling

Exactly. I was wondering if it's worth preallocating. I'm wondering if it'actually better to initialise Exemplars to nil adding a comment to the reason. Then we can reconsider it when profiling a prod cluster.

Thoughts?

d.receivedMetadata.DeleteLabelValues(userID)
d.incomingSamples.DeleteLabelValues(userID)
d.incomingExemplars.DeleteLabelValues(userID)
d.incomingMetadata.DeleteLabelValues(userID)
d.nonHASamples.DeleteLabelValues(userID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it's an useful information. I'm a bit dubious about it.

@pull-request-size pull-request-size bot added size/XL and removed size/L labels May 3, 2021
@cstyan
Copy link
Contributor

cstyan commented May 3, 2021

Fixed a lot of the review comments from @pracucci review, though for the life of me I can't figure out the per user metrics registry. I added the new metrics to the tsdb metrics test, and the new metrics have the user labels attached properly, but the expected output says the user labels should not be attached.

// Exemplar labels, different than series labels
repeated LabelPair labels = 1 [(gogoproto.nullable) = false, (gogoproto.customtype) = "LabelAdapter"];
double value = 2;
int64 timestamp_ms = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My take is that the message should equal to Prometheus one (unless there's a good reason to not). I've seen in the Prometheus PR it's named timestamp so I was wondering if there's a good reason to name it differently here.

Definitely not a blocker.

Samples: make([]Sample, 0, expectedSamplesPerSeries),
Labels: make([]LabelAdapter, 0, expectedLabels),
Samples: make([]Sample, 0, expectedSamplesPerSeries),
Exemplars: make([]Exemplar, 0, expectedExemplarsPerSeries),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the decision is whether 1% is worth prealloc/pooling

Exactly. I was wondering if it's worth preallocating. I'm wondering if it'actually better to initialise Exemplars to nil adding a comment to the reason. Then we can reconsider it when profiling a prod cluster.

Thoughts?

@@ -496,6 +512,29 @@ func newTSDBMetrics(r prometheus.Registerer) *tsdbMetrics {
"Total number of TSDB checkpoint creations attempted.",
nil, nil),

tsdbExemplarsTotal: prometheus.NewDesc(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point and I agree the more granularity the better from an observability perspective. We also have clusters with a large number of tenants per ingester (several thousands tenants / ingester) so this can easily end up in additional significative metrics exported by Cortex.

That's why I was asking if all of them are required. We typically start small and split by user if it turns out the metric is not useful unless splitted by user.

I think the following could be global to have a sense of the rate:

  • cortex_ingester_tsdb_exemplar_exemplars_appended_total
  • cortex_ingester_tsdb_exemplar_exemplars_in_storage
  • cortex_ingester_tsdb_exemplar_series_with_exemplars_in_storage
  • cortex_ingester_tsdb_exemplar_out_of_order_exemplars_total

That being said, up to you. Not a blocker!

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mdisibio and @cstyan for addressing my feedback and replying my questions. I don't have any concern but I would be glad if you could take a look at my last comments. Thanks!

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for addressing my feedback. LGTM! 🤘 🚀 👏

@pracucci pracucci changed the title WIP: Support ingesting exemplars into TSDB when blocks storage is enabled Support ingesting exemplars into TSDB when blocks storage is enabled May 5, 2021
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

Should this PR have a changelog entry, or do you plan to add it after query-path is updated too?

@@ -496,6 +512,29 @@ func newTSDBMetrics(r prometheus.Registerer) *tsdbMetrics {
"Total number of TSDB checkpoint creations attempted.",
nil, nil),

tsdbExemplarsTotal: prometheus.NewDesc(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is cortex_ingester_tsdb_exemplar_last_exemplars_timestamp_seconds useful to have per-user? (Asking to understand better.)

@mdisibio mdisibio force-pushed the exemplar-ingest branch 2 times, most recently from e46bdf9 to f22d2d0 Compare May 5, 2021 18:06
@pstibrany
Copy link
Contributor

Thank you for addressing my feedback.

@pstibrany pstibrany enabled auto-merge (squash) May 6, 2021 07:31
mdisibio and others added 6 commits May 6, 2021 08:26
Signed-off-by: Martin Disibio <[email protected]>
…. Count runes in exemplar labels instead of bytes

Signed-off-by: Martin Disibio <[email protected]>
…feedback about limiting cardinality. Update other code formatting and logic based on review feedback

Signed-off-by: Martin Disibio <[email protected]>
Simplify string formatting

Co-authored-by: Peter Štibraný <[email protected]>
Signed-off-by: Martin Disibio <[email protected]>
Signed-off-by: Martin Disibio <[email protected]>
auto-merge was automatically disabled May 6, 2021 12:47

Head branch was pushed to by a user without write access

@mdisibio mdisibio force-pushed the exemplar-ingest branch from f22d2d0 to 038adf8 Compare May 6, 2021 12:47
@pstibrany pstibrany enabled auto-merge (squash) May 6, 2021 13:09
@pstibrany pstibrany merged commit 926a691 into cortexproject:master May 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants