Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize ingester Push() on blocks storage when the per-user or per-metric series limit is reached #3971

Merged
merged 4 commits into from
Mar 18, 2021

Conversation

pracucci
Copy link
Contributor

@pracucci pracucci commented Mar 18, 2021

What this PR does:
Following up the PR #3969, in this PR I'm optimizing the v2Push() when an high volume of per-user/series limit is reached.

In the profile I've seen much CPU and memory is consumed formatting the Limiter error. The idea in this PR is to return static errors from Limiter, offer a FormatError() function and then format only the 1st partial error in v2Push().

Metadata and chunks storage hasn't been changed.

Benchmark:

The benchmarks are on top of #3969 (I merged #3969 first and then worked on this PR):

_Ingester_v2PushOnError/failure:_per-user_series_limit_reached,_scenario:_no_concurrency-12        1.19ms ± 0%    0.63ms ± 1%  -46.91%  (p=0.008 n=5+5)
_Ingester_v2PushOnError/failure:_per-user_series_limit_reached,_scenario:_low_concurrency-12       29.9ms ± 8%    16.4ms ±11%  -45.24%  (p=0.008 n=5+5)
_Ingester_v2PushOnError/failure:_per-user_series_limit_reached,_scenario:_high_concurrency-12       360ms ±14%     194ms ± 9%  -46.12%  (p=0.008 n=5+5)
_Ingester_v2PushOnError/failure:_per-metric_series_limit_reached,_scenario:_high_concurrency-12     1.04s ± 2%     0.33s ± 4%  -68.81%  (p=0.008 n=5+5)
_Ingester_v2PushOnError/failure:_per-metric_series_limit_reached,_scenario:_no_concurrency-12      1.31ms ± 1%    0.69ms ± 2%  -47.38%  (p=0.008 n=5+5)
_Ingester_v2PushOnError/failure:_per-metric_series_limit_reached,_scenario:_low_concurrency-12      105ms ±11%      30ms ± 7%  -71.19%  (p=0.008 n=5+5)
_Ingester_v2PushOnError/failure:_out_of_bound_samples,_scenario:_no_concurrency-12                  362µs ± 3%     298µs ± 1%  -17.65%  (p=0.008 n=5+5)
_Ingester_v2PushOnError/failure:_out_of_bound_samples,_scenario:_low_concurrency-12                8.54ms ± 6%    8.54ms ±15%     ~     (p=0.690 n=5+5)
_Ingester_v2PushOnError/failure:_out_of_bound_samples,_scenario:_high_concurrency-12               86.1ms ± 2%    87.9ms ±13%     ~     (p=0.841 n=5+5)
_Ingester_v2PushOnError/failure:_out_of_order_samples,_scenario:_no_concurrency-12                  351µs ±11%     304µs ± 7%  -13.45%  (p=0.008 n=5+5)
_Ingester_v2PushOnError/failure:_out_of_order_samples,_scenario:_low_concurrency-12                6.70ms ±10%    7.53ms ±15%     ~     (p=0.095 n=5+5)
_Ingester_v2PushOnError/failure:_out_of_order_samples,_scenario:_high_concurrency-12               74.3ms ±12%    72.9ms ±11%     ~     (p=0.690 n=5+5)

name                                                                                             old alloc/op   new alloc/op   delta
_Ingester_v2PushOnError/failure:_per-user_series_limit_reached,_scenario:_no_concurrency-12         655kB ± 0%     407kB ± 0%  -37.91%  (p=0.008 n=5+5)
_Ingester_v2PushOnError/failure:_per-user_series_limit_reached,_scenario:_low_concurrency-12       67.3MB ± 0%    41.6MB ± 0%  -38.17%  (p=0.008 n=5+5)
_Ingester_v2PushOnError/failure:_per-user_series_limit_reached,_scenario:_high_concurrency-12       861MB ± 8%     495MB ± 9%  -42.51%  (p=0.008 n=5+5)
_Ingester_v2PushOnError/failure:_per-metric_series_limit_reached,_scenario:_high_concurrency-12     686MB ± 2%     430MB ± 6%  -37.33%  (p=0.016 n=4+5)
_Ingester_v2PushOnError/failure:_per-metric_series_limit_reached,_scenario:_no_concurrency-12       655kB ± 0%     406kB ± 0%  -38.06%  (p=0.008 n=5+5)
_Ingester_v2PushOnError/failure:_per-metric_series_limit_reached,_scenario:_low_concurrency-12     67.2MB ± 3%    41.2MB ± 2%  -38.61%  (p=0.008 n=5+5)
_Ingester_v2PushOnError/failure:_out_of_bound_samples,_scenario:_no_concurrency-12                 99.0kB ± 0%    99.0kB ± 0%     ~     (p=0.103 n=5+5)
_Ingester_v2PushOnError/failure:_out_of_bound_samples,_scenario:_low_concurrency-12                10.4MB ± 1%    10.4MB ± 1%     ~     (p=0.841 n=5+5)
_Ingester_v2PushOnError/failure:_out_of_bound_samples,_scenario:_high_concurrency-12                106MB ± 2%     133MB ±20%     ~     (p=0.063 n=4+5)
_Ingester_v2PushOnError/failure:_out_of_order_samples,_scenario:_no_concurrency-12                 2.15kB ± 1%    2.15kB ± 1%     ~     (p=1.000 n=5+5)
_Ingester_v2PushOnError/failure:_out_of_order_samples,_scenario:_low_concurrency-12                 447kB ±52%     433kB ±50%     ~     (p=1.000 n=5+5)
_Ingester_v2PushOnError/failure:_out_of_order_samples,_scenario:_high_concurrency-12               29.5MB ±86%   26.9MB ±103%     ~     (p=1.000 n=5+5)

name                                                                                             old allocs/op  new allocs/op  delta
_Ingester_v2PushOnError/failure:_per-user_series_limit_reached,_scenario:_no_concurrency-12         9.03k ± 0%     5.04k ± 0%  -44.26%  (p=0.008 n=5+5)
_Ingester_v2PushOnError/failure:_per-user_series_limit_reached,_scenario:_low_concurrency-12         910k ± 0%      507k ± 0%  -44.24%  (p=0.008 n=5+5)
_Ingester_v2PushOnError/failure:_per-user_series_limit_reached,_scenario:_high_concurrency-12       9.73M ± 2%     5.33M ± 3%  -45.18%  (p=0.008 n=5+5)
_Ingester_v2PushOnError/failure:_per-metric_series_limit_reached,_scenario:_high_concurrency-12     9.05M ± 0%     5.10M ± 2%  -43.69%  (p=0.016 n=4+5)
_Ingester_v2PushOnError/failure:_per-metric_series_limit_reached,_scenario:_no_concurrency-12       9.04k ± 0%     5.04k ± 0%  -44.23%  (p=0.008 n=5+5)
_Ingester_v2PushOnError/failure:_per-metric_series_limit_reached,_scenario:_low_concurrency-12       909k ± 1%      507k ± 1%  -44.28%  (p=0.008 n=5+5)
_Ingester_v2PushOnError/failure:_out_of_bound_samples,_scenario:_no_concurrency-12                  2.04k ± 0%     2.04k ± 0%     ~     (all equal)
_Ingester_v2PushOnError/failure:_out_of_bound_samples,_scenario:_low_concurrency-12                  206k ± 0%      206k ± 0%     ~     (p=0.841 n=5+5)
_Ingester_v2PushOnError/failure:_out_of_bound_samples,_scenario:_high_concurrency-12                2.06M ± 0%     2.15M ± 4%     ~     (p=0.063 n=4+5)
_Ingester_v2PushOnError/failure:_out_of_order_samples,_scenario:_no_concurrency-12                   40.0 ± 0%      40.0 ± 0%     ~     (all equal)
_Ingester_v2PushOnError/failure:_out_of_order_samples,_scenario:_low_concurrency-12                 4.74k ±16%     4.67k ±16%     ~     (p=1.000 n=5+5)
_Ingester_v2PushOnError/failure:_out_of_order_samples,_scenario:_high_concurrency-12                 128k ±69%      119k ±81%     ~     (p=1.000 n=5+5)

Which issue(s) this PR fixes:
N/A

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Marco Pracucci <[email protected]>
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. FormatError and global errors is not the idiomatic way of doing it, but using global error values helps to avoid allocation unless necessary. Good job!

// Caused by limits.
if firstPartialErr == nil {
firstPartialErr = ve
if cause == errMaxSeriesPerUserLimitExceeded || cause == errMaxSeriesPerMetricLimitExceeded {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simplify the logic on lines 787 – 820 with:

		updateFirstPartial := func(errFn func() error) {
			if firstPartialErr == nil {
				firstPartialErr = errFn()
			}
		}

		switch cause {
		case storage.ErrOutOfBounds:
			sampleOutOfBoundsCount++
			updateFirstPartial(func() error { return wrappedTSDBIngestErr(err, model.Time(s.TimestampMs), ts.Labels) })
			continue
		case storage.ErrOutOfOrderSample:
			sampleOutOfOrderCount++
			updateFirstPartial(func() error { return wrappedTSDBIngestErr(err, model.Time(s.TimestampMs), ts.Labels) })
			continue
		case storage.ErrDuplicateSampleForTimestamp:
			newValueForTimestampCount++
			updateFirstPartial(func() error { return wrappedTSDBIngestErr(err, model.Time(s.TimestampMs), ts.Labels) })
			continue
		case errMaxSeriesPerUserLimitExceeded:
			perUserSeriesLimitCount++
			updateFirstPartial(func() error { return makeLimitError(perUserSeriesLimit, i.limiter.FormatError(userID, cause)) })
			continue
		case errMaxSeriesPerMetricLimitExceeded:
			perMetricSeriesLimitCount++
			updateFirstPartial(func() error { return makeMetricLimitError(perMetricSeriesLimit, copiedLabels, i.limiter.FormatError(userID, cause))	})
			continue
		}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.

}

var ve *validationError
if errors.As(cause, &ve) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed because, following the code flow, I think the only errors that could be returned are per-user and per-metric series limit.

Signed-off-by: Marco Pracucci <[email protected]>
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great job!

@pracucci pracucci merged commit c28d326 into cortexproject:master Mar 18, 2021
@pracucci pracucci deleted the optimize-limit-reached branch March 18, 2021 15:33
@pracucci pracucci mentioned this pull request Mar 22, 2021
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants