Alertmanager alerts limits #4253

pstibrany · 2021-06-03T13:22:58Z

What this PR does: This PR adds new limits in Alertmanager:

total number of alerts that a user can have in Alertmanager's memory (-alertmanager.max-alerts-count)
total size of alerts that user can have in Alertmanager's memory (-alertmanager.max-alerts-size-bytes)

These are overrideable per tenant. When limits are reached, Alertmanager will reject more alerts, produce a log message and increment cortex_alertmanager_insert_alert_failures_total metric. Additional metric to track behaviour of alerts limiter are: cortex_alertmanager_alerts_limiter_current_alerts_count and cortex_alertmanager_alerts_limiter_current_alerts_size_bytes.

This PR builds on top of #4237, and will be rebased once #4237 is merged.

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

pstibrany · 2021-06-03T13:23:13Z

Marking as draft until #4237 is merged.

gotjosh · 2021-06-03T14:12:08Z

Can you assign this to me? I'll give you a first pass on it.

pstibrany · 2021-06-03T14:23:30Z

Can you assign this to me? I'll give you a first pass on it.

Unfortunately assigning only cortex-project members :( But don't let that stop you. Thank you!

pstibrany · 2021-06-03T16:52:02Z

Rebased on top of master now, it's ready for review.

stevesg · 2021-06-04T09:24:07Z

Will let Josh do a first pass and come back to this one.

gotjosh

🎖️ Fantastic job @pstibrany ! Most of my comments are nits and questions (for the benefit of my understanding).

gotjosh · 2021-06-07T10:11:39Z

pkg/alertmanager/alertmanager.go

+	a.mx.Lock()
+	defer a.mx.Unlock()
+
+	if !existing {


An alert is considered unique based on the hash of its labels (fingerprint), technically if an alert comes in with a size that's OK, then its annotations change and its size is no longer acceptable we would bypass it because we would consider its size to be OK because it was already there.

I believe we need to check that if it does exist, its new size would not tip us over the limit. Unless you want the semantic to be: If we had the alert before, we'll always accept it regardless of whenever it tips us over the limit

If this is the case, can we add a comment?

If we had the alert before, we'll always accept it regardless of whenever it tips us over the limit

That was indeed my intention, mostly to make sure that we don't accidentally reject "resolved" alerts. But if the size has changed, it may make sense to reject it if it's over limit. I don't have strong opinion on this.

Also ... when new alert arrives and "merges" with existing one, it cannot update labels (fingerprint would change) or annotations (right now, at least).

See the merging code:

cortex/vendor/github.com/prometheus/alertmanager/types/types.go

Line 366 in 5e496f4

func (a *Alert) Merge(o *Alert) *Alert {

or annotations (right now, at least)

I see that is true for the labels, which is expected but is it also for the annotations? I can't seem to see it in the code - am I missing something?

I see that is true for the labels, which is expected but is it also for the annotations? I can't seem to see it in the code - am I missing something?

No, you’re right. I misread the code I linked… line 372 is copying everything over, including annotations.

Makes sense. In this case, I think rejection would be the best course of action - don't you think?

The alert that existed before should have a short enough endsAt that it would resolve itself eventually so there's little to no risk of just leaving the existing one to resolve by itself. On the contrary, accepting the alert that might tip us over the limit risks our availability.

I think you're right, and I will make this change before merging.

I've changed the logic such that existing alert that grows and doesn't fit the limit anymore is rejected.

pkg/alertmanager/alertmanager.go

pkg/util/validation/limits.go

pkg/alertmanager/multitenant.go

pracucci

Good job! I left a couple of comments I would be glad if you could take a look 🙏 🙇

pkg/alertmanager/alertmanager_metrics.go

pkg/alertmanager/alertmanager.go

pstibrany · 2021-06-09T06:25:57Z

Thank you for reviews, I've addressed all your comments. Please take a look again.

pracucci

LGTM (modulo a nit). Thanks a lot for addressing my feedback 🙏

pkg/alertmanager/alertmanager.go

Signed-off-by: Peter Štibraný <[email protected]>

Signed-off-by: Peter Štibraný <[email protected]>

…is rejected. Signed-off-by: Peter Štibraný <[email protected]>

pull-request-size bot added the size/XXL label Jun 3, 2021

pstibrany marked this pull request as draft June 3, 2021 13:23

pstibrany force-pushed the alertmanager-alerts-limits branch from f142e64 to 7656ab2 Compare June 3, 2021 16:51

pull-request-size bot added size/L and removed size/XXL labels Jun 3, 2021

pstibrany marked this pull request as ready for review June 3, 2021 16:52

gotjosh reviewed Jun 7, 2021

View reviewed changes

pracucci reviewed Jun 8, 2021

View reviewed changes

pracucci approved these changes Jun 9, 2021

View reviewed changes

pkg/alertmanager/alertmanager.go Outdated Show resolved Hide resolved

pstibrany and others added 9 commits June 10, 2021 17:22

Add store limits.

96dd698

Signed-off-by: Peter Štibraný <[email protected]>

Expose alerts limiter metrics.

b6536da

Signed-off-by: Peter Štibraný <[email protected]>

Fix tests.

354aa2e

Signed-off-by: Peter Štibraný <[email protected]>

CHANGELOG.md

4b51e9b

Signed-off-by: Peter Štibraný <[email protected]>

Address review feedback.

7ffe60b

Signed-off-by: Peter Štibraný <[email protected]>

Added comment.

803ad4d

Signed-off-by: Peter Štibraný <[email protected]>

Address review feedback.

537b9ac

Signed-off-by: Peter Štibraný <[email protected]>

Move check to the top.

42ff3b1

Signed-off-by: Peter Štibraný <[email protected]>

When existing alert grows and doesn't fit the size limit anymore, it …

314c643

…is rejected. Signed-off-by: Peter Štibraný <[email protected]>

pstibrany force-pushed the alertmanager-alerts-limits branch from aee3b95 to 314c643 Compare June 10, 2021 15:43

pracucci enabled auto-merge (squash) June 10, 2021 15:52

pracucci merged commit cae36dc into cortexproject:master Jun 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alertmanager alerts limits #4253

Alertmanager alerts limits #4253

pstibrany commented Jun 3, 2021

pstibrany commented Jun 3, 2021

gotjosh commented Jun 3, 2021

pstibrany commented Jun 3, 2021

pstibrany commented Jun 3, 2021

stevesg commented Jun 4, 2021

gotjosh left a comment

gotjosh Jun 7, 2021

pstibrany Jun 8, 2021

pstibrany Jun 8, 2021

gotjosh Jun 8, 2021

pstibrany Jun 8, 2021

gotjosh Jun 9, 2021

pstibrany Jun 10, 2021

pstibrany Jun 10, 2021 •

edited

Loading

pracucci left a comment

pstibrany commented Jun 9, 2021

pracucci left a comment

Alertmanager alerts limits #4253

Alertmanager alerts limits #4253

Conversation

pstibrany commented Jun 3, 2021

pstibrany commented Jun 3, 2021

gotjosh commented Jun 3, 2021

pstibrany commented Jun 3, 2021

pstibrany commented Jun 3, 2021

stevesg commented Jun 4, 2021

gotjosh left a comment

Choose a reason for hiding this comment

gotjosh Jun 7, 2021

Choose a reason for hiding this comment

pstibrany Jun 8, 2021

Choose a reason for hiding this comment

pstibrany Jun 8, 2021

Choose a reason for hiding this comment

gotjosh Jun 8, 2021

Choose a reason for hiding this comment

pstibrany Jun 8, 2021

Choose a reason for hiding this comment

gotjosh Jun 9, 2021

Choose a reason for hiding this comment

pstibrany Jun 10, 2021

Choose a reason for hiding this comment

pstibrany Jun 10, 2021 • edited Loading

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

pstibrany commented Jun 9, 2021

pracucci left a comment

Choose a reason for hiding this comment

pstibrany Jun 10, 2021 •

edited

Loading