-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alertmanager alerts limits #4253
Alertmanager alerts limits #4253
Conversation
Marking as draft until #4237 is merged. |
Can you assign this to me? I'll give you a first pass on it. |
Unfortunately assigning only cortex-project members :( But don't let that stop you. Thank you! |
f142e64
to
7656ab2
Compare
Rebased on top of master now, it's ready for review. |
Will let Josh do a first pass and come back to this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎖️ Fantastic job @pstibrany ! Most of my comments are nits and questions (for the benefit of my understanding).
pkg/alertmanager/alertmanager.go
Outdated
a.mx.Lock() | ||
defer a.mx.Unlock() | ||
|
||
if !existing { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alert is considered unique based on the hash of its labels (fingerprint), technically if an alert comes in with a size that's OK, then its annotations change and its size is no longer acceptable we would bypass it because we would consider its size to be OK because it was already there.
I believe we need to check that if it does exist, its new size would not tip us over the limit. Unless you want the semantic to be: If we had the alert before, we'll always accept it regardless of whenever it tips us over the limit
If this is the case, can we add a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we had the alert before, we'll always accept it regardless of whenever it tips us over the limit
That was indeed my intention, mostly to make sure that we don't accidentally reject "resolved" alerts. But if the size has changed, it may make sense to reject it if it's over limit. I don't have strong opinion on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also ... when new alert arrives and "merges" with existing one, it cannot update labels (fingerprint would change) or annotations (right now, at least).
See the merging code:
func (a *Alert) Merge(o *Alert) *Alert { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or annotations (right now, at least)
I see that is true for the labels, which is expected but is it also for the annotations? I can't seem to see it in the code - am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that is true for the labels, which is expected but is it also for the annotations? I can't seem to see it in the code - am I missing something?
No, you’re right. I misread the code I linked… line 372 is copying everything over, including annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. In this case, I think rejection would be the best course of action - don't you think?
The alert that existed before should have a short enough endsAt
that it would resolve itself eventually so there's little to no risk of just leaving the existing one to resolve by itself. On the contrary, accepting the alert that might tip us over the limit risks our availability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right, and I will make this change before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the logic such that existing alert that grows and doesn't fit the limit anymore is rejected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job! I left a couple of comments I would be glad if you could take a look 🙏 🙇
Thank you for reviews, I've addressed all your comments. Please take a look again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (modulo a nit). Thanks a lot for addressing my feedback 🙏
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
…is rejected. Signed-off-by: Peter Štibraný <[email protected]>
aee3b95
to
314c643
Compare
What this PR does: This PR adds new limits in Alertmanager:
-alertmanager.max-alerts-count
)-alertmanager.max-alerts-size-bytes
)These are overrideable per tenant. When limits are reached, Alertmanager will reject more alerts, produce a log message and increment
cortex_alertmanager_insert_alert_failures_total
metric. Additional metric to track behaviour of alerts limiter are:cortex_alertmanager_alerts_limiter_current_alerts_count
andcortex_alertmanager_alerts_limiter_current_alerts_size_bytes
.This PR builds on top of #4237, and will be rebased once #4237 is merged.
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]