-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alertmanager: Rate limit email notifier #4135
Alertmanager: Rate limit email notifier #4135
Conversation
Can you rebase the changelog against master to pull in #4137? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job! LGTM modulo a couple of comments.
CHANGELOG.md
Outdated
@@ -32,6 +32,7 @@ | |||
* `-memberlist.tls-insecure-skip-verify` | |||
* [CHANGE] Cortex now fast fails on startup if unable to connect to the ring backend. #4068 | |||
* [FEATURE] Ruler: added `local` backend support to the ruler storage configuration under the `-ruler-storage.` flag prefix. #3932 | |||
* [FEATURE] Alertmanager: Added rate-limits to email notifier. Rate limits can be configured using `-alertmanager.email-notification-rate-limit` and `-alertmanager.email-notification-burst-size`. These limits are applied on individual alertmanagers. Rate-limited email notifications are failed notifications. It is possible to monitor rate-limited notifications via new `cortex_alertmanager_notification_rate_limited_total` metric. #4135 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember to rebase and move to the top.
return false, errRateLimited | ||
} | ||
|
||
return r.upstream.Notify(ctx, alerts...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should cancel the limiter reservation if the upstream returns error. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm not sure. These limits will be already quite high. If user is hitting them, it's likely that user is doing something wrong or bad.
Imagine someone abusing alertmanager notifications for sending lot of requests to a website, which starts crashing and return 500. This can look like failed notification – if we cancel the reservation, it allows bad actor to keep sending more requests. (This PR is only dealing with emails, but we will reuse this for other integrations).
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point and your example makes sense. I'm also thinking about the opposite case: a legit receiver backend server is down, retries hit the rate limit even if no notification has been successfully delivered. However, since we can't distinguish it, it's probably safer to keep the current logic as you suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @pstibrany we should count against the limit whether success or fail, because opens some possibilities for abuse/broken things to make it through if done the other way.
pkg/util/validation/limits.go
Outdated
EmailNotificationRateLimit float64 `yaml:"email_notification_rate_limit" json:"email_notification_rate_limit"` | ||
EmailNotificationBurstSize int `yaml:"email_notification_burst_size" json:"email_notification_burst_size"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To have YAML config names in limits specular to CLI flags we need to also add alertmanager_
. This also helps clarifying which component it applies to.
EmailNotificationRateLimit float64 `yaml:"email_notification_rate_limit" json:"email_notification_rate_limit"` | |
EmailNotificationBurstSize int `yaml:"email_notification_burst_size" json:"email_notification_burst_size"` | |
EmailNotificationRateLimit float64 `yaml:"alertmanager_email_notification_rate_limit" json:"alertmanager_email_notification_rate_limit"` | |
EmailNotificationBurstSize int `yaml:"alertmanager_email_notification_burst_size" json:"alertmanager_email_notification_burst_size"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, other than a question about config file plans for related work.
return false, errRateLimited | ||
} | ||
|
||
return r.upstream.Notify(ctx, alerts...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @pstibrany we should count against the limit whether success or fail, because opens some possibilities for abuse/broken things to make it through if done the other way.
|
||
# Per-user rate limit for sending email notifications from Alertmanager in | ||
# emails/sec. 0 = rate limit disabled. Negative value = no emails are allowed. | ||
# CLI flag: -alertmanager.email-notification-rate-limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you planning to have separate configuration for each notification type? I think there are some advantages to that in flexibility, but might make the config a bit complicated if this is extended to all the receivers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you planning to have separate configuration for each notification type?
Yes, that was my plan. Do you suggest to use single rate-limit configuration for all notification types? We can also have one generic rate-limit config for all integrations, with per-integration-type overrides in case they are needed. WDYT?
Would you also suggest to use single rate-limiter shared across all notifiers? (I guess not, because eg. too many webhook notifications could stop email notifications).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also have one generic rate-limit config for all integrations, with per-integration-type overrides in case they are needed.
Ah... we cannot easily distinguish between default values and missing values, so this would be difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah... we cannot easily distinguish between default values and missing values, so this would be difficult.
Can we use pointer values in the config? If pointer is nil then it has not been set. I haven't checked if it's doable, just asking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use pointer values in the config? If pointer is nil then it has not been set. I haven't checked if it's doable, just asking.
This is doable, but could lead to confusing rulers when trying to setup overrides correctly.
Eg.:
global limits:
- no defined shared (for all integrations) rate limit (a)
- defined email rate limit (b)
user A:
- defined shared (for all integrations) rate limit (c)
- undefined email rate limit (d)
Now when computing rate limit for email integration for "user A", we can either use his shared limits (c), or global email rate limits (b), but it's unclear what is a better option, and no matter what we choose, we will confuse some people.
I would leave that to a separate discussion and PR for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave that to a separate discussion and PR for now.
I agree on this. @ranton256 What do you think? Some feedback on this would be great. Thanks!
86c711b
to
88b885f
Compare
efadfa8
to
f7aeb71
Compare
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
f7aeb71
to
223de7c
Compare
What this PR does: This PR adds ability to rate-limit email notifications sent by Alertmanager. Rate-limited notifications are dropped. When running multiple alertmanagers, rate-limits are applied on each alertmanager individually. Rate limits are configurable per tenant.
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]