Alertmanager: Rate limit email notifier #4135

pstibrany · 2021-04-28T12:25:31Z

What this PR does: This PR adds ability to rate-limit email notifications sent by Alertmanager. Rate-limited notifications are dropped. When running multiple alertmanagers, rate-limits are applied on each alertmanager individually. Rate limits are configurable per tenant.

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

gouthamve · 2021-04-28T15:06:20Z

Can you rebase the changelog against master to pull in #4137?

pracucci

Good job! LGTM modulo a couple of comments.

pracucci · 2021-04-28T15:28:03Z

CHANGELOG.md

@@ -32,6 +32,7 @@
    * `-memberlist.tls-insecure-skip-verify`
 * [CHANGE] Cortex now fast fails on startup if unable to connect to the ring backend. #4068
 * [FEATURE] Ruler: added `local` backend support to the ruler storage configuration under the `-ruler-storage.` flag prefix. #3932
+* [FEATURE] Alertmanager: Added rate-limits to email notifier. Rate limits can be configured using `-alertmanager.email-notification-rate-limit` and `-alertmanager.email-notification-burst-size`. These limits are applied on individual alertmanagers. Rate-limited email notifications are failed notifications. It is possible to monitor rate-limited notifications via new `cortex_alertmanager_notification_rate_limited_total` metric. #4135


Remember to rebase and move to the top.

pracucci · 2021-04-28T15:35:27Z

pkg/alertmanager/rate_limited_notifier.go

+		return false, errRateLimited
+	}
+
+	return r.upstream.Notify(ctx, alerts...)


I think we should cancel the limiter reservation if the upstream returns error. WDYT?

Hmm, I'm not sure. These limits will be already quite high. If user is hitting them, it's likely that user is doing something wrong or bad.

Imagine someone abusing alertmanager notifications for sending lot of requests to a website, which starts crashing and return 500. This can look like failed notification – if we cancel the reservation, it allows bad actor to keep sending more requests. (This PR is only dealing with emails, but we will reuse this for other integrations).

WDYT?

I see your point and your example makes sense. I'm also thinking about the opposite case: a legit receiver backend server is down, retries hit the rate limit even if no notification has been successfully delivered. However, since we can't distinguish it, it's probably safer to keep the current logic as you suggest.

I agree with @pstibrany we should count against the limit whether success or fail, because opens some possibilities for abuse/broken things to make it through if done the other way.

pracucci · 2021-04-28T15:37:00Z

pkg/util/validation/limits.go

+	EmailNotificationRateLimit float64 `yaml:"email_notification_rate_limit" json:"email_notification_rate_limit"`
+	EmailNotificationBurstSize int     `yaml:"email_notification_burst_size" json:"email_notification_burst_size"`


To have YAML config names in limits specular to CLI flags we need to also add alertmanager_. This also helps clarifying which component it applies to.

Suggested change

EmailNotificationRateLimit float64 `yaml:"email_notification_rate_limit" json:"email_notification_rate_limit"`

EmailNotificationBurstSize int `yaml:"email_notification_burst_size" json:"email_notification_burst_size"`

EmailNotificationRateLimit float64 `yaml:"alertmanager_email_notification_rate_limit" json:"alertmanager_email_notification_rate_limit"`

EmailNotificationBurstSize int `yaml:"alertmanager_email_notification_burst_size" json:"alertmanager_email_notification_burst_size"`

ranton256

LGTM, other than a question about config file plans for related work.

ranton256 · 2021-04-28T23:54:33Z

pkg/alertmanager/rate_limited_notifier.go

+		return false, errRateLimited
+	}
+
+	return r.upstream.Notify(ctx, alerts...)


I agree with @pstibrany we should count against the limit whether success or fail, because opens some possibilities for abuse/broken things to make it through if done the other way.

ranton256 · 2021-04-28T23:55:42Z

docs/configuration/config-file-reference.md

+
+# Per-user rate limit for sending email notifications from Alertmanager in
+# emails/sec. 0 = rate limit disabled. Negative value = no emails are allowed.
+# CLI flag: -alertmanager.email-notification-rate-limit


Are you planning to have separate configuration for each notification type? I think there are some advantages to that in flexibility, but might make the config a bit complicated if this is extended to all the receivers.

Are you planning to have separate configuration for each notification type?

Yes, that was my plan. Do you suggest to use single rate-limit configuration for all notification types? We can also have one generic rate-limit config for all integrations, with per-integration-type overrides in case they are needed. WDYT?

Would you also suggest to use single rate-limiter shared across all notifiers? (I guess not, because eg. too many webhook notifications could stop email notifications).

We can also have one generic rate-limit config for all integrations, with per-integration-type overrides in case they are needed.

Ah... we cannot easily distinguish between default values and missing values, so this would be difficult.

Ah... we cannot easily distinguish between default values and missing values, so this would be difficult.

Can we use pointer values in the config? If pointer is nil then it has not been set. I haven't checked if it's doable, just asking.

Can we use pointer values in the config? If pointer is nil then it has not been set. I haven't checked if it's doable, just asking.

This is doable, but could lead to confusing rulers when trying to setup overrides correctly.

Eg.:

global limits:

no defined shared (for all integrations) rate limit (a)

defined email rate limit (b)

user A:

defined shared (for all integrations) rate limit (c)

undefined email rate limit (d)

Now when computing rate limit for email integration for "user A", we can either use his shared limits (c), or global email rate limits (b), but it's unclear what is a better option, and no matter what we choose, we will confuse some people.

I would leave that to a separate discussion and PR for now.

I would leave that to a separate discussion and PR for now.

I agree on this. @ranton256 What do you think? Some feedback on this would be great. Thanks!

Signed-off-by: Peter Štibraný <[email protected]>

pull-request-size bot added the size/L label Apr 28, 2021

pracucci approved these changes Apr 28, 2021

View reviewed changes

ranton256 approved these changes Apr 28, 2021

View reviewed changes

pstibrany force-pushed the rate-limited-notifier branch from 86c711b to 88b885f Compare April 29, 2021 07:12

pracucci mentioned this pull request Apr 29, 2021

Allow to override Alertmanager receivers firewall settings on a per-tenant basis #4143

Merged

3 tasks

pstibrany force-pushed the rate-limited-notifier branch from efadfa8 to f7aeb71 Compare May 3, 2021 07:06

pstibrany added 8 commits May 6, 2021 13:43

Introduce rate-limit for sending email notifications from alertmanager.

9b20c00

Signed-off-by: Peter Štibraný <[email protected]>

Don't retry failed rate-limited notifications.

be3a1f1

Signed-off-by: Peter Štibraný <[email protected]>

Added test to verify that email notifier is used with rate limits.

2ba3dce

Signed-off-by: Peter Štibraný <[email protected]>

Improve documentation.

0946c25

Signed-off-by: Peter Štibraný <[email protected]>

CHANGELOG.md

f85e1e1

Signed-off-by: Peter Štibraný <[email protected]>

Rename yaml fields, and add unit test.

9255abb

Signed-off-by: Peter Štibraný <[email protected]>

Fix documentation.

1838de9

Signed-off-by: Peter Štibraný <[email protected]>

Moved changelog entry.

223de7c

Signed-off-by: Peter Štibraný <[email protected]>

pstibrany force-pushed the rate-limited-notifier branch from f7aeb71 to 223de7c Compare May 6, 2021 11:43

Merge branch 'master' into rate-limited-notifier

e096720

pstibrany enabled auto-merge (squash) May 6, 2021 14:02

pstibrany merged commit 0595579 into cortexproject:master May 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alertmanager: Rate limit email notifier #4135

Alertmanager: Rate limit email notifier #4135

pstibrany commented Apr 28, 2021

gouthamve commented Apr 28, 2021

pracucci left a comment

pracucci Apr 28, 2021

pracucci Apr 28, 2021

pstibrany Apr 28, 2021

pracucci Apr 28, 2021

ranton256 Apr 28, 2021

pracucci Apr 28, 2021

ranton256 left a comment

ranton256 Apr 28, 2021

ranton256 Apr 28, 2021

pstibrany Apr 29, 2021

pstibrany Apr 29, 2021

pracucci Apr 30, 2021

pstibrany May 3, 2021 •

edited

Loading

pracucci May 4, 2021

		EmailNotificationRateLimit float64 `yaml:"email_notification_rate_limit" json:"email_notification_rate_limit"`
		EmailNotificationBurstSize int `yaml:"email_notification_burst_size" json:"email_notification_burst_size"`

Alertmanager: Rate limit email notifier #4135

Alertmanager: Rate limit email notifier #4135

Conversation

pstibrany commented Apr 28, 2021

gouthamve commented Apr 28, 2021

pracucci left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ranton256 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pstibrany May 3, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pstibrany May 3, 2021 •

edited

Loading