Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alertmanager: Rate limit email notifier #4135

Merged
merged 9 commits into from
May 6, 2021

Conversation

pstibrany
Copy link
Contributor

What this PR does: This PR adds ability to rate-limit email notifications sent by Alertmanager. Rate-limited notifications are dropped. When running multiple alertmanagers, rate-limits are applied on each alertmanager individually. Rate limits are configurable per tenant.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@gouthamve
Copy link
Contributor

Can you rebase the changelog against master to pull in #4137?

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! LGTM modulo a couple of comments.

CHANGELOG.md Outdated
@@ -32,6 +32,7 @@
* `-memberlist.tls-insecure-skip-verify`
* [CHANGE] Cortex now fast fails on startup if unable to connect to the ring backend. #4068
* [FEATURE] Ruler: added `local` backend support to the ruler storage configuration under the `-ruler-storage.` flag prefix. #3932
* [FEATURE] Alertmanager: Added rate-limits to email notifier. Rate limits can be configured using `-alertmanager.email-notification-rate-limit` and `-alertmanager.email-notification-burst-size`. These limits are applied on individual alertmanagers. Rate-limited email notifications are failed notifications. It is possible to monitor rate-limited notifications via new `cortex_alertmanager_notification_rate_limited_total` metric. #4135
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to rebase and move to the top.

return false, errRateLimited
}

return r.upstream.Notify(ctx, alerts...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should cancel the limiter reservation if the upstream returns error. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not sure. These limits will be already quite high. If user is hitting them, it's likely that user is doing something wrong or bad.

Imagine someone abusing alertmanager notifications for sending lot of requests to a website, which starts crashing and return 500. This can look like failed notification – if we cancel the reservation, it allows bad actor to keep sending more requests. (This PR is only dealing with emails, but we will reuse this for other integrations).

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point and your example makes sense. I'm also thinking about the opposite case: a legit receiver backend server is down, retries hit the rate limit even if no notification has been successfully delivered. However, since we can't distinguish it, it's probably safer to keep the current logic as you suggest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @pstibrany we should count against the limit whether success or fail, because opens some possibilities for abuse/broken things to make it through if done the other way.

Comment on lines 100 to 101
EmailNotificationRateLimit float64 `yaml:"email_notification_rate_limit" json:"email_notification_rate_limit"`
EmailNotificationBurstSize int `yaml:"email_notification_burst_size" json:"email_notification_burst_size"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To have YAML config names in limits specular to CLI flags we need to also add alertmanager_. This also helps clarifying which component it applies to.

Suggested change
EmailNotificationRateLimit float64 `yaml:"email_notification_rate_limit" json:"email_notification_rate_limit"`
EmailNotificationBurstSize int `yaml:"email_notification_burst_size" json:"email_notification_burst_size"`
EmailNotificationRateLimit float64 `yaml:"alertmanager_email_notification_rate_limit" json:"alertmanager_email_notification_rate_limit"`
EmailNotificationBurstSize int `yaml:"alertmanager_email_notification_burst_size" json:"alertmanager_email_notification_burst_size"`

Copy link
Contributor

@ranton256 ranton256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, other than a question about config file plans for related work.

return false, errRateLimited
}

return r.upstream.Notify(ctx, alerts...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @pstibrany we should count against the limit whether success or fail, because opens some possibilities for abuse/broken things to make it through if done the other way.


# Per-user rate limit for sending email notifications from Alertmanager in
# emails/sec. 0 = rate limit disabled. Negative value = no emails are allowed.
# CLI flag: -alertmanager.email-notification-rate-limit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning to have separate configuration for each notification type? I think there are some advantages to that in flexibility, but might make the config a bit complicated if this is extended to all the receivers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning to have separate configuration for each notification type?

Yes, that was my plan. Do you suggest to use single rate-limit configuration for all notification types? We can also have one generic rate-limit config for all integrations, with per-integration-type overrides in case they are needed. WDYT?

Would you also suggest to use single rate-limiter shared across all notifiers? (I guess not, because eg. too many webhook notifications could stop email notifications).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also have one generic rate-limit config for all integrations, with per-integration-type overrides in case they are needed.

Ah... we cannot easily distinguish between default values and missing values, so this would be difficult.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... we cannot easily distinguish between default values and missing values, so this would be difficult.

Can we use pointer values in the config? If pointer is nil then it has not been set. I haven't checked if it's doable, just asking.

Copy link
Contributor Author

@pstibrany pstibrany May 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use pointer values in the config? If pointer is nil then it has not been set. I haven't checked if it's doable, just asking.

This is doable, but could lead to confusing rulers when trying to setup overrides correctly.

Eg.:

global limits:

  • no defined shared (for all integrations) rate limit (a)
  • defined email rate limit (b)

user A:

  • defined shared (for all integrations) rate limit (c)
  • undefined email rate limit (d)

Now when computing rate limit for email integration for "user A", we can either use his shared limits (c), or global email rate limits (b), but it's unclear what is a better option, and no matter what we choose, we will confuse some people.

I would leave that to a separate discussion and PR for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave that to a separate discussion and PR for now.

I agree on this. @ranton256 What do you think? Some feedback on this would be great. Thanks!

pstibrany added 8 commits May 6, 2021 13:43
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
@pstibrany pstibrany force-pushed the rate-limited-notifier branch from f7aeb71 to 223de7c Compare May 6, 2021 11:43
@pstibrany pstibrany enabled auto-merge (squash) May 6, 2021 14:02
@pstibrany pstibrany merged commit 0595579 into cortexproject:master May 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants