-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete alertmanager state objects from remote storage on user deletion. #4167
Conversation
The same pattern as for cleaning up local state is used - checking for state objects to clean up whenever the users are synced, and deleting them if the user has no configuration. This means that we perform an additional listing operation as opposed to doing it on demand, when the user configuration is deleted. This does mean that the deletion will be more robust against e.g. the process dying before cleaning up, or transient storage errors causing deletion to fail. Note that a different mechanism is used to check if a remote user state should be deleted, compared to for local state, because every user may not be currently active in every alertmanager instance due to sharding. Therefore, we only delete state for users which have also have no configuration. The user listing is re-used from the user sync. Signed-off-by: Steve Simpson <[email protected]>
Signed-off-by: Steve Simpson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good work! The logic LGTM. I just left few nits.
pkg/alertmanager/multitenant.go
Outdated
@@ -697,34 +704,33 @@ func (am *MultitenantAlertmanager) stopping(_ error) error { | |||
} | |||
|
|||
// loadAlertmanagerConfigs Loads (and filters) the alertmanagers configuration from object storage, taking into consideration the sharding strategy. | |||
func (am *MultitenantAlertmanager) loadAlertmanagerConfigs(ctx context.Context) (map[string]alertspb.AlertConfigDesc, error) { | |||
// Returns the list of all users with a configuration, and the set of users which should be configured in this instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and the set of users which should be configured in this instance
It returns all discovered users, not the ones that should be configured in this instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It returns both, but I can improve the wording a little.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually misread the comment initially. It's good.
@@ -1147,6 +1153,35 @@ func (am *MultitenantAlertmanager) UpdateState(ctx context.Context, part *cluste | |||
return &alertmanagerpb.UpdateStateResponse{Status: alertmanagerpb.OK}, nil | |||
} | |||
|
|||
// deleteUnusedRemoteUserState deletes state objects in remote storage for users that are no longer configured. | |||
|
|||
func (am *MultitenantAlertmanager) deleteUnusedRemoteUserState(ctx context.Context, allUsers []string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[food for thought] This will be executed on every alertmanager replica, so the number of ListUsersWithState()
operations will linearly scale up with the number of replicas. I'm wondering if this cleanup is enough if done by the alertmanager replicas holding an hardcoded token (eg. token 0).
@pstibrany @stevesg What's your take on this? I'm dubious, so asking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only needs to be done by one/any instance, in fact it's a completely independent process. I don't have a good answer as to whether the saving is worth the added complexity though. One bucket iteration per PollInterval
per replica doesn't seem excessive. Reducing the frequency of the cleanup would also have a good effect (e.g. interval of 4x PollInterval
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's wait for @pstibrany feedback. We need then 2nd reviewer anyway, so I wouldn't be able to merge the PR right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong opinion on whether to do it by single instance or all... doing it by all is certainly easier. To reduce the API operations, we can do it less often then on every sync, and add some jitter to the process.
Signed-off-by: Steve Simpson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
What this PR does:
The same pattern as for cleaning up local state is used - checking for
state objects to clean up whenever the users are synced, and deleting
them if the user has no configuration. This means that we perform an
additional listing operation as opposed to doing it on demand, when the
user configuration is deleted. This does mean that the deletion will be
more robust against e.g. the process dying before cleaning up, or
transient storage errors causing deletion to fail.
Note that a different mechanism is used to check if a remote user state
should be deleted, compared to for local state, because every user may
not be currently active in every alertmanager instance due to sharding.
Therefore, we only delete state for users which have also have no
configuration. The user listing is re-used from the user sync.
Checklist
Documentation addedCHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]