Delete alertmanager state objects from remote storage on user deletion. #4167

stevesg · 2021-05-10T10:19:30Z

What this PR does:
The same pattern as for cleaning up local state is used - checking for
state objects to clean up whenever the users are synced, and deleting
them if the user has no configuration. This means that we perform an
additional listing operation as opposed to doing it on demand, when the
user configuration is deleted. This does mean that the deletion will be
more robust against e.g. the process dying before cleaning up, or
transient storage errors causing deletion to fail.

Note that a different mechanism is used to check if a remote user state
should be deleted, compared to for local state, because every user may
not be currently active in every alertmanager instance due to sharding.
Therefore, we only delete state for users which have also have no
configuration. The user listing is re-used from the user sync.

Checklist

Tests updated
~~Documentation added~~
~~CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]~~

The same pattern as for cleaning up local state is used - checking for state objects to clean up whenever the users are synced, and deleting them if the user has no configuration. This means that we perform an additional listing operation as opposed to doing it on demand, when the user configuration is deleted. This does mean that the deletion will be more robust against e.g. the process dying before cleaning up, or transient storage errors causing deletion to fail. Note that a different mechanism is used to check if a remote user state should be deleted, compared to for local state, because every user may not be currently active in every alertmanager instance due to sharding. Therefore, we only delete state for users which have also have no configuration. The user listing is re-used from the user sync. Signed-off-by: Steve Simpson <[email protected]>

pkg/alertmanager/multitenant.go

Signed-off-by: Steve Simpson <[email protected]>

pracucci

Very good work! The logic LGTM. I just left few nits.

pracucci · 2021-05-11T15:12:40Z

pkg/alertmanager/multitenant.go

@@ -697,34 +704,33 @@ func (am *MultitenantAlertmanager) stopping(_ error) error {
 }

 // loadAlertmanagerConfigs Loads (and filters) the alertmanagers configuration from object storage, taking into consideration the sharding strategy.
-func (am *MultitenantAlertmanager) loadAlertmanagerConfigs(ctx context.Context) (map[string]alertspb.AlertConfigDesc, error) {
+// Returns the list of all users with a configuration, and the set of users which should be configured in this instance.


and the set of users which should be configured in this instance

It returns all discovered users, not the ones that should be configured in this instance.

It returns both, but I can improve the wording a little.

I actually misread the comment initially. It's good.

pkg/alertmanager/multitenant.go

pracucci · 2021-05-11T15:18:42Z

pkg/alertmanager/multitenant.go

@@ -1147,6 +1153,35 @@ func (am *MultitenantAlertmanager) UpdateState(ctx context.Context, part *cluste
 	return &alertmanagerpb.UpdateStateResponse{Status: alertmanagerpb.OK}, nil
 }

+// deleteUnusedRemoteUserState deletes state objects in remote storage for users that are no longer configured.
+
+func (am *MultitenantAlertmanager) deleteUnusedRemoteUserState(ctx context.Context, allUsers []string) {


[food for thought] This will be executed on every alertmanager replica, so the number of ListUsersWithState() operations will linearly scale up with the number of replicas. I'm wondering if this cleanup is enough if done by the alertmanager replicas holding an hardcoded token (eg. token 0).

@pstibrany @stevesg What's your take on this? I'm dubious, so asking.

It only needs to be done by one/any instance, in fact it's a completely independent process. I don't have a good answer as to whether the saving is worth the added complexity though. One bucket iteration per PollInterval per replica doesn't seem excessive. Reducing the frequency of the cleanup would also have a good effect (e.g. interval of 4x PollInterval).

Let's wait for @pstibrany feedback. We need then 2nd reviewer anyway, so I wouldn't be able to merge the PR right now.

I don't have strong opinion on whether to do it by single instance or all... doing it by all is certainly easier. To reduce the API operations, we can do it less often then on every sync, and add some jitter to the process.

pkg/alertmanager/alertstore/store.go

Signed-off-by: Steve Simpson <[email protected]>

pstibrany

LGTM, thanks!

pull-request-size bot added the size/L label May 10, 2021

stevesg marked this pull request as ready for review May 10, 2021 11:30

pstibrany reviewed May 10, 2021

View reviewed changes

pkg/alertmanager/multitenant.go Show resolved Hide resolved

pkg/alertmanager/multitenant.go Show resolved Hide resolved

Fix deletion to use correct user ID listing and add unit test.

9f44086

Signed-off-by: Steve Simpson <[email protected]>

stevesg force-pushed the am-delete-state branch from fcef8cb to 9f44086 Compare May 11, 2021 11:36

stevesg requested a review from pstibrany May 11, 2021 13:43

pracucci approved these changes May 11, 2021

View reviewed changes

Review comments.

9bda3cf

Signed-off-by: Steve Simpson <[email protected]>

pstibrany approved these changes May 18, 2021

View reviewed changes

pstibrany merged commit 22d7182 into cortexproject:master May 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete alertmanager state objects from remote storage on user deletion. #4167

Delete alertmanager state objects from remote storage on user deletion. #4167

stevesg commented May 10, 2021 •

edited

Loading

pracucci left a comment

pracucci May 11, 2021

stevesg May 11, 2021

pracucci May 13, 2021

pracucci May 11, 2021

stevesg May 11, 2021 •

edited

Loading

pracucci May 13, 2021

pstibrany May 18, 2021 •

edited

Loading

pstibrany left a comment

Delete alertmanager state objects from remote storage on user deletion. #4167

Delete alertmanager state objects from remote storage on user deletion. #4167

Conversation

stevesg commented May 10, 2021 • edited Loading

pracucci left a comment

Choose a reason for hiding this comment

pracucci May 11, 2021

Choose a reason for hiding this comment

stevesg May 11, 2021

Choose a reason for hiding this comment

pracucci May 13, 2021

Choose a reason for hiding this comment

pracucci May 11, 2021

Choose a reason for hiding this comment

stevesg May 11, 2021 • edited Loading

Choose a reason for hiding this comment

pracucci May 13, 2021

Choose a reason for hiding this comment

pstibrany May 18, 2021 • edited Loading

Choose a reason for hiding this comment

pstibrany left a comment

Choose a reason for hiding this comment

stevesg commented May 10, 2021 •

edited

Loading

stevesg May 11, 2021 •

edited

Loading

pstibrany May 18, 2021 •

edited

Loading