Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete alertmanager state objects from remote storage on user deletion. #4167

Merged
merged 3 commits into from
May 18, 2021

Conversation

stevesg
Copy link
Contributor

@stevesg stevesg commented May 10, 2021

What this PR does:
The same pattern as for cleaning up local state is used - checking for
state objects to clean up whenever the users are synced, and deleting
them if the user has no configuration. This means that we perform an
additional listing operation as opposed to doing it on demand, when the
user configuration is deleted. This does mean that the deletion will be
more robust against e.g. the process dying before cleaning up, or
transient storage errors causing deletion to fail.

Note that a different mechanism is used to check if a remote user state
should be deleted, compared to for local state, because every user may
not be currently active in every alertmanager instance due to sharding.
Therefore, we only delete state for users which have also have no
configuration. The user listing is re-used from the user sync.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

The same pattern as for cleaning up local state is used - checking for
state objects to clean up whenever the users are synced, and deleting
them if the user has no configuration. This means that we perform an
additional listing operation as opposed to doing it on demand, when the
user configuration is deleted. This does mean that the deletion will be
more robust against e.g. the process dying before cleaning up, or
transient storage errors causing deletion to fail.

Note that a different mechanism is used to check if a remote user state
should be deleted, compared to for local state, because every user may
not be currently active in every alertmanager instance due to sharding.
Therefore, we only delete state for users which have also have no
configuration. The user listing is re-used from the user sync.

Signed-off-by: Steve Simpson <[email protected]>
@stevesg stevesg marked this pull request as ready for review May 10, 2021 11:30
@stevesg stevesg force-pushed the am-delete-state branch from fcef8cb to 9f44086 Compare May 11, 2021 11:36
@stevesg stevesg requested a review from pstibrany May 11, 2021 13:43
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good work! The logic LGTM. I just left few nits.

@@ -697,34 +704,33 @@ func (am *MultitenantAlertmanager) stopping(_ error) error {
}

// loadAlertmanagerConfigs Loads (and filters) the alertmanagers configuration from object storage, taking into consideration the sharding strategy.
func (am *MultitenantAlertmanager) loadAlertmanagerConfigs(ctx context.Context) (map[string]alertspb.AlertConfigDesc, error) {
// Returns the list of all users with a configuration, and the set of users which should be configured in this instance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the set of users which should be configured in this instance

It returns all discovered users, not the ones that should be configured in this instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns both, but I can improve the wording a little.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually misread the comment initially. It's good.

@@ -1147,6 +1153,35 @@ func (am *MultitenantAlertmanager) UpdateState(ctx context.Context, part *cluste
return &alertmanagerpb.UpdateStateResponse{Status: alertmanagerpb.OK}, nil
}

// deleteUnusedRemoteUserState deletes state objects in remote storage for users that are no longer configured.

func (am *MultitenantAlertmanager) deleteUnusedRemoteUserState(ctx context.Context, allUsers []string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[food for thought] This will be executed on every alertmanager replica, so the number of ListUsersWithState() operations will linearly scale up with the number of replicas. I'm wondering if this cleanup is enough if done by the alertmanager replicas holding an hardcoded token (eg. token 0).

@pstibrany @stevesg What's your take on this? I'm dubious, so asking.

Copy link
Contributor Author

@stevesg stevesg May 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only needs to be done by one/any instance, in fact it's a completely independent process. I don't have a good answer as to whether the saving is worth the added complexity though. One bucket iteration per PollInterval per replica doesn't seem excessive. Reducing the frequency of the cleanup would also have a good effect (e.g. interval of 4x PollInterval).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait for @pstibrany feedback. We need then 2nd reviewer anyway, so I wouldn't be able to merge the PR right now.

Copy link
Contributor

@pstibrany pstibrany May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong opinion on whether to do it by single instance or all... doing it by all is certainly easier. To reduce the API operations, we can do it less often then on every sync, and add some jitter to the process.

Signed-off-by: Steve Simpson <[email protected]>
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@pstibrany pstibrany merged commit 22d7182 into cortexproject:master May 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants