Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore alertmanager state from storage as fallback #2293

Merged
merged 2 commits into from
Jul 6, 2022

Conversation

56quarters
Copy link
Contributor

@56quarters 56quarters commented Jun 30, 2022

Signed-off-by: Nick Pillitteri [email protected]

What this PR does

In cortexproject/cortex#3925 the ability to restore alertmanager state from
peer alertmanagers was added, short-circuiting if there is only a single
replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read
state from storage was added in case reading from peers failed. However, the
short-circuiting if there is only a single peer was not removed. This has the
effect of never restoring state in an alertmanager if only running a single
replica.

Which issue(s) this PR fixes or relates to

Fixes #2245

Checklist

  • Tests updated
  • [na] Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@callumj
Copy link
Contributor

callumj commented Jun 30, 2022

Out of curiosity what would be the license implications of backporting this to Cortex?

@56quarters
Copy link
Contributor Author

Out of curiosity what would be the license implications of backporting this to Cortex?

I'm not a lawyer so take this with a giant grain of salt:

  • I did the fix as a Grafana employee and signed a CLA so they own the copyright.
  • The DCO for Cortex requires me to say I have the right to contribute this code and license it under Apache 2 which I don't.
  • Cortex could independently fix this based on a bug report similar to the one you submitted to Mimir (excellent report, BTW).

@56quarters 56quarters force-pushed the 56quarters/am-state branch from 821e146 to c5a66a0 Compare June 30, 2022 19:27
@56quarters 56quarters marked this pull request as ready for review June 30, 2022 19:38
Copy link
Contributor

@jhesketh jhesketh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 👍

@pracucci pracucci self-requested a review July 4, 2022 15:24
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nick for working on this! The fix LGTM, but I left few comments I would like you to take a look at, before merging. Thanks!

assert.NoError(t, testutil.GatherAndCompare(reg, strings.NewReader(`
# HELP alertmanager_state_fetch_replica_state_failed_total Number of times we have failed to read and merge the full state from another replica.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have been removed? It shouldn't change anything if RF>1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this because it was noisy and the most important metric is alertmanager_state_initial_sync_completed_total since it more directly represents what this method is testing: what the result of trying to load initial state is.

Comment on lines +116 to +119
replicationResults map[string]clusterpb.Part
storeResults map[string]clusterpb.Part
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these maps? Below in the test execution there's user-1 hardcoded. I think we can simplify it removing the map, given we always test with user-1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making this change actually makes it a bit more involved due to needing to make them pointers (to represent "no data") and then needing to convert back to values. I'd rather just leave it in this PR if that's alright.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'm fine keeping it as is, given the inconsistency wasn't introduced in this PR. Was nice to apply the boy scout rule, but will be for the next time 😉

In cortexproject/cortex#3925 the ability to restore alertmanager state from
peer alertmanagers was added, short-circuiting if there is only a single
replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read
state from storage was added in case reading from peers failed. However, the
short-circuiting if there is only a single peer was not removed. This has the
effect of never restoring state in an alertmanager if only running a single
replica.

Fixes #2245

Signed-off-by: Nick Pillitteri <[email protected]>
Signed-off-by: Nick Pillitteri <[email protected]>
@56quarters 56quarters force-pushed the 56quarters/am-state branch from c5a66a0 to b584b82 Compare July 5, 2022 16:01
@pracucci pracucci merged commit 7777802 into main Jul 6, 2022
@pracucci pracucci deleted the 56quarters/am-state branch July 6, 2022 07:19
masonmei pushed a commit to udmire/mimir that referenced this pull request Jul 11, 2022
* Restore alertmanager state from storage as fallback

In cortexproject/cortex#3925 the ability to restore alertmanager state from
peer alertmanagers was added, short-circuiting if there is only a single
replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read
state from storage was added in case reading from peers failed. However, the
short-circuiting if there is only a single peer was not removed. This has the
effect of never restoring state in an alertmanager if only running a single
replica.

Fixes grafana#2245

Signed-off-by: Nick Pillitteri <[email protected]>

* Code review changes

Signed-off-by: Nick Pillitteri <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Alertmanager state is not restored from remote storage if replicationFactor == 1
4 participants