-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restore alertmanager state from storage as fallback #2293
Conversation
Out of curiosity what would be the license implications of backporting this to Cortex? |
I'm not a lawyer so take this with a giant grain of salt:
|
821e146
to
c5a66a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Nick for working on this! The fix LGTM, but I left few comments I would like you to take a look at, before merging. Thanks!
assert.NoError(t, testutil.GatherAndCompare(reg, strings.NewReader(` | ||
# HELP alertmanager_state_fetch_replica_state_failed_total Number of times we have failed to read and merge the full state from another replica. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have been removed? It shouldn't change anything if RF>1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this because it was noisy and the most important metric is alertmanager_state_initial_sync_completed_total
since it more directly represents what this method is testing: what the result of trying to load initial state is.
replicationResults map[string]clusterpb.Part | ||
storeResults map[string]clusterpb.Part |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these maps? Below in the test execution there's user-1
hardcoded. I think we can simplify it removing the map, given we always test with user-1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making this change actually makes it a bit more involved due to needing to make them pointers (to represent "no data") and then needing to convert back to values. I'd rather just leave it in this PR if that's alright.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I'm fine keeping it as is, given the inconsistency wasn't introduced in this PR. Was nice to apply the boy scout rule, but will be for the next time 😉
In cortexproject/cortex#3925 the ability to restore alertmanager state from peer alertmanagers was added, short-circuiting if there is only a single replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read state from storage was added in case reading from peers failed. However, the short-circuiting if there is only a single peer was not removed. This has the effect of never restoring state in an alertmanager if only running a single replica. Fixes #2245 Signed-off-by: Nick Pillitteri <[email protected]>
Signed-off-by: Nick Pillitteri <[email protected]>
c5a66a0
to
b584b82
Compare
* Restore alertmanager state from storage as fallback In cortexproject/cortex#3925 the ability to restore alertmanager state from peer alertmanagers was added, short-circuiting if there is only a single replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read state from storage was added in case reading from peers failed. However, the short-circuiting if there is only a single peer was not removed. This has the effect of never restoring state in an alertmanager if only running a single replica. Fixes grafana#2245 Signed-off-by: Nick Pillitteri <[email protected]> * Code review changes Signed-off-by: Nick Pillitteri <[email protected]>
Signed-off-by: Nick Pillitteri [email protected]
What this PR does
In cortexproject/cortex#3925 the ability to restore alertmanager state from
peer alertmanagers was added, short-circuiting if there is only a single
replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read
state from storage was added in case reading from peers failed. However, the
short-circuiting if there is only a single peer was not removed. This has the
effect of never restoring state in an alertmanager if only running a single
replica.
Which issue(s) this PR fixes or relates to
Fixes #2245
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]