Restore alertmanager state from storage as fallback #2293

56quarters · 2022-06-30T19:02:49Z

Signed-off-by: Nick Pillitteri [email protected]

What this PR does

In cortexproject/cortex#3925 the ability to restore alertmanager state from
peer alertmanagers was added, short-circuiting if there is only a single
replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read
state from storage was added in case reading from peers failed. However, the
short-circuiting if there is only a single peer was not removed. This has the
effect of never restoring state in an alertmanager if only running a single
replica.

Which issue(s) this PR fixes or relates to

Fixes #2245

Checklist

Tests updated
[na] Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

callumj · 2022-06-30T19:09:11Z

Out of curiosity what would be the license implications of backporting this to Cortex?

56quarters · 2022-06-30T19:24:53Z

Out of curiosity what would be the license implications of backporting this to Cortex?

I'm not a lawyer so take this with a giant grain of salt:

I did the fix as a Grafana employee and signed a CLA so they own the copyright.
The DCO for Cortex requires me to say I have the right to contribute this code and license it under Apache 2 which I don't.
Cortex could independently fix this based on a bug report similar to the one you submitted to Mimir (excellent report, BTW).

jhesketh

lgtm 👍

pracucci

Thanks Nick for working on this! The fix LGTM, but I left few comments I would like you to take a look at, before merging. Thanks!

pkg/alertmanager/state_replication.go

pracucci · 2022-07-04T15:39:23Z

pkg/alertmanager/state_replication_test.go

 				assert.NoError(t, testutil.GatherAndCompare(reg, strings.NewReader(`
-# HELP alertmanager_state_fetch_replica_state_failed_total Number of times we have failed to read and merge the full state from another replica.


Why have been removed? It shouldn't change anything if RF>1.

I removed this because it was noisy and the most important metric is alertmanager_state_initial_sync_completed_total since it more directly represents what this method is testing: what the result of trying to load initial state is.

pracucci · 2022-07-04T15:41:12Z

pkg/alertmanager/state_replication_test.go

+		replicationResults map[string]clusterpb.Part
+		storeResults       map[string]clusterpb.Part


Why are these maps? Below in the test execution there's user-1 hardcoded. I think we can simplify it removing the map, given we always test with user-1.

Making this change actually makes it a bit more involved due to needing to make them pointers (to represent "no data") and then needing to convert back to values. I'd rather just leave it in this PR if that's alright.

Ok. I'm fine keeping it as is, given the inconsistency wasn't introduced in this PR. Was nice to apply the boy scout rule, but will be for the next time 😉

In cortexproject/cortex#3925 the ability to restore alertmanager state from peer alertmanagers was added, short-circuiting if there is only a single replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read state from storage was added in case reading from peers failed. However, the short-circuiting if there is only a single peer was not removed. This has the effect of never restoring state in an alertmanager if only running a single replica. Fixes #2245 Signed-off-by: Nick Pillitteri <[email protected]>

Signed-off-by: Nick Pillitteri <[email protected]>

* Restore alertmanager state from storage as fallback In cortexproject/cortex#3925 the ability to restore alertmanager state from peer alertmanagers was added, short-circuiting if there is only a single replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read state from storage was added in case reading from peers failed. However, the short-circuiting if there is only a single peer was not removed. This has the effect of never restoring state in an alertmanager if only running a single replica. Fixes grafana#2245 Signed-off-by: Nick Pillitteri <[email protected]> * Code review changes Signed-off-by: Nick Pillitteri <[email protected]>

56quarters force-pushed the 56quarters/am-state branch from 821e146 to c5a66a0 Compare June 30, 2022 19:27

56quarters marked this pull request as ready for review June 30, 2022 19:38

jhesketh approved these changes Jul 4, 2022

View reviewed changes

pracucci self-requested a review July 4, 2022 15:24

pracucci approved these changes Jul 4, 2022

View reviewed changes

56quarters added 2 commits July 5, 2022 12:01

Code review changes

b584b82

Signed-off-by: Nick Pillitteri <[email protected]>

56quarters force-pushed the 56quarters/am-state branch from c5a66a0 to b584b82 Compare July 5, 2022 16:01

pracucci merged commit 7777802 into main Jul 6, 2022

pracucci deleted the 56quarters/am-state branch July 6, 2022 07:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Restore alertmanager state from storage as fallback #2293

Restore alertmanager state from storage as fallback #2293

Uh oh!

56quarters commented Jun 30, 2022 •

edited

Loading

Uh oh!

callumj commented Jun 30, 2022

Uh oh!

56quarters commented Jun 30, 2022

Uh oh!

jhesketh left a comment

Uh oh!

pracucci left a comment

Uh oh!

Uh oh!

pracucci Jul 4, 2022

Uh oh!

56quarters Jul 5, 2022

Uh oh!

pracucci Jul 4, 2022

Uh oh!

56quarters Jul 5, 2022

Uh oh!

pracucci Jul 6, 2022

Uh oh!

Uh oh!

		assert.NoError(t, testutil.GatherAndCompare(reg, strings.NewReader(`
		# HELP alertmanager_state_fetch_replica_state_failed_total Number of times we have failed to read and merge the full state from another replica.

		replicationResults map[string]clusterpb.Part
		storeResults map[string]clusterpb.Part

Restore alertmanager state from storage as fallback #2293

Restore alertmanager state from storage as fallback #2293

Uh oh!

Conversation

56quarters commented Jun 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

Uh oh!

callumj commented Jun 30, 2022

Uh oh!

56quarters commented Jun 30, 2022

Uh oh!

jhesketh left a comment

Choose a reason for hiding this comment

Uh oh!

pracucci left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pracucci Jul 4, 2022

Choose a reason for hiding this comment

Uh oh!

56quarters Jul 5, 2022

Choose a reason for hiding this comment

Uh oh!

pracucci Jul 4, 2022

Choose a reason for hiding this comment

Uh oh!

56quarters Jul 5, 2022

Choose a reason for hiding this comment

Uh oh!

pracucci Jul 6, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

56quarters commented Jun 30, 2022 •

edited

Loading