Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alertmanager state is not restored from remote storage if replicationFactor == 1 #2245

Closed
callumj opened this issue Jun 27, 2022 · 3 comments · Fixed by #2293
Closed

Alertmanager state is not restored from remote storage if replicationFactor == 1 #2245

callumj opened this issue Jun 27, 2022 · 3 comments · Fixed by #2293
Assignees
Labels
bug Something isn't working component/alertmanager

Comments

@callumj
Copy link
Contributor

callumj commented Jun 27, 2022

Describe the bug

Alertmanager state is not restored from remote storage if replicationFactor == 1: https://github.com/grafana/mimir/blob/main/pkg/alertmanager/state_replication.go#L205

I understand that running replicationFactor at 1 is a bad idea but for dev or staging environments it usually makes sense to keep things simple or cut down on costs.

I was wondering if this is intentional that we skip reading from storage in these cases?

To Reproduce

Steps to reproduce the behavior:

  1. Run Alertmanager with replicationFactor of 1
  2. Post an alert to AM
  3. Kill the AM process
  4. Observe state is not restored from storage

Expected behavior

State should be read from storage if no replication option is available

Environment

  • Infrastructure: k8s
  • Deployment tool: manual

Additional Context

@56quarters
Copy link
Contributor

Looking at the change that introduced this, it seems like a bug to me too: 7896c43

@56quarters 56quarters added bug Something isn't working component/alertmanager labels Jun 27, 2022
@56quarters 56quarters self-assigned this Jun 30, 2022
@56quarters
Copy link
Contributor

Actually, looks like it was a combination of the following PRs:

I'll have something that should fix this up for review shortly.

@stevesg
Copy link
Contributor

stevesg commented Jun 30, 2022

Good spot - the bug is that we should jump here if RF==1.

56quarters added a commit that referenced this issue Jun 30, 2022
In cortexproject/cortex#3925 the ability to restore alertmanager state from
peer alertmanagers was added, short-circuiting if there is only a single
replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read
state from storage was added in case reading from peers failed. However, the
short-circuiting if there is only a single peer was not removed. This has the
effect of never restoring state in an alertmanager if only running a single
replica.

Fixes #2245

Signed-off-by: Nick Pillitteri <[email protected]>
56quarters added a commit that referenced this issue Jul 5, 2022
In cortexproject/cortex#3925 the ability to restore alertmanager state from
peer alertmanagers was added, short-circuiting if there is only a single
replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read
state from storage was added in case reading from peers failed. However, the
short-circuiting if there is only a single peer was not removed. This has the
effect of never restoring state in an alertmanager if only running a single
replica.

Fixes #2245

Signed-off-by: Nick Pillitteri <[email protected]>
pracucci pushed a commit that referenced this issue Jul 6, 2022
* Restore alertmanager state from storage as fallback

In cortexproject/cortex#3925 the ability to restore alertmanager state from
peer alertmanagers was added, short-circuiting if there is only a single
replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read
state from storage was added in case reading from peers failed. However, the
short-circuiting if there is only a single peer was not removed. This has the
effect of never restoring state in an alertmanager if only running a single
replica.

Fixes #2245

Signed-off-by: Nick Pillitteri <[email protected]>

* Code review changes

Signed-off-by: Nick Pillitteri <[email protected]>
masonmei pushed a commit to udmire/mimir that referenced this issue Jul 11, 2022
* Restore alertmanager state from storage as fallback

In cortexproject/cortex#3925 the ability to restore alertmanager state from
peer alertmanagers was added, short-circuiting if there is only a single
replica of the alertmanager. In cortexproject/cortex#4021 a fallback to read
state from storage was added in case reading from peers failed. However, the
short-circuiting if there is only a single peer was not removed. This has the
effect of never restoring state in an alertmanager if only running a single
replica.

Fixes grafana#2245

Signed-off-by: Nick Pillitteri <[email protected]>

* Code review changes

Signed-off-by: Nick Pillitteri <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component/alertmanager
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants