Skip to content

TestAlertmanagerSharding is flaky due to a logic issue #3927

@pracucci

Description

@pracucci

Describe the bug
The TestAlertmanagerSharding, which was updated in #3839, is flaky. As an example, you can see it here and here.

To Reproduce
've reproduced it locally with debug logs. This is a snippet of logs:

# we create the silence here

17:57:39 alertmanager-3: level=debug ts=2021-03-08T16:57:39.9726694Z caller=multitenant.go:929 component=MultiTenantAlertmanager msg="user does not have an alertmanager in this instance" user=user-5
17:57:39 alertmanager-1: level=debug ts=2021-03-08T16:57:39.9740552Z caller=multitenant.go:907 component=MultiTenantAlertmanager msg="user not found while trying to replicate state" user=user-5 key=sil:user-5

# the periodic sync realise the shard has changed and loads user-5 too, but it's "too late"

17:57:40 alertmanager-3: level=debug ts=2021-03-08T16:57:40.1726301Z caller=multitenant.go:673 component=MultiTenantAlertmanager msg="setting config" user=user-5
17:57:40 alertmanager-3: level=debug ts=2021-03-08T16:57:40.1730083Z caller=multitenant.go:726 component=MultiTenantAlertmanager msg="initializing new per-tenant alertmanager" user=user-5
17:57:40 alertmanager-3: level=debug ts=2021-03-08T16:57:40.1732274Z caller=alertmanager.go:139 user=user-5 msg="starting tenant alertmanager with ring-based replication"

The problem is that if the silence is created soon after a resharding, the replication may fail.

Expected behavior
The replication should not fail if happening right after a resharding.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions