Description
Describe the bug
The TestAlertmanagerSharding
, which was updated in #3839, is flaky. As an example, you can see it here and here.
To Reproduce
've reproduced it locally with debug logs. This is a snippet of logs:
# we create the silence here
17:57:39 alertmanager-3: level=debug ts=2021-03-08T16:57:39.9726694Z caller=multitenant.go:929 component=MultiTenantAlertmanager msg="user does not have an alertmanager in this instance" user=user-5
17:57:39 alertmanager-1: level=debug ts=2021-03-08T16:57:39.9740552Z caller=multitenant.go:907 component=MultiTenantAlertmanager msg="user not found while trying to replicate state" user=user-5 key=sil:user-5
# the periodic sync realise the shard has changed and loads user-5 too, but it's "too late"
17:57:40 alertmanager-3: level=debug ts=2021-03-08T16:57:40.1726301Z caller=multitenant.go:673 component=MultiTenantAlertmanager msg="setting config" user=user-5
17:57:40 alertmanager-3: level=debug ts=2021-03-08T16:57:40.1730083Z caller=multitenant.go:726 component=MultiTenantAlertmanager msg="initializing new per-tenant alertmanager" user=user-5
17:57:40 alertmanager-3: level=debug ts=2021-03-08T16:57:40.1732274Z caller=alertmanager.go:139 user=user-5 msg="starting tenant alertmanager with ring-based replication"
The problem is that if the silence is created soon after a resharding, the replication may fail.
Expected behavior
The replication should not fail if happening right after a resharding.