-
Notifications
You must be signed in to change notification settings - Fork 2.2k
[BUG] Remote Cluster State Diff Download Failures while performing IndicesAliases Action #18045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The issue appears to stem from the read path of the publication on the nodes where we create the new cluster state update from the incoming manifest. Reference - OpenSearch/server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateService.java Lines 1583 to 1694 in caf5d71
The manifest contains two information which are set by the cluster manager prior to sending it to the nodes:
While reading the new cluster state, we do the following
OpenSearch/server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateService.java Line 1419 in caf5d71
OpenSearch/server/src/main/java/org/opensearch/gateway/remote/RemoteClusterStateService.java Lines 1469 to 1472 in caf5d71
As part of the IndicesAlias Action we can remove an index and assigns its name as an alias to another index in the same request. For example:
In this case the new cluster state to be achieved would have the following diff -
In this case, since we are only updating the metadata with the newest IndexMetadata we fetched from Remote (As mentioned in Step 3 above), we end up having the Metadata in an inconsistent state which leads to the issue. Q: Why do we only see it in Remote Publication enabled domains? In the case of a cluster state update by diff application from transport - we do a map diff applying where-in we delete the index metadata to be removed first before performing any update actions We maintain this in the Metadata in the format of <String, IndexMetadata>
We apply by deleting first - OpenSearch/server/src/main/java/org/opensearch/cluster/DiffableUtils.java Lines 164 to 181 in caf5d71
|
Describe the bug
Issue Overview
When Remote State and Publication is enabled in a OpenSearch cluster, executing an PUT
_aliases
request (IndicesAliasAction) that removes an index and assigns its name as an alias to another index in the same request causes a cluster state update failure specifically a diff download failure. This results in the cluster manager stepping down and new node stepping up as a cluster manager. All nodes then follow by doing a full cluster state download as a consequence of node-joinsThe request goes through however it internally leads to a cluster manager election (due to previous cluster manager stepping down as a result of publication failure). This has the following consequence:
Related component
Cluster Manager
To Reproduce
Expected behavior
The
_alias
request should go through without any download failures (Publication should succeed)Additional Details
Plugins
NA, Core issue
Screenshots
None
Host/Environment (please complete the following information):
Additional context
None
The text was updated successfully, but these errors were encountered: