Description
Describe the bug
The test testOngoingRecoveryAndClusterManagerFailOver is failing with async shard batch fetch enabled, context on the test:
- Create a index with primary shard.
- Block the transport thread once for clean file phase during peer recovery
- Increase the replica count to 1 - Shard state [ 1p - Started, 1r - Initializing (recovery blocked)]
- Switch active master node
- Increase the replica count to 2 [ 1p - Started, 1r - Initializing (recovery blocked), 1r - unassigned]
- Validate there are 2 active shards using cluster health in default timeout window (30 sec)
Root cause:
While creating batches in the ASF batch mode,ShardBatchGateWayAllocator, for every shard Id, there is only one shard routing maintained in the batch. In this case when the replica count is increased to two, for the same shard ID there are two replica shards in states UNASSIGNED and INITIALIZING, during batch creation, the batch entry for UNASSIGNED shard gets over-written with the INITIALIZING shards and there is no eligible shard found by ReplicaShardBatchAllocator because the shard is in INITIALIZING state. The replica 2 shard remains stuck in awaiting_fetch state as shard info is never fetched since there were no eligible shards.
The cluster is stuck in this state, until the recovery for replica 1 shard fails due to 60sec timeout, causing the replica one shard to be unassigned again.
Logs
[2025-04-28T12:00:18,892][INFO ][o.o.c.m.MetadataUpdateSettingsService] [node_t0] updating number_of_replicas to [2] for indices [test]
[2025-04-28T12:00:18,893][INFO ][o.o.c.r.a.AllocationService] [node_t0] Executing reroute for reason: settings update
[2025-04-28T12:00:18,894][INFO ][o.o.c.r.a.AllocationService] [node_t0] Executing batch mode for unassigned shards: [[test][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:18.892Z], delayed=false, allocation_status[no_attempt]]]
[2025-04-28T12:00:18,894][INFO ][o.o.g.ShardsBatchGatewayAllocator] [node_t0] Current Batched shards: {}
[2025-04-28T12:00:18,894][INFO ][o.o.g.ShardsBatchGatewayAllocator] [node_t0] Current Batched shards: {}
[2025-04-28T12:00:18,894][INFO ][o.o.g.ShardsBatchGatewayAllocator] [node_t0] New shard [test][0] to batch with shard routing [test][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:18.892Z], delayed=false, allocation_status[no_attempt]]
[2025-04-28T12:00:18,895][INFO ][o.o.g.S.InternalReplicaBatchShardAllocator] [node_t0] Starting shard allocation execution for unassigned replica shards: [[test][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:18.892Z], delayed=false, allocation_status[no_attempt]]]
[2025-04-28T12:00:18,897][INFO ][o.o.g.S.InternalReplicaBatchShardAllocator] [node_t0] Executing decision awaiting_info for unassigned shard [test][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:18.892Z], delayed=false, allocation_status[no_attempt]]
[2025-04-28T12:00:18,898][INFO ][o.o.c.r.a.AllocationService] [node_t0] Completed batch mode execution, unassigned shards [], initialising shards [[test][0], node[LAnB2nPrT6KDDN1zoUS5Fw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=tXyDz0WbSkOMgd9ks2l50w], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:16.384Z], delayed=false, allocation_status[no_attempt]]], relocating shards [], started shards [[test][0], node[JALHgQ1lTyWDZTdB4cqe1A], [P], s[STARTED], a[id=5AdKLtJkTQeEc3LNvmJXIA]]
[2025-04-28T12:00:18,900][INFO ][o.o.p.PluginsService ] [node_t0] PluginService:onIndexModule index:[test/LG5RgaVDQCOwoP52__ejDA]
[2025-04-28T12:00:19,325][INFO ][o.o.c.r.a.AllocationService] [node_t0] Executing reroute for reason: post-join reroute
[2025-04-28T12:00:19,325][INFO ][o.o.c.r.a.AllocationService] [node_t0] Executing batch mode for unassigned shards: [[test][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:18.892Z], delayed=false, allocation_status[no_attempt]]]
[2025-04-28T12:00:19,325][INFO ][o.o.g.ShardsBatchGatewayAllocator] [node_t0] Current Batched shards: {}
[2025-04-28T12:00:19,326][INFO ][o.o.g.ShardsBatchGatewayAllocator] [node_t0] Current Batched shards: {[test][0]=PcIye5YBoGs1Jp2XV8lP}
[2025-04-28T12:00:19,326][INFO ][o.o.g.ShardsBatchGatewayAllocator] [node_t0] Updating batch PcIye5YBoGs1Jp2XV8lP for shard ID [test][0] with shard [test][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:18.892Z], delayed=false, allocation_status[no_attempt]]
[2025-04-28T12:00:19,326][INFO ][o.o.g.ShardsBatchGatewayAllocator] [node_t0] Updating batch PcIye5YBoGs1Jp2XV8lP for shard ID [test][0] with shard [test][0], node[LAnB2nPrT6KDDN1zoUS5Fw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=tXyDz0WbSkOMgd9ks2l50w], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:16.384Z], delayed=false, allocation_status[no_attempt]]
[2025-04-28T12:00:19,327][INFO ][o.o.g.S.InternalReplicaBatchShardAllocator] [node_t0] Starting shard allocation execution for unassigned replica shards: [[test][0], node[LAnB2nPrT6KDDN1zoUS5Fw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=tXyDz0WbSkOMgd9ks2l50w], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:16.384Z], delayed=false, allocation_status[no_attempt]]]
[2025-04-28T12:00:19,327][INFO ][o.o.g.S.InternalReplicaBatchShardAllocator] [node_t0] Shard is not in unassigned state [test][0], node[LAnB2nPrT6KDDN1zoUS5Fw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=tXyDz0WbSkOMgd9ks2l50w], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:16.384Z], delayed=false, allocation_status[no_attempt]]
[2025-04-28T12:00:19,327][INFO ][o.o.g.S.InternalReplicaBatchShardAllocator] [node_t0] Ineligible replica shard [test][0], node[LAnB2nPrT6KDDN1zoUS5Fw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=tXyDz0WbSkOMgd9ks2l50w], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:16.384Z], delayed=false, allocation_status[no_attempt]]
[2025-04-28T12:00:19,328][INFO ][o.o.g.S.InternalReplicaBatchShardAllocator] [node_t0] Executing decision awaiting_info for unassigned shard [test][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:18.892Z], delayed=false, allocation_status[no_attempt]]
[2025-04-28T12:00:19,328][INFO ][o.o.c.r.a.AllocationService] [node_t0] Completed batch mode execution, unassigned shards [], initialising shards [[test][0], node[LAnB2nPrT6KDDN1zoUS5Fw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=tXyDz0WbSkOMgd9ks2l50w], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:16.384Z], delayed=false, allocation_status[no_attempt]]], relocating shards [], started shards [[test][0], node[JALHgQ1lTyWDZTdB4cqe1A], [P], s[STARTED], a[id=5AdKLtJkTQeEc3LNvmJXIA]]
[2025-04-28T12:00:19,343][INFO ][o.o.i.r.IndexRecoveryIT ] [testOngoingRecoveryAndClusterManagerFailOver] Routing table: IndexRoutingTable{shards={0=IndexShardRoutingTable([test][0]){[test][0], node[JALHgQ1lTyWDZTdB4cqe1A], [P], s[STARTED], a[id=5AdKLtJkTQeEc3LNvmJXIA], [test][0], node[LAnB2nPrT6KDDN1zoUS5Fw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=tXyDz0WbSkOMgd9ks2l50w], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:16.384Z], delayed=false, allocation_status[no_attempt]], [test][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:18.892Z], delayed=false, allocation_status[no_attempt]]}}, index=[test/LG5RgaVDQCOwoP52__ejDA]}
[2025-04-28T12:01:16,946][WARN ][o.o.i.c.IndicesClusterStateService] [node_t3] [test][0] marking and sending shard failed due to [failed recovery]
org.opensearch.indices.recovery.RecoveryFailedException: [test][0]: Recovery failed from {node_t2}{JALHgQ1lTyWDZTdB4cqe1A}{hsSnbbJmTWyRzwGq2SXnGQ}{127.0.0.1}{127.0.0.1:61768}{d}{shard_indexing_pressure_enabled=true} into {node_t3}{LAnB2nPrT6KDDN1zoUS5Fw}{3KRUV8TTTs-oJkICn_cY9w}{127.0.0.1}{127.0.0.1:61784}{d}{shard_indexing_pressure_enabled=true} ([test][0]: Recovery failed from {node_t2}{JALHgQ1lTyWDZTdB4cqe1A}{hsSnbbJmTWyRzwGq2SXnGQ}{127.0.0.1}{127.0.0.1:61768}{d}{shard_indexing_pressure_enabled=true} into {node_t3}{LAnB2nPrT6KDDN1zoUS5Fw}{3KRUV8TTTs-oJkICn_cY9w}{127.0.0.1}{127.0.0.1:61784}{d}{shard_indexing_pressure_enabled=true})
at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:141) ~[main/:?]
at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:180) ~[main/:?]
at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:212) ~[main/:?]
at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.onException(PeerRecoveryTargetService.java:760) ~[main/:?]
at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:690) ~[main/:?]
at org.opensearch.telemetry.tracing.handler.TraceableTransportResponseHandler.handleException(TraceableTransportResponseHandler.java:81) ~[main/:?]
at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1505) ~[main/:?]
at org.opensearch.transport.NativeMessageHandler.lambda$handleException$5(NativeMessageHandler.java:454) ~[main/:?]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916) ~[main/:?]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: org.opensearch.indices.recovery.RecoveryFailedException: [test][0]: Recovery failed from {node_t2}{JALHgQ1lTyWDZTdB4cqe1A}{hsSnbbJmTWyRzwGq2SXnGQ}{127.0.0.1}{127.0.0.1:61768}{d}{shard_indexing_pressure_enabled=true} into {node_t3}{LAnB2nPrT6KDDN1zoUS5Fw}{3KRUV8TTTs-oJkICn_cY9w}{127.0.0.1}{127.0.0.1:61784}{d}{shard_indexing_pressure_enabled=true}
... 9 more
Caused by: org.opensearch.transport.RemoteTransportException: [node_t2][127.0.0.1:61768][internal:index/shard/recovery/start_recovery]
Caused by: org.opensearch.transport.ReceiveTimeoutTransportException: [node_t3][127.0.0.1:61784][internal:index/shard/recovery/clean_files] request_id [22] timed out after [60000ms]
at org.opensearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1399) ~[main/:?]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916) [main/:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
at java.lang.Thread.run(Thread.java:1583) [?:?]
Related component
Cluster Manager
To Reproduce
- Enable batch mode setting
- Execute test testOngoingRecoveryAndClusterManagerFailOver
Expected behavior
Unassigned replica shards should get assigned even when there is another replica shard in the initializing state.
Additional Details
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
- OS: [e.g. iOS]
- Version [e.g. 22]
Additional context
Add any other context about the problem here.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status