[BUG] ASF batch replica shard assignment blocked for unassigned shards when a shard in intialising state exist for the same shard ID

### Describe the bug

The test [testOngoingRecoveryAndClusterManagerFailOver](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexRecoveryIT.java#L1482) is failing with async shard batch fetch enabled, context on the test:
1. Create a index with primary shard.
2. Block the transport thread once for clean file phase during peer recovery
3. Increase the replica count to 1 - Shard state [ 1p - Started, 1r - Initializing (recovery blocked)]
4. Switch active master node
5. Increase the replica count to 2 [ 1p - Started, 1r - Initializing (recovery blocked), 1r - unassigned]
6. Validate there are 2 active shards using cluster health in default timeout window (30 sec)


Root cause:
While creating batches in the ASF batch mode,[ShardBatchGateWayAllocator](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/gateway/ShardsBatchGatewayAllocator.java#L390-L462), for every shard Id, there is only one shard routing maintained in the batch. In this case when the replica count is increased to two, for the same shard ID there are two replica shards in states UNASSIGNED and INITIALIZING, during batch creation, the batch entry for UNASSIGNED shard gets [over-written](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/gateway/ShardsBatchGatewayAllocator.java#L415-L429) with the INITIALIZING shards and there is no eligible shard found by [ReplicaShardBatchAllocator](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/gateway/ReplicaShardBatchAllocator.java#L197) because the shard is in INITIALIZING state.  The replica 2 shard remains stuck in awaiting_fetch state as shard info is never fetched since there were no eligible shards.

The cluster is stuck in this state, until the recovery for replica 1 shard fails due to 60sec  timeout, causing the replica one shard to be unassigned again.

Logs

```
    [2025-04-28T12:00:18,892][INFO ][o.o.c.m.MetadataUpdateSettingsService] [node_t0] updating number_of_replicas to [2] for indices [test]
    [2025-04-28T12:00:18,893][INFO ][o.o.c.r.a.AllocationService] [node_t0] Executing reroute for reason: settings update
    [2025-04-28T12:00:18,894][INFO ][o.o.c.r.a.AllocationService] [node_t0] Executing batch mode for unassigned shards: [[test][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:18.892Z], delayed=false, allocation_status[no_attempt]]]
    [2025-04-28T12:00:18,894][INFO ][o.o.g.ShardsBatchGatewayAllocator] [node_t0] Current Batched shards: {}
    [2025-04-28T12:00:18,894][INFO ][o.o.g.ShardsBatchGatewayAllocator] [node_t0] Current Batched shards: {}
    [2025-04-28T12:00:18,894][INFO ][o.o.g.ShardsBatchGatewayAllocator] [node_t0] New shard [test][0] to batch with shard routing [test][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:18.892Z], delayed=false, allocation_status[no_attempt]]
    [2025-04-28T12:00:18,895][INFO ][o.o.g.S.InternalReplicaBatchShardAllocator] [node_t0] Starting shard allocation execution for unassigned replica shards: [[test][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:18.892Z], delayed=false, allocation_status[no_attempt]]]
    [2025-04-28T12:00:18,897][INFO ][o.o.g.S.InternalReplicaBatchShardAllocator] [node_t0] Executing decision awaiting_info for unassigned shard [test][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:18.892Z], delayed=false, allocation_status[no_attempt]]
    [2025-04-28T12:00:18,898][INFO ][o.o.c.r.a.AllocationService] [node_t0] Completed batch mode execution, unassigned shards [], initialising shards [[test][0], node[LAnB2nPrT6KDDN1zoUS5Fw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=tXyDz0WbSkOMgd9ks2l50w], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:16.384Z], delayed=false, allocation_status[no_attempt]]], relocating shards [], started shards [[test][0], node[JALHgQ1lTyWDZTdB4cqe1A], [P], s[STARTED], a[id=5AdKLtJkTQeEc3LNvmJXIA]]
    [2025-04-28T12:00:18,900][INFO ][o.o.p.PluginsService     ] [node_t0] PluginService:onIndexModule index:[test/LG5RgaVDQCOwoP52__ejDA]
    [2025-04-28T12:00:19,325][INFO ][o.o.c.r.a.AllocationService] [node_t0] Executing reroute for reason: post-join reroute
    [2025-04-28T12:00:19,325][INFO ][o.o.c.r.a.AllocationService] [node_t0] Executing batch mode for unassigned shards: [[test][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:18.892Z], delayed=false, allocation_status[no_attempt]]]
    [2025-04-28T12:00:19,325][INFO ][o.o.g.ShardsBatchGatewayAllocator] [node_t0] Current Batched shards: {}
    [2025-04-28T12:00:19,326][INFO ][o.o.g.ShardsBatchGatewayAllocator] [node_t0] Current Batched shards: {[test][0]=PcIye5YBoGs1Jp2XV8lP}
    [2025-04-28T12:00:19,326][INFO ][o.o.g.ShardsBatchGatewayAllocator] [node_t0] Updating batch PcIye5YBoGs1Jp2XV8lP for shard ID [test][0] with shard [test][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:18.892Z], delayed=false, allocation_status[no_attempt]]
    [2025-04-28T12:00:19,326][INFO ][o.o.g.ShardsBatchGatewayAllocator] [node_t0] Updating batch PcIye5YBoGs1Jp2XV8lP for shard ID [test][0] with shard [test][0], node[LAnB2nPrT6KDDN1zoUS5Fw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=tXyDz0WbSkOMgd9ks2l50w], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:16.384Z], delayed=false, allocation_status[no_attempt]]
    [2025-04-28T12:00:19,327][INFO ][o.o.g.S.InternalReplicaBatchShardAllocator] [node_t0] Starting shard allocation execution for unassigned replica shards: [[test][0], node[LAnB2nPrT6KDDN1zoUS5Fw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=tXyDz0WbSkOMgd9ks2l50w], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:16.384Z], delayed=false, allocation_status[no_attempt]]]
    [2025-04-28T12:00:19,327][INFO ][o.o.g.S.InternalReplicaBatchShardAllocator] [node_t0] Shard is not in unassigned state [test][0], node[LAnB2nPrT6KDDN1zoUS5Fw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=tXyDz0WbSkOMgd9ks2l50w], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:16.384Z], delayed=false, allocation_status[no_attempt]]
    [2025-04-28T12:00:19,327][INFO ][o.o.g.S.InternalReplicaBatchShardAllocator] [node_t0] Ineligible replica shard [test][0], node[LAnB2nPrT6KDDN1zoUS5Fw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=tXyDz0WbSkOMgd9ks2l50w], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:16.384Z], delayed=false, allocation_status[no_attempt]]
    [2025-04-28T12:00:19,328][INFO ][o.o.g.S.InternalReplicaBatchShardAllocator] [node_t0] Executing decision awaiting_info for unassigned shard [test][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:18.892Z], delayed=false, allocation_status[no_attempt]]
    [2025-04-28T12:00:19,328][INFO ][o.o.c.r.a.AllocationService] [node_t0] Completed batch mode execution, unassigned shards [], initialising shards [[test][0], node[LAnB2nPrT6KDDN1zoUS5Fw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=tXyDz0WbSkOMgd9ks2l50w], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:16.384Z], delayed=false, allocation_status[no_attempt]]], relocating shards [], started shards [[test][0], node[JALHgQ1lTyWDZTdB4cqe1A], [P], s[STARTED], a[id=5AdKLtJkTQeEc3LNvmJXIA]]
    [2025-04-28T12:00:19,343][INFO ][o.o.i.r.IndexRecoveryIT  ] [testOngoingRecoveryAndClusterManagerFailOver] Routing table: IndexRoutingTable{shards={0=IndexShardRoutingTable([test][0]){[test][0], node[JALHgQ1lTyWDZTdB4cqe1A], [P], s[STARTED], a[id=5AdKLtJkTQeEc3LNvmJXIA], [test][0], node[LAnB2nPrT6KDDN1zoUS5Fw], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=tXyDz0WbSkOMgd9ks2l50w], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:16.384Z], delayed=false, allocation_status[no_attempt]], [test][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2025-04-28T07:00:18.892Z], delayed=false, allocation_status[no_attempt]]}}, index=[test/LG5RgaVDQCOwoP52__ejDA]}
    [2025-04-28T12:01:16,946][WARN ][o.o.i.c.IndicesClusterStateService] [node_t3] [test][0] marking and sending shard failed due to [failed recovery]
    org.opensearch.indices.recovery.RecoveryFailedException: [test][0]: Recovery failed from {node_t2}{JALHgQ1lTyWDZTdB4cqe1A}{hsSnbbJmTWyRzwGq2SXnGQ}{127.0.0.1}{127.0.0.1:61768}{d}{shard_indexing_pressure_enabled=true} into {node_t3}{LAnB2nPrT6KDDN1zoUS5Fw}{3KRUV8TTTs-oJkICn_cY9w}{127.0.0.1}{127.0.0.1:61784}{d}{shard_indexing_pressure_enabled=true} ([test][0]: Recovery failed from {node_t2}{JALHgQ1lTyWDZTdB4cqe1A}{hsSnbbJmTWyRzwGq2SXnGQ}{127.0.0.1}{127.0.0.1:61768}{d}{shard_indexing_pressure_enabled=true} into {node_t3}{LAnB2nPrT6KDDN1zoUS5Fw}{3KRUV8TTTs-oJkICn_cY9w}{127.0.0.1}{127.0.0.1:61784}{d}{shard_indexing_pressure_enabled=true})
        at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:141) ~[main/:?]
        at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:180) ~[main/:?]
        at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:212) ~[main/:?]
        at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.onException(PeerRecoveryTargetService.java:760) ~[main/:?]
        at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:690) ~[main/:?]
        at org.opensearch.telemetry.tracing.handler.TraceableTransportResponseHandler.handleException(TraceableTransportResponseHandler.java:81) ~[main/:?]
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1505) ~[main/:?]
        at org.opensearch.transport.NativeMessageHandler.lambda$handleException$5(NativeMessageHandler.java:454) ~[main/:?]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916) ~[main/:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
    Caused by: org.opensearch.indices.recovery.RecoveryFailedException: [test][0]: Recovery failed from {node_t2}{JALHgQ1lTyWDZTdB4cqe1A}{hsSnbbJmTWyRzwGq2SXnGQ}{127.0.0.1}{127.0.0.1:61768}{d}{shard_indexing_pressure_enabled=true} into {node_t3}{LAnB2nPrT6KDDN1zoUS5Fw}{3KRUV8TTTs-oJkICn_cY9w}{127.0.0.1}{127.0.0.1:61784}{d}{shard_indexing_pressure_enabled=true}
        ... 9 more
    Caused by: org.opensearch.transport.RemoteTransportException: [node_t2][127.0.0.1:61768][internal:index/shard/recovery/start_recovery]
    Caused by: org.opensearch.transport.ReceiveTimeoutTransportException: [node_t3][127.0.0.1:61784][internal:index/shard/recovery/clean_files] request_id [22] timed out after [60000ms]
        at org.opensearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1399) ~[main/:?]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916) [main/:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
        at java.lang.Thread.run(Thread.java:1583) [?:?]
```

### Related component

Cluster Manager

### To Reproduce

1. Enable [batch mode setting](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/cluster/routing/allocation/ExistingShardsAllocator.java#L79)
2. Execute test [testOngoingRecoveryAndClusterManagerFailOver](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/internalClusterTest/java/org/opensearch/indices/recovery/IndexRecoveryIT.java#L1482)


### Expected behavior

Unassigned replica shards should get assigned even when there is another replica shard in the initializing state.

### Additional Details

**Plugins**
Please list all plugins currently enabled.

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Host/Environment (please complete the following information):**
 - OS: [e.g. iOS]
 - Version [e.g. 22]

**Additional context**
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] ASF batch replica shard assignment blocked for unassigned shards when a shard in intialising state exist for the same shard ID #18098

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] ASF batch replica shard assignment blocked for unassigned shards when a shard in intialising state exist for the same shard ID #18098

Description

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions