Skip to content

[Feature Request] Remote Translog Optimisations for faster and resilient recoveries #15277

Open
@Bukhtawar

Description

@Bukhtawar

Is your feature request related to a problem? Please describe

  1. If the recoveries have large amount of uncommitted operations, they timeout and the translogs have to be download afresh causing a loop of failed recoveries each timing out as there are no incremental downloads on retries
  2. The remote translog recovery acquires a shard lock while downloading translog files. Now if the recovery fails the shard cannot be closed until the recovery of translog completes which causes the cluster applier thread to get blocked, causing nodes to lag the cluster state and ultimately drop out of the cluster
"opensearch[691aeed35826ecc93653e3011d18c9b1][clusterApplierService#updateTask][T#1]" #268 daemon prio=5 os_prio=0 cpu=69394.87ms elapsed=10325.08s tid=0x0000ffdde862cd40 nid=0x487c waiting for monitor entry  [0x0000ffdc2f4fd000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.opensearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:265)
        - waiting to lock <0x0000ffe0683bbfa8> (a org.opensearch.indices.cluster.IndicesClusterStateService)
        at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:608)
        at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:595)
        at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:563)
        at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:486)
        at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:188)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:863)
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283)
        at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246)
        at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
        at java.lang.Thread.run([email protected]/Thread.java:840)
        
   Locked ownable synchronizers:
        - <0x0000ffe06a6cba78> (a java.util.concurrent.ThreadPoolExecutor$Worker)
"opensearch[691aeed35826ecc93653e3011d18c9b1][generic][T#26]" #290 daemon prio=5 os_prio=0 cpu=464.40ms elapsed=10325.07s tid=0x0000ffdc90029390 nid=0x4892 waiting for monitor entry  [0x0000ffdc2defd000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.opensearch.index.shard.IndexShard.close(IndexShard.java:2110)
        - waiting to lock <0x0000ffe1449c1d98> (a java.lang.Object)
        at org.opensearch.index.IndexService.closeShard(IndexService.java:644)
        at org.opensearch.index.IndexService.removeShard(IndexService.java:620)
        - locked <0x0000ffe0931fbd80> (a org.opensearch.index.IndexService)
        at org.opensearch.indices.cluster.IndicesClusterStateService.failAndRemoveShard(IndicesClusterStateService.java:817)
        at org.opensearch.indices.cluster.IndicesClusterStateService.handleRecoveryFailure(IndicesClusterStateService.java:797)
        - locked <0x0000ffe0683bbfa8> (a org.opensearch.indices.cluster.IndicesClusterStateService)
        at org.opensearch.indices.recovery.RecoveryListener.onFailure(RecoveryListener.java:55)
        at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:136)
        at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:180)
        at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:212)
        at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.onException(PeerRecoveryTargetService.java:756)
        at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:682)
        at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleException(SecurityInterceptor.java:430)
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1515)
        at org.opensearch.transport.InboundHandler.lambda$handleException$5(InboundHandler.java:447)
        at org.opensearch.transport.InboundHandler$$Lambda$8371/0x000000a00227e220.run(Unknown Source)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:863)
        at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
        at java.lang.Thread.run([email protected]/Thread.java:840)
        
   Locked ownable synchronizers:
        - <0x0000ffe06a605900> (a java.util.concurrent.ThreadPoolExecutor$Worker)
"opensearch[691aeed35826ecc93653e3011d18c9b1][generic][T#19]" #283 daemon prio=5 os_prio=0 cpu=187528.43ms elapsed=10325.07s tid=0x0000ffdc90021ff0 nid=0x488b waiting on condition  [0x0000ffdc2e5fd000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        - parking to wait for  <0x0000fffa222465d0> (a java.util.concurrent.FutureTask)
        at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:211)
        at java.util.concurrent.FutureTask.awaitDone([email protected]/FutureTask.java:447)
        at java.util.concurrent.FutureTask.get([email protected]/FutureTask.java:190)
        at org.opensearch.encryption.frame.CryptoInputStream.read(CryptoInputStream.java:193)
        at java.io.InputStream.transferTo([email protected]/InputStream.java:782)
        at java.nio.file.Files.copy([email protected]/Files.java:3171)
        at org.opensearch.index.translog.transfer.TranslogTransferManager.downloadToFS(TranslogTransferManager.java:312)
        at org.opensearch.index.translog.transfer.TranslogTransferManager.downloadTranslog(TranslogTransferManager.java:258)
        at org.opensearch.index.translog.RemoteFsTranslog.downloadOnce(RemoteFsTranslog.java:246)
        at org.opensearch.index.translog.RemoteFsTranslog.download(RemoteFsTranslog.java:213)
        at org.opensearch.index.translog.RemoteFsTranslog.download(RemoteFsTranslog.java:196)
        at org.opensearch.index.shard.IndexShard.syncTranslogFilesFromRemoteTranslog(IndexShard.java:5000)
        at org.opensearch.index.shard.IndexShard.syncRemoteTranslogAndUpdateGlobalCheckpoint(IndexShard.java:4978)
        at org.opensearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:2584)
        - locked <0x0000ffe1449c1d98> (a java.lang.Object)
        at org.opensearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:2554)

Describe the solution you'd like

  1. Makes translog downloads on recovery incremental
  2. Make translog downloads on recovery cancellable
  3. Parallelise downloads and and translog replays
  4. Attempt & trigger flush on recovery failures

Related component

Storage:Remote

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Storage:RemoteenhancementEnhancement or improvement to existing feature or request

    Type

    No type

    Projects

    Status

    🆕 New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions