Skip to content

Orca tf1 train gets No route to host error when launching from remote on k8s #3725

Open
@hkvision

Description

@hkvision

Run on Almaren-Node-188
cd kai
conda activate orca-demo
export PYSPARK_PYTHON=/usr/local/envs/pytf1/bin/python
python ncf_orca.py

get the error below when the training is finished:

2021-12-21 09:47:30 INFO  DistriOptimizer$:431 - [Epoch 2 80000/80668][Iteration 21][Wall Clock 10.99408137s] Trained 8000.0 records in 0.341889744 seconds. Throughput is 23399.357 records/second. Loss is 1.6157264.
2021-12-21 09:47:30 INFO  DistriOptimizer$:431 - [Epoch 2 88000/80668][Iteration 22][Wall Clock 11.309833195s] Trained 8000.0 records in 0.315751825 seconds. Throughput is 25336.354 records/second. Loss is 1.2710276.
2021-12-21 09:47:30 INFO  DistriOptimizer$:473 - [Epoch 2 88000/80668][Iteration 22][Wall Clock 11.309833195s] Epoch finished. Wall clock time is 11328.070709 ms
2021-12-14 14:04:46 ERROR RetryingBlockFetcher:155 - Exception while beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to /172.30.96.12:34942
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:287)
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218)
        at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:123)
        at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:153)
        at org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:133)
        at org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:143)
        at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102)
        at org.apache.spark.storage.BlockManager.fetchRemoteManagedBuffer(BlockManager.scala:1061)
        at org.apache.spark.storage.BlockManager.$anonfun$getRemoteBlock$8(BlockManager.scala:1005)
        at scala.Option.orElse(Option.scala:447)
        at org.apache.spark.storage.BlockManager.getRemoteBlock(BlockManager.scala:1005)
        at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:1143)
        at org.apache.spark.scheduler.TaskResultGetter$$anon$3.$anonfun$run$1(TaskResultGetter.scala:88)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
        at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:63)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /A.B.C.D:X 
Caused by: java.net.NoRouteToHostException: No route to host
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)


py4j.protocol.Py4JJavaError: An error occurred while calling o88.estimatorTrainMiniBatch.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 851.0 failed 4 times, most recent failure: Lost task 0.3 in stage 851.0 (TID 438) (172.30.96.12 executor 1): TaskResultLost (result lost from block manager)
Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
        at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1449)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
        at org.apache.spark.rdd.RDD.take(RDD.scala:1422)
        at org.apache.spark.rdd.RDD.$anonfun$first$1(RDD.scala:1463)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
        at org.apache.spark.rdd.RDD.first(RDD.scala:1463)
        at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer$.$anonfun$getModel$4(Topology.scala:1471)
        at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
        at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer$.getModel(Topology.scala:1470)
        at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:987)
        at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1123)
        at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:793)
        at com.intel.analytics.bigdl.dllib.estimator.Estimator.train(Estimator.scala:191)
        at com.intel.analytics.bigdl.dllib.estimator.python.PythonEstimator.estimatorTrainMiniBatch(PythonEstimator.scala:119)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)

Seems 172.30.96.12 is an internal ip.
Previously there is no error, not sure what has happened during the past week for the docker image and the k8s cluster.
cc @pinggao18

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions