Skip to content

Commit be441e8

Browse files
committed
[SPARK-35714][FOLLOW-UP][CORE] WorkerWatcher should run System.exit in a thread out of RpcEnv
### What changes were proposed in this pull request? This PR proposes to let `WorkerWatcher` run `System.exit` in a separate thread instead of some thread of `RpcEnv`. ### Why are the changes needed? `System.exit` will trigger the shutdown hook to run `executor.stop`, which will result in the same deadlock issue with SPARK-14180. But note that since Spark upgrades to Hadoop 3 recently, each hook now will have a [timeout threshold](https://github.com/apache/hadoop/blob/d4794dd3b2ba365a9d95ad6aafcf43a1ea40f777/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ShutdownHookManager.java#L205-L209) which forcibly interrupt the hook execution once reaches timeout. So, the deadlock issue doesn't really exist in the master branch. However, it's still critical for previous releases and is a wrong behavior that should be fixed. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested manually. Closes #35069 from Ngone51/fix-workerwatcher-exit. Authored-by: yi.wu <[email protected]> Signed-off-by: yi.wu <[email protected]> (cherry picked from commit 639d6f4) Signed-off-by: yi.wu <[email protected]>
1 parent 3aaf722 commit be441e8

File tree

1 file changed

+6
-2
lines changed

1 file changed

+6
-2
lines changed

core/src/main/scala/org/apache/spark/deploy/worker/WorkerWatcher.scala

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -54,8 +54,12 @@ private[spark] class WorkerWatcher(
5454
if (isTesting) {
5555
isShutDown = true
5656
} else if (isChildProcessStopping.compareAndSet(false, true)) {
57-
// SPARK-35714: avoid the duplicate call of `System.exit` to avoid the dead lock
58-
System.exit(-1)
57+
// SPARK-35714: avoid the duplicate call of `System.exit` to avoid the dead lock.
58+
// Same as SPARK-14180, we should run `System.exit` in a separate thread to avoid
59+
// dead lock since `System.exit` will trigger the shutdown hook of `executor.stop`.
60+
new Thread("WorkerWatcher-exit-executor") {
61+
override def run(): Unit = System.exit(-1)
62+
}.start()
5963
}
6064

6165
override def receive: PartialFunction[Any, Unit] = {

0 commit comments

Comments
 (0)