Skip to content

[BUG] Job timeout registration pathologically fails in some [databricks] CI_PART1 pipelines #12408

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gerashegalov opened this issue Mar 28, 2025 · 0 comments · Fixed by #12426
Assignees
Labels
bug Something isn't working test Only impacts tests

Comments

@gerashegalov
Copy link
Collaborator

Describe the bug
Some logs exhibit the issue where pyspark child JVM dies and we can see the symptom similar to https://issues.apache.org/jira/browse/SPARK-48711

[2025-03-28T15:44:40.154Z] request = <SubRequest 'set_spark_job_timeout' for <Function test_lru_cache_datagen>>
[2025-03-28T15:44:40.154Z] 
[2025-03-28T15:44:40.154Z]     @pytest.fixture(scope="function", autouse=True)
[2025-03-28T15:44:40.154Z]     def set_spark_job_timeout(request):
[2025-03-28T15:44:40.154Z]         # TODO dial down after identifying all long tests
[2025-03-28T15:44:40.154Z]         # and set exceptions there
[2025-03-28T15:44:40.154Z]         default_timeout_seconds = 900
[2025-03-28T15:44:40.154Z]         logger.debug("set_spark_job_timeout: BEFORE TEST\n")
[2025-03-28T15:44:40.154Z]         tm = request.node.get_closest_marker("spark_job_timeout")
[2025-03-28T15:44:40.154Z]         if tm:
[2025-03-28T15:44:40.154Z]             spark_timeout = tm.kwargs.get('seconds', default_timeout_seconds)
[2025-03-28T15:44:40.154Z]             dump_threads = tm.kwargs.get('dump_threads', True)
[2025-03-28T15:44:40.154Z]         else:
[2025-03-28T15:44:40.154Z]             spark_timeout = default_timeout_seconds
[2025-03-28T15:44:40.154Z]             dump_threads = True
[2025-03-28T15:44:40.154Z]         # before the test
[2025-03-28T15:44:40.154Z]         hung_job_listener = (
[2025-03-28T15:44:40.154Z] >         _spark._jvm.org.apache.spark.rapids.tests.TimeoutSparkListener(
[2025-03-28T15:44:40.154Z]               _spark._jsc,
[2025-03-28T15:44:40.154Z]               spark_timeout,
[2025-03-28T15:44:40.154Z]               dump_threads)
[2025-03-28T15:44:40.154Z]         )
[2025-03-28T15:44:40.154Z] 
[2025-03-28T15:44:40.154Z] �[1m�[31m../../src/main/python/spark_init_internal.py�[0m:275: 
[2025-03-28T15:44:40.154Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2025-03-28T15:44:40.154Z] �[1m�[31m/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py�[0m:1745: in __getattr__
[2025-03-28T15:44:40.154Z]     answer = self._gateway_client.send_command(
[2025-03-28T15:44:40.154Z] �[1m�[31m/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py�[0m:1036: in send_command
[2025-03-28T15:44:40.154Z]     connection = self._get_connection()
[2025-03-28T15:44:40.154Z] �[1m�[31m/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py�[0m:284: in _get_connection
[2025-03-28T15:44:40.154Z]     connection = self._create_new_connection()
[2025-03-28T15:44:40.154Z] �[1m�[31m/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py�[0m:291: in _create_new_connection
[2025-03-28T15:44:40.154Z]     connection.connect_to_java_server()
[2025-03-28T15:44:40.154Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2025-03-28T15:44:40.154Z] 
[2025-03-28T15:44:40.154Z] self = <py4j.clientserver.ClientServerConnection object at 0x7f24a5be1a80>
[2025-03-28T15:44:40.154Z] 
[2025-03-28T15:44:40.154Z]     def connect_to_java_server(self):
[2025-03-28T15:44:40.154Z]         try:
[2025-03-28T15:44:40.154Z]             self.socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
[2025-03-28T15:44:40.154Z]             if self.java_parameters.read_timeout:
[2025-03-28T15:44:40.154Z]                 self.socket.settimeout(self.java_parameters.read_timeout)
[2025-03-28T15:44:40.155Z]             if self.ssl_context:
[2025-03-28T15:44:40.155Z]                 self.socket = self.ssl_context.wrap_socket(
[2025-03-28T15:44:40.155Z]                     self.socket, server_hostname=self.java_address)
[2025-03-28T15:44:40.155Z] >           self.socket.connect((self.java_address, self.java_port))
[2025-03-28T15:44:40.155Z] �[1m�[31mE           ConnectionRefusedError: [Errno 111] Connection refused�[0m

Steps/Code to reproduce bug
Seems to consistently occur on CI, but no particular repro to run manually at this time

Expected behavior
Since issues like SPARK-48711 are hardly preventable, timeout registration code should only do best-effort registration and ignore (just log) any exceptions)

Environment details (please complete the following information)
Databricks nightly CI

Additional context
#12346

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working test Only impacts tests
Projects
None yet
1 participant