[7.2.0-rc1] Bazel crashed due to an internal error: RejectedExecutionException #22393

nya3jp · 2024-05-16T04:37:22Z

Description of the bug:

ChromeOS Bazelification team tried to update Bazel version to 7.2.0-rc1, but the build failed due to Bazel's internal error.

(19:22:23) FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.RuntimeException: Unrecoverable error while evaluating node 'REPOSITORY_DIRECTORY:@@_main~portage_deps~chrome-internal-126.0.6477.0' (requested by nodes 'PACKAGE_LOOKUP:@@_main~portage_deps~chrome-internal-126.0.6477.0//')
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:550)
	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:414)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
Caused by: java.util.concurrent.RejectedExecutionException
	at java.base/java.util.concurrent.ThreadPerTaskExecutor.ensureNotShutdown(Unknown Source)
	at java.base/java.util.concurrent.ThreadPerTaskExecutor.start(Unknown Source)
	at java.base/java.util.concurrent.ThreadPerTaskExecutor.execute(Unknown Source)
	at com.google.common.util.concurrent.MoreExecutors$ListeningDecorator.execute(MoreExecutors.java:640)
	at java.base/java.util.concurrent.AbstractExecutorService.submit(Unknown Source)
	at com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:79)
	at com.google.devtools.build.lib.bazel.repository.starlark.RepoFetchingSkyKeyComputeState.startWorker(RepoFetchingSkyKeyComputeState.java:124)
	at com.google.devtools.build.lib.bazel.repository.starlark.StarlarkRepositoryFunction.fetch(StarlarkRepositoryFunction.java:164)
	at com.google.devtools.build.lib.bazel.repository.starlark.StarlarkRepositoryFunction.fetch(StarlarkRepositoryFunction.java:193)
	at com.google.devtools.build.lib.rules.repository.RepositoryDelegatorFunction.fetchRepository(RepositoryDelegatorFunction.java:421)
	at com.google.devtools.build.lib.rules.repository.RepositoryDelegatorFunction.compute(RepositoryDelegatorFunction.java:208)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:461)
	... 7 more

CI Link: https://ci.chromium.org/ui/p/chromeos/builders/cq/brya-bazel-lite-cq/b8747820731539234353/overview (access limited to Google employees only, sorry)

Which category does this issue belong to?

Core

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

N/A

Which operating system are you running Bazel on?

Linux

What is the output of `bazel info release`?

release 7.2.0rc1

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

No response

What's the output of `git remote get-url origin; git rev-parse HEAD` ?

N/A

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

We've never seen this issue before, and our build has been stable for 2 months with Bazel 7.1.0.

Have you found anything relevant by searching the web?

N/A

Any other information, logs, or outputs that you want to share?

No response

The text was updated successfully, but these errors were encountered:

lberki · 2024-05-16T06:25:12Z

My guess is that this has been fixed in #22100 and cherry-picked into 7.2.0 as f6618782df1521e674061d3b6a6bb973ed78f591.

Assigning to @Wyverald on the off chance he notices something I didn't.

emaxx-google · 2024-05-16T09:05:44Z

Wasn't f6618782df1521e674061d3b6a6bb973ed78f591 already part of the 7.2.0-rc1 release? Git log shows this commit on the branch, however we observed the issue after updating to that version.

lberki · 2024-05-16T09:14:08Z

Nevermind, my bad. I didn't realize this occurred with 7.2.0-rc1. Let's see what @Wyverald has to say then!

Wyverald · 2024-05-16T12:57:05Z

@bazel-io fork 7.2.0

Wyverald · 2024-05-16T19:54:09Z

Do you see this consistently? I'm asking because the only way I see this happening is that, between line 153 (

bazel/src/main/java/com/google/devtools/build/lib/bazel/repository/starlark/StarlarkRepositoryFunction.java

Line 153 in d98b02e

state.closeAndWaitForTermination();

, which sets workerFuture to null and creates a fresh workerExecutorService) and line 157 (which checks if workerFuture is null), the workerExecutorService got shut down because of memory pressure. That seems extremely unlikely to me, so if you're seeing this more than once, I probably am seriously misunderstanding how this code works.

Either way, this issue raises the priority of resolving the TODO on line 148. If we're allowed to acquire a lock in close(), everything becomes much simpler.

emaxx-google · 2024-05-16T20:03:47Z

The problem is not seen consistently - only seen on about 1 build out of 6. FWIW it happened on a particular builder type only (as opposed to the few other, similar but slightly different, ChromeOS builders), but even there it also succeeded once. So it's occurring flakily.

Wyverald · 2024-05-16T20:08:27Z

Thanks. But did it occur more than once? Just a singular flaky failure I could understand, but if this occurred even twice, I'd feel a bit alarmed.

emaxx-google · 2024-05-16T20:11:07Z

Thanks. But did it occur more than once? Just a singular flaky failure I could understand, but if this occurred even twice, I'd feel a bit alarmed.

It was twice (i.e., twice in a row on a particular build type).

Wyverald · 2024-05-16T20:11:55Z

Thanks. I'll look at resolving the TODO on line 148 for 7.2.0rc2 then.

lberki · 2024-05-17T07:32:57Z

Taking about a zillionth look at the code: in what case is it possible that the state.workerExecutorService.isShutdown() check in line 143 returns true? AFAIU the executor is shut down in exactly three cases:

When .close() is called on the SkyKeyComputeState by Skyframe. However, the documentation of removalListener() says that the removal listener is called after the entry is removed from the cache , which means that the effects of .close() shouldn't be visible to subsequent restarts.
When shouldShutdownWorkerExecutorInFinally is set. But that should only happen when the SkyFunction returns, either with an error or not, but in either case, it should not be invoked again.

But then why the check?

Looking at the code, I can imagine one simple way in which this crash can happen: when the catch (CancellationException e) clause is taken in line 184 and then there is a Skyframe restart. In that case, the outer fetch() invocation will unconditionally stop the executor service.

@Wyverald -- I'd be alarmed even if there was a single occurrence. Why do you think one case is not a reason to be concerned?

(there is also #22215 as an alternative)

lberki · 2024-05-17T07:33:48Z

@emaxx-google do you have a way of testing experimental Bazel versions for the occurrence of this bug? It would be nice if we could "battle-test" changes to fix this issue with a quicker round-trip time than a week or so.

emaxx-google · 2024-05-17T08:56:09Z

I haven't managed to repro it locally yet, so so far the way to reproduce or test it is uploading a ChromeOS CL that changes .bazelversion and CQ'ing it. BTW I'm afraid testing on a ToT Bazel, as opposed to the 7.* branches, would be difficult - last time I tried there were a bunch of issues in third-party Bazel rules that we pull.

lberki · 2024-05-17T12:09:07Z

Changing .bazelversion would be good enough. I think what could work is the following:

@Wyverald rebases the tentative fix on top of 7.2.0rc1
He pushes it through CI (this step is necessary)
You CQ a change that sets .bazelversion to the hash of that commit.

For a bit more certainty, @Wyverald could add a very conspicuous message on the console to make sure that your commit queue got the right version.

Wyverald · 2024-05-17T18:05:35Z

When .close() is called on the SkyKeyComputeState by Skyframe. However, the documentation of removalListener() says that the removal listener is called after the entry is removed from the cache , which means that the effects of .close() shouldn't be visible to subsequent restarts.

That's a very good point! I for some reason got it in my head that on high memory pressure we just close() all the state objects, instead of actually removing them from the cache.

But in any case, what I was referring to was the case where a memory pressure listener triggered and called close() on the state object after line 153, while the SkyFunction was executing. That's the only way I see this particular crash happening (and partially why I wouldn't be as alarmed if only a single instance of the crash occurred -- this is as unlikely as cosmic rays).

Looking at the code, I can imagine one simple way in which this crash can happen: when the catch (CancellationException e) clause is taken in line 184 and then there is a Skyframe restart. In that case, the outer fetch() invocation will unconditionally stop the executor service.

What you said could indeed happen (and is a good argument against using recursion here, I see now... because shouldShutDownWorkerExecutorInFinally doesn't propagate to outer stack frames), but that should still not result in this particular crash, right? Since we do check whether the executor service is shutdown, so line 153 would do a complete reset.

I think what could work is the following:

That SGTM -- I'm on it.

lberki · 2024-05-17T19:17:54Z

When .close() is called on the SkyKeyComputeState by Skyframe. However, the documentation of removalListener() says that the removal listener is called after the entry is removed from the cache , which means that the effects of .close() shouldn't be visible to subsequent restarts.

That's a very good point! I for some reason got it in my head that on high memory pressure we just close() all the state objects, instead of actually removing them from the cache.

In all fairness, I had to look at the code and then the documentation of removeListener and even then, I didn't check that it does what its documentation says. I could imagine weirdness happening if it doesn't conform to the "listener runs after removal" contract or if it does, but the state is held somewhere in the bowels of Skyframe.

One possible simple fix would be to make it so that shutting down the worker thread is either deferred or outright not done (I didn't think it through) when the pertinent SkyFunction is executing. But then that's another way inter-thread synchronization needs to be done and given our track record with this bug, I'm outright terrified of that.

But in any case, what I was referring to was the case where a memory pressure listener triggered and called close() on the state object after line 153, while the SkyFunction was executing. That's the only way I see this particular crash happening (and partially why I wouldn't be as alarmed if only a single instance of the crash occurred -- this is as unlikely as cosmic rays).

Looking at the code, I can imagine one simple way in which this crash can happen: when the catch (CancellationException e) clause is taken in line 184 and then there is a Skyframe restart. In that case, the outer fetch() invocation will unconditionally stop the executor service.

What you said could indeed happen (and is a good argument against using recursion here, I see now... because shouldShutDownWorkerExecutorInFinally doesn't propagate to outer stack frames), but that should still not result in this particular crash, right? Since we do check whether the executor service is shutdown, so line 153 would do a complete reset.

Indeed! I missed that bit. I don't have a theory as to what could be the problem then, but the stack trace does show that this crash is in a recursive fetch()... I was thinking that maybe the state object from the outer fetch() is kept, but then again, the inner fetch() re-requests the state from SkyFunction.Environment and if removeListener is indeed called after the state was removed from the map, it would be a fresh state with a fresh executor.

It's suspicious, though, because the symptom matches pretty well with reusing a closed state object: if close() is called on a state object which is then reused, workerThread would be null, a new one would be submitted, which would then fail because the executor is shut down.

Absent any other theories, add logging, ask @emaxx-google to try it out?

I think what could work is the following:

That SGTM -- I'm on it.

Wyverald · 2024-05-21T20:26:32Z

@emaxx-google please try setting USE_BAZEL_VERSION=fc930be6e7823264c382a39a9cc1dbd19f43b863

emaxx-google · 2024-05-22T01:59:40Z

@emaxx-google please try setting USE_BAZEL_VERSION=fc930be6e7823264c382a39a9cc1dbd19f43b863

Hmm, I'm afraid the build via this version hung: it's been running on one of the bots for more than 4 hours without visible activity, meanwhile similar bots finished building within ~50 minutes. Excerpt from the logs:

14:47:39.237: INFO: run: /mnt/host/source/chromite/bin/bazel build '--profile=/tmp/allpackages_command.profile.gz' --noslim_profile --experimental_profile_include_target_label --experimental_profile_include_primary_output '--keep_going=false' '--execution_log_binary_file=/tmp/allpackages_exec.log' '--execution_log_sort=false' '--config=hash_tracer' '--config=collect_logs' '--config=collect_ebuild_metadata' '--build_event_json_file=/tmp/chromeos_bazel_build_events.json' --experimental_enable_execution_graph_log '--experimental_execution_graph_log_dep_type=all' '--experimental_execution_graph_log_path=/tmp/allpackages_graph.log' '--config=rbe_exec' <...>
14:47:39.348: DEBUG: Ensure file: skia/bots/bazelisk_${os}_${arch} version:0
(14:47:39) INFO: Invocation ID: 892f3f2e-8f74-4660-9031-b34e8ddc41e1
(14:47:39) INFO: Streaming build results to: http://sponge2/892f3f2e-8f74-4660-9031-b34e8ddc41e1
<...>
(14:51:47) Analyzing: 2186 targets (2018 packages loaded, 67512 targets configured)
[26,400 / 26,400] checking cached actions

(presumably, it's been sitting there for hours - all other timestamps in the build log fit within 10 minutes)

lberki · 2024-05-22T06:39:58Z

@emaxx-google can you get us a stack trace? (jstack / SIGINT / Ctrl-\ all work, although I don't know whether any of those is feasible in your CI setup)

lberki · 2024-05-22T10:42:08Z

@emaxx-google Can you try your CQ with version 9cbbd7d2c9b783f80a816b188ac3a7a7025bdd1f? It's essentially #22215 rebased and whereas I'm not in love with it for the same reasons @Wyverald isn't, it would be a strong argument in favor of it if it worked.

emaxx-google · 2024-05-22T10:58:05Z

Sure, I'll look into both ideas: getting the stack trace on the first commit (someone on the team might have necessary permissions) and trying the second commit.

BTW, given that digging information out from ChromeOS Commit Queue attempts is time-consuming and cumbersome, any ideas on how to get a local repro? I already tried running the same Bazel commands on the same checkout locally, without luck so far. The workstation hardware is very different from the build bots though. Should I just try stressing the machine in parallel to running Bazel, or maybe there's anything you could recommend to try?

ismell · 2024-05-22T14:53:43Z

Here is the stack trace from the java server.
issue-22393-stack.log

emaxx-google · 2024-05-22T16:53:41Z

@emaxx-google Can you try your CQ with version 9cbbd7d2c9b783f80a816b188ac3a7a7025bdd1f? It's essentially #22215 rebased and whereas I'm not in love with it for the same reasons @Wyverald isn't, it would be a strong argument in favor of it if it worked.

I've tested this as well - this failed with an exception:

(08:40:42) FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.RuntimeException: Unrecoverable error while evaluating node 'REPOSITORY_DIRECTORY:@@_main~portage_deps~chrome-internal-127.0.6485.0' (requested by nodes 'PACKAGE_LOOKUP:@@_main~portage_deps~chrome-internal-127.0.6485.0//')
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:550)
	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:414)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
Caused by: java.lang.IllegalStateException: unexpected exception type: class java.lang.InterruptedException
	at com.google.devtools.build.lib.bazel.repository.starlark.StarlarkRepositoryFunction.fetch(StarlarkRepositoryFunction.java:211)
	at com.google.devtools.build.lib.rules.repository.RepositoryDelegatorFunction.fetchRepository(RepositoryDelegatorFunction.java:421)
	at com.google.devtools.build.lib.rules.repository.RepositoryDelegatorFunction.compute(RepositoryDelegatorFunction.java:208)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:461)
	... 7 more
Caused by: java.lang.InterruptedException
	at com.google.devtools.build.lib.shell.FutureCommandResult.waitForProcess(FutureCommandResult.java:121)
	at com.google.devtools.build.lib.shell.FutureCommandResult.get(FutureCommandResult.java:52)
	at com.google.devtools.build.lib.shell.Command.execute(Command.java:263)
	at com.google.devtools.build.lib.bazel.repository.starlark.StarlarkExecutionResult$Builder.execute(StarlarkExecutionResult.java:203)
	at com.google.devtools.build.lib.bazel.repository.starlark.StarlarkBaseExternalContext.execute(StarlarkBaseExternalContext.java:1635)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(Unknown Source)
	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
	at net.starlark.java.eval.MethodDescriptor.call(MethodDescriptor.java:178)
	at net.starlark.java.eval.BuiltinFunction.fastcall(BuiltinFunction.java:78)
	at net.starlark.java.eval.Starlark.fastcall(Starlark.java:806)
	at net.starlark.java.eval.Eval.evalCall(Eval.java:682)
	at net.starlark.java.eval.Eval.eval(Eval.java:497)
	at net.starlark.java.eval.Eval.execAssignment(Eval.java:109)
	at net.starlark.java.eval.Eval.exec(Eval.java:268)
	at net.starlark.java.eval.Eval.execStatements(Eval.java:82)
	at net.starlark.java.eval.Eval.execFor(Eval.java:126)
	at net.starlark.java.eval.Eval.exec(Eval.java:276)
	at net.starlark.java.eval.Eval.execStatements(Eval.java:82)
	at net.starlark.java.eval.Eval.execFunctionBody(Eval.java:66)
	at net.starlark.java.eval.StarlarkFunction.fastcall(StarlarkFunction.java:179)
	at net.starlark.java.eval.Starlark.fastcall(Starlark.java:806)
	at net.starlark.java.eval.Eval.evalCall(Eval.java:682)
	at net.starlark.java.eval.Eval.eval(Eval.java:497)
	at net.starlark.java.eval.Eval.exec(Eval.java:271)
	at net.starlark.java.eval.Eval.execStatements(Eval.java:82)
	at net.starlark.java.eval.Eval.execFunctionBody(Eval.java:66)
	at net.starlark.java.eval.StarlarkFunction.fastcall(StarlarkFunction.java:179)
	at net.starlark.java.eval.Starlark.fastcall(Starlark.java:806)
	at net.starlark.java.eval.Eval.evalCall(Eval.java:682)
	at net.starlark.java.eval.Eval.eval(Eval.java:497)
	at net.starlark.java.eval.Eval.exec(Eval.java:271)
	at net.starlark.java.eval.Eval.execStatements(Eval.java:82)
	at net.starlark.java.eval.Eval.execFunctionBody(Eval.java:66)
	at net.starlark.java.eval.StarlarkFunction.fastcall(StarlarkFunction.java:179)
	at net.starlark.java.eval.Starlark.fastcall(Starlark.java:806)
	at net.starlark.java.eval.Starlark.call(Starlark.java:772)
	at com.google.devtools.build.lib.bazel.repository.starlark.StarlarkRepositoryFunction.fetchInternal(StarlarkRepositoryFunction.java:367)
	at com.google.devtools.build.lib.bazel.repository.starlark.StarlarkRepositoryFunction.fetchFromWorker(StarlarkRepositoryFunction.java:230)
	at com.google.devtools.build.lib.bazel.repository.starlark.StarlarkRepositoryFunction.lambda$fetch$0(StarlarkRepositoryFunction.java:158)
	at java.base/java.lang.VirtualThread.run(Unknown Source)

Wyverald · 2024-05-22T17:42:16Z

Thank you all for the update and logs. I'll look at them now.

BTW, given that digging information out from ChromeOS Commit Queue attempts is time-consuming and cumbersome, any ideas on how to get a local repro?

Not from me, sorry -- I'd eagerly welcome any ideas here, too :)

lberki · 2024-05-22T18:58:36Z

I don't have any suggestion other than to rig up a stress test (while true do run Bazel build and do mean things to it; done) and see if it breaks.

Wyverald · 2024-05-22T19:16:24Z

Here is the stack trace from the java server. issue-22393-stack.log

Unfortunately, jstack (which I assume was used to capture this) doesn't capture stacks for parked virtual threads. Could you please try running jcmd ${pid} Thread.dump_to_file ${out_file} during a hang? That would give us a bit more information to work with.

ismell · 2024-05-22T20:50:16Z

Here you go: threads.log

Wyverald · 2024-05-22T21:03:20Z

Thanks! This new log suggests that the worker thread is already dead. I have an idea for a fix; will update this thread with the Bazel commit when it's ready.

Wyverald · 2024-05-22T21:50:13Z

Please try 7f2a9a8ef97b0e2fdc2511c1dd23018ca8dfa8f7 (7f2a9a8)!

Wyverald · 2024-05-23T03:02:14Z

Dumping my notes before I sign off for the day:

I sent 7f2a9a8, theorizing that the previous deadlock was due to us resetting the state object repeatedly instead of creating new ones. Trying that commit resulted in this CI run (internal link) which finished quickly, but with several errors indicating that, for some repos, multiple fetches might have happened concurrently at certain points.
This led me to think that maybe there are state objects that are evicted from cache (by memory pressure or by StarlarkRepositoryFunction's finally block), but are still doing things (ie. not fully close()d) by the time the next fetch attempt starts. Because we now create new state objects every time, this could result in multiple actors working in the same repo directory, resulting in those errors.
Brain is a bit fried to come up with good solutions, so I opted to figure out the feedback loop problem first. I was able to work out how to get access to run CI jobs for the ChromeOS+Bazel project with my changes and sent b10ade7 to just add some print statements. With help from the ChromeOS team, we should be able to figure out ACLs by tomorrow.
That last commit triggered a CI run that is still running at the time of writing, but (unfortunately) didn't seem to run into the same problems as the previous CI run. At least I can see now that memory pressure is definitely playing a big part here -- the CI machines seem to have ~32GB of RAM and we get to ~100% memory usage during the build.

Will continue digging tomorrow.

lberki · 2024-05-23T10:30:13Z

@emaxx-google Can you try 7602fb260075dc9e992fdb5adaba621eb670cdd4? Your stack trace indicated a problem that was pretty easily resolvable and our test battery agreed.

What I would do if I had more than 20 minutes for this today is to rig up a test interface to Bazel using its gRPC server that allows one to trigger a Skyframe state cleanup at will, then run a stress-test build with many repositories with many Skyframe restarts each in a loop to see what would happen. If our theory is correct it should tickle the bug we are fighting with pretty reliably.

@Wyverald if you feel adventurous: you'd need to add a new method in command_server.proto (or else piggyback on e.g. Run and check for a weird command name, then put a MemoryPressureEvent on the EventBus of the current command, if any (see MemoryPressureListener.eventBus as to how to get a reference to that)

lberki · 2024-05-23T12:24:42Z

I have a reproduction! It's quite involved, but I managed to reproduce it at least once, although not as a test case for time time being because it requires a temporary debug interface into Bazel.

1. Patch this diff into Bazel

This adds a bazel debug command that triggers a memory pressure event.

diff --git a/src/main/java/com/google/devtools/build/lib/runtime/BlazeCommandDispatcher.java b/src/main/java/com/google/devtools/build/lib/runtime/BlazeCommandDispatcher.java
index b20a1bb41b..8d5219254b 100644
--- a/src/main/java/com/google/devtools/build/lib/runtime/BlazeCommandDispatcher.java
+++ b/src/main/java/com/google/devtools/build/lib/runtime/BlazeCommandDispatcher.java
@@ -73,6 +73,7 @@ import java.io.BufferedOutputStream;
 import java.io.FileOutputStream;
 import java.io.IOException;
 import java.io.OutputStream;
+import java.io.PrintStream;
 import java.time.Duration;
 import java.util.ArrayList;
 import java.util.HashSet;
@@ -100,6 +101,8 @@ public class BlazeCommandDispatcher implements CommandDispatcher {
   private static final ImmutableSet<String> ALL_HELP_OPTIONS =
       ImmutableSet.of("--help", "-help", "-h");

+  private CommandEnvironment currentCommandEnv;
+
   private final BlazeRuntime runtime;
   private final int serverPid;
   private final BugReporter bugReporter;
@@ -144,6 +147,27 @@ public class BlazeCommandDispatcher implements CommandDispatcher {
     this.commandLock = new Object();
   }

+  private void debug(OutErr outErr, List<String> args) {
+    try (PrintStream ps = new PrintStream(outErr.getErrorStream())) {
+      CommandEnvironment env = currentCommandEnv;
+      if (env == null) {
+        ps.println("No executing command");
+        return;
+      }
+
+      ps.println("Sending memory pressure event");
+      ps.flush();
+      MemoryPressureEvent event = MemoryPressureEvent.newBuilder()
+          .setTenuredSpaceMaxBytes(Runtime.getRuntime().maxMemory())
+          .setTenuredSpaceUsedBytes(Runtime.getRuntime().maxMemory())
+          .setWasFullGc(true)
+          .setWasManualGc(false)
+          .build();
+      env.getEventBus().post(event);
+      ps.println("Memory pressure event sent");
+    }
+  }
+
   @Override
   public BlazeCommandResult exec(
       InvocationPolicy invocationPolicy,
@@ -162,6 +186,11 @@ public class BlazeCommandDispatcher implements CommandDispatcher {

     String commandName = args.get(0);

+    if (commandName.equals("debug")) {
+      debug(outErr, args);
+      return BlazeCommandResult.success();
+    }
+
     // Be gentle to users who want to find out about Blaze invocation.
     if (ALL_HELP_OPTIONS.contains(commandName)) {
       commandName = "help";
@@ -404,6 +433,8 @@ public class BlazeCommandDispatcher implements CommandDispatcher {
     Reporter reporter = env.getReporter();
     OutErr.SystemPatcher systemOutErrPatcher = reporter.getOutErr().getSystemPatcher();
     try {
+      currentCommandEnv = env;
+
       // Temporary: there are modules that output events during beforeCommand, but the reporter
       // isn't setup yet. Add the stored event handler to catch those events.
       reporter.addHandler(storedEventHandler);
@@ -705,6 +736,7 @@ public class BlazeCommandDispatcher implements CommandDispatcher {
       result = BlazeCommandResult.createShutdown(crash);
       return result;
     } finally {
+      currentCommandEnv = null;
       if (needToCallAfterCommand) {
         BlazeCommandResult newResult = runtime.afterCommand(env, result);
         if (!newResult.equals(result)) {
diff --git a/src/main/java/com/google/devtools/build/lib/server/GrpcServerImpl.java b/src/main/java/com/google/devtools/build/lib/server/GrpcServerImpl.java
index c7c21960e6..ce36f7558d 100644
--- a/src/main/java/com/google/devtools/build/lib/server/GrpcServerImpl.java
+++ b/src/main/java/com/google/devtools/build/lib/server/GrpcServerImpl.java
@@ -641,6 +641,7 @@ public class GrpcServerImpl extends CommandServerGrpc.CommandServerImplBase impl
         ((ServerCallStreamObserver<RunResponse>) observer);
     BlockingStreamObserver<RunResponse> blockingStreamObserver =
         new BlockingStreamObserver<>(serverCallStreamObserver);
+
     commandExecutorPool.execute(() -> executeCommand(request, blockingStreamObserver));
   }

2. Create files in the local file system for Bazel to read

for i in $(seq 0 100); do mkdir -p /tmp/d_$i; for j in $(seq 1 100); do echo ${i}_${j} > /tmp/d_${i}/$j; done; done

3. Create a set of repository rules to read these files

This creates 100 repository rules that do 100 Skyframe restarts each, thereby giving ample opportunity to test the interaction of memory pressure events with Starlark repositories.

cat > MODULE.bazel <<'EOF'
r = use_repo_rule("//:r.bzl", "r")
[r(name="r_" + str(i), index=i, dir="/tmp/d_" + str(i)) for i in range(0, 100)]
EOF

cat > r.bzl <<'EOF'
def _r_impl(rctx):
    d = rctx.path(rctx.attr.dir)
    contents = ""
    for f in sorted(d.readdir(), lambda p: p.basename):
        c = rctx.read(f)
        print("read %s: %s" % (f.realpath, c))
        contents += c
    rctx.file("BUILD", "filegroup(name='f', visibility=['//visibility:public'])")

r = repository_rule(
    implementation = _r_impl,
    attrs = {"index": attr.int(), "dir": attr.string()},
)
EOF

cat > BUILD.bazel <<'EOF'
genrule(
    name = "g",
    srcs = ["@r_%d//:f" % i for i in range(0, 100)],
    outs = ["go"],
    cmd = "sleep 1; echo O > $@",
)

genrule(
    name = "w",
    srcs = [],
    outs = ["wo"],
    cmd = "sleep 3600; echo W > $@",
)

Run these two commands at the same time

This makes Bazel fetch those 100 repositories in parallel while sending many memory pressure events. You have to be pretty quick to start the while loop before the Bazel invocation finishes.

bazel build //:g
while true; do sleep 0.05; bazel debug; done

If the reproduction does not work, you have to run bazel clean --expunge before trying again. It'd probably be more easy to reproduce if there was a sleep in the repository rules. I haven't tried that because it's already a miracle I was able to do this today.

The results

Rewrite Starlark repostory workers using an RPC paradigm... #22215 deadlocks on memory pressure events pretty reliably
Bazel at 792d17c223 crashes with the RejectedExecutionException reported in this bug.

Wyverald · 2024-05-24T03:33:28Z

Thanks Lukács for the repro tips. I think I finally have a solution that fixes this deadlock. (There's an internal CL going on right now; it's basically this commit da74178)

Description of the fix

To summarize, 7f2a9a8 was correct in its analysis (describing how fc930be caused the deadlock); pasting the commit message here:

According to the latest thread dumps, the host thread is stuck acquiring the semaphore while the worker thread is nowhere to be found. This suggests that the worker thread was killed immediately after it was scheduled (or perhaps even before it was scheduled), releasing the semaphore, but then since we reset() the state object, the semaphore release is then immediately drained. This could conceivably result in the race and subsequent deadlock we're seeing.

The thinking behind this change is that, instead of reset()ting the state object, we'd just ask Skyframe to evict the closed state object from the central cache. This obviates the need to drain the semaphore on close(), which allows any awaiting host thread to proceed.

But as I suspected in the comment yesterday:

This led me to think that maybe there are state objects that are evicted from cache (by memory pressure or by StarlarkRepositoryFunction's finally block), but are still doing things (ie. not fully close()d) by the time the next fetch attempt starts. Because we now create new state objects every time, this could result in multiple actors working in the same repo directory, resulting in those errors.

I was going to do something crazy like having a semaphore for each repo name to ensure that only one "actor" is alive... but the central state cache was already doing that, before we started evicting entries!

The trick, then, is to do the reset() at the beginning of startWorker(), instead of at the end of close(). This makes draining the signal semaphore (in reset()) safe, because only a host Skyframe thread may be waiting on the signal semaphore, and only a host Skyframe thread can call startWorker(). IOW, if we're in startWorker(), we know for sure nobody is waiting on the signal semaphore.

Verifying the fix works

Thanks to @ismell, I was able to secure ACLs to run CI on ChromeOS+Bazel myself. The fix indeed worked! (although I only had time to do one run)

I also tried @lberki's repro instructions, with some changes:

There is already a --skyframe_high_water_mark_threshold flag. If we set it to 0, it makes Skyframe treat every GC as a cue to drop all state objects. This obviates the patch.
For some reason the repo rules still ran too fast to finish without being affected much by the state drops (such that I couldn't repro the failure with 7.2.0rc1). So I tried to slow down the repo rule by replacing the contents += c line in r.bzl with for _ in range(50): contents += c. This worked well enough on my machine that I could repro the failure with 7.2.0rc1, and confirm that the fix I described above eliminates the deadlock. (The number 50 is actually rather sensitive; I tried 100, 200, etc. and those would just slow the fetching to a crawl as nobody could make progress.)

These two together makes me feel much better about the chances of this being an actual fix.

lberki · 2024-05-24T08:44:06Z

re: --skyframe_high_water_mark_threshold, I thought what we discussed yesterday is that you were unable to reproduce the issue with that one. Did I misunderstood that? I'd be elated if I didn't need that bazel debug patch to reproduce the issue, but I want to make sure that we don't misunderstand each other. Maybe my "100*100 files and 100 repos"
test case with --skyframe_high_water_mark_threshold worked but whatever test case you had didn't?

Wyverald · 2024-05-24T20:19:59Z

re: --skyframe_high_water_mark_threshold, I thought what we discussed yesterday is that you were unable to reproduce the issue with that one. Did I misunderstood that?

What I said at the time was that "I tried setting --skyframe_high_water_mark_threshold=0 but that just made the build take forever". I was trying to build Bazel itself.

Maybe my "100*100 files and 100 repos"
test case with --skyframe_high_water_mark_threshold worked but whatever test case you had didn't?

Basically yes -- and just to emphasize it again, your stress test case also only worked with the "50" number. Higher numbers resulted in builds taking forever (just like when I was trying to build Bazel itself), and lower numbers resulted in the repro not showing up (ie. no RejectedExecutionException at HEAD).

This greatly simplifies the code flow. Instead of using `volatile` and resorting to some very unsavory workarounds, we can simply make sure only one thread is changing `state.workerFuture` using plain old synchronization, and on memory pressure, make absolutely sure that the state object is cleaned up after we remove it from the central state cache. This goes against the advice introduced in bazelbuild@8ef0a51; the wording for `SkyKeyComputeState#close()` has been updated. Also changed the "retry on cancellation" logic from using recursion to using a `while`-loop for better clarity around nested `finally` blocks. Fixes bazelbuild#22393. PiperOrigin-RevId: 637975501 Change-Id: Ied43f0310ec8953f4ff1c2712fe07b8ccbd6c184

This greatly simplifies the code flow. Instead of using `volatile` and resorting to some very unsavory workarounds, we can simply make sure only one thread is changing `state.workerFuture` using plain old synchronization, and on memory pressure, make absolutely sure that the state object is cleaned up after we remove it from the central state cache. This goes against the advice introduced in 8ef0a51; the wording for `SkyKeyComputeState#close()` has been updated. Also changed the "retry on cancellation" logic from using recursion to using a `while`-loop for better clarity around nested `finally` blocks. Fixes #22393. PiperOrigin-RevId: 637975501 Change-Id: Ied43f0310ec8953f4ff1c2712fe07b8ccbd6c184 Commit de4d519 Co-authored-by: Googler <[email protected]>

nya3jp added type: bug untriaged labels May 16, 2024

nya3jp assigned iancha1992, satyanandak and sgowroji May 16, 2024

github-actions bot added the team-Core Skyframe, bazel query, BEP, options parsing, bazelrc label May 16, 2024

sgowroji unassigned sgowroji, iancha1992 and satyanandak May 16, 2024

lberki assigned Wyverald May 16, 2024

Wyverald added P1 I'll work on this now. (Assignee required) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. and removed untriaged team-Core Skyframe, bazel query, BEP, options parsing, bazelrc labels May 16, 2024

bazel-io mentioned this issue May 16, 2024

[7.2.0] Bazel crashed due to an internal error: RejectedExecutionException #22401

Closed

copybara-service bot closed this as completed in de4d519 May 28, 2024

bazel-io mentioned this issue May 28, 2024

[7.2.0] Block in repo fetching state's close() method #22573

Merged

[7.2.0-rc1] Bazel crashed due to an internal error: RejectedExecutionException #22393

[7.2.0-rc1] Bazel crashed due to an internal error: RejectedExecutionException #22393

Comments

nya3jp commented May 16, 2024

Description of the bug:

Which category does this issue belong to?

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Which operating system are you running Bazel on?

What is the output of bazel info release?

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

What's the output of git remote get-url origin; git rev-parse HEAD ?

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

lberki commented May 16, 2024

Uh oh!

emaxx-google commented May 16, 2024

Uh oh!

lberki commented May 16, 2024

Uh oh!

Wyverald commented May 16, 2024

Uh oh!

Wyverald commented May 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emaxx-google commented May 16, 2024

Uh oh!

Wyverald commented May 16, 2024

Uh oh!

emaxx-google commented May 16, 2024

Uh oh!

Wyverald commented May 16, 2024

Uh oh!

lberki commented May 17, 2024

Uh oh!

lberki commented May 17, 2024

Uh oh!

emaxx-google commented May 17, 2024

Uh oh!

lberki commented May 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Wyverald commented May 17, 2024

Uh oh!

lberki commented May 17, 2024

Uh oh!

Wyverald commented May 21, 2024

Uh oh!

emaxx-google commented May 22, 2024

Uh oh!

lberki commented May 22, 2024

Uh oh!

lberki commented May 22, 2024

Uh oh!

emaxx-google commented May 22, 2024

Uh oh!

ismell commented May 22, 2024

Uh oh!

emaxx-google commented May 22, 2024

Uh oh!

Wyverald commented May 22, 2024

Uh oh!

lberki commented May 22, 2024

Uh oh!

Wyverald commented May 22, 2024

Uh oh!

ismell commented May 22, 2024

Uh oh!

Wyverald commented May 22, 2024

Uh oh!

Wyverald commented May 22, 2024

Uh oh!

Wyverald commented May 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lberki commented May 23, 2024

Uh oh!

lberki commented May 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Patch this diff into Bazel

What is the output of `bazel info release`?

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

What's the output of `git remote get-url origin; git rev-parse HEAD` ?

Wyverald commented May 16, 2024 •

edited

Loading

lberki commented May 17, 2024 •

edited

Loading

Wyverald commented May 23, 2024 •

edited

Loading

lberki commented May 23, 2024 •

edited

Loading