-
Notifications
You must be signed in to change notification settings - Fork 4.2k
[7.2.0-rc1] Bazel crashed due to an internal error: RejectedExecutionException #22393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Wasn't |
Nevermind, my bad. I didn't realize this occurred with 7.2.0-rc1. Let's see what @Wyverald has to say then! |
@bazel-io fork 7.2.0 |
Do you see this consistently? I'm asking because the only way I see this happening is that, between line 153 ( Line 153 in d98b02e
workerFuture to null and creates a fresh workerExecutorService ) and line 157 (which checks if workerFuture is null), the workerExecutorService got shut down because of memory pressure. That seems extremely unlikely to me, so if you're seeing this more than once, I probably am seriously misunderstanding how this code works.
Either way, this issue raises the priority of resolving the TODO on line 148. If we're allowed to acquire a lock in |
The problem is not seen consistently - only seen on about 1 build out of 6. FWIW it happened on a particular builder type only (as opposed to the few other, similar but slightly different, ChromeOS builders), but even there it also succeeded once. So it's occurring flakily. |
Thanks. But did it occur more than once? Just a singular flaky failure I could understand, but if this occurred even twice, I'd feel a bit alarmed. |
It was twice (i.e., twice in a row on a particular build type). |
Thanks. I'll look at resolving the TODO on line 148 for 7.2.0rc2 then. |
Taking about a zillionth look at the code: in what case is it possible that the
But then why the check? Looking at the code, I can imagine one simple way in which this crash can happen: when the @Wyverald -- I'd be alarmed even if there was a single occurrence. Why do you think one case is not a reason to be concerned? (there is also #22215 as an alternative) |
@emaxx-google do you have a way of testing experimental Bazel versions for the occurrence of this bug? It would be nice if we could "battle-test" changes to fix this issue with a quicker round-trip time than a week or so. |
I haven't managed to repro it locally yet, so so far the way to reproduce or test it is uploading a ChromeOS CL that changes |
Changing
For a bit more certainty, @Wyverald could add a very conspicuous message on the console to make sure that your commit queue got the right version. |
That's a very good point! I for some reason got it in my head that on high memory pressure we just But in any case, what I was referring to was the case where a memory pressure listener triggered and called
What you said could indeed happen (and is a good argument against using recursion here, I see now... because
That SGTM -- I'm on it. |
In all fairness, I had to look at the code and then the documentation of One possible simple fix would be to make it so that shutting down the worker thread is either deferred or outright not done (I didn't think it through) when the pertinent
Indeed! I missed that bit. I don't have a theory as to what could be the problem then, but the stack trace does show that this crash is in a recursive It's suspicious, though, because the symptom matches pretty well with reusing a closed state object: if Absent any other theories, add logging, ask @emaxx-google to try it out?
|
@emaxx-google please try setting |
Hmm, I'm afraid the build via this version hung: it's been running on one of the bots for more than 4 hours without visible activity, meanwhile similar bots finished building within ~50 minutes. Excerpt from the logs:
(presumably, it's been sitting there for hours - all other timestamps in the build log fit within 10 minutes) |
@emaxx-google can you get us a stack trace? ( |
@emaxx-google Can you try your CQ with version |
Sure, I'll look into both ideas: getting the stack trace on the first commit (someone on the team might have necessary permissions) and trying the second commit. BTW, given that digging information out from ChromeOS Commit Queue attempts is time-consuming and cumbersome, any ideas on how to get a local repro? I already tried running the same Bazel commands on the same checkout locally, without luck so far. The workstation hardware is very different from the build bots though. Should I just try |
Here is the stack trace from the java server. |
I've tested this as well - this failed with an exception:
|
Thank you all for the update and logs. I'll look at them now.
Not from me, sorry -- I'd eagerly welcome any ideas here, too :) |
I don't have any suggestion other than to rig up a stress test (while true do run Bazel build and do mean things to it; done) and see if it breaks. |
Unfortunately, |
Here you go: threads.log |
Thanks! This new log suggests that the worker thread is already dead. I have an idea for a fix; will update this thread with the Bazel commit when it's ready. |
Please try |
Dumping my notes before I sign off for the day:
Will continue digging tomorrow. |
@emaxx-google Can you try What I would do if I had more than 20 minutes for this today is to rig up a test interface to Bazel using its gRPC server that allows one to trigger a Skyframe state cleanup at will, then run a stress-test build with many repositories with many Skyframe restarts each in a loop to see what would happen. If our theory is correct it should tickle the bug we are fighting with pretty reliably. @Wyverald if you feel adventurous: you'd need to add a new method in |
I have a reproduction! It's quite involved, but I managed to reproduce it at least once, although not as a test case for time time being because it requires a temporary debug interface into Bazel. 1. Patch this diff into BazelThis adds a diff --git a/src/main/java/com/google/devtools/build/lib/runtime/BlazeCommandDispatcher.java b/src/main/java/com/google/devtools/build/lib/runtime/BlazeCommandDispatcher.java
index b20a1bb41b..8d5219254b 100644
--- a/src/main/java/com/google/devtools/build/lib/runtime/BlazeCommandDispatcher.java
+++ b/src/main/java/com/google/devtools/build/lib/runtime/BlazeCommandDispatcher.java
@@ -73,6 +73,7 @@ import java.io.BufferedOutputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
+import java.io.PrintStream;
import java.time.Duration;
import java.util.ArrayList;
import java.util.HashSet;
@@ -100,6 +101,8 @@ public class BlazeCommandDispatcher implements CommandDispatcher {
private static final ImmutableSet<String> ALL_HELP_OPTIONS =
ImmutableSet.of("--help", "-help", "-h");
+ private CommandEnvironment currentCommandEnv;
+
private final BlazeRuntime runtime;
private final int serverPid;
private final BugReporter bugReporter;
@@ -144,6 +147,27 @@ public class BlazeCommandDispatcher implements CommandDispatcher {
this.commandLock = new Object();
}
+ private void debug(OutErr outErr, List<String> args) {
+ try (PrintStream ps = new PrintStream(outErr.getErrorStream())) {
+ CommandEnvironment env = currentCommandEnv;
+ if (env == null) {
+ ps.println("No executing command");
+ return;
+ }
+
+ ps.println("Sending memory pressure event");
+ ps.flush();
+ MemoryPressureEvent event = MemoryPressureEvent.newBuilder()
+ .setTenuredSpaceMaxBytes(Runtime.getRuntime().maxMemory())
+ .setTenuredSpaceUsedBytes(Runtime.getRuntime().maxMemory())
+ .setWasFullGc(true)
+ .setWasManualGc(false)
+ .build();
+ env.getEventBus().post(event);
+ ps.println("Memory pressure event sent");
+ }
+ }
+
@Override
public BlazeCommandResult exec(
InvocationPolicy invocationPolicy,
@@ -162,6 +186,11 @@ public class BlazeCommandDispatcher implements CommandDispatcher {
String commandName = args.get(0);
+ if (commandName.equals("debug")) {
+ debug(outErr, args);
+ return BlazeCommandResult.success();
+ }
+
// Be gentle to users who want to find out about Blaze invocation.
if (ALL_HELP_OPTIONS.contains(commandName)) {
commandName = "help";
@@ -404,6 +433,8 @@ public class BlazeCommandDispatcher implements CommandDispatcher {
Reporter reporter = env.getReporter();
OutErr.SystemPatcher systemOutErrPatcher = reporter.getOutErr().getSystemPatcher();
try {
+ currentCommandEnv = env;
+
// Temporary: there are modules that output events during beforeCommand, but the reporter
// isn't setup yet. Add the stored event handler to catch those events.
reporter.addHandler(storedEventHandler);
@@ -705,6 +736,7 @@ public class BlazeCommandDispatcher implements CommandDispatcher {
result = BlazeCommandResult.createShutdown(crash);
return result;
} finally {
+ currentCommandEnv = null;
if (needToCallAfterCommand) {
BlazeCommandResult newResult = runtime.afterCommand(env, result);
if (!newResult.equals(result)) {
diff --git a/src/main/java/com/google/devtools/build/lib/server/GrpcServerImpl.java b/src/main/java/com/google/devtools/build/lib/server/GrpcServerImpl.java
index c7c21960e6..ce36f7558d 100644
--- a/src/main/java/com/google/devtools/build/lib/server/GrpcServerImpl.java
+++ b/src/main/java/com/google/devtools/build/lib/server/GrpcServerImpl.java
@@ -641,6 +641,7 @@ public class GrpcServerImpl extends CommandServerGrpc.CommandServerImplBase impl
((ServerCallStreamObserver<RunResponse>) observer);
BlockingStreamObserver<RunResponse> blockingStreamObserver =
new BlockingStreamObserver<>(serverCallStreamObserver);
+
commandExecutorPool.execute(() -> executeCommand(request, blockingStreamObserver));
} 2. Create files in the local file system for Bazel to read
3. Create a set of repository rules to read these filesThis creates 100 repository rules that do 100 Skyframe restarts each, thereby giving ample opportunity to test the interaction of memory pressure events with Starlark repositories. cat > MODULE.bazel <<'EOF'
r = use_repo_rule("//:r.bzl", "r")
[r(name="r_" + str(i), index=i, dir="/tmp/d_" + str(i)) for i in range(0, 100)]
EOF
cat > r.bzl <<'EOF'
def _r_impl(rctx):
d = rctx.path(rctx.attr.dir)
contents = ""
for f in sorted(d.readdir(), lambda p: p.basename):
c = rctx.read(f)
print("read %s: %s" % (f.realpath, c))
contents += c
rctx.file("BUILD", "filegroup(name='f', visibility=['//visibility:public'])")
r = repository_rule(
implementation = _r_impl,
attrs = {"index": attr.int(), "dir": attr.string()},
)
EOF
cat > BUILD.bazel <<'EOF'
genrule(
name = "g",
srcs = ["@r_%d//:f" % i for i in range(0, 100)],
outs = ["go"],
cmd = "sleep 1; echo O > $@",
)
genrule(
name = "w",
srcs = [],
outs = ["wo"],
cmd = "sleep 3600; echo W > $@",
) Run these two commands at the same timeThis makes Bazel fetch those 100 repositories in parallel while sending many memory pressure events. You have to be pretty quick to start the
If the reproduction does not work, you have to run The results
|
Thanks Lukács for the repro tips. I think I finally have a solution that fixes this deadlock. (There's an internal CL going on right now; it's basically this commit da74178) Description of the fixTo summarize, 7f2a9a8 was correct in its analysis (describing how fc930be caused the deadlock); pasting the commit message here:
But as I suspected in the comment yesterday:
I was going to do something crazy like having a semaphore for each repo name to ensure that only one "actor" is alive... but the central state cache was already doing that, before we started evicting entries! The trick, then, is to do the Verifying the fix worksThanks to @ismell, I was able to secure ACLs to run CI on ChromeOS+Bazel myself. The fix indeed worked! (although I only had time to do one run) I also tried @lberki's repro instructions, with some changes:
These two together makes me feel much better about the chances of this being an actual fix. |
re: |
What I said at the time was that "I tried setting
Basically yes -- and just to emphasize it again, your stress test case also only worked with the "50" number. Higher numbers resulted in builds taking forever (just like when I was trying to build Bazel itself), and lower numbers resulted in the repro not showing up (ie. no |
This greatly simplifies the code flow. Instead of using `volatile` and resorting to some very unsavory workarounds, we can simply make sure only one thread is changing `state.workerFuture` using plain old synchronization, and on memory pressure, make absolutely sure that the state object is cleaned up after we remove it from the central state cache. This goes against the advice introduced in bazelbuild@8ef0a51; the wording for `SkyKeyComputeState#close()` has been updated. Also changed the "retry on cancellation" logic from using recursion to using a `while`-loop for better clarity around nested `finally` blocks. Fixes bazelbuild#22393. PiperOrigin-RevId: 637975501 Change-Id: Ied43f0310ec8953f4ff1c2712fe07b8ccbd6c184
This greatly simplifies the code flow. Instead of using `volatile` and resorting to some very unsavory workarounds, we can simply make sure only one thread is changing `state.workerFuture` using plain old synchronization, and on memory pressure, make absolutely sure that the state object is cleaned up after we remove it from the central state cache. This goes against the advice introduced in 8ef0a51; the wording for `SkyKeyComputeState#close()` has been updated. Also changed the "retry on cancellation" logic from using recursion to using a `while`-loop for better clarity around nested `finally` blocks. Fixes #22393. PiperOrigin-RevId: 637975501 Change-Id: Ied43f0310ec8953f4ff1c2712fe07b8ccbd6c184 Commit de4d519 Co-authored-by: Googler <[email protected]>
Description of the bug:
ChromeOS Bazelification team tried to update Bazel version to 7.2.0-rc1, but the build failed due to Bazel's internal error.
CI Link: https://ci.chromium.org/ui/p/chromeos/builders/cq/brya-bazel-lite-cq/b8747820731539234353/overview (access limited to Google employees only, sorry)
Which category does this issue belong to?
Core
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
N/A
Which operating system are you running Bazel on?
Linux
What is the output of
bazel info release
?release 7.2.0rc1
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse HEAD
?Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
We've never seen this issue before, and our build has been stable for 2 months with Bazel 7.1.0.
Have you found anything relevant by searching the web?
N/A
Any other information, logs, or outputs that you want to share?
No response
The text was updated successfully, but these errors were encountered: