Skip to content

Commit a698bef

Browse files
committed
Support for cancellation in WorkRequestHandler.
To actually use cancellation, a worker implementation will still have to implement a cancellation callback that actually cancels and add `supports-worker-cancellation = 1` to the execution requirements, and then the build must run with `--experimental_worker_cancellation`. Cancellation design doc: https://docs.google.com/document/d/1-h4gcBV8Jn6DK9G_e23kZQ159jmX__uckhub1Gv9dzc RELNOTES: None. PiperOrigin-RevId: 373749452
1 parent c366d30 commit a698bef

File tree

5 files changed

+473
-65
lines changed

5 files changed

+473
-65
lines changed

site/docs/creating-workers.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,51 @@ id, so the request id must be specified if it is nonzero. This is a valid
7272
}
7373
```
7474

75+
A `request_id` of 0 indicates a "singleplex" request, i.e. this request cannot
76+
be processed in parallel with other requests. The server guarantees that a
77+
given worker receives requests with either only `request_id` 0 or only
78+
`request_id` greater than zero. Singleplex requests are sent in serial, i.e. the
79+
server doesn't send another request until it has received a response (except
80+
for cancel requests, see below).
81+
82+
**Notes**
83+
84+
* Each protocol buffer is preceded by its length in `varint` format (see
85+
[`MessageLite.writeDelimitedTo()`](https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/MessageLite.html#writeDelimitedTo-java.io.OutputStream-).
86+
* JSON requests and responses are not preceded by a size indicator.
87+
* JSON requests uphold the same structure as the protobuf, but use standard
88+
JSON.
89+
* Bazel stores requests as protobufs and converts them to JSON using
90+
[protobuf's JSON format](https://cs.opensource.google/protobuf/protobuf/+/master:java/util/src/main/java/com/google/protobuf/util/JsonFormat.java)
91+
92+
### Cancellation
93+
94+
Workers can optionally allow work requests to be cancelled before they finish.
95+
This is particularly useful in connection with dynamic execution, where local
96+
execution can regularly be interrupted by a faster remote execution. To allow
97+
cancellation, add `supports-worker-cancellation: 1` to the
98+
`execution-requirements` field (see below) and set the
99+
`--experimental_worker_cancellation` flag.
100+
101+
A **cancel request** is a `WorkRequest` with the `cancel` field set (and
102+
similarly a **cancel response** is a `WorkResponse` with the `was_cancelled`
103+
field set). The only other field that must be in a cancel request or cancel
104+
response is `request_id`, indicating which
105+
request to cancel. The `request_id` field will be 0 for singleplex workers
106+
or the non-0 `request_id` of a previously sent `WorkRequest` for multiplex
107+
workers. The server may send cancel requests for requests that the worker has
108+
already responded to, in which case the cancel request must be ignored.
109+
110+
Each non-cancel `WorkRequest` message must be answered exactly once, whether
111+
or not it was cancelled. Once the server has sent a cancel request, the worker
112+
may respond with a `WorkResponse` with the `request_id` set
113+
and the `was_cancelled` field set to true. Sending a regular `WorkResponse`
114+
is also accepted, but the `output` and `exit_code` fields will be ignored.
115+
116+
Once a response has been sent for a `WorkRequest`, the worker must not touch
117+
the files in its working directory. The server is free to clean up the files,
118+
including temporary files.
119+
75120
## Making the rule that uses the worker
76121

77122
You'll also need to create a rule that generates actions to be performed by the

src/main/java/com/google/devtools/build/lib/worker/WorkRequestHandler.java

Lines changed: 132 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,6 @@
1313
// limitations under the License.
1414
package com.google.devtools.build.lib.worker;
1515

16-
1716
import com.google.common.annotations.VisibleForTesting;
1817
import com.google.devtools.build.lib.worker.WorkerProtocol.WorkRequest;
1918
import com.google.devtools.build.lib.worker.WorkerProtocol.WorkResponse;
@@ -24,13 +23,12 @@
2423
import java.io.StringWriter;
2524
import java.lang.management.ManagementFactory;
2625
import java.time.Duration;
27-
import java.util.ArrayDeque;
2826
import java.util.List;
29-
import java.util.Map;
3027
import java.util.Optional;
31-
import java.util.Queue;
3228
import java.util.concurrent.ConcurrentHashMap;
29+
import java.util.concurrent.ConcurrentMap;
3330
import java.util.concurrent.atomic.AtomicReference;
31+
import java.util.function.BiConsumer;
3432
import java.util.function.BiFunction;
3533

3634
/**
@@ -56,13 +54,31 @@ public interface WorkerMessageProcessor {
5654

5755
/** Holds information necessary to properly handle a request, especially for cancellation. */
5856
static class RequestInfo {
57+
/** The thread handling the request. */
58+
final Thread thread;
59+
/** If true, we have received a cancel request for this request. */
60+
private boolean cancelled;
5961
/**
6062
* The builder for the response to this request. Since only one response must be sent per
6163
* request, this builder must be accessed through takeBuilder(), which zeroes this field and
6264
* returns the builder.
6365
*/
6466
private WorkResponse.Builder responseBuilder = WorkResponse.newBuilder();
6567

68+
RequestInfo(Thread thread) {
69+
this.thread = thread;
70+
}
71+
72+
/** Sets whether this request has been cancelled. */
73+
void setCancelled() {
74+
cancelled = true;
75+
}
76+
77+
/** Returns true if this request has been cancelled. */
78+
boolean isCancelled() {
79+
return cancelled;
80+
}
81+
6682
/**
6783
* Returns the response builder. If called more than once on the same instance, subsequent calls
6884
* will return {@code null}.
@@ -72,13 +88,22 @@ synchronized Optional<WorkResponse.Builder> takeBuilder() {
7288
responseBuilder = null;
7389
return Optional.ofNullable(b);
7490
}
91+
92+
/**
93+
* Adds {@code s} as output to when the response eventually gets built. Does nothing if the
94+
* response has already been taken. There is no guarantee that the response hasn't already been
95+
* taken, making this call a no-op. This may be called multiple times. No delimiters are added
96+
* between strings from multiple calls.
97+
*/
98+
synchronized void addOutput(String s) {
99+
if (responseBuilder != null) {
100+
responseBuilder.setOutput(responseBuilder.getOutput() + s);
101+
}
102+
}
75103
}
76104

77105
/** Requests that are currently being processed. Visible for testing. */
78-
final Map<Integer, RequestInfo> activeRequests = new ConcurrentHashMap<>();
79-
80-
/** WorkRequests that have been received but could not be processed yet. */
81-
private final Queue<WorkRequest> availableRequests = new ArrayDeque<>();
106+
final ConcurrentMap<Integer, RequestInfo> activeRequests = new ConcurrentHashMap<>();
82107

83108
/** The function to be called after each {@link WorkRequest} is read. */
84109
private final BiFunction<List<String>, PrintWriter, Integer> callback;
@@ -88,6 +113,7 @@ synchronized Optional<WorkResponse.Builder> takeBuilder() {
88113

89114
final WorkerMessageProcessor messageProcessor;
90115

116+
private final BiConsumer<Integer, Thread> cancelCallback;
91117

92118
private final CpuTimeBasedGcScheduler gcScheduler;
93119

@@ -107,7 +133,7 @@ public WorkRequestHandler(
107133
BiFunction<List<String>, PrintWriter, Integer> callback,
108134
PrintStream stderr,
109135
WorkerMessageProcessor messageProcessor) {
110-
this(callback, stderr, messageProcessor, Duration.ZERO);
136+
this(callback, stderr, messageProcessor, Duration.ZERO, null);
111137
}
112138

113139
/**
@@ -131,10 +157,24 @@ public WorkRequestHandler(
131157
PrintStream stderr,
132158
WorkerMessageProcessor messageProcessor,
133159
Duration cpuUsageBeforeGc) {
160+
this(callback, stderr, messageProcessor, cpuUsageBeforeGc, null);
161+
}
162+
163+
/**
164+
* Creates a {@code WorkRequestHandler} that will call {@code callback} for each WorkRequest
165+
* received. Only used for the Builder.
166+
*/
167+
private WorkRequestHandler(
168+
BiFunction<List<String>, PrintWriter, Integer> callback,
169+
PrintStream stderr,
170+
WorkerMessageProcessor messageProcessor,
171+
Duration cpuUsageBeforeGc,
172+
BiConsumer<Integer, Thread> cancelCallback) {
134173
this.callback = callback;
135174
this.stderr = stderr;
136175
this.messageProcessor = messageProcessor;
137176
this.gcScheduler = new CpuTimeBasedGcScheduler(cpuUsageBeforeGc);
177+
this.cancelCallback = cancelCallback;
138178
}
139179

140180
/** Builder class for WorkRequestHandler. Required parameters are passed to the constructor. */
@@ -143,6 +183,7 @@ public static class WorkRequestHandlerBuilder {
143183
private final PrintStream stderr;
144184
private final WorkerMessageProcessor messageProcessor;
145185
private Duration cpuUsageBeforeGc = Duration.ZERO;
186+
private BiConsumer<Integer, Thread> cancelCallback;
146187

147188
/**
148189
* Creates a {@code WorkRequestHandlerBuilder}.
@@ -173,9 +214,19 @@ public WorkRequestHandlerBuilder setCpuUsageBeforeGc(Duration cpuUsageBeforeGc)
173214
return this;
174215
}
175216

217+
/**
218+
* Sets a callback will be called when a cancellation message has been received. The callback
219+
* will be call with the request ID and the thread executing the request.
220+
*/
221+
public WorkRequestHandlerBuilder setCancelCallback(BiConsumer<Integer, Thread> cancelCallback) {
222+
this.cancelCallback = cancelCallback;
223+
return this;
224+
}
225+
176226
/** Returns a WorkRequestHandler instance with the values in this Builder. */
177227
public WorkRequestHandler build() {
178-
return new WorkRequestHandler(callback, stderr, messageProcessor, cpuUsageBeforeGc);
228+
return new WorkRequestHandler(
229+
callback, stderr, messageProcessor, cpuUsageBeforeGc, cancelCallback);
179230
}
180231
}
181232

@@ -191,56 +242,42 @@ public void processRequests() throws IOException {
191242
if (request == null) {
192243
break;
193244
}
194-
availableRequests.add(request);
195-
startRequestThreads();
196-
}
197-
}
198-
199-
/**
200-
* Starts threads for as many outstanding requests as possible. This is the only method that adds
201-
* to {@code activeRequests}.
202-
*/
203-
private synchronized void startRequestThreads() {
204-
while (!availableRequests.isEmpty()) {
205-
// If there's a singleplex request in process, don't start more processes.
206-
if (activeRequests.containsKey(0)) {
207-
return;
245+
if (request.getCancel()) {
246+
respondToCancelRequest(request);
247+
} else {
248+
startResponseThread(request);
208249
}
209-
WorkRequest request = availableRequests.peek();
210-
// Don't start new singleplex requests if there are other requests running.
211-
if (request.getRequestId() == 0 && !activeRequests.isEmpty()) {
212-
return;
213-
}
214-
availableRequests.remove();
215-
Thread t = createResponseThread(request);
216-
activeRequests.put(request.getRequestId(), new RequestInfo());
217-
t.start();
218250
}
219251
}
220252

221-
/** Creates a new {@link Thread} to process a multiplex request. */
222-
Thread createResponseThread(WorkRequest request) {
253+
/** Starts a thread for the given request. */
254+
void startResponseThread(WorkRequest request) {
223255
Thread currentThread = Thread.currentThread();
224256
String threadName =
225257
request.getRequestId() > 0
226258
? "multiplex-request-" + request.getRequestId()
227259
: "singleplex-request";
228-
return new Thread(
229-
() -> {
230-
RequestInfo requestInfo = activeRequests.get(request.getRequestId());
231-
try {
232-
respondToRequest(request, requestInfo);
233-
} catch (IOException e) {
234-
e.printStackTrace(stderr);
235-
// In case of error, shut down the entire worker.
236-
currentThread.interrupt();
237-
} finally {
238-
activeRequests.remove(request.getRequestId());
239-
// A good time to start more requests, especially if we finished a singleplex request
240-
startRequestThreads();
241-
}
242-
},
243-
threadName);
260+
Thread t =
261+
new Thread(
262+
() -> {
263+
RequestInfo requestInfo = activeRequests.get(request.getRequestId());
264+
if (requestInfo == null) {
265+
// Already cancelled
266+
return;
267+
}
268+
try {
269+
respondToRequest(request, requestInfo);
270+
} catch (IOException e) {
271+
e.printStackTrace(stderr);
272+
// In case of error, shut down the entire worker.
273+
currentThread.interrupt();
274+
} finally {
275+
activeRequests.remove(request.getRequestId());
276+
}
277+
},
278+
threadName);
279+
activeRequests.put(request.getRequestId(), new RequestInfo(t));
280+
t.start();
244281
}
245282

246283
/** Handles and responds to the given {@link WorkRequest}. */
@@ -260,7 +297,11 @@ void respondToRequest(WorkRequest request, RequestInfo requestInfo) throws IOExc
260297
if (optBuilder.isPresent()) {
261298
WorkResponse.Builder builder = optBuilder.get();
262299
builder.setRequestId(request.getRequestId());
263-
builder.setOutput(builder.getOutput() + sw.toString()).setExitCode(exitCode);
300+
if (requestInfo.isCancelled()) {
301+
builder.setWasCancelled(true);
302+
} else {
303+
builder.setOutput(builder.getOutput() + sw).setExitCode(exitCode);
304+
}
264305
WorkResponse response = builder.build();
265306
synchronized (this) {
266307
messageProcessor.writeWorkResponse(response);
@@ -270,6 +311,45 @@ void respondToRequest(WorkRequest request, RequestInfo requestInfo) throws IOExc
270311
}
271312
}
272313

314+
/**
315+
* Handles cancelling an existing request, including sending a response if that is not done by the
316+
* time {@code cancelCallback.accept} returns.
317+
*/
318+
void respondToCancelRequest(WorkRequest request) throws IOException {
319+
// Theoretically, we could have gotten two singleplex requests, and we can't tell those apart.
320+
// However, that's a violation of the protocol, so we don't try to handle it (not least because
321+
// handling it would be quite error-prone).
322+
RequestInfo ri = activeRequests.remove(request.getRequestId());
323+
324+
if (ri == null) {
325+
return;
326+
}
327+
if (cancelCallback == null) {
328+
ri.setCancelled();
329+
// This is either an error on the server side or a version mismatch between the server setup
330+
// and the binary. It's better to wait for the regular work to finish instead of breaking the
331+
// build, but we should inform the user about the bad setup.
332+
ri.addOutput(
333+
String.format(
334+
"Cancellation request received for worker request %d, but this worker does not"
335+
+ " support cancellation.\n",
336+
request.getRequestId()));
337+
} else {
338+
if (ri.thread.isAlive() && !ri.isCancelled()) {
339+
ri.setCancelled();
340+
cancelCallback.accept(request.getRequestId(), ri.thread);
341+
Optional<WorkResponse.Builder> builder = ri.takeBuilder();
342+
if (builder.isPresent()) {
343+
WorkResponse response =
344+
builder.get().setWasCancelled(true).setRequestId(request.getRequestId()).build();
345+
synchronized (this) {
346+
messageProcessor.writeWorkResponse(response);
347+
}
348+
}
349+
}
350+
}
351+
}
352+
273353
@Override
274354
public void close() throws IOException {
275355
messageProcessor.close();

src/main/protobuf/worker_protocol.proto

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -41,11 +41,12 @@ message WorkRequest {
4141

4242
// Each WorkRequest must have either a unique
4343
// request_id or request_id = 0. If request_id is 0, this WorkRequest must be
44-
// processed alone, otherwise the worker may process multiple WorkRequests in
45-
// parallel (multiplexing). As an exception to the above, if the cancel field
46-
// is true, the request_id must be the same as a previously sent WorkRequest.
47-
// The request_id must be attached unchanged to the corresponding
48-
// WorkResponse.
44+
// processed alone (singleplex), otherwise the worker may process multiple
45+
// WorkRequests in parallel (multiplexing). As an exception to the above, if
46+
// the cancel field is true, the request_id must be the same as a previously
47+
// sent WorkRequest. The request_id must be attached unchanged to the
48+
// corresponding WorkResponse. Only one singleplex request may be sent to a
49+
// worker at a time.
4950
int32 request_id = 3;
5051

5152
// EXPERIMENTAL: When true, this is a cancel request, indicating that a

src/test/java/com/google/devtools/build/lib/worker/ExampleWorker.java

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -94,11 +94,10 @@ public void processRequests() throws IOException {
9494
if (poisoned && workerOptions.hardPoison) {
9595
throw new IllegalStateException("I'm a very poisoned worker and will just crash.");
9696
}
97-
if (request.getRequestId() != 0) {
98-
Thread t = createResponseThread(request);
99-
t.start();
97+
if (request.getCancel()) {
98+
respondToCancelRequest(request);
10099
} else {
101-
respondToRequest(request, new RequestInfo());
100+
startResponseThread(request);
102101
}
103102
if (workerOptions.exitAfter > 0 && workUnitCounter > workerOptions.exitAfter) {
104103
System.exit(0);

0 commit comments

Comments
 (0)