Description
To support cancellation, the curl client polls the underlying socket for read/writability with a short timeout (1 sec). However, at least on POSIX systems, the call to poll
may be interrupted by signals. Currently, such an interruption returns an error and results in a spurious failure of the HTTP request. Instead, the poll
call should be retried.
Request retries mask this issue in many cases, but this issue surfaced because in an application I work on the process is sometimes signaled with a regular frequency, which resulted in operations failing at a very high rate.
This is hard to unit test, but to reproduce it should be sufficient to signal the process with a high frequency while making HTTP requests. This is on a POSIX system (Linux, in my case). I'm not familiar enough with Windows to know if there's a similar issue around WSAPoll
.
Below is a patch I'm using to resolve the issue. I'd send a pull request but this is for POSIX platforms only.
From 02a4b073a45ed5d54a3c04ac78aabf72f0a1a416 Mon Sep 17 00:00:00 2001
From: John Heffner <[email protected]>
Date: Wed, 1 Jun 2022 16:55:48 -0400
Subject: [PATCH] Retry poll calls on EINTR
When signals are delievered to the process, calls here to poll may be
interrupted and return with a spurious failure. The call instead should be
restarted.
---
sdk/core/azure-core/src/http/curl/curl.cpp | 26 ++++++++++++++++++++--
1 file changed, 24 insertions(+), 2 deletions(-)
diff --git a/sdk/core/azure-core/src/http/curl/curl.cpp b/sdk/core/azure-core/src/http/curl/curl.cpp
index ea94d08c..fbbb1764 100644
--- a/sdk/core/azure-core/src/http/curl/curl.cpp
+++ b/sdk/core/azure-core/src/http/curl/curl.cpp
@@ -16,6 +16,7 @@
#if defined(AZ_PLATFORM_POSIX)
#include <poll.h> // for poll()
#include <sys/socket.h> // for socket shutdown
+#include <chrono>
#elif defined(AZ_PLATFORM_WINDOWS)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
@@ -103,17 +104,38 @@ int pollSocketUntilEventOrTimeout(
interval = timeout;
}
int result = 0;
- for (long counter = 0; counter < timeout && result == 0; counter = counter + interval)
+ bool poll_interrupted = false; // deadline and now are only valid if poll_interrupted
+ long deadline;
+ long now;
+ for (long counter = 0; (poll_interrupted ? now < deadline : counter < timeout) && result == 0;
+ counter = counter + interval)
{
// check cancelation
context.ThrowIfCancelled();
#if defined(AZ_PLATFORM_POSIX)
result = poll(&poller, 1, interval);
+ if (result < 0 && EINTR == errno)
+ {
+ // Only make extra syscalls to get the time in the (rare) event that poll
+ // is interrupted. We jump through these hoops because the poll call does
+ // not tell us how long it waited before being interrupted. For the first
+ // interruption only, assume no time has passed during the interrupted
+ // call.
+ now = std::chrono::duration_cast<std::chrono::milliseconds>(
+ std::chrono::steady_clock::now().time_since_epoch())
+ .count();
+ if (!poll_interrupted)
+ {
+ poll_interrupted = true;
+ deadline = now + timeout - counter;
+ result = 0;
+ }
+ }
#elif defined(AZ_PLATFORM_WINDOWS)
result = WSAPoll(&poller, 1, interval);
#endif
}
- // result can be either 0 (timeout) or > 1 (socket ready)
+ // result can be 0 (timeout), > 0 (socket ready), or < 0 (error)
return result;
}
--
2.25.1