Skip to content

Interrupted poll calls cause spurious HTTP request failures #3692

Closed
@johnwheffner

Description

@johnwheffner

To support cancellation, the curl client polls the underlying socket for read/writability with a short timeout (1 sec). However, at least on POSIX systems, the call to poll may be interrupted by signals. Currently, such an interruption returns an error and results in a spurious failure of the HTTP request. Instead, the poll call should be retried.

Request retries mask this issue in many cases, but this issue surfaced because in an application I work on the process is sometimes signaled with a regular frequency, which resulted in operations failing at a very high rate.

This is hard to unit test, but to reproduce it should be sufficient to signal the process with a high frequency while making HTTP requests. This is on a POSIX system (Linux, in my case). I'm not familiar enough with Windows to know if there's a similar issue around WSAPoll.

Below is a patch I'm using to resolve the issue. I'd send a pull request but this is for POSIX platforms only.

From 02a4b073a45ed5d54a3c04ac78aabf72f0a1a416 Mon Sep 17 00:00:00 2001
From: John Heffner <[email protected]>
Date: Wed, 1 Jun 2022 16:55:48 -0400
Subject: [PATCH] Retry poll calls on EINTR

When signals are delievered to the process, calls here to poll may be
interrupted and return with a spurious failure.  The call instead should be
restarted.
---
 sdk/core/azure-core/src/http/curl/curl.cpp | 26 ++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/sdk/core/azure-core/src/http/curl/curl.cpp b/sdk/core/azure-core/src/http/curl/curl.cpp
index ea94d08c..fbbb1764 100644
--- a/sdk/core/azure-core/src/http/curl/curl.cpp
+++ b/sdk/core/azure-core/src/http/curl/curl.cpp
@@ -16,6 +16,7 @@
 #if defined(AZ_PLATFORM_POSIX)
 #include <poll.h> // for poll()
 #include <sys/socket.h> // for socket shutdown
+#include <chrono>
 #elif defined(AZ_PLATFORM_WINDOWS)
 #if !defined(WIN32_LEAN_AND_MEAN)
 #define WIN32_LEAN_AND_MEAN
@@ -103,17 +104,38 @@ int pollSocketUntilEventOrTimeout(
     interval = timeout;
   }
   int result = 0;
-  for (long counter = 0; counter < timeout && result == 0; counter = counter + interval)
+  bool poll_interrupted = false;  // deadline and now are only valid if poll_interrupted
+  long deadline;
+  long now;
+  for (long counter = 0; (poll_interrupted ? now < deadline : counter < timeout) && result == 0;
+       counter = counter + interval)
   {
     // check cancelation
     context.ThrowIfCancelled();
 #if defined(AZ_PLATFORM_POSIX)
     result = poll(&poller, 1, interval);
+    if (result < 0 && EINTR == errno)
+    {
+      // Only make extra syscalls to get the time in the (rare) event that poll
+      // is interrupted. We jump through these hoops because the poll call does
+      // not tell us how long it waited before being interrupted.  For the first
+      // interruption only, assume no time has passed during the interrupted
+      // call.
+      now = std::chrono::duration_cast<std::chrono::milliseconds>(
+                std::chrono::steady_clock::now().time_since_epoch())
+                .count();
+      if (!poll_interrupted)
+      {
+        poll_interrupted = true;
+        deadline = now + timeout - counter;
+        result = 0;
+      }
+    }
 #elif defined(AZ_PLATFORM_WINDOWS)
     result = WSAPoll(&poller, 1, interval);
 #endif
   }
-  // result can be either 0 (timeout) or > 1 (socket ready)
+  // result can be 0 (timeout), > 0 (socket ready), or < 0 (error)
   return result;
 }
 
-- 
2.25.1

Metadata

Metadata

Labels

Azure.CoreClientThis issue points to a problem in the data-plane of the library.customer-reportedIssues that are reported by GitHub users external to the Azure organization.questionThe issue doesn't require a change to the product in order to be resolved. Most issues start as that

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions