Skip to content

hyper-util: Expose attempted SocketAddr in failed request #3888

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rcoh opened this issue May 14, 2025 · 12 comments
Open

hyper-util: Expose attempted SocketAddr in failed request #3888

rcoh opened this issue May 14, 2025 · 12 comments
Labels
C-feature Category: feature. This is adding a new feature.

Comments

@rcoh
Copy link

rcoh commented May 14, 2025

Is your feature request related to a problem? Please describe.
For the purposes of client-side load balancing, its important to be able to track which IP addresses are failing. Currently, its possible to capture the HttpInfo with capture_request, but only when the Http connection actually succeeds. If it fails, the conn_info extras are not written into the extensions:

https://github.com/hyperium/hyper-util/blob/7e742487c0680697857fd8a3eb403e0432c0d894/src/client/legacy/client.rs#L343-L346

Describe the solution you'd like
Unconditionally write HttpInfo into the request extensions. This probably requires stashing it in the ConnectorError, at least privately.

Additional context
cc @commonsensesoftware

@rcoh rcoh added the C-feature Category: feature. This is adding a new feature. label May 14, 2025
@commonsensesoftware
Copy link

I'm working in a scenario where physical load balancers do not exist and are not an option. As a result, I need to implement a client-side load balancing strategy. The necessary hooks and extensions exist to control how DNS is queried and candidate endpoints are provided for selection, but I also need to know what the candidate endpoint was when a connection fails. I require this information so that I can mark down and weigh away from that endpoint in the candidate endpoint selection logic. I don't think all of the candidate endpoints need to be included in the error. The final candidate that propagates the error should be sufficient. In my case, I ensure the candidate list is always a sequence of exactly one endpoint (which makes correlation simple).

@seanmonstar
Copy link
Member

Interesting. I'll just include here that currently, the HttpConnector tries all the addresses returned by the DNS lookup, and only if they all fail, does it return an error. So, which address failed loses a bit of its meaning... I suppose a custom connector could do differently, and return that info in it's error.

Unless you're specifically interested in the period between a TcpStream successfully connecting, and an HTTP-ish error handling before the first request is able to go out.

In my case, I ensure the candidate list is always a sequence of exactly one endpoint (which makes correlation simple).

How do you do this? Through DNS? Some custom connector? Would that not mean you can already correlate the failed address in that case?

@commonsensesoftware
Copy link

Unfortunately, I don't have something OSS that I can share, but let me add some more context that will hopefully paint the picture.

Consider the following snippets and pseudocode:

struct Endpoint {
    address: SocketAddr,
    weight: usize,
    last_computed: Instant,
}

impl Endpoint {
    fn fail(&mut self) {
        self.weight = 0;
        self.last_computed = Instant::now();
    }

    fn weigh(&mut self) { /* re-weigh endpoint to eventual recovery */   }
}

#[derive(Clone)]
struct State(Arc<Mutex<Vec<Endpoint>>>);

The custom DNS resolver component is responsible for periodically querying DNS and generated endpoints. There's nothing that special happening. It's good ol' DNS resolution, but mapped with a weight. Each time this happens, the weight of existing endpoints are copied over so that the mere act of refreshing doesn't clear any weights. Old endpoints are removed. New endpoints come in at full health. The result is updated in the shared State.

To ensure that only a single candidate is handed out at a time, the resolved result is always vec![endpoint]. If this one candidate fails, an error is propagated, but can be retried. If a retry happens, a different candidate will be selected the next go around when the DNS resolver service is invoked. This process repeats until a good endpoint is found or all the candidates are bad. When that happens, retries are exhausted and any random endpoint is selected because there are no good choices.

The other side, the side that's missing, is middleware that can inspect what's happening in a response. Endpoint::fail should be invoked by matching up known endpoints based on their SocketAddr found in the State passed to the middleware. This will happen for:

  1. 4xx HTTP responses
  2. 5xx HTTP responses
  3. Transport level failures

A single client could be making multiple requests concurrently. In order to mark an endpoint as failed, the endpoint in the response needs to be correlated. This is provided in #1 and #2, but it is not provided in #3. As a result, there isn't a way to ensure that endpoint isn't selected again. There isn't a way to provide correlation from DNS because there's no opportunity to add metadata to the outbound property bag. Presumably, this would be provided via conn_info extras regardless of whether the request succeeds or fails. In the failure case, this would likely be the last candidate attempted. This could still be None if the DNS resolver doesn't provide any candidates, which is a different error and problem.

I can elaborate more if that isn't clear enough.

@sfackler
Copy link
Contributor

You might want to consider performing the DNS resolution above the Hyper layer, so you can map IPs per connection out explicitly.

@seanmonstar
Copy link
Member

Yea, all that sounds fairly standard load balancing. We did similar in the linkerd2-proxy. We combined tower::balance and tower::discover to resolve addresses in a weighted way, and then provided hyper the socket addr directly, kind of like what sfackler suggests.

That's not to say we can't figure out improvements in hyper-util. We certainly can do that too.

@commonsensesoftware
Copy link

The DNS resolution is happening at a much higher level and not directly in the path of hyper, but it made the most sense to put it in that context to understand the issue. There's no preemptive effort to verify a resolved endpoint is reachable. ICMP isn't enabled. Ultimately, I just need to know which endpoint was attempted when it failed so I can mark down that address. If there is some other piece of metadata that I can ensure will propagate from the request to the response, I can correlate other ways. I just need to know that when a response fails, for any reason, what was the endpoint used for the original request. When a client makes parallel requests, I can't guarantee the response correlates to the last selected endpoint. I'm open to other suggestions or workarounds to enable that.

@seanmonstar
Copy link
Member

OK, let me try discussing this from a different angle. You said your code is proprietary, no worries. Can you sketch out some code of how, after you've done your load balancing and such, that you would hope to call hyper? Do you supply a custom resolver? Is it just the default, and you've edited the DNS results somewhere else?

What does a solution look like here, as a user call?

@commonsensesoftware
Copy link

That feels like a reasonable ask. Let me put something together that I can share. Yes, I am using a custom resolver. The resolved addresses are mapped into an Endpoint (see above), which is put into a shared State (also see above). The custom resolver and middleware both have a copy of this state, but do not otherwise know about each other. When the middleware detects a 4xx, 5xx, or the entire connection fails, it marks down the target endpoint a la Endpoint::fail. The issue for the middleware is that it doesn't know which SocketAddr (or any form of address) it should mark down because there is no response and the connection error doesn't indicate which candidate it gave up on. If the error provided this information, it could match it up pretty easily. This might be possible if it were possible to attach this information via a property bag, but I'm not 100% that would work either because what is really needed is the disposition made of the final address selection.

@commonsensesoftware
Copy link

There's more than one possible solution here. Rather than try to return the failed endpoint, which could arguably be the first or the last candidate in the list, the resolver could add support to handle rejections. This would be backward compatible with a default implementation. The mark down behavior would be split between the resolver and the middleware. The middleware would only handle 4xx and 5xx, but it can get the endpoint information in those paths. The resolver would handle failed requests in an unambiguous way.

I imagine it would look something like this:

pub trait Resolve {
    type Addrs: Iterator<Item = SocketAddr>;
    type Error: Into<Box<dyn std::error::Error + Send + Sync>>;
    type Future: Future<Output = Result<Self::Addrs, Self::Error>>;

    fn poll_ready(&mut self, cx: &mut task::Context<'_>) -> Poll<Result<(), Self::Error>>;
    fn resolve(&mut self, name: Name) -> Self::Future;

    /// Notifies the resolver that the specified candidate was rejected.
    fn reject(&mut self, candidate: &SocketAddr) { 
        // does nothing by default
    }
}

Since Resolve is the thing that provides the Iterator<Item = SocketAddr>, it only feels appropriate that it would also be the thing that is notified if one of its candidates were rejected. I'm open to other ideas, but something like this might be more appropriate and straight forward as opposed to attaching a SocketAddr to an Error.

@seanmonstar
Copy link
Member

Hm. So, I could update it so the HttpConnectors error that is returned, ConnectError, is public, and add a method to check for a possible Option<SocketAddr> on the connect error. It would be based on the first connect error out of any list of addresses that were resolved (so if you resolved to many, it's assumed the first in the list is the most relevant). Then you can look for the ConnectError in the Error::source() chain.

However, this is probably pretty fragile. And it doesn't account for any other middleware you may have, or any errors that might occur in-between connection established but before request sent.


With all that I've read so far, I'd recommend towards making your request scope aware of the endpoint that will be resolved. Either in a task-local, in the request extensions, or just as part of the middleware.

let target_addr = ([127, 0, 0, 1], 8080).into();
endpoints.resolve_to(target_addr);

match client.request(some_req).await {
    Ok(_) => (),
    Err(err) => {
        endpoints.mark(target_addr);
        // etc
    }
}

@commonsensesoftware
Copy link

I can get behind not having it on the error. I certainly agree there's not a clear expectation for the caller whether the first or last candidate is returned.

You didn't comment on it, but did you not like the idea of signaling to the resolver that its candidate was rejected (or some other similar word)? IMHO this has a much clearer chain of responsibility that wouldn't affect any middleware. This behavior also doesn't necessarily have to be on the resolver. This could be some other type of service that is notified which an attempt address fails. The default implementation simply does nothing.

@seanmonstar
Copy link
Member

No, I'd rather not add such a method to Resolve. Largely because it's not a design I see in pretty much any other HTTP client library in other languages. Plus, Resolve isn't implemented by users, it's auto-implemented for impl Service<Uri>.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature Category: feature. This is adding a new feature.
Projects
None yet
Development

No branches or pull requests

4 participants