Skip to content

QUIC: generate unique server_name for client connections #7260

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 4, 2025

Conversation

alexpyattaev
Copy link

@alexpyattaev alexpyattaev commented Jul 31, 2025

Problem

  • QUIC clients reuse the same server_name for all connections which can cause QUIC tokens to be mixed up, as they are stored in a big hashmap deep inside Quinn, keyed by server_name
  • This prevents features like address validation and 0RTT from working correctly

Summary of Changes

  • Generate server_name based on the SocketAddr of the peer we are connecting to, thus ensuring it is appropriate for the server.

Fixes #7197

@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 88.23529% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.8%. Comparing base (bca4a36) to head (3cfe1ac).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #7260   +/-   ##
=======================================
  Coverage    82.8%    82.8%           
=======================================
  Files         803      803           
  Lines      364439   364447    +8     
=======================================
+ Hits       302068   302079   +11     
+ Misses      62371    62368    -3     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@alexpyattaev alexpyattaev marked this pull request as ready for review July 31, 2025 20:53
@alexpyattaev alexpyattaev requested a review from lijunwangs July 31, 2025 21:08
@lijunwangs
Copy link

Can you share details about token mixed up? What is the impact?

@ripatel-fd
Copy link

ripatel-fd commented Aug 1, 2025

Can you share details about token mixed up? What is the impact?

@lijunwangs Agave clients have been using the same hostname for all servers it connects to. It also has a "TokenStore" that caches address validation tokens received by servers, but it is keyed by hostname. In other words, all servers share the same "TokenStore" entry.

When dialing a new connection, the Agave client will send the last received address validation token in the Initial packet "Token" field. Note that this "Token" field can be interpreted ambiguously since the schema is defined by the server's implementation. Since all Agave servers use the same implementation, they all interpret this incoming garbage token as an address validation token, and just ignore it per spec. So the impact to Agave-Agave connections is zero.

Firedancer, however, interprets this garbage token as a "Retry token". The spec mandates that the Initial is dropped or rejected in this case. So, Agave-Firedancer connection attempts fail frequently, but this condition is recoverable, since the Agave client will omit the token on the second conn attempt.

Firedancer deployed a workaround (ignore these Agave tokens / parse them as address validation tokens).

/// since Solana does not rely on DNS names, but we need to provide a unique
/// one to ensure that we present correct QUIC tokens to the correct server.
pub fn socket_addr_to_quic_server_name(peer: SocketAddr) -> String {
format!("{}.{}.sol", peer.ip(), peer.port())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would have been cleaner to use the identity key here. But anything works, as long as it's unique

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identity key stays unchanged during identity transfer, so we would be again connecting to a different host while using incorrect keys. So, can not use just pubkey.

Copy link

@lijunwangs lijunwangs Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From server name,

/// Attempt to make a ServerName from a string by parsing as a DNS name or IP address.
impl<'a> TryFrom<&'a str> for ServerName<'a> {
    type Error = InvalidDnsNameError;
    fn try_from(s: &'a str) -> Result<Self, Self::Error> {
        match DnsName::try_from(s) {
            Ok(dns) => Ok(Self::DnsName(dns)),
            Err(InvalidDnsNameError) => match IpAddr::try_from(s) {
                Ok(ip) => Ok(Self::IpAddress(ip)),
                Err(_) => Err(InvalidDnsNameError),
            },
        }
    }
}

, it is expecting either a DNS name or IP, can we simply use the IP address? Why do we need a port number?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can not simply use the IP address because it is perfectly legitimate to run more than one node on the same IP address.

Copy link

@ripatel-fd ripatel-fd Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identity key stays unchanged during identity transfer, so we would be again connecting to a different host while using incorrect keys. So, can not use just pubkey.

@alexpyattaev I'm confused, sorry ... So in Agave, set-identity does not update keys for all its QUIC clients and servers? Isn't that an implementation bug? Since this is a network protocol change, shouldn't we fix the implementation than work around in the protocol?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set identity does update the keys, but on the host where identity is transplanted to. As a random client connecting via e.g. tpu-client-next, I'd have no idea that the identity has been transplanted, I'm just connecting to the server using SocketAddress. For the client it is also indistinguishable from a multihomed server changing its primary IP (in which case the old tokens would still work). But since we can not know for sure, we assume it is an identity transplant (i.e. old tokens are invalid).

Copy link

@KirillLykov KirillLykov Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ripatel-fd if you mean Admin RPC call to change the identity, it changes the all client identities.
I think Alex means that if a validator is running on host H_1, it may decide to switch to the other host H_2. He will use the same identity but on a different host simultaneously stopping the validator on the old host.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, and now if I use identity as the key for tokens, this key will be the same while host is different => problems.

Copy link

@lijunwangs lijunwangs Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me some time to review how the server name is used in Quinn and rustls

  1. Resumption token for crypto handshaking -- our configuration essentially disabled it.
  2. SNI extensions in the initial packet -- server will need to provide a certificate with the name when server certificate validation is done -- we are not validating it now. Our server always have the SAN (subject alternative name) to localhost.
  3. The token in the initial packet for address validations (retry and future connection address validation) mentioned by @ripatel-fd.
    4 ECH (encrypted client hello) -- our implementation does not use it.

Copy link

@ripatel-fd ripatel-fd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing

Copy link

@KirillLykov KirillLykov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to add some tests, but might be a bit elaborated to create one

@alexpyattaev
Copy link
Author

would be nice to add some tests, but might be a bit elaborated to create one

Do you mean to test that when we transition to another server the token is reused? Yeah it would be a decent test around streamer crate (since there we have better access to the internals of the server to check if it actually had to validate the address).

@alexpyattaev alexpyattaev merged commit eb42788 into anza-xyz:master Aug 4, 2025
41 checks passed
@alexpyattaev alexpyattaev deleted the tpu_client_server_id branch August 4, 2025 20:30
@alexpyattaev alexpyattaev added the v2.3 Backport to v2.3 branch label Aug 5, 2025
Copy link

mergify bot commented Aug 5, 2025

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

mergify bot pushed a commit that referenced this pull request Aug 5, 2025
generate unique server_name for QUIC connections

(cherry picked from commit eb42788)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v2.3 Backport to v2.3 branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[QUIC] Client mistakenly recycles tokens across different servers
5 participants