Skip to content

Gossip: Add dynamic stake weighting based on fraction of unstaked nodes - Fixed Point Math #7098

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

gregcusack
Copy link

@gregcusack gregcusack commented Jul 23, 2025

Problem

This PR addresses this Issue: #6903
This achieves the same goal as PR: #7070 but does it with fixed point instead of floating point math.

Unstaked and low staked nodes in gossip do not receive many messages on testnet due to the testnet stake distribution and the aggressive quadratic weighting of peers by stake. As a result, unstaked nodes have a poor and inconsistent view of the peers in the network.

Linear weighting is not aggressive enough and causes a disproportionate number of messages to be sent to unstaked nodes on mainnet.

We need weighting that falls in between linear and quadratic.

Summary of Changes

Add a dynamic weighting element alpha. alpha fluctuates between 1 and 2 at runtime and is linearly proportional to the number of unstaked nodes in the cluster.
For example, on mainnet, there are 82% unstaked nodes, so alpha will converge on 1.82. On testnet, there are 4% unstaked nodes, so alpha will converge on 1.04.

We use a first-order, low-pass filter to update alpha to avoid constant alpha thrashing and overshooting the new target. Unit tests show that alpha converges on the target alpha within approximately 4-5 push active set rotations (30-38s).

Note: floating point operations are expensive but rotate() is called once every 7.5s, so should not add any significant overhead.

Testing has shown that flattening out the weight curve when their aren't many unstaked nodes (as on testnet), increases an unstaked nodes ingress gossip traffic as desired. On the other hand, high staked nodes don't see a significant reduction in received traffic.

All calculations and data that backup and support this PR here:

  1. Statistical Gossip Propagation Analysis
  2. Low Pass Filter Discussion on Discord
  3. Testing data on mininet and using statistical analysis on Discord

@codecov-commenter
Copy link

codecov-commenter commented Jul 23, 2025

Codecov Report

❌ Patch coverage is 95.41885% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.3%. Comparing base (f6e6ff2) to head (97ecb64).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master    #7098    +/-   ##
========================================
  Coverage    83.2%    83.3%            
========================================
  Files         800      802     +2     
  Lines      362783   363517   +734     
========================================
+ Hits       302195   302884   +689     
- Misses      60588    60633    +45     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gregcusack gregcusack marked this pull request as ready for review July 23, 2025 15:05
@gregcusack
Copy link
Author

per the discussion here: https://discord.com/channels/428295358100013066/478692221441409024/1397296872984678506, going to implement account control over TC, FS, Static/Dynamic weighting

@gregcusack gregcusack marked this pull request as draft July 23, 2025 15:41
@gregcusack gregcusack force-pushed the gossip-adaptive-stake-weighting-fixed-point branch 7 times, most recently from 57ef75a to 1c58bc0 Compare July 24, 2025 19:20
@gregcusack gregcusack marked this pull request as ready for review July 24, 2025 19:21
@gregcusack gregcusack force-pushed the gossip-adaptive-stake-weighting-fixed-point branch from 1c58bc0 to 23eaef6 Compare July 24, 2025 19:32

#[derive(Serialize, Deserialize, Debug, Default, Clone)]
#[repr(C)]
pub struct WeightingConfig {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure but would you not get some additional header written here by record program? Or are you using a different program to write onchain state?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may also be nice to comment why we are deriving Serialize here as we are only deserializing this one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right, we do not need deserialize here. Also, ya i wrote my own program to update onchain state since I figure it would be good to learn how to do it. Not sure either tbh, don't have any issues deserializing or updating push active set values.

// alpha = 2.0 -> Quadratic
Static,
// alpha in [1.0, 2.0], smoothed over time, scaled up by 1000 to avoid floating-point math
Dynamic,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it not be better to wrap alpha into the Dynamic enum variant to avoid carrying it around when mode is set to Static?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yesss much better idea.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be better placed in net-utils? We may want to reuse this for other protocols later, and it is clearly not gossip-specific.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya definitely not gossip specific. was trying to figure out where to put it. net-utils didn't feel like the best place either since it's not net specific. other idea is leave it in gossip for now and if someone wants to use it in the future, we can find a better place for it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but by then it will be part of public API =) net-utils is also small enough and pulls almost no other deps, unlike gossip which is rather huge already. It is also somewhat unlikely this kind of filter will be used outside of traffic shaping applications.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crates are free. give it its own. hide everything behind an agave-unstable-api cargo feature

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya we can do this. just makes the push active set setup a little harder to read. would it be best to have this crate behind a feature and then define some dummy low pass filter api that is compiled when the feature is not enabled? otherwise we end up having a ton of gated definitions throughout push_active_set.rs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you only need to activate the feature in Cargo.toml, the code in push_active_set.rs will be effectively the same, just pulling lpf from another crate.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so when we activate the feature, we pull lpf from the actual crate. but then when it's not activated, we just use some dummy lpf which never gets used since we'd be running WeightingMode::Static? or am i misunderstanding?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we always activate the feature in agave, kinda like DCOU. but external users of our repo do not get to use that code without activating the "i will not complain about broken public API" feature. Think of it as DCOU but for prod code.

Copy link

@t-nelson t-nelson Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but then when it's not activated

this doesn't exist. if the feature isn't specified, the crate exports no symbols. we're forcing consumers into a contract where they acknowledge consuming an unstable api

// Fixed point scale for K and `alpha` calculation
pub const SCALE: NonZeroU64 = NonZeroU64::new(1000).unwrap();
// 2 * pi * SCALE
const TWO_PI_SCALED: u64 = 6_283;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const TWO_PI_SCALED: u64 = 6_283;
const TWO_PI_SCALED: u64 = (std::f64::consts::PI * 2.0 * 1000.0) as u64;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this suggestion leaves rounding to the platform. prefer a literal

Copy link

@alexpyattaev alexpyattaev Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not round on the platform. The rounding is done at compile time and is 100% reproducible, else it would not build. https://godbolt.org/z/vonv18Kos both asm outputs return literal 3141592 .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is true if we ship a binary, right? but if we ship the source code and people build from source, the platform still has to build this and round at compile time.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the rounding is 100% deterministic.

Copy link

@t-nelson t-nelson Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the rounding is 100% deterministic.

thanks for completely ignoring "to the platform"

use std::num::NonZeroU64;

// Fixed point scale for K and `alpha` calculation
pub const SCALE: NonZeroU64 = NonZeroU64::new(1000).unwrap();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit of having this as NonZero type if this is a constant anyway?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may further be preferable to have this as a power of two for performance reasons.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think cross the PoT claim when slowlana tells us to. in the meantime, base-10 is easier to reason about

/// If future code changes make `alpha_target` jump larger, we must retune
/// `TC`/`K` or use a higher‑order filter to avoid lag/overshoot.
/// Returns `alpha_new = K * target + (1 - K) * prev`, rounded and clamped.
pub fn filter_alpha(prev: u64, target: u64, k: u64, min: u64, max: u64) -> u64 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it not be better to supply a range such as 1..2 rather than min and max parameters? A function taking 4 u64 parameters is asking to get misused.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya good call. can do something like:

pub fn filter_alpha(alpha_prev: u64, alpha_target: u64, config: FilterConfig) -> u64 {...}

where:

pub struct FilterConfig {
    output_range: Range<u64>,
    k: u64,
}

@@ -0,0 +1,62 @@
//! Fixed-point IIR filter for smoothing `alpha` updates.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to note that this is an order 1 Butterworth filter in case someone greps for it?

Copy link

@t-nelson t-nelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first pass looking good. didn't review any of the tests yet

use std::num::NonZeroU64;

// Fixed point scale for K and `alpha` calculation
pub const SCALE: NonZeroU64 = NonZeroU64::new(1000).unwrap();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure millies is sufficient precision? i usually try to take as many orders as the expected math will allow. we can always truncate in place if necessary

pub fn compute_k(fs_ms: u64, tc_ms: u64) -> u64 {
if tc_ms == 0 {
return 0;
} // disabled / passthrough

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure this consumer detail belongs down here

return 0;
} // disabled / passthrough
let scale = SCALE.get();
let wc_scaled = (TWO_PI_SCALED * fs_ms) / tc_ms;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we fix all of the math in this module to saturating variants?

debug_assert!(t <= scale, "interpolation t={} > SCALE={}", t, scale);
let base_squared = base.saturating_mul(base);
((base * (scale - t) + base_squared * t) + scale / 2) / scale
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this module could use some tests. especially at the boundaries of mathematical operations

}

mod weighting_config_control_pubkey {
solana_pubkey::declare_id!("maCwV5aXW6srR4kppBWTeNdizYhyEwVcVh6mSWnFZxo");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider grinding a vanity key for this

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we need multisig for this same as for all2all?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just the address to load the record account from, no? multisig is not applicable here

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the validator part of things does not care how the account is handled on chain.

tc_ms: u64,
}

impl Default for PushActiveSet {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default is usually bug. prefer literal instantiation. add a fn default_for_tests() under dcou if that's where the "convenience" of Default is desired

}

impl PushActiveSet {
#[cfg(test)]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will only be visible within the crate

}

fn apply_cfg(&mut self, cfg: WeightingConfig) {
match cfg.weighting_mode {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deserializer can handle this for us, no?

}

// Only recompute K if tc_ms changed
let new_tc_ms = if cfg.tc_ms != 0 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're going to use a sentinel value, prefer an enum with self-documenting variant names

.collect();
entry.rotate(rng, size, num_bloom_filter_items, nodes, &weights);

match self.mode {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this block be refactored to avoid replicode? seems like the map closure is the only difference

@gregcusack gregcusack force-pushed the gossip-adaptive-stake-weighting-fixed-point branch 2 times, most recently from 17baf80 to d93b7da Compare August 4, 2025 20:09
@alexpyattaev alexpyattaev self-requested a review August 4, 2025 20:31
@gregcusack gregcusack force-pushed the gossip-adaptive-stake-weighting-fixed-point branch from d93b7da to 320c570 Compare August 5, 2025 05:33
Copy link

@alexpyattaev alexpyattaev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good generally, left a couple of nits, mostly in low-pass-filter crate.

// Fixed point scale for K and `alpha` calculation
pub const SCALE: NonZeroU64 = NonZeroU64::new(1_000_000).unwrap();
// 2 * pi * SCALE
const TWO_PI_SCALED: u64 = 6_283_185;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have const eval and it works!

Suggested change
const TWO_PI_SCALED: u64 = 6_283_185;
const TWO_PI_SCALED: u64 = (std::f64::consts::PI * SCALE as f64) as u64;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trent thought we should define the literal. see: #7098 (comment)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, missed it. Consts are evaluated at compile time and are 100% consistent across builds on any architecture. You can trivially test it on a mac vs devbox.

/// Note: This function is most accurate when `base` is small e.g. < ~25.
#[inline]
#[allow(clippy::arithmetic_side_effects)]
pub fn interpolate(base: u64, t: u64) -> u64 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in comment parameter is called alpha and in signature it is t. Also a bit unclear why this function is called interpolate and which function is it actually approximating

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. will update parameters to be uniform.

also, this probably should be in push_active_set. it's not related to the filter at all. we need interpolation here in order to avoid floating point math:

In Agave, we currently weight nodes by: (bucket + 1)^2. But the square is too severe and underweights unstaked nodes, especially when there aren't many unstaked in the network (aka testnet). So we are going to replace the square with alpha where alpha is in the range [1,2] and is based on the fraction of unstaked nodes in the network. so if there are 60% unstaked nodes, then alpha will be 1.6.

However, we can't just do (bucket + 1)^alpha since this would require floating point math. So, we scale alpha by 1_000_000. so now alpha is in [1000000, 2000000]. and we still can't do (bucket + 1)^alpha since we will overflow. We need to approximate (bucket + 1)^alpha by linearly interpolating between (bucket + 1)^1 and (bucket + 1)^2.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, maybe i can call it linearly_interpolate() or something.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok now I fully understand what it does, thank you for explaining!

You could directly sidestep the overflow concern by using u128 (yes it will be slower but this does not have to be ridiculously fast).

I think the core cause of confusion is that this function's signature does not match what one would expect from a "normal" interpolator (since it is gossip-specific). Normally one would expect it to be defined like

fn linear_interpolate (x1:u64 /* start */, x2:u64 /* stop */, t:u64 /* parameter */ ); // t < SCALE  (i.e. between 0 and 1)

So extracting the interpolator from the gossip-specific callsite where we call

 interpolate ((bucket + 1), (bucket+1)^2, (alpha *SCALE) as u64 - SCALE) 

would go a long way towards making it obvious. Then you could keep the interpolator in this crate (maybe call crate fixed-point-math or something generic like that) and call it from gossip with gossip-specific parameters as in example above.

Alternatively, this can be moved to gossip crate as in that context it would be more clear, just call it gossip_bucket_interpolate or something like it, so people grepping for a linear interpolator would not get confused & you need not make any changes to the implementation/tests. Probably better way to ship this in 3.0.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya think the last option is best for now. will rename to make more clear

@gregcusack gregcusack requested a review from a team as a code owner August 7, 2025 21:46
@gregcusack gregcusack force-pushed the gossip-adaptive-stake-weighting-fixed-point branch 3 times, most recently from 756e0fc to c56f65a Compare August 7, 2025 22:06
@gregcusack gregcusack removed the request for review from a team August 7, 2025 22:07
@gregcusack gregcusack force-pushed the gossip-adaptive-stake-weighting-fixed-point branch from c56f65a to acc661b Compare August 7, 2025 22:20
@gregcusack gregcusack force-pushed the gossip-adaptive-stake-weighting-fixed-point branch 3 times, most recently from 41a83ee to 9686c5d Compare August 8, 2025 21:35
Copy link

@alexpyattaev alexpyattaev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left a few nits but nothing blocking at this point.

@@ -0,0 +1,76 @@
use {serde::Deserialize, solana_account::ReadableAccount, solana_runtime::bank::Bank};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe hide this whole module behind agave-unstable-api? Then we can safely axe it once tests are done without breaking semver. Does not have to be this PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the best way to do this? with the low-pass-filter crate i could gate it behind agave-unstable-api and then pull it into gossip crate in cargo.toml and activate the agave-unstable-api feature. but here, this file is already in the gossip crate...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well actually you could do better and just not mark any items here "pub". They are not used outside of gossip anyway so no need for them to be pub, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

mod weighting_config_control_pubkey {
solana_pubkey::declare_id!("goSwVUizoqNYKEaaiTjkgdN2RgLpvsTvFt1MEVGibY9");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please allocate this account on testnet before merging the changes and write all zeros in there so we can be sure this does not activate accidentally as we are setting up the multisigs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so i can deploy this account without setting up multisig first?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can set authority later.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why tf does this need a control account? these are not toys to be used at the slightest whim. where is the cluster gate? there is no way in hell we can deploy this logic to mb. please think before you merge

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. I forgot to add the cluster gate.

the thing we're changing with the control account is the filter parameter tc_ms and then whether we want the dynamic weighting or the fixed weighting. Math says dynamic weighting should work as expected + filter tests show filter converging pretty quickly (~4 iterations) with the current 30000 tc_ms. Downside if we don't have the control is that if something goes wrong testnet may go down (i guess that's the point of testnet) or we have to make a new PR with a different tc_ms (not expected).

Are you suggesting we just axe the control account and just roll out dynamic weighting and just see if it matches expectations on testnet?

@@ -212,21 +409,20 @@ mod tests {
}

#[test]
fn test_push_active_set() {
fn test_push_active_set_static_weighting() {
const CLUSTER_SIZE: usize = 117;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but should we not make this ~4000 for this test to be more representative?

Copy link
Author

@gregcusack gregcusack Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya maybe in a future PR. it doesn't really matter. we're just testing rotating nodes into our active set and pruning. cluster size doesn't have an impact here beyond like 20

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok then, nevermind=)

#[test]
fn test_alpha_converges_to_expected_target() {
const CLUSTER_SIZE: usize = 415;
const TOLERANCE_MILLI: u64 = 10_000; // ±1% of alpha

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand, scale is hardcoded here. This will explode if we change the scale, right? Maybe worth updating this one? Doe not have to be this PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

path = "src/lib.rs"

[features]
default = []

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI by default all features are disabled https://doc.rust-lang.org/cargo/reference/features.html#the-default-feature. Does not matter here of course.

@gregcusack gregcusack force-pushed the gossip-adaptive-stake-weighting-fixed-point branch 2 times, most recently from 8490937 to 3ca64a7 Compare August 11, 2025 19:04
@gregcusack gregcusack force-pushed the gossip-adaptive-stake-weighting-fixed-point branch from 3ca64a7 to 97ecb64 Compare August 11, 2025 20:12
Copy link

@alexpyattaev alexpyattaev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@gregcusack gregcusack merged commit 299fd87 into anza-xyz:master Aug 12, 2025
51 checks passed
@gregcusack gregcusack deleted the gossip-adaptive-stake-weighting-fixed-point branch August 12, 2025 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants