-
Notifications
You must be signed in to change notification settings - Fork 659
Gossip: Add dynamic stake weighting based on fraction of unstaked nodes - Fixed Point Math #7098
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gossip: Add dynamic stake weighting based on fraction of unstaked nodes - Fixed Point Math #7098
Conversation
c0d5a77
to
cf6ebe0
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #7098 +/- ##
========================================
Coverage 83.2% 83.3%
========================================
Files 800 802 +2
Lines 362783 363517 +734
========================================
+ Hits 302195 302884 +689
- Misses 60588 60633 +45 🚀 New features to boost your workflow:
|
per the discussion here: https://discord.com/channels/428295358100013066/478692221441409024/1397296872984678506, going to implement account control over TC, FS, Static/Dynamic weighting |
57ef75a
to
1c58bc0
Compare
1c58bc0
to
23eaef6
Compare
gossip/src/stake_weighting_config.rs
Outdated
|
||
#[derive(Serialize, Deserialize, Debug, Default, Clone)] | ||
#[repr(C)] | ||
pub struct WeightingConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure but would you not get some additional header written here by record program? Or are you using a different program to write onchain state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may also be nice to comment why we are deriving Serialize here as we are only deserializing this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh right, we do not need deserialize here. Also, ya i wrote my own program to update onchain state since I figure it would be good to learn how to do it. Not sure either tbh, don't have any issues deserializing or updating push active set values.
gossip/src/push_active_set.rs
Outdated
// alpha = 2.0 -> Quadratic | ||
Static, | ||
// alpha in [1.0, 2.0], smoothed over time, scaled up by 1000 to avoid floating-point math | ||
Dynamic, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it not be better to wrap alpha into the Dynamic enum variant to avoid carrying it around when mode is set to Static?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yesss much better idea.
gossip/src/low_pass_filter.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be better placed in net-utils? We may want to reuse this for other protocols later, and it is clearly not gossip-specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya definitely not gossip specific. was trying to figure out where to put it. net-utils didn't feel like the best place either since it's not net specific. other idea is leave it in gossip for now and if someone wants to use it in the future, we can find a better place for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but by then it will be part of public API =) net-utils is also small enough and pulls almost no other deps, unlike gossip which is rather huge already. It is also somewhat unlikely this kind of filter will be used outside of traffic shaping applications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crates are free. give it its own. hide everything behind an agave-unstable-api
cargo feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya we can do this. just makes the push active set setup a little harder to read. would it be best to have this crate behind a feature and then define some dummy low pass filter api that is compiled when the feature is not enabled? otherwise we end up having a ton of gated definitions throughout push_active_set.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you only need to activate the feature in Cargo.toml, the code in push_active_set.rs will be effectively the same, just pulling lpf from another crate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok so when we activate the feature, we pull lpf
from the actual crate. but then when it's not activated, we just use some dummy lpf which never gets used since we'd be running WeightingMode::Static
? or am i misunderstanding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we always activate the feature in agave, kinda like DCOU. but external users of our repo do not get to use that code without activating the "i will not complain about broken public API" feature. Think of it as DCOU but for prod code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but then when it's not activated
this doesn't exist. if the feature isn't specified, the crate exports no symbols. we're forcing consumers into a contract where they acknowledge consuming an unstable api
gossip/src/low_pass_filter.rs
Outdated
// Fixed point scale for K and `alpha` calculation | ||
pub const SCALE: NonZeroU64 = NonZeroU64::new(1000).unwrap(); | ||
// 2 * pi * SCALE | ||
const TWO_PI_SCALED: u64 = 6_283; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const TWO_PI_SCALED: u64 = 6_283; | |
const TWO_PI_SCALED: u64 = (std::f64::consts::PI * 2.0 * 1000.0) as u64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this suggestion leaves rounding to the platform. prefer a literal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not round on the platform. The rounding is done at compile time and is 100% reproducible, else it would not build. https://godbolt.org/z/vonv18Kos both asm outputs return literal 3141592 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is true if we ship a binary, right? but if we ship the source code and people build from source, the platform still has to build this and round at compile time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the rounding is 100% deterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the rounding is 100% deterministic.
thanks for completely ignoring "to the platform"
gossip/src/low_pass_filter.rs
Outdated
use std::num::NonZeroU64; | ||
|
||
// Fixed point scale for K and `alpha` calculation | ||
pub const SCALE: NonZeroU64 = NonZeroU64::new(1000).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the benefit of having this as NonZero type if this is a constant anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may further be preferable to have this as a power of two for performance reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think cross the PoT claim when slowlana tells us to. in the meantime, base-10 is easier to reason about
gossip/src/low_pass_filter.rs
Outdated
/// If future code changes make `alpha_target` jump larger, we must retune | ||
/// `TC`/`K` or use a higher‑order filter to avoid lag/overshoot. | ||
/// Returns `alpha_new = K * target + (1 - K) * prev`, rounded and clamped. | ||
pub fn filter_alpha(prev: u64, target: u64, k: u64, min: u64, max: u64) -> u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it not be better to supply a range such as 1..2 rather than min and max parameters? A function taking 4 u64 parameters is asking to get misused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya good call. can do something like:
pub fn filter_alpha(alpha_prev: u64, alpha_target: u64, config: FilterConfig) -> u64 {...}
where:
pub struct FilterConfig {
output_range: Range<u64>,
k: u64,
}
gossip/src/low_pass_filter.rs
Outdated
@@ -0,0 +1,62 @@ | |||
//! Fixed-point IIR filter for smoothing `alpha` updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to note that this is an order 1 Butterworth filter in case someone greps for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first pass looking good. didn't review any of the tests yet
gossip/src/low_pass_filter.rs
Outdated
use std::num::NonZeroU64; | ||
|
||
// Fixed point scale for K and `alpha` calculation | ||
pub const SCALE: NonZeroU64 = NonZeroU64::new(1000).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we sure millies is sufficient precision? i usually try to take as many orders as the expected math will allow. we can always truncate in place if necessary
gossip/src/low_pass_filter.rs
Outdated
pub fn compute_k(fs_ms: u64, tc_ms: u64) -> u64 { | ||
if tc_ms == 0 { | ||
return 0; | ||
} // disabled / passthrough |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure this consumer detail belongs down here
gossip/src/low_pass_filter.rs
Outdated
return 0; | ||
} // disabled / passthrough | ||
let scale = SCALE.get(); | ||
let wc_scaled = (TWO_PI_SCALED * fs_ms) / tc_ms; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we fix all of the math in this module to saturating variants?
gossip/src/low_pass_filter.rs
Outdated
debug_assert!(t <= scale, "interpolation t={} > SCALE={}", t, scale); | ||
let base_squared = base.saturating_mul(base); | ||
((base * (scale - t) + base_squared * t) + scale / 2) / scale | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this module could use some tests. especially at the boundaries of mathematical operations
gossip/src/stake_weighting_config.rs
Outdated
} | ||
|
||
mod weighting_config_control_pubkey { | ||
solana_pubkey::declare_id!("maCwV5aXW6srR4kppBWTeNdizYhyEwVcVh6mSWnFZxo"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider grinding a vanity key for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we need multisig for this same as for all2all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just the address to load the record account from, no? multisig is not applicable here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the validator part of things does not care how the account is handled on chain.
gossip/src/push_active_set.rs
Outdated
tc_ms: u64, | ||
} | ||
|
||
impl Default for PushActiveSet { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default
is usually bug. prefer literal instantiation. add a fn default_for_tests()
under dcou if that's where the "convenience" of Default
is desired
gossip/src/push_active_set.rs
Outdated
} | ||
|
||
impl PushActiveSet { | ||
#[cfg(test)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will only be visible within the crate
gossip/src/push_active_set.rs
Outdated
} | ||
|
||
fn apply_cfg(&mut self, cfg: WeightingConfig) { | ||
match cfg.weighting_mode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deserializer can handle this for us, no?
gossip/src/push_active_set.rs
Outdated
} | ||
|
||
// Only recompute K if tc_ms changed | ||
let new_tc_ms = if cfg.tc_ms != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we're going to use a sentinel value, prefer an enum with self-documenting variant names
.collect(); | ||
entry.rotate(rng, size, num_bloom_filter_items, nodes, &weights); | ||
|
||
match self.mode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this block be refactored to avoid replicode? seems like the map closure is the only difference
17baf80
to
d93b7da
Compare
d93b7da
to
320c570
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good generally, left a couple of nits, mostly in low-pass-filter crate.
low-pass-filter/src/lib.rs
Outdated
// Fixed point scale for K and `alpha` calculation | ||
pub const SCALE: NonZeroU64 = NonZeroU64::new(1_000_000).unwrap(); | ||
// 2 * pi * SCALE | ||
const TWO_PI_SCALED: u64 = 6_283_185; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have const eval and it works!
const TWO_PI_SCALED: u64 = 6_283_185; | |
const TWO_PI_SCALED: u64 = (std::f64::consts::PI * SCALE as f64) as u64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trent thought we should define the literal. see: #7098 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, missed it. Consts are evaluated at compile time and are 100% consistent across builds on any architecture. You can trivially test it on a mac vs devbox.
low-pass-filter/src/lib.rs
Outdated
/// Note: This function is most accurate when `base` is small e.g. < ~25. | ||
#[inline] | ||
#[allow(clippy::arithmetic_side_effects)] | ||
pub fn interpolate(base: u64, t: u64) -> u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in comment parameter is called alpha and in signature it is t. Also a bit unclear why this function is called interpolate and which function is it actually approximating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. will update parameters to be uniform.
also, this probably should be in push_active_set
. it's not related to the filter at all. we need interpolation here in order to avoid floating point math:
In Agave, we currently weight nodes by: (bucket + 1)^2
. But the square is too severe and underweights unstaked nodes, especially when there aren't many unstaked in the network (aka testnet). So we are going to replace the square with alpha
where alpha
is in the range [1,2] and is based on the fraction of unstaked nodes in the network. so if there are 60% unstaked nodes, then alpha
will be 1.6.
However, we can't just do (bucket + 1)^alpha
since this would require floating point math. So, we scale alpha
by 1_000_000
. so now alpha
is in [1000000, 2000000]
. and we still can't do (bucket + 1)^alpha
since we will overflow. We need to approximate (bucket + 1)^alpha
by linearly interpolating between (bucket + 1)^1
and (bucket + 1)^2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, maybe i can call it linearly_interpolate()
or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok now I fully understand what it does, thank you for explaining!
You could directly sidestep the overflow concern by using u128 (yes it will be slower but this does not have to be ridiculously fast).
I think the core cause of confusion is that this function's signature does not match what one would expect from a "normal" interpolator (since it is gossip-specific). Normally one would expect it to be defined like
fn linear_interpolate (x1:u64 /* start */, x2:u64 /* stop */, t:u64 /* parameter */ ); // t < SCALE (i.e. between 0 and 1)
So extracting the interpolator from the gossip-specific callsite where we call
interpolate ((bucket + 1), (bucket+1)^2, (alpha *SCALE) as u64 - SCALE)
would go a long way towards making it obvious. Then you could keep the interpolator in this crate (maybe call crate fixed-point-math
or something generic like that) and call it from gossip with gossip-specific parameters as in example above.
Alternatively, this can be moved to gossip crate as in that context it would be more clear, just call it gossip_bucket_interpolate
or something like it, so people grepping for a linear interpolator would not get confused & you need not make any changes to the implementation/tests. Probably better way to ship this in 3.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya think the last option is best for now. will rename to make more clear
756e0fc
to
c56f65a
Compare
c56f65a
to
acc661b
Compare
41a83ee
to
9686c5d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, left a few nits but nothing blocking at this point.
gossip/src/stake_weighting_config.rs
Outdated
@@ -0,0 +1,76 @@ | |||
use {serde::Deserialize, solana_account::ReadableAccount, solana_runtime::bank::Bank}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe hide this whole module behind agave-unstable-api? Then we can safely axe it once tests are done without breaking semver. Does not have to be this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the best way to do this? with the low-pass-filter crate i could gate it behind agave-unstable-api and then pull it into gossip crate in cargo.toml and activate the agave-unstable-api feature. but here, this file is already in the gossip crate...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well actually you could do better and just not mark any items here "pub". They are not used outside of gossip anyway so no need for them to be pub, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
mod weighting_config_control_pubkey { | ||
solana_pubkey::declare_id!("goSwVUizoqNYKEaaiTjkgdN2RgLpvsTvFt1MEVGibY9"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please allocate this account on testnet before merging the changes and write all zeros in there so we can be sure this does not activate accidentally as we are setting up the multisigs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok so i can deploy this account without setting up multisig first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you can set authority later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why tf does this need a control account? these are not toys to be used at the slightest whim. where is the cluster gate? there is no way in hell we can deploy this logic to mb. please think before you merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right. I forgot to add the cluster gate.
the thing we're changing with the control account is the filter parameter tc_ms
and then whether we want the dynamic weighting or the fixed weighting. Math says dynamic weighting should work as expected + filter tests show filter converging pretty quickly (~4 iterations) with the current 30000 tc_ms. Downside if we don't have the control is that if something goes wrong testnet may go down (i guess that's the point of testnet) or we have to make a new PR with a different tc_ms (not expected).
Are you suggesting we just axe the control account and just roll out dynamic weighting and just see if it matches expectations on testnet?
@@ -212,21 +409,20 @@ mod tests { | |||
} | |||
|
|||
#[test] | |||
fn test_push_active_set() { | |||
fn test_push_active_set_static_weighting() { | |||
const CLUSTER_SIZE: usize = 117; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but should we not make this ~4000 for this test to be more representative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya maybe in a future PR. it doesn't really matter. we're just testing rotating nodes into our active set and pruning. cluster size doesn't have an impact here beyond like 20
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok then, nevermind=)
gossip/src/push_active_set.rs
Outdated
#[test] | ||
fn test_alpha_converges_to_expected_target() { | ||
const CLUSTER_SIZE: usize = 415; | ||
const TOLERANCE_MILLI: u64 = 10_000; // ±1% of alpha |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand, scale is hardcoded here. This will explode if we change the scale, right? Maybe worth updating this one? Doe not have to be this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
low-pass-filter/Cargo.toml
Outdated
path = "src/lib.rs" | ||
|
||
[features] | ||
default = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI by default all features are disabled https://doc.rust-lang.org/cargo/reference/features.html#the-default-feature. Does not matter here of course.
8490937
to
3ca64a7
Compare
3ca64a7
to
97ecb64
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Problem
This PR addresses this Issue: #6903
This achieves the same goal as PR: #7070 but does it with fixed point instead of floating point math.
Unstaked and low staked nodes in gossip do not receive many messages on testnet due to the testnet stake distribution and the aggressive quadratic weighting of peers by stake. As a result, unstaked nodes have a poor and inconsistent view of the peers in the network.
Linear weighting is not aggressive enough and causes a disproportionate number of messages to be sent to unstaked nodes on mainnet.
We need weighting that falls in between linear and quadratic.
Summary of Changes
Add a dynamic weighting element
alpha
.alpha
fluctuates between 1 and 2 at runtime and is linearly proportional to the number of unstaked nodes in the cluster.For example, on mainnet, there are 82% unstaked nodes, so
alpha
will converge on 1.82. On testnet, there are 4% unstaked nodes, soalpha
will converge on 1.04.We use a first-order, low-pass filter to update
alpha
to avoid constantalpha
thrashing and overshooting the new target. Unit tests show that alpha converges on the targetalpha
within approximately 4-5 push active set rotations (30-38s).Note: floating point operations are expensive but
rotate()
is called once every 7.5s, so should not add any significant overhead.Testing has shown that flattening out the weight curve when their aren't many unstaked nodes (as on testnet), increases an unstaked nodes ingress gossip traffic as desired. On the other hand, high staked nodes don't see a significant reduction in received traffic.
All calculations and data that backup and support this PR here: