-
Notifications
You must be signed in to change notification settings - Fork 886
Make range sync peer loadbalancing PeerDAS-friendly #6922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I would like to wait to merge this PR first to have more test coverage |
1debbbf
to
4792275
Compare
@jimmygchen I have reduced the scope of this PR. I intended to deprecate the check
In the future we can deprecate the I added a lighthouse/beacon_node/network/src/sync/range_sync/chain.rs Lines 913 to 918 in 1debbbf
|
Currently we track a key metric `PEERS_PER_COLUMN_SUBNET` in a getter `good_peers_on_sampling_subnets`. Another PR #6922 deletes that function, so we have to move the metric anyway. This PR moves that metric computation to the metrics spawned task which is refreshed every 5 seconds. I also added a few more useful metrics. The total set and intended usage is: - `sync_peers_per_column_subnet`: Track health of overall subnets in your node - `sync_peers_per_custody_column_subnet`: Track health of the subnets your node needs. We should track this metric closely in our dashboards with a heatmap and bar plot - ~~`sync_column_subnets_with_zero_peers`: Is equivalent to the Grafana query `count(sync_peers_per_column_subnet == 0) by (instance)`. We may prefer to skip it, but I believe it's the most important metric as if `sync_column_subnets_with_zero_peers > 0` your node stalls.~~ - ~~`sync_custody_column_subnets_with_zero_peers`: `count(sync_peers_per_custody_column_subnet == 0) by (instance)`~~
Currently we track a key metric `PEERS_PER_COLUMN_SUBNET` in a getter `good_peers_on_sampling_subnets`. Another PR sigp#6922 deletes that function, so we have to move the metric anyway. This PR moves that metric computation to the metrics spawned task which is refreshed every 5 seconds. I also added a few more useful metrics. The total set and intended usage is: - `sync_peers_per_column_subnet`: Track health of overall subnets in your node - `sync_peers_per_custody_column_subnet`: Track health of the subnets your node needs. We should track this metric closely in our dashboards with a heatmap and bar plot - ~~`sync_column_subnets_with_zero_peers`: Is equivalent to the Grafana query `count(sync_peers_per_column_subnet == 0) by (instance)`. We may prefer to skip it, but I believe it's the most important metric as if `sync_column_subnets_with_zero_peers > 0` your node stalls.~~ - ~~`sync_custody_column_subnets_with_zero_peers`: `count(sync_peers_per_custody_column_subnet == 0) by (instance)`~~
Currently we track a key metric `PEERS_PER_COLUMN_SUBNET` in a getter `good_peers_on_sampling_subnets`. Another PR sigp#6922 deletes that function, so we have to move the metric anyway. This PR moves that metric computation to the metrics spawned task which is refreshed every 5 seconds. I also added a few more useful metrics. The total set and intended usage is: - `sync_peers_per_column_subnet`: Track health of overall subnets in your node - `sync_peers_per_custody_column_subnet`: Track health of the subnets your node needs. We should track this metric closely in our dashboards with a heatmap and bar plot - ~~`sync_column_subnets_with_zero_peers`: Is equivalent to the Grafana query `count(sync_peers_per_column_subnet == 0) by (instance)`. We may prefer to skip it, but I believe it's the most important metric as if `sync_column_subnets_with_zero_peers > 0` your node stalls.~~ - ~~`sync_custody_column_subnets_with_zero_peers`: `count(sync_peers_per_custody_column_subnet == 0) by (instance)`~~
This pull request has merge conflicts. Could you please resolve them @dapplion? 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good. There are 2 failing range sync tests:
https://github.com/sigp/lighthouse/actions/runs/14340199806/job/40197309292
@dapplion @pawanjay176 as discussed, we need more testing on this PR, but would like to avoid having too many sync branches. Once this is reviewed, we can merge this to I've updated the base branch for this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors peer selection and request handling for both range and backfill syncs to support PeerDAS while improving error handling and load balancing. The changes include updates to peer management and removal in range sync, refactoring of batch state transitions and error handling, and modifications to the network context to select peers based on active request counts.
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
beacon_node/network/src/sync/range_sync/range.rs | Removed the network argument from remove_peer callbacks. |
beacon_node/network/src/sync/range_sync/chain.rs | Refactored peer removal, updated batch removal and download completion APIs. |
beacon_node/network/src/sync/range_sync/batch.rs | Updated BatchInfo state variants and renamed current_peer to processing_peer. |
beacon_node/network/src/sync/network_context/requests.rs | Added helper method to iterate over active request peers. |
beacon_node/network/src/sync/network_context/custody.rs | Renamed error variant from NoPeers to NoPeer for consistency. |
beacon_node/network/src/sync/network_context.rs | Modified peer selection logic for block components requests using active request counts. |
beacon_node/network/src/sync/manager.rs | Updated backfill sync peer disconnect handling. |
beacon_node/network/src/sync/backfill_sync/mod.rs | Refactored backfill sync to remove active request tracking and update retry logic. |
beacon_node/lighthouse_network/src/types/globals.rs | Introduced a helper to verify custody peer membership. |
Comments suppressed due to low confidence (1)
beacon_node/network/src/sync/backfill_sync/mod.rs:652
- [nitpick] Consider resolving the TODO regarding penalizing custody column peers to ensure all relevant peers are consistently handled during backfill sync.
// TODO(das): `participating_peers` only includes block peers. Should we penalize the custody column peers too?
I still can't figure out why the tests are failing, seems like adding peers doesn't cause a new chain to be created. |
Fixed the test by requesting peers from the global peer pool, only synced peers. To request from the SyncingChain pool of peers we need more tweaks that we should do in another PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me.
I'll start testing this while waiting for Pawan's review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I like the direction of moving the peer selection to the network context. I have tested this in the mainnet context and found no issues.
I haven't tested this yet on a peerdas network yet.
Given that we still have a bunch of todos wrt the attribution, I think it might be a good idea to have a peerdas-syncing
branch that we merge all sync improvements to and test extensively before merging it back to unstable.
The new branch is only meant as a testing branch for all sync improvements so we still review properly before merging into it.
cc @dapplion @jimmygchen what do you guys think?
Yeah sounds good to me. I can also confirm this works fine with mainnet. I synced a node from scratch and it completed backfill successfully. I haven't been able to get it to work under PeerDAS though. I'll check the logs to confirm we haven't broke anything there. |
I couldn't get sync to work on a local devnet, seems to be getting stuck due to metadata being acquired after peer is added to chain:
I'm going to try revive #6975 and retest again. |
Confirm this works on peerdas devnet too with #6975. Merging this now. |
Issue Addressed
Range sync and backfill sync still assume that each batch request is done by a single peer. This assumption breaks with PeerDAS, where we request custody columns to N peers.
Issues with current unstable:
Proposed Changes
block_components_by_range_request
gets a set of peers to choose from instead of a single peer. If it can't find a peer, it returns the errorRpcRequestSendError::NoPeer
RpcRequestSendError::NoPeer
explicitlyAwaitingDownload
state and does nothing. TODO: we should have some mechanism to fail the chain if it's stale for too long - EDIT: Not done in this PRTODOs
Note: this touches the mainnet path!