Make range sync peer loadbalancing PeerDAS-friendly #6922

dapplion · 2025-02-06T05:16:46Z

Issue Addressed

Re-opens Make range sync peer loadbalancing PeerDAS-friendly #6864 targeting unstable

Range sync and backfill sync still assume that each batch request is done by a single peer. This assumption breaks with PeerDAS, where we request custody columns to N peers.

Issues with current unstable:

Peer prioritization counts batch requests per peer. This accounting is broken now, data columns by range request are not accounted
Peer selection for data columns by range ignores the set of peers on a syncing chain, instead draws from the global pool of peers
The implementation is very strict when we have no peers to request from. After PeerDAS this case is very common and we want to be flexible or easy and handle that case better than just hard failing everything.

Proposed Changes

Upstream peer prioritization to the network context, it knows exactly how many active requests a peer (including columns by range)
Upstream peer selection to the network context, now block_components_by_range_request gets a set of peers to choose from instead of a single peer. If it can't find a peer, it returns the error RpcRequestSendError::NoPeer
Range sync and backfill sync handle RpcRequestSendError::NoPeer explicitly
- Range sync: leaves the batch in AwaitingDownload state and does nothing. TODO: we should have some mechanism to fail the chain if it's stale for too long - EDIT: Not done in this PR
- Backfill sync: pauses the sync until another peer joins - EDIT: Same logic as unstable

TODOs

Add tests :)
Manually test backfill sync

Note: this touches the mainnet path!

dapplion · 2025-02-06T14:00:12Z

I would like to wait to merge this PR first to have more test coverage

Add more range sync tests #6872

beacon_node/network/src/sync/backfill_sync/mod.rs

beacon_node/network/src/sync/range_sync/chain.rs

beacon_node/network/src/sync/network_context.rs

dapplion · 2025-02-06T15:27:22Z

@jimmygchen I have reduced the scope of this PR. I intended to deprecate the check good_peers_on_sampling_subnets in this PR but it's a very sensitive change. I left that logic untouched, what we do now is:

Check if good_peers_on_sampling_subnets
- If no, don't create batch
- If yes, create batch and send it
  - If we don't have enough custody peers, error and drop chain

In the future we can deprecate the good_peers_on_sampling_subnets check by allowing batches to remain in AwaitingDownload state. It's essentially duplicate code as we check for peer twice. It should make sync less likely to drop chains too, like we did in lookup sync by allowing batches to be peer-less for some time.

I added a TODO(das) to tackle in another PR

lighthouse/beacon_node/network/src/sync/range_sync/chain.rs

Lines 913 to 918 in 1debbbf

    
           // TODO(das): Handle the NoPeer case explicitly and don't drop the batch. For 
        
           // sync to work properly it must be okay to have "stalled" batches in 
        
           // AwaitingDownload state. Currently it will error with invalid state if 
        
           // that happens. Sync manager must periodicatlly prune stalled batches like 
        
           // we do for lookup sync. Then we can deprecate the redundant 
        
           // `good_peers_on_sampling_subnets` checks.

Currently we track a key metric `PEERS_PER_COLUMN_SUBNET` in a getter `good_peers_on_sampling_subnets`. Another PR #6922 deletes that function, so we have to move the metric anyway. This PR moves that metric computation to the metrics spawned task which is refreshed every 5 seconds. I also added a few more useful metrics. The total set and intended usage is: - `sync_peers_per_column_subnet`: Track health of overall subnets in your node - `sync_peers_per_custody_column_subnet`: Track health of the subnets your node needs. We should track this metric closely in our dashboards with a heatmap and bar plot - ~~`sync_column_subnets_with_zero_peers`: Is equivalent to the Grafana query `count(sync_peers_per_column_subnet == 0) by (instance)`. We may prefer to skip it, but I believe it's the most important metric as if `sync_column_subnets_with_zero_peers > 0` your node stalls.~~ - ~~`sync_custody_column_subnets_with_zero_peers`: `count(sync_peers_per_custody_column_subnet == 0) by (instance)`~~

Currently we track a key metric `PEERS_PER_COLUMN_SUBNET` in a getter `good_peers_on_sampling_subnets`. Another PR sigp#6922 deletes that function, so we have to move the metric anyway. This PR moves that metric computation to the metrics spawned task which is refreshed every 5 seconds. I also added a few more useful metrics. The total set and intended usage is: - `sync_peers_per_column_subnet`: Track health of overall subnets in your node - `sync_peers_per_custody_column_subnet`: Track health of the subnets your node needs. We should track this metric closely in our dashboards with a heatmap and bar plot - ~~`sync_column_subnets_with_zero_peers`: Is equivalent to the Grafana query `count(sync_peers_per_column_subnet == 0) by (instance)`. We may prefer to skip it, but I believe it's the most important metric as if `sync_column_subnets_with_zero_peers > 0` your node stalls.~~ - ~~`sync_custody_column_subnets_with_zero_peers`: `count(sync_peers_per_custody_column_subnet == 0) by (instance)`~~

beacon_node/network/src/sync/network_context.rs

mergify · 2025-03-12T22:04:55Z

This pull request has merge conflicts. Could you please resolve them @dapplion? 🙏

jimmygchen

Changes look good. There are 2 failing range sync tests:
https://github.com/sigp/lighthouse/actions/runs/14340199806/job/40197309292

jimmygchen · 2025-05-01T02:23:56Z

@dapplion @pawanjay176 as discussed, we need more testing on this PR, but would like to avoid having too many sync branches. Once this is reviewed, we can merge this to peerdas-devnet-6 branch for testing all the sync fixes together.

I've updated the base branch for this PR.

Copilot

Pull Request Overview

This PR refactors peer selection and request handling for both range and backfill syncs to support PeerDAS while improving error handling and load balancing. The changes include updates to peer management and removal in range sync, refactoring of batch state transitions and error handling, and modifications to the network context to select peers based on active request counts.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
beacon_node/network/src/sync/range_sync/range.rs	Removed the network argument from remove_peer callbacks.
beacon_node/network/src/sync/range_sync/chain.rs	Refactored peer removal, updated batch removal and download completion APIs.
beacon_node/network/src/sync/range_sync/batch.rs	Updated BatchInfo state variants and renamed current_peer to processing_peer.
beacon_node/network/src/sync/network_context/requests.rs	Added helper method to iterate over active request peers.
beacon_node/network/src/sync/network_context/custody.rs	Renamed error variant from NoPeers to NoPeer for consistency.
beacon_node/network/src/sync/network_context.rs	Modified peer selection logic for block components requests using active request counts.
beacon_node/network/src/sync/manager.rs	Updated backfill sync peer disconnect handling.
beacon_node/network/src/sync/backfill_sync/mod.rs	Refactored backfill sync to remove active request tracking and update retry logic.
beacon_node/lighthouse_network/src/types/globals.rs	Introduced a helper to verify custody peer membership.

Comments suppressed due to low confidence (1)

beacon_node/network/src/sync/backfill_sync/mod.rs:652

[nitpick] Consider resolving the TODO regarding penalizing custody column peers to ensure all relevant peers are consistently handled during backfill sync.

// TODO(das): `participating_peers` only includes block peers. Should we penalize the custody column peers too?

jimmygchen · 2025-05-02T03:45:29Z

I still can't figure out why the tests are failing, seems like adding peers doesn't cause a new chain to be created.
I think we'll need this PR to prevent Lighthouse from requesting from peers that are either out of sync or not on the same chain.

dapplion · 2025-05-03T00:28:49Z

Fixed the test by requesting peers from the global peer pool, only synced peers. To request from the SyncingChain pool of peers we need more tweaks that we should do in another PR

jimmygchen

Changes look good to me.
I'll start testing this while waiting for Pawan's review.

pawanjay176

LGTM.
I like the direction of moving the peer selection to the network context. I have tested this in the mainnet context and found no issues.
I haven't tested this yet on a peerdas network yet.

Given that we still have a bunch of todos wrt the attribution, I think it might be a good idea to have a peerdas-syncing branch that we merge all sync improvements to and test extensively before merging it back to unstable.
The new branch is only meant as a testing branch for all sync improvements so we still review properly before merging into it.
cc @dapplion @jimmygchen what do you guys think?

beacon_node/lighthouse_network/src/types/globals.rs

jimmygchen · 2025-05-05T20:53:42Z

LGTM. I like the direction of moving the peer selection to the network context. I have tested this in the mainnet context and found no issues. I haven't tested this yet on a peerdas network yet.

Given that we still have a bunch of todos wrt the attribution, I think it might be a good idea to have a peerdas-syncing branch that we merge all sync improvements to and test extensively before merging it back to unstable. The new branch is only meant as a testing branch for all sync improvements so we still review properly before merging into it. cc @dapplion @jimmygchen what do you guys think?

Yeah sounds good to me. I can also confirm this works fine with mainnet. I synced a node from scratch and it completed backfill successfully.

I haven't been able to get it to work under PeerDAS though. I'll check the logs to confirm we haven't broke anything there.

Co-authored-by: Pawan Dhananjay <[email protected]>

Comments addressed

jimmygchen · 2025-05-06T06:11:43Z

I couldn't get sync to work on a local devnet, seems to be getting stuck due to metadata being acquired after peer is added to chain:

May 06 06:03:14.963 DEBUG Syncing new finalized chain                   id: 1, component: "range_sync"
May 06 06:03:14.963 DEBUG Waiting for peers to be available on sampling column subnets chain: 1service: "range_sync"
May 06 06:03:14.968 DEBUG Finalization sync peer joined                 peer_id: 16Uiu2HAmFWoqu6Y2EFTWzuZtdoupagnacQunLH2HoAkzT8ZuDi9y, component: "range_sync"
May 06 06:03:14.968 DEBUG Adding peer to known chain                    peer_id: 16Uiu2HAmFWoqu6Y2EFTWzuZtdoupagnacQunLH2HoAkzT8ZuDi9y, sync_type: Finalized, id: 1, component: "range_sync"
May 06 06:03:14.968 DEBUG Waiting for peers to be available on sampling column subnets chain: 1service: "range_sync"
May 06 06:03:14.968 DEBUG Waiting for peers to be available on sampling column subnets chain: 1service: "range_sync"
May 06 06:03:14.973 DEBUG Waiting for peers to be available on sampling column subnets chain: 1service: "range_sync"

I'm going to try revive #6975 and retest again.

jimmygchen · 2025-05-07T00:06:44Z

Confirm this works on peerdas devnet too with #6975. Merging this now.

dapplion added 10 commits February 5, 2025 13:03

Remove request tracking inside syncing chains

fbab829

Prioritize by range peers in network context

2faf5f1

Prioritize custody peers for columns by range

0e13a8d

Explicit error handling of the no peers error case

70e6066

Remove good_peers_on_sampling_subnets

8cf4e8c

Count AwaitingDownload towards the buffer limit

4da322f

Retry syncing chains in AwaitingDownload state

9cd238d

Use same peer priorization for lookups

891c8fc

Review PR

29a5aff

Address TODOs

cd6c5d6

dapplion requested a review from jxs as a code owner February 6, 2025 05:16

dapplion added ready-for-review The code is ready for review syncing das Data Availability Sampling labels Feb 6, 2025

dapplion requested a review from jimmygchen February 6, 2025 05:19

Revert changes to peer erroring in range sync

2c6a7cc

dapplion mentioned this pull request Feb 6, 2025

Add PeerDAS metrics to track subnets without peers #6928

Merged

dapplion added 2 commits February 6, 2025 12:01

Revert metrics changes

f77bc24

Update comment

4792275

dapplion force-pushed the peer-loadbalancing branch from 1debbbf to 4792275 Compare February 6, 2025 15:22

dapplion commented Feb 6, 2025

View reviewed changes

ethDreamer reviewed Mar 12, 2025

View reviewed changes

beacon_node/network/src/sync/network_context.rs Outdated Show resolved Hide resolved

beacon_node/network/src/sync/network_context.rs Show resolved Hide resolved

beacon_node/network/src/sync/network_context.rs Outdated Show resolved Hide resolved

dapplion added 2 commits March 17, 2025 00:24

Pass peers_to_deprioritize to select_columns_by_range_peers_to_request

45f5528

more idiomatic

7ec350b

jimmygchen requested changes Apr 9, 2025

View reviewed changes

jimmygchen requested a review from Copilot April 9, 2025 04:16

This comment was marked as outdated.

Sign in to view

pawanjay176 added the under-review A reviewer has only partially completed a review. label Apr 30, 2025

jimmygchen changed the base branch from unstable to peerdas-devnet-6 May 1, 2025 02:22

jimmygchen requested a review from Copilot May 1, 2025 02:24

Copilot AI reviewed May 1, 2025

View reviewed changes

sigp deleted a comment from Copilot AI May 1, 2025

Request columns by range from all synced peers

6567d19

dapplion requested a review from jimmygchen May 3, 2025 00:30

jimmygchen approved these changes May 5, 2025

View reviewed changes

pawanjay176 approved these changes May 5, 2025

View reviewed changes

beacon_node/lighthouse_network/src/types/globals.rs Outdated Show resolved Hide resolved

jimmygchen changed the base branch from peerdas-devnet-6 to unstable May 5, 2025 23:12

jimmygchen and others added 3 commits May 6, 2025 09:14

Update doc comment beacon_node/lighthouse_network/src/types/globals.rs

60f2833

Co-authored-by: Pawan Dhananjay <[email protected]>

Merge branch 'unstable' into peer-loadbalancing

658d7a8

Add test to verify request_batches does not result in infinite loop.

c78c568

dapplion mentioned this pull request Mar 28, 2025

Improve range sync with PeerDAS #6258

Open

jimmygchen removed the waiting-on-author The reviewer has suggested changes and awaits thier implementation. label May 6, 2025

jimmygchen added ready-for-merge This PR is ready to merge. and removed under-review A reviewer has only partially completed a review. labels May 7, 2025

mergify bot added a commit that referenced this pull request May 7, 2025

Merge of #6922

c3c8bfb

mergify bot mentioned this pull request May 7, 2025

merge queue: embarking unstable (43c38a6) and #6922 together #7406

Closed

6 tasks

mergify bot merged commit beb0ce6 into sigp:unstable May 7, 2025
31 checks passed

Make range sync peer loadbalancing PeerDAS-friendly #6922

Make range sync peer loadbalancing PeerDAS-friendly #6922

Conversation

dapplion commented Feb 6, 2025 • edited by jimmygchen Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Addressed

Proposed Changes

TODOs

Uh oh!

dapplion commented Feb 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dapplion commented Feb 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Mar 12, 2025

Uh oh!

jimmygchen left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

jimmygchen commented May 1, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

jimmygchen commented May 2, 2025

Uh oh!

dapplion commented May 3, 2025

Uh oh!

jimmygchen left a comment

Choose a reason for hiding this comment

Uh oh!

pawanjay176 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jimmygchen commented May 5, 2025

Uh oh!

jimmygchen commented May 6, 2025

Uh oh!

jimmygchen commented May 7, 2025

Uh oh!

Uh oh!

Uh oh!

dapplion commented Feb 6, 2025 •

edited by jimmygchen

Loading