Skip to content

Make range sync peer loadbalancing PeerDAS-friendly #6922

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
May 7, 2025

Conversation

dapplion
Copy link
Collaborator

@dapplion dapplion commented Feb 6, 2025

Issue Addressed

Range sync and backfill sync still assume that each batch request is done by a single peer. This assumption breaks with PeerDAS, where we request custody columns to N peers.

Issues with current unstable:

  • Peer prioritization counts batch requests per peer. This accounting is broken now, data columns by range request are not accounted
  • Peer selection for data columns by range ignores the set of peers on a syncing chain, instead draws from the global pool of peers
  • The implementation is very strict when we have no peers to request from. After PeerDAS this case is very common and we want to be flexible or easy and handle that case better than just hard failing everything.

Proposed Changes

  • Upstream peer prioritization to the network context, it knows exactly how many active requests a peer (including columns by range)
  • Upstream peer selection to the network context, now block_components_by_range_request gets a set of peers to choose from instead of a single peer. If it can't find a peer, it returns the error RpcRequestSendError::NoPeer
  • Range sync and backfill sync handle RpcRequestSendError::NoPeer explicitly
    • Range sync: leaves the batch in AwaitingDownload state and does nothing. TODO: we should have some mechanism to fail the chain if it's stale for too long - EDIT: Not done in this PR
    • Backfill sync: pauses the sync until another peer joins - EDIT: Same logic as unstable

TODOs

  • Add tests :)
  • Manually test backfill sync

Note: this touches the mainnet path!

@dapplion dapplion requested a review from jxs as a code owner February 6, 2025 05:16
@dapplion dapplion added ready-for-review The code is ready for review syncing das Data Availability Sampling labels Feb 6, 2025
@dapplion dapplion requested a review from jimmygchen February 6, 2025 05:19
@dapplion
Copy link
Collaborator Author

dapplion commented Feb 6, 2025

I would like to wait to merge this PR first to have more test coverage

@dapplion
Copy link
Collaborator Author

dapplion commented Feb 6, 2025

@jimmygchen I have reduced the scope of this PR. I intended to deprecate the check good_peers_on_sampling_subnets in this PR but it's a very sensitive change. I left that logic untouched, what we do now is:

  • Check if good_peers_on_sampling_subnets
    • If no, don't create batch
    • If yes, create batch and send it
      • If we don't have enough custody peers, error and drop chain

In the future we can deprecate the good_peers_on_sampling_subnets check by allowing batches to remain in AwaitingDownload state. It's essentially duplicate code as we check for peer twice. It should make sync less likely to drop chains too, like we did in lookup sync by allowing batches to be peer-less for some time.

I added a TODO(das) to tackle in another PR

// TODO(das): Handle the NoPeer case explicitly and don't drop the batch. For
// sync to work properly it must be okay to have "stalled" batches in
// AwaitingDownload state. Currently it will error with invalid state if
// that happens. Sync manager must periodicatlly prune stalled batches like
// we do for lookup sync. Then we can deprecate the redundant
// `good_peers_on_sampling_subnets` checks.

mergify bot pushed a commit that referenced this pull request Feb 11, 2025
Currently we track a key metric `PEERS_PER_COLUMN_SUBNET` in a getter `good_peers_on_sampling_subnets`. Another PR #6922 deletes that function, so we have to move the metric anyway. This PR moves that metric computation to the metrics spawned task which is refreshed every 5 seconds.

I also added a few more useful metrics. The total set and intended usage is:

- `sync_peers_per_column_subnet`: Track health of overall subnets in your node
- `sync_peers_per_custody_column_subnet`: Track health of the subnets your node needs. We should track this metric closely in our dashboards with a heatmap and bar plot
- ~~`sync_column_subnets_with_zero_peers`: Is equivalent to the Grafana query `count(sync_peers_per_column_subnet == 0) by (instance)`. We may prefer to skip it, but I believe it's the most important metric as if `sync_column_subnets_with_zero_peers > 0` your node stalls.~~
- ~~`sync_custody_column_subnets_with_zero_peers`: `count(sync_peers_per_custody_column_subnet == 0) by (instance)`~~
pawanjay176 pushed a commit to pawanjay176/lighthouse that referenced this pull request Feb 21, 2025
Currently we track a key metric `PEERS_PER_COLUMN_SUBNET` in a getter `good_peers_on_sampling_subnets`. Another PR sigp#6922 deletes that function, so we have to move the metric anyway. This PR moves that metric computation to the metrics spawned task which is refreshed every 5 seconds.

I also added a few more useful metrics. The total set and intended usage is:

- `sync_peers_per_column_subnet`: Track health of overall subnets in your node
- `sync_peers_per_custody_column_subnet`: Track health of the subnets your node needs. We should track this metric closely in our dashboards with a heatmap and bar plot
- ~~`sync_column_subnets_with_zero_peers`: Is equivalent to the Grafana query `count(sync_peers_per_column_subnet == 0) by (instance)`. We may prefer to skip it, but I believe it's the most important metric as if `sync_column_subnets_with_zero_peers > 0` your node stalls.~~
- ~~`sync_custody_column_subnets_with_zero_peers`: `count(sync_peers_per_custody_column_subnet == 0) by (instance)`~~
eserilev pushed a commit to eserilev/lighthouse that referenced this pull request Mar 5, 2025
Currently we track a key metric `PEERS_PER_COLUMN_SUBNET` in a getter `good_peers_on_sampling_subnets`. Another PR sigp#6922 deletes that function, so we have to move the metric anyway. This PR moves that metric computation to the metrics spawned task which is refreshed every 5 seconds.

I also added a few more useful metrics. The total set and intended usage is:

- `sync_peers_per_column_subnet`: Track health of overall subnets in your node
- `sync_peers_per_custody_column_subnet`: Track health of the subnets your node needs. We should track this metric closely in our dashboards with a heatmap and bar plot
- ~~`sync_column_subnets_with_zero_peers`: Is equivalent to the Grafana query `count(sync_peers_per_column_subnet == 0) by (instance)`. We may prefer to skip it, but I believe it's the most important metric as if `sync_column_subnets_with_zero_peers > 0` your node stalls.~~
- ~~`sync_custody_column_subnets_with_zero_peers`: `count(sync_peers_per_custody_column_subnet == 0) by (instance)`~~
Copy link

mergify bot commented Mar 12, 2025

This pull request has merge conflicts. Could you please resolve them @dapplion? 🙏

Copy link
Member

@jimmygchen jimmygchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good. There are 2 failing range sync tests:
https://github.com/sigp/lighthouse/actions/runs/14340199806/job/40197309292

@jimmygchen jimmygchen requested a review from Copilot April 9, 2025 04:16
Copilot

This comment was marked as outdated.

@pawanjay176 pawanjay176 added the under-review A reviewer has only partially completed a review. label Apr 30, 2025
@jimmygchen jimmygchen changed the base branch from unstable to peerdas-devnet-6 May 1, 2025 02:22
@jimmygchen
Copy link
Member

@dapplion @pawanjay176 as discussed, we need more testing on this PR, but would like to avoid having too many sync branches. Once this is reviewed, we can merge this to peerdas-devnet-6 branch for testing all the sync fixes together.

I've updated the base branch for this PR.

@jimmygchen jimmygchen requested a review from Copilot May 1, 2025 02:24
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors peer selection and request handling for both range and backfill syncs to support PeerDAS while improving error handling and load balancing. The changes include updates to peer management and removal in range sync, refactoring of batch state transitions and error handling, and modifications to the network context to select peers based on active request counts.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
beacon_node/network/src/sync/range_sync/range.rs Removed the network argument from remove_peer callbacks.
beacon_node/network/src/sync/range_sync/chain.rs Refactored peer removal, updated batch removal and download completion APIs.
beacon_node/network/src/sync/range_sync/batch.rs Updated BatchInfo state variants and renamed current_peer to processing_peer.
beacon_node/network/src/sync/network_context/requests.rs Added helper method to iterate over active request peers.
beacon_node/network/src/sync/network_context/custody.rs Renamed error variant from NoPeers to NoPeer for consistency.
beacon_node/network/src/sync/network_context.rs Modified peer selection logic for block components requests using active request counts.
beacon_node/network/src/sync/manager.rs Updated backfill sync peer disconnect handling.
beacon_node/network/src/sync/backfill_sync/mod.rs Refactored backfill sync to remove active request tracking and update retry logic.
beacon_node/lighthouse_network/src/types/globals.rs Introduced a helper to verify custody peer membership.
Comments suppressed due to low confidence (1)

beacon_node/network/src/sync/backfill_sync/mod.rs:652

  • [nitpick] Consider resolving the TODO regarding penalizing custody column peers to ensure all relevant peers are consistently handled during backfill sync.
// TODO(das): `participating_peers` only includes block peers. Should we penalize the custody column peers too?

@sigp sigp deleted a comment from Copilot AI May 1, 2025
@jimmygchen
Copy link
Member

I still can't figure out why the tests are failing, seems like adding peers doesn't cause a new chain to be created.
I think we'll need this PR to prevent Lighthouse from requesting from peers that are either out of sync or not on the same chain.

@dapplion
Copy link
Collaborator Author

dapplion commented May 3, 2025

Fixed the test by requesting peers from the global peer pool, only synced peers. To request from the SyncingChain pool of peers we need more tweaks that we should do in another PR

@dapplion dapplion requested a review from jimmygchen May 3, 2025 00:30
Copy link
Member

@jimmygchen jimmygchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me.
I'll start testing this while waiting for Pawan's review.

Copy link
Member

@pawanjay176 pawanjay176 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
I like the direction of moving the peer selection to the network context. I have tested this in the mainnet context and found no issues.
I haven't tested this yet on a peerdas network yet.

Given that we still have a bunch of todos wrt the attribution, I think it might be a good idea to have a peerdas-syncing branch that we merge all sync improvements to and test extensively before merging it back to unstable.
The new branch is only meant as a testing branch for all sync improvements so we still review properly before merging into it.
cc @dapplion @jimmygchen what do you guys think?

@jimmygchen
Copy link
Member

LGTM. I like the direction of moving the peer selection to the network context. I have tested this in the mainnet context and found no issues. I haven't tested this yet on a peerdas network yet.

Given that we still have a bunch of todos wrt the attribution, I think it might be a good idea to have a peerdas-syncing branch that we merge all sync improvements to and test extensively before merging it back to unstable. The new branch is only meant as a testing branch for all sync improvements so we still review properly before merging into it. cc @dapplion @jimmygchen what do you guys think?

Yeah sounds good to me. I can also confirm this works fine with mainnet. I synced a node from scratch and it completed backfill successfully.

I haven't been able to get it to work under PeerDAS though. I'll check the logs to confirm we haven't broke anything there.

@jimmygchen jimmygchen changed the base branch from peerdas-devnet-6 to unstable May 5, 2025 23:12
@jimmygchen
Copy link
Member

I couldn't get sync to work on a local devnet, seems to be getting stuck due to metadata being acquired after peer is added to chain:

May 06 06:03:14.963 DEBUG Syncing new finalized chain                   id: 1, component: "range_sync"
May 06 06:03:14.963 DEBUG Waiting for peers to be available on sampling column subnets chain: 1service: "range_sync"
May 06 06:03:14.968 DEBUG Finalization sync peer joined                 peer_id: 16Uiu2HAmFWoqu6Y2EFTWzuZtdoupagnacQunLH2HoAkzT8ZuDi9y, component: "range_sync"
May 06 06:03:14.968 DEBUG Adding peer to known chain                    peer_id: 16Uiu2HAmFWoqu6Y2EFTWzuZtdoupagnacQunLH2HoAkzT8ZuDi9y, sync_type: Finalized, id: 1, component: "range_sync"
May 06 06:03:14.968 DEBUG Waiting for peers to be available on sampling column subnets chain: 1service: "range_sync"
May 06 06:03:14.968 DEBUG Waiting for peers to be available on sampling column subnets chain: 1service: "range_sync"
May 06 06:03:14.973 DEBUG Waiting for peers to be available on sampling column subnets chain: 1service: "range_sync"

I'm going to try revive #6975 and retest again.

@jimmygchen jimmygchen removed the waiting-on-author The reviewer has suggested changes and awaits thier implementation. label May 6, 2025
@jimmygchen
Copy link
Member

Confirm this works on peerdas devnet too with #6975. Merging this now.

@jimmygchen jimmygchen added ready-for-merge This PR is ready to merge. and removed under-review A reviewer has only partially completed a review. labels May 7, 2025
mergify bot added a commit that referenced this pull request May 7, 2025
@mergify mergify bot merged commit beb0ce6 into sigp:unstable May 7, 2025
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
das Data Availability Sampling ready-for-merge This PR is ready to merge. syncing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants