Aura alterations for Asynchronous Backing compatibility #2476
Description
The bulk of the changes here should also apply to SASSAFRAS.
High Level
(feel free to skip ahead to the proposed modifications section)
As a recap (more details #2301), we need to prepare Cumulus collators for more advanced parachains protocol features such as Asynchronous Backing, Elastic Scaling, and Blockspace Regions. These changes impact the collator logic and protocols in similar ways: they enable more blockspace to be utilized, and they enable this blockspace to be consumed optimistically, based on predictions of what will happen on the relay chain.
Collator-selection algorithms are not responsible for consensus safety (i.e. security) but just consensus liveness (not halting). Liveness is a spectrum, though. From a formal standpoint, consensus algorithms usually focus on guaranteeing some minimal level of liveness, whereas in practice we care about maximizing throughput, minimizing latency, and reducing waste.
To be more specific, we can clarify two desirable properties for a collator-selection algorithm:
- Maximize utilization of available blockspace (Goal L1)
- Minimize effort in fulfiling L1 (Goal L2)
The phrase "available blockspace" here means something very specific. It refers to the number of cores and the proportion of each of those cores allocated to the parachain. This deliberately ignores the Pay-as-you-Go model, as Aura and similar collator-selection mechanisms are focused on dividing up authorship rights within already allocated blockspace regions, as opposed to Pay-as-you-Go, which is focused on purchasing and consuming blockspace in small increments.
The available blockspace can be roughly condensed into two variables: the maximum capacity C and the expected velocity V. Capacity refers to the amount of blocks that can be pending at any given time. e.g. for 1 core this might be 1 + 4 (the 4 comes from some lookahead performed by Polkadot's validators). The velocity of blockspace refers to the rate of block processing by the relay chain. Roughly speaking, with one core, V = 1, with two cores V = 2 and so on. Velocity calculations are more subtle when we take into account parachains sharing cores with strided blockspace regions.
The term "unincluded segment" (cc #1866) refers to the currently used capacity of the chain. Parachains control the maximum size of the unincluded segment and set C themselves, but the C variable is in practice capped by the relay chain's configuration. The remaining capacity of the unincluded segment is c = C - len(unincluded).
New relay-chain blocks increase c at velocity v <= V (the actual velocity is always less than or equal to the maximum), via the advancement of the backing and availability protocols. That is, the advancement of the relay-chain opens up new capacity to be fulfilled by parachain block authors. New parachain blocks extend the unincluded segment and utilize capacity.
This "capacity"/"velocity" framing may seem overwrought given the current state of blockspace in Polkadot: parachains have 1 core and they are scheduled at every moment on that core. However, with elastic scaling and divisible blockspace regions, it starts to become more reasonable to talk about capacity and velocity values that are much greater than 1 and much smaller than 1.
Proposed Aura modifications
Aura slots in Cumulus are the same as the relay-parent's slot.
- Allow block authors to build multiple blocks per slot.
- Only allow block authors to build a new block when the unincluded segment would not exceed its maximum
- Only allow block authors to build up to V + 1 blocks per slot
All of these checks can be guaranteed by runtime pallets. The velocity can be deduced from the relay-chain state and the maximum unincluded segment length is determined by the parachain's own runtime.
In practice, we need a few changes to support this.
- Generalize
pallet-aura
or create drop-in replacement that allows multiple blocks per slot (pallet-aura: Allow multiple blocks per slot substrate#14024) - Limit unincluded segment length ((async backing) parachain-system: track limitations for unincluded blocks #2438)
- Enable runtime consensus logic to determine the current velocity of the parachain (refactor unincluded segment length into a ConsensusHook #2501)
- Ensure only the first V + 1 blocks from a given author at a slot are accepted by the runtime #2544
- Expose information about unincluded segment and velocity to the node-side code that controls authorship
Liveness Analysis
There are 2 states a (parachain parent, relay-parent) pair can be in, and we'll analyze the outcomes at each of them.
c = 0
By the definition of c, this means there is no remaining capacity for new blocks at this relay-parent.
The slot author for any relay-parent where the unincluded segment is totally full cannot author any blocks.
c > 0
The unincluded segment has some remaining capacity c. The slot owner as-of R can create min(V + 1, c) blocks with R as a relay-parent. This increases the utilization of the maximum capacity.
Implications
The goal is to build as long of a backlog of blocks as the parachain will allow (latency/throughput tradeoff). When the backlog is empty, each collator adds 1 extra parachain block per relay-chain block until the backlog is full. When the backlog is full, each collator simply replaces the blocks that have now been included into the relay chain.
This does imply that it takes N relay-chain blocks to fill an empty backlog, but backlogs are expected to be short, e.g. 2-4 blocks, so it will only take 12-24 seconds to fill the backlog completely. This fulfills Goal L1.
Goal L2 is mostly fulfilled by limiting the contention over which collators get to fill up which parts of the backlog. By limiting collators to at most v+1 blocks, the overlap between collators is limited. There are still issues where collators can decide to build on entirely competing forks, but these need to be addressed at the fork-choice and incentive levels.