An idle event persister can hold up outbound federation of PDUs and other processes #15595
Description
During a recent federation outage on matrix.org, where federation senders would get stuck for minutes, @\richvdh noticed that the minimum stream position of RoomStreamToken
s was stuck because event_persister-4 wasn't advancing.
When an event persister is idle, the minimum stream position in RoomStreamToken
s will be stuck at the last persisted stream position of the idle event persister plus any continuous run of stream positions seen over replication after that. That is, the minimum stream position gets stuck at the first gap. See here for how the minimum stream position is calculated.
For an explanation of the fields in a RoomStreamToken, see
synapse/synapse/types/__init__.py
Lines 476 to 496 in 36df9c5
The federation senders use this minimum stream position to determine where it is safe to process up to (since new events can't appear with an earlier stream position). Thus when the minimum stream position gets stuck, the federation senders stop making progress even when there are new events from local users needing to be sent.
synapse/synapse/replication/tcp/client.py
Lines 176 to 188 in 3bf973e
synapse/synapse/federation/sender/__init__.py
Lines 444 to 453 in 36df9c5
synapse/synapse/federation/sender/__init__.py
Lines 467 to 473 in 36df9c5
The idle event persister could likely do something to fix the problem, since it can tell when it is behind.
synapse/synapse/storage/util/id_generators.py
Lines 759 to 794 in 36df9c5
Note that
_persisted_upto_position
can end up ahead of the event persister's own position in _current_positions
if it has nothing in flight. However, the event persister's own position doesn't appear to be updated and _persisted_upto_position
isn't broadcast over replication.
Even if we did broadcast _persisted_upto_position
over replication, this would only work for a single idle event persister. When there are two or more idle event persisters, we would just get stuck.
Note that the second half of the code is responsible for advancing the minimum stream position up to the first gap in stream positions.