Skip to content

federation_sender spinning endlessly(?) and hammering get_catch_up_room_event_ids() after purging a room #18491

Open
@intelfx

Description

@intelfx

Description

After purging a room (!QQpfJfZvqxbCfeDgCj:matrix.org, which is a recently tombstoned TWIM), my federation_sender started spinning on something consuming 100% CPU and hammering its database connections.

2 days later, the federation_sender instance and 5 postgres backends have accumulated more than 3 days of combined CPU time:

    PID CPU USER  VIRT   RES   SHR S  CPU% MEM%   TIME+ ▽Command
  10613  20 995  2599M  409M 20060 R 122.4  0.3 2d06h14m /usr/bin/python3 -m synapse.app.federation_sender --config-path=/etc/synapse/homeserver.yaml --config-path=/etc/synapse/workers/federation_sender.yaml
  11051  20 88   16.5G 49036 12588 S  20.8  0.0  8h42:57 postgres: synapse synapse [local] idle
  11088   1 88   16.5G 48876 12208 R  20.2  0.0  8h42:41 postgres: synapse synapse [local] idle
  11082  17 88   16.5G 48816 11476 S  14.5  0.0  8h40:45 postgres: synapse synapse [local] idle
  11085   5 88   16.5G 49328 12464 S  20.2  0.0  8h23:05 postgres: synapse synapse [local] idle
  11087   6 88   16.5G 48644 12392 S  20.8  0.0  8h22:46 postgres: synapse synapse [local] idle

Looking at pg_stat_statements, this query is top 1 in cumulative execution time and call count (23089417 calls and 159560 ms total accrued in slightly over 1 hour after statistics reset):

SELECT event_id FROM destination_rooms
                 JOIN events USING (stream_ordering)
                WHERE destination = $1
                  AND stream_ordering > $2
                ORDER BY stream_ordering
                LIMIT $3

I cannot isolate statistics for these backend processes specifically (pg_stat_statements is global for the entire database), so this is the best evidence I have.

This has happened for the second time, and judging by the first one, the problem disappears when the worker is restarted, so I cannot easily try different builds or debugging tactics.

Steps to reproduce

  • purge a big room

Homeserver

intelfx.name

Synapse Version

{"server_version":"1.130.0"}

Installation Method

Other (please mention below)

Database

PostgreSQL

Workers

Multiple workers

Platform

  • Arch Linux
  • commodity hardware (homelab-grade)
  • bare metal

Configuration

  • presence: on
  • message retention: indefinite
  • synapse modules: none

Relevant log output

Purge log (UTC):

May 29 18:05:51 synapse[2986]: synapse.handlers.room: [task-shutdown_and_purge_room-0-nMTEFSKFptEhTuba] Shutting down room '!QQpfJfZvqxbCfeDgCj:matrix.org'
May 29 18:05:51 synapse[2986]: synapse.handlers.pagination: [task-shutdown_and_purge_room-0-nMTEFSKFptEhTuba] starting purge room_id=!QQpfJfZvqxbCfeDgCj:matrix.org force=False
<...>
May 29 18:07:04 synapse[2986]: synapse.handlers.pagination: [task-shutdown_and_purge_room-0-nMTEFSKFptEhTuba] purge complete for room_id !QQpfJfZvqxbCfeDgCj:matrix.org
May 29 18:07:04 synapse[2986]: synapse.util.task_scheduler: [task-shutdown_and_purge_room-0-nMTEFSKFptEhTuba] Task complete: 73.538sec (0.371sec, 0.246sec) (0.001sec/73.515sec/10) [0 dbevts] '!QQpfJfZvqxbCfeDgCj:matrix.org', {'new_room_user_id': None, 'new_room_name': None, 'message': None, 'requester_user_id': '@intelfx:intelfx.name', 'block': False, 'purge': True, 'force_purge': False}

CPU usage graph (UTC):
Image

Full federation_sender log for the last 1 hour (UTC): federation_sender.txt

Anything else that would be useful to know?

Curiously, the only place where this query is emitted is followed by the "Catching up destination <...>" log message:

event_ids = await self._store.get_catch_up_room_event_ids(
self._destination, last_successful_stream_ordering
)
if not event_ids:
# No more events to catch up on, but we can't ignore the chance
# of a race condition, so we check that no new events have been
# skipped due to us being in catch-up mode
if self._catchup_last_skipped > last_successful_stream_ordering:
# another event has been skipped because we were in catch-up mode
continue
# we are done catching up!
self._catching_up = False
break
if first_catch_up_check:
# as this is our check for needing catch-up, we may have PDUs in
# the queue from before we *knew* we had to do catch-up, so
# clear those out now.
self._start_catching_up()
# fetch the relevant events from the event store
# - redacted behaviour of REDACT is fine, since we only send metadata
# of redacted events to the destination.
# - don't need to worry about rejected events as we do not actively
# forward received events over federation.
catchup_pdus = await self._store.get_events_as_list(event_ids)
if not catchup_pdus:
raise AssertionError(
"No events retrieved when we asked for %r. "
"This should not happen." % event_ids
)
logger.info(
"Catching up destination %s with %d PDUs",
self._destination,
len(catchup_pdus),
)

However, this message is only emitted once in a few minutes (compare and contrast with 23089417 queries per hour). This seems to suggest that the worker is busy-looping via the continue branch on line 467.

`federation_sender` log for the last 1 hour, grep by "Catching up destination"
May 31 15:05:40 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-287855] Catching up destination ravemusic.de with 1 PDUs
May 31 15:06:35 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-287864] Catching up destination zemos.net with 3 PDUs
May 31 15:09:30 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-287897] Catching up destination ravemusic.de with 1 PDUs
May 31 15:10:20 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-287905] Catching up destination zemos.net with 3 PDUs
May 31 15:13:35 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-287942] Catching up destination ravemusic.de with 1 PDUs
May 31 15:14:25 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-287950] Catching up destination zemos.net with 3 PDUs
May 31 15:17:13 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-289747] Catching up destination zemos.net with 3 PDUs
May 31 15:17:15 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-289924] Catching up destination ravemusic.de with 1 PDUs
May 31 15:17:30 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291093] Catching up destination ravemusic.de with 1 PDUs
May 31 15:18:25 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291100] Catching up destination zemos.net with 3 PDUs
May 31 15:21:35 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291136] Catching up destination ravemusic.de with 1 PDUs
May 31 15:22:25 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291143] Catching up destination zemos.net with 3 PDUs
May 31 15:25:25 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291174] Catching up destination ravemusic.de with 1 PDUs
May 31 15:26:15 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291183] Catching up destination zemos.net with 3 PDUs
May 31 15:29:30 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291217] Catching up destination ravemusic.de with 1 PDUs
May 31 15:30:20 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291228] Catching up destination zemos.net with 3 PDUs
May 31 15:33:25 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291260] Catching up destination ravemusic.de with 1 PDUs
May 31 15:34:10 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291267] Catching up destination zemos.net with 3 PDUs
May 31 15:37:25 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291302] Catching up destination ravemusic.de with 1 PDUs
May 31 15:38:05 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291309] Catching up destination zemos.net with 3 PDUs
May 31 15:41:35 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291344] Catching up destination ravemusic.de with 1 PDUs
May 31 15:42:14 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-293116] Catching up destination zemos.net with 3 PDUs
May 31 15:42:16 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-293296] Catching up destination ravemusic.de with 1 PDUs
May 31 15:42:25 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-294033] Catching up destination zemos.net with 3 PDUs
May 31 15:45:30 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-294496] Catching up destination ravemusic.de with 1 PDUs
May 31 15:46:15 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-294504] Catching up destination zemos.net with 3 PDUs
May 31 15:49:30 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-294540] Catching up destination ravemusic.de with 1 PDUs
May 31 15:50:20 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-294549] Catching up destination zemos.net with 3 PDUs
May 31 15:53:25 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-294581] Catching up destination ravemusic.de with 1 PDUs
May 31 15:53:45 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-294584] Catching up destination synthetix.io with 1 PDUs
May 31 15:54:20 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-294593] Catching up destination zemos.net with 3 PDUs
May 31 15:57:30 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-294628] Catching up destination ravemusic.de with 1 PDUs
May 31 15:58:25 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-294639] Catching up destination zemos.net with 3 PDUs
May 31 16:01:30 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-294674] Catching up destination ravemusic.de with 1 PDUs
May 31 16:02:20 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-294682] Catching up destination zemos.net with 3 PDUs

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions