Description
Description
After purging a room (!QQpfJfZvqxbCfeDgCj:matrix.org
, which is a recently tombstoned TWIM), my federation_sender started spinning on something consuming 100% CPU and hammering its database connections.
2 days later, the federation_sender
instance and 5 postgres backends have accumulated more than 3 days of combined CPU time:
PID CPU USER VIRT RES SHR S CPU% MEM% TIME+ ▽Command
10613 20 995 2599M 409M 20060 R 122.4 0.3 2d06h14m /usr/bin/python3 -m synapse.app.federation_sender --config-path=/etc/synapse/homeserver.yaml --config-path=/etc/synapse/workers/federation_sender.yaml
11051 20 88 16.5G 49036 12588 S 20.8 0.0 8h42:57 postgres: synapse synapse [local] idle
11088 1 88 16.5G 48876 12208 R 20.2 0.0 8h42:41 postgres: synapse synapse [local] idle
11082 17 88 16.5G 48816 11476 S 14.5 0.0 8h40:45 postgres: synapse synapse [local] idle
11085 5 88 16.5G 49328 12464 S 20.2 0.0 8h23:05 postgres: synapse synapse [local] idle
11087 6 88 16.5G 48644 12392 S 20.8 0.0 8h22:46 postgres: synapse synapse [local] idle
Looking at pg_stat_statements
, this query is top 1 in cumulative execution time and call count (23089417 calls and 159560 ms total accrued in slightly over 1 hour after statistics reset):
SELECT event_id FROM destination_rooms
JOIN events USING (stream_ordering)
WHERE destination = $1
AND stream_ordering > $2
ORDER BY stream_ordering
LIMIT $3
I cannot isolate statistics for these backend processes specifically (pg_stat_statements is global for the entire database), so this is the best evidence I have.
This has happened for the second time, and judging by the first one, the problem disappears when the worker is restarted, so I cannot easily try different builds or debugging tactics.
Steps to reproduce
- purge a big room
Homeserver
intelfx.name
Synapse Version
{"server_version":"1.130.0"}
Installation Method
Other (please mention below)
Database
PostgreSQL
Workers
Multiple workers
Platform
- Arch Linux
- commodity hardware (homelab-grade)
- bare metal
Configuration
- presence: on
- message retention: indefinite
- synapse modules: none
Relevant log output
Purge log (UTC):
May 29 18:05:51 synapse[2986]: synapse.handlers.room: [task-shutdown_and_purge_room-0-nMTEFSKFptEhTuba] Shutting down room '!QQpfJfZvqxbCfeDgCj:matrix.org'
May 29 18:05:51 synapse[2986]: synapse.handlers.pagination: [task-shutdown_and_purge_room-0-nMTEFSKFptEhTuba] starting purge room_id=!QQpfJfZvqxbCfeDgCj:matrix.org force=False
<...>
May 29 18:07:04 synapse[2986]: synapse.handlers.pagination: [task-shutdown_and_purge_room-0-nMTEFSKFptEhTuba] purge complete for room_id !QQpfJfZvqxbCfeDgCj:matrix.org
May 29 18:07:04 synapse[2986]: synapse.util.task_scheduler: [task-shutdown_and_purge_room-0-nMTEFSKFptEhTuba] Task complete: 73.538sec (0.371sec, 0.246sec) (0.001sec/73.515sec/10) [0 dbevts] '!QQpfJfZvqxbCfeDgCj:matrix.org', {'new_room_user_id': None, 'new_room_name': None, 'message': None, 'requester_user_id': '@intelfx:intelfx.name', 'block': False, 'purge': True, 'force_purge': False}
Full federation_sender
log for the last 1 hour (UTC): federation_sender.txt
Anything else that would be useful to know?
Curiously, the only place where this query is emitted is followed by the "Catching up destination <...>"
log message:
synapse/synapse/federation/sender/per_destination_queue.py
Lines 456 to 495 in 2436512
However, this message is only emitted once in a few minutes (compare and contrast with 23089417 queries per hour). This seems to suggest that the worker is busy-looping via the continue
branch on line 467.
`federation_sender` log for the last 1 hour, grep by "Catching up destination"
May 31 15:05:40 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-287855] Catching up destination ravemusic.de with 1 PDUs
May 31 15:06:35 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-287864] Catching up destination zemos.net with 3 PDUs
May 31 15:09:30 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-287897] Catching up destination ravemusic.de with 1 PDUs
May 31 15:10:20 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-287905] Catching up destination zemos.net with 3 PDUs
May 31 15:13:35 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-287942] Catching up destination ravemusic.de with 1 PDUs
May 31 15:14:25 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-287950] Catching up destination zemos.net with 3 PDUs
May 31 15:17:13 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-289747] Catching up destination zemos.net with 3 PDUs
May 31 15:17:15 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-289924] Catching up destination ravemusic.de with 1 PDUs
May 31 15:17:30 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291093] Catching up destination ravemusic.de with 1 PDUs
May 31 15:18:25 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291100] Catching up destination zemos.net with 3 PDUs
May 31 15:21:35 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291136] Catching up destination ravemusic.de with 1 PDUs
May 31 15:22:25 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291143] Catching up destination zemos.net with 3 PDUs
May 31 15:25:25 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291174] Catching up destination ravemusic.de with 1 PDUs
May 31 15:26:15 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291183] Catching up destination zemos.net with 3 PDUs
May 31 15:29:30 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291217] Catching up destination ravemusic.de with 1 PDUs
May 31 15:30:20 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291228] Catching up destination zemos.net with 3 PDUs
May 31 15:33:25 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291260] Catching up destination ravemusic.de with 1 PDUs
May 31 15:34:10 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291267] Catching up destination zemos.net with 3 PDUs
May 31 15:37:25 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291302] Catching up destination ravemusic.de with 1 PDUs
May 31 15:38:05 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291309] Catching up destination zemos.net with 3 PDUs
May 31 15:41:35 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-291344] Catching up destination ravemusic.de with 1 PDUs
May 31 15:42:14 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-293116] Catching up destination zemos.net with 3 PDUs
May 31 15:42:16 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-293296] Catching up destination ravemusic.de with 1 PDUs
May 31 15:42:25 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-294033] Catching up destination zemos.net with 3 PDUs
May 31 15:45:30 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-294496] Catching up destination ravemusic.de with 1 PDUs
May 31 15:46:15 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-294504] Catching up destination zemos.net with 3 PDUs
May 31 15:49:30 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-294540] Catching up destination ravemusic.de with 1 PDUs
May 31 15:50:20 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-294549] Catching up destination zemos.net with 3 PDUs
May 31 15:53:25 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-294581] Catching up destination ravemusic.de with 1 PDUs
May 31 15:53:45 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-294584] Catching up destination synthetix.io with 1 PDUs
May 31 15:54:20 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-294593] Catching up destination zemos.net with 3 PDUs
May 31 15:57:30 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-294628] Catching up destination ravemusic.de with 1 PDUs
May 31 15:58:25 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-294639] Catching up destination zemos.net with 3 PDUs
May 31 16:01:30 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-294674] Catching up destination ravemusic.de with 1 PDUs
May 31 16:02:20 synapse[2990]: synapse.federation.sender.per_destination_queue: [federation_transaction_transmission_loop-294682] Catching up destination zemos.net with 3 PDUs