This repository was archived by the owner on Apr 26, 2024. It is now read-only.
This repository was archived by the owner on Apr 26, 2024. It is now read-only.
Race condition with replication means that publishing room aliases lacks read-after-write consistency between workers #14210
Open
Description
Consider the following sequence of events:
- Alice creates a room without any aliases.
- Alice lists aliases for that room.
- Alice sets an alias for that room.
- Alice lists aliases for that room.
If the alias writes occur on a separate worker to the reads, this is vulnerable to a classic worker cache invalidation race:
- (2) succeeds because the reader has no cached alias information for the room. It queries the database (which is written before (1) completes) and caches the result.
- (3) succeeds on the writer, which fires off a message telling readers to invalidate their caches.
⚠️ If request (4) arrives before the reader has received and processed the invalidation, the reader will return the (now stale) data in its cache. This means Alice has failed to read her own write.
I don't think actual humans edit and then immediately list aliases that often, so I suggest we don't worry about fixing this. (i.e. I think this only manifests as test flakes). But I wanted to write this up as a reference. (It would be nice to have a catalogue of known races like this).
History:
See issues labeled with
Z-Read-After-Write
And previous related history specifically around aliases:
- Move some more workers off master sytest#1053 moved requests off the main worker in sytest
GET /rooms/:room_id/aliases lists aliases
is broken sytest#1055 this causes a broken/flakey test- Retry aliases query sytest#1056 introduces retry logic to work around this
- Add 13 alias tests (clearing 6 sytests) complement#266 sytest ported to complement
- Test Synapse in worker mode under Complement #12638 We start testing complement with workers
- Unknown: something happens to cause that test to start failing in worker mode on complement. Possibly Strip whitespace from worker types in Dockerfile-workers #14165? Unconfirmed.
- Complement flake(?):
GET_/rooms/:room_id/aliases_lists_aliases
#14183 we notice the failures. - Deflake
GET /rooms/:room_id/aliases lists aliases
complement#521 readds the retry logic
Metadata
Metadata
Assignees
Labels
Issues related to testing in complement, synapse, etcProblems related to running Synapse in Worker Mode (or replication)Most users are unlikely to come across this or unexpected workflowMinor significance, cosmetic issues, low or no impact to users.Bugs, crashes, hangs, security vulnerabilities, or other reported issues.A lack of read-after-write consistency, usually due to cache invalidation races with workers