Skip to content
This repository was archived by the owner on Apr 26, 2024. It is now read-only.
This repository was archived by the owner on Apr 26, 2024. It is now read-only.

Race condition where a second join races with cache invalidation over replication and erroneously rejects joining Restricted Room #13185

Open
@reivilibre

Description

@reivilibre

(This is the cause of a flake in Complement's TestRestrictedRoomsLocalJoin tests when running with workers.)

If room A is a restricted room, restricted to members of room B, then if:

  1. a user first attempts to join room A and is rejected;
  2. the user then joins room B successfully; and
  3. the user then attempts to join room A again

the user can be erroneously rejected from joining room A (in step (3)).

This is because there is a cache, get_rooms_for_user_with_stream_ordering, which is populated in step (1) on the event creator.
In step (2), the event persister will invalidate that cache and send a command over replication for other workers to do the same.
This issue then occurs if the event creator doesn't receive that command until after step (3).

2022-07-05-TestRestrictedRoomsLocalJoin_race

This occurs easily in Complement+workers on CI, where it takes ~200 ms for the invalidation to be received on the new worker (CPU contention in CI is probably playing a big part).

An obvious workaround is for the client to just sleep & retry, but it seems very ugly that we're issuing 403s to clients when they're making fully serial requests.

Another incomplete workaround is for the event creator to invalidate its own cache, but that won't help if the join to room B takes place on a different event creator.

I think a potential solution that would work is to:

  1. when persisting the join event and sending the invalidation, ensure this is done before finishing the replication request and the client request
  2. when starting a new join for a restricted room on an event creator, fetch the latest cache invalidation position from Redis and then only start processing the join once replication has caught up past that point.

(I'm not exactly sure how you 'fetch the latest position'; I was rather hoping there'd be SETMAX in Redis but didn't find it. Alternatively, it's probably possible to just PING Redis and treat the PONG as the barrier, but it's not a very nice solution.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-TestingIssues related to testing in complement, synapse, etcT-DefectBugs, crashes, hangs, security vulnerabilities, or other reported issues.Z-Read-After-WriteA lack of read-after-write consistency, usually due to cache invalidation races with workers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions