Race condition where a second join races with cache invalidation over replication and erroneously rejects joining Restricted Room

(This is the cause of a flake in Complement's `TestRestrictedRoomsLocalJoin` tests when _running with workers_.)

If room A is a restricted room, restricted to members of room B, then if:
1. a user first attempts to join room A and is rejected;
2. the user then joins room B successfully; and
3. the user then attempts to join room A again

the user can be erroneously rejected from joining room A (in step (3)).

This is because there is a cache, `get_rooms_for_user_with_stream_ordering`, which is populated in step (1) on the event *creator*.
In step (2), the event *persister* will invalidate that cache and send a command over replication for other workers to do the same.
This issue then occurs if the event *creator* doesn't receive that command until after step (3).

![2022-07-05-TestRestrictedRoomsLocalJoin_race](https://user-images.githubusercontent.com/38398653/177368542-2171a20c-87d3-4552-9342-18fbf9fc7e06.png)

This occurs easily in Complement+workers on CI, where it takes ~200 ms for the invalidation to be received on the new worker (CPU contention in CI is probably playing a big part).

An obvious workaround is for the client to just sleep & retry, but it seems very ugly that we're issuing 403s to clients when they're making fully **serial** requests.

Another incomplete workaround is for the event creator to invalidate its own cache, but that won't help if the join to room B takes place on a different event creator.

I think a potential solution that would work is to:

1. when persisting the join event and sending the invalidation, ensure this is done before finishing the replication request and the client request
2. when starting a new join for a restricted room on an event creator, fetch the latest cache invalidation position from Redis and then only start processing the join once replication has caught up past that point.

(I'm not exactly sure how you 'fetch the latest position'; I was rather hoping there'd be `SETMAX` in Redis but didn't find it. Alternatively, it's probably possible to just PING Redis and treat the PONG as the barrier, but it's not a very nice solution.)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Race condition where a second join races with cache invalidation over replication and erroneously rejects joining Restricted Room #13185

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Race condition where a second join races with cache invalidation over replication and erroneously rejects joining Restricted Room #13185

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions