Split out mutable event content from event cache into new caches that are keyed by room ID #13916
Description
In the hopes of fixing #11521 and paving the way towards an immutable external event cache (#2123), a new architecture for the event cache is proposed.
The current state
Currently there are a couple separate data structures related to caching event contents in memory in Synapse (see code for fetching an event from cache/db here):
EventsWorkerStore._get_event_cache
- An instance of AsyncLruCache implemented as a map ofEventID
-> (EventBase
, redactedEventBase
| None).- This cache is populated after fetching an event from the database.
- Entries in this cache are invalidated when an event is deleted from the database (in most cases, c.f. completely broken room after purge_room/rejoin #11521), redacted or marked as rejected.
- Entries in this cache are invalidated when the size limit of the LruCache are reached.
EventsWorkerStore._event_ref
- A WeakValueDictionary which serves as a single point of reference for EventBase's in memory, ensuring that we don't end up with multiple, unnecessary copies of a single EventBase in memory.- This data structure is populated after fetched an event from the database.
- Because this is a WeakValueDictionary, entries in this cache are invalidated when all other references to the EventBase in an entry are gone.
- Entries in this cache are invalidated when an event is deleted from the database (in most cases, c.f. completely broken room after purge_room/rejoin #11521), redacted or marked as rejected.
- Entries in this cache are not invalidated when an entry is evicted from
EventsWorkerStore._get_event_cache
, as something else may still be processing the event, even if it's been removed from that cache.
What's the problem?
See #11521; because each of these caches are keyed by EventID alone, it becomes tricky to invalidate them when all you have is a RoomID (i.e. when purging a room completely). We could query all known events for a room from the database, but that may result in millions of events. Ideally we'd have some map of RoomID -> EventID which only covers the events that are actually currently held in memory. We could then use that to invalidate all three of these caches.
Additionally, as get_event_cache
contains mutable EventCacheEntry
s (comprised of EventBase, redacted EventBase | None
), invalidating them is necessary when an event is both redacted or marked as rejected. These can differ per-homeserver, so removing this component from the cache entries opens up avenues for multiple homeservers sharing the same, immutable event cache.
Proposal
After speaking with @erikjohnston we've (mostly Erik :) came up with the following idea:
EventsWorkerStore._get_event_cache
would simply become a map ofEventID
->EventBase
.- We add a separate cache which is a nested map of
RoomID
->EventID
->{rejected_status: bool, redacted_event_content: Optional[EventBase]}
.- Entries are added to this map when an event is pulled from the database. We know the RoomID at this point.
- Entries are not invalidated from this map when an entry is
EventsWorkerStore._get_event_cache
is invalidated due to hitting the cache size. - This does mean that we'll need to know the RoomID when querying for rejected/redacted status though... But we can get that from the event cache?
The beauty of this is that we no longer need to invalidate the _get_event_cache
at all (unless the size limit is hit)! Even in the room purge use case! How? Here are some examples of using this system:
Fetch EventID A which is not in-memory
- Some calling function asks for EventID A.
- This does not exist in
_get_event_cache
(nor other caches) so we query from the database. The event and related metadata is fetched from the DB (event_json
,redactions
,rejections
) and both the_get_event_cache
and event metadata cache are populated. - Return information from the database.
Fetch EventID A which is in-memory
- Some calling function asks for EventID A.
- This already exists in
_get_event_cache
, and presumably the metadata cache. We take the RoomID from the EventBase in the_get_event_cache
and query the event metadata cache. - Return information from both caches.
EventID A is not in-memory but the event has been purged
- Some calling function asks for EventID A.
- This already exists in
_get_event_cache
, and presumably the metadata cache. We take the RoomID from the EventBase in the cache and query the event metadata cache - but uh oh, there's no matching entry in the metadata cache! The event must have been purged. - We invalidate the entry in the event cache as well and return None.
Thus when purging a room, we only need to purge entries in the metadata cache (which we can easily do by RoomID due to the metadata cache's structure). Entries in the get_event_cache
and event_ref
will be invalidated as they are fetched.
I'm curious for thoughts on whether this sounds reasonable from other members of the Synapse team + cc @Fizzadar.