Open
Description
Add metric when the page times out. Record the Matrix API that is still running and the duration.
Things to record in each event:
- Response status code we ended up sending
- Total time spent on the server rendering the request (this will just end up being the timeout configured)
- Homeserver
- Potential for high cardinality although the list will be limited to a known set of homeservers (Fetch events from multiple homeservers #5)
- Room ID
- Since this has a very high cardinality (lots of possible values), we might not be able to index this but would be good to have on each metric event to inspect.
- These extra details are nice if we want to investigate why a particular room/homeserver combo is timing out
- Matrix API endpoint path that is still running when we timed out (like
/join
,/messages
)- Is this useful? Would be nice to know where most requests get stuck at
We can also send a success metric and response time to compare against how many requests we're failing to serve vs total traffic.
Dev notes
We probably just need to add something like prom-client
, expose a Prometheus /metrics
scrape endpoint that serves await register.metrics()
, then add a scrape annotation to the K8s service (which is still being finalized)
Adjacent: Here is an example middleware from the Gitter webapp that logs and metrics when a request is pending for more than 60 seconds, https://gitlab.com/gitterHQ/webapp/-/blob/676fadc3693260c8c51f448a0ca4c3e180d1b4a2/server/web/middlewares/pending-request.js#L50-84