Description
In order to diagnose element-hq/element-web#23113 I just went on a mission to trace the lifetime of to-device messages from me (EI) -> matrix.org -> vector.modular.im -> rick (EW) in order to find out where they got lost. (Turns out they got lost at Rick's end; see the bug for details).
However, the process of tracing the to-device messages is utterly miserable due to insufficient debugging. Looking at the logs, it starts off on Element iOS okay with encrypting the room_key for the appropriate destination device:
2022-08-18 22:41:07.611 Element[5250:589660] [MXMegolmEncryption] shareKey: Sharing keys with device @rxl881:vector.modular.im:ZAUPWAYFGG
2022-08-18 22:41:07.625 Element[5250:589660] [MXOlmDevice] encryptMessage: Olm Session id NPkU5Y8QyqmlDSKkgZ/hTROQ1cGWdrpvOcx/2vWkU0Q to ZRFNAAnDjik0Z5QO74hvi901ULy4OmnVu50jD7tUJTg
2022-08-18 22:41:07.626 Element[5250:589660] [MXBackgroundTask] Start background task #9663 - [MXRealmCryptoStore] performSessionOperationWithDevice
2022-08-18 22:41:07.626 Element[5250:589660] [MXBackgroundTask] Background task [MXRealmCryptoStore] performSessionOperationWithDevice started with app state: active and estimated background time remaining: undetermined
2022-08-18 22:41:07.649 Element[5250:589660] [MXBackgroundTask] End background task #9663 - [MXRealmCryptoStore] performSessionOperationWithDevice after 22.861ms
...and then it bundles it up with a bunch of others...
2022-08-18 22:41:08.257 Element[5250:589660] [MXMegolmEncryption] shareKey: Actually share with 2 users and 28 devices
...and sends it to the server in a to-device request...:
2022-08-18 22:41:08.278 Element[5250:589660] [MXHTTPClient] #2963 - PUT _matrix/client/r0/sendToDevice/m.room.encrypted/m1692882576.68
However, while we have a transaction ID for this set of to-device messages, we don't have any other IDs for tracing them to their various recipients.
We can see the inbound request in synapse's logs on matrix.org:
encryption-1
2022-08-18 19:41:09,001 - synapse.access.http.28101 - 450 - INFO - PUT-354700 - 82.132.228.213 - 28101 - {@matthew:matrix.org} Processed request: 0.018sec/0.002sec (0.007sec, 0.001sec) (0.001sec/0.009sec/3) 2B 200 "PUT /_matrix/client/r0/sendToDevice/m.room.encrypted/m1692882576.68 HTTP/1.1" "Element/1.8.25 (iPhone; iOS 15.6; Scale/3.00)" [0 dbevts]
...but jaeger currently can't view the parent span due to matrix-org/synapse#13567. Having found the span based on timeframes, it doesn't tell us much either though, as we don't log which remote devices we're sending to - only local ones. So all you get told is that it's trying to send a to-device msg to vector.modular.im.
There's then no way to correlate that at all to the actual attempt to send the EDU to vector.modular.im with the to-device message in it: the only logs you get are:
fed-sender-14:
2022-08-18 19:41:09,010 - synapse.federation.sender.transaction_manager - 121 - INFO - federation_transaction_transmission_loop-275093- - TX [vector.modular.im] {1660839599478} Sending transaction [1660839599478], (PDUs: 0, EDUs: 1)
2022-08-18 19:41:09,092 - synapse.http.matrixfederationclient - 299 - INFO - federation_transaction_transmission_loop-275093- - {PUT-O-194819} [vector.modular.im] Completed request: 200 OK in 0.08 secs, got 11 bytes - PUT matrix://vector.modular.im/_matrix/federation/v1/send/1660839599478
2022-08-18 19:41:09,092 - synapse.federation.sender.transaction_manager - 174 - INFO - federation_transaction_transmission_loop-275093- - TX [vector.modular.im] {1660839599478} got 200 response
...which as far as I can see doesn't have any IDs at all (other than the local txnid), and doesn't come up in Jaeger, either correlated with the original request, or viewing the outbound traffic at all?
At this point I gave up of trying to track the inbound request on vector.modular.im and confirming that it went down /sync to Rick, as I assume we're missing logging and tracing for that too.
Finally, on js-sdk, I had to completely guess which of the to-device messages was actually the one in question - it will have been one of these (from Rick's side of the logs):
2022-08-19T08:11:27.271Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
2022-08-19T08:11:36.581Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
2022-08-19T08:11:36.581Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
2022-08-19T08:11:36.581Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
2022-08-19T08:11:36.581Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
2022-08-19T08:11:36.582Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
2022-08-19T08:11:36.583Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
If we want to be able to trace UISI root causes we have to fill in the missing observability here, and see which Olm traffic went missing, and where, and then why.