Skip to content

Synapse takes forever to send large responses - dead time gap after encode_json_response  #17722

Open
@MadLittleMods

Description

@MadLittleMods

Synapse takes forever to send large responses. It takes us longer to send just the response than it does for us to process and encode_json_response in some cases.

Examples

98s to process the request and encode_json_response but the request isn't finished sending until 484s (8 minutes) which is 6.5 minutes of dead-time. The response size is 36MB

Jaeger trace: 4238bdbadd9f3077.json

Jaeger trace with big gap for sending the response

59s to process and finishes after 199s. The response size is 36MB

Jaeger trace: 2149cc5e59306446.json

Jaeger trace with big gap for sending the response

I've come across this before and it's not a new thing. For example in #13620 (original issue), I described as the "mystery gap at the end after we encode the JSON response (encode_json_response)" but never encountered it being this egregious.

It can also happen for small requests. 2s to process and finishes after 5s. The response size is 120KB

Jaeger trace with big gap for sending the response

Investigation

@kegsay pointed out _write_bytes_to_request which runs after encode_json_response and has comments like "Write until there's backpressure telling us to stop." that definitely hint at some areas of interest.

with start_active_span("encode_json_response"):
span = active_span()
json_str = await defer_to_thread(request.reactor, encode, span)
_write_bytes_to_request(request, json_str)

The JSON serialization is done in a background thread because it can block the reactor for many seconds. This part seems normal and fast (no problem).

But we also use _ByteProducer to send the bytes down to the client. Using a producer ensures we can send down all of the bytes to the client without hitting a 60s timeout (see context in comments below)

# The problem with dumping all of the response into the `Request` object at
# once (via `Request.write`) is that doing so starts the timeout for the
# next request to be received: so if it takes longer than 60s to stream back
# the response to the client, the client never gets it.
#
# The correct solution is to use a Producer; then the timeout is only
# started once all of the content is sent over the TCP connection.

This logic was added in:

Some extra time is expected as we're working with the reactor instead of blocking it but it seems like something isn't tuned optimally (chunk size, starting/stopping too much, etc)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions