Skip to content

Allow for (subgraph) response size limits and traffic shaping to save the router from OOM due to large responses #6999

Open
@frittentheke

Description

@frittentheke

Is your feature request related to a problem? Please describe.

We recently had a case of huge responses from one particular subgraph causing the Router to OOM.
To be exact the subgraph responded with megabytes of "errors": [].

While there are lots of settings to apply traffic shaping or limit requests (from clients),
I found no way to configure any limits on subgraph response sizes that serve as a circuit-breaker for such cases:

In essence this feature request is just another aspect (like max_depth and max_height) by which the resources of individual requests can be limited.

Describe the solution you'd like

I'd like to be able to set a limit on the size of an individual subgraph request that the router will parse and then compile into the response in order to limit the maximum required memory per individual original request.

This might not necessarily have to be a limit per individual subgraph request, but some configurable maximum per request the router processes in order to now allow for a few requests to fill all of the memory.

Certainly there has to be a log message indicating that requests where dropped / rejected due to their higher than allowed memory, like for all other request limits.

It might also make sense to indicate to the client that their response is larger than allowed by the router, maybe using https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/413 ?

Describe alternatives you've considered

An alternative would be some sort of overload protection in case the router memory approaches certain threshold. Something similar to the overload manager built into Envoy: https://www.envoyproxy.io/docs/envoy/latest/configuration/operations/overload_manager/overload_manager

While this might be beneficial in any case as lots of small inflight requests might also cause a router to go OOM, this tackles a different problem as we had with a single malfunctioning subgraph being the troublemaker not the amount of concurrent requests per se.

Being unable to limit the memory footprint of handling a single (of potentially many concurrent) request, makes it hard to determine the required memory for "full throttle" (all connections / threads / workers / ...) being busy the router requires.

Additional context

There are some related issues and feature requests I found

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions