internal/bufferpool has no restrictions on growth, causing un-preventable OOMs

ToC:
- [What](#issue-what)
- [Why](#issue-why)
- [What do we do now](#issue-what-do-we-do-now)

# <a href="#issue-what" id="issue-what">#</a> What

[Cadence](https://github.com/cadence-workflow/cadence) has some somewhat-uncommon call patterns that seem to be causing major memory issues via yarpc when enough planets align:
- We have _many thousands_ of concurrent requests doing long-polls to acquire work (~1 minute each before it gives up and polls again)
  - These are mostly "small" responses, <0.5MB, and the graphs I'm looking at right now are averaging ~5KB.
- We have occasional _very large_ requests
  - These are essentially unbounded, but e.g. downloading a large workflow's history can *very reasonably* exceed tens of megabytes, and our replication-related requests are largely unlimited (they're tuned for throughput, e.g. hundreds of MB might be reasonable).
  - I'm not sure if we have any concrete numbers here, but I've seen *single* GC cycles release >10GB of memory.  "large" can be safely assumed.

Internally, yarpc uses a *shared global* `sync.Pool` wrapper in internal/bufferpool: https://github.com/yarpc/yarpc-go/blob/7b06a1133668f3202bd63725db3c4254f0ce3c62/internal/bufferpool/bufferpool.go#L30  
and it holds items for the *full* length of a unary request: https://github.com/yarpc/yarpc-go/blob/42ee4791e3485911edf07cc584d6dfd9ba5f847a/transport/tchannel/handler.go#L199-L200 and https://github.com/yarpc/yarpc-go/blob/7b06a1133668f3202bd63725db3c4254f0ce3c62/transport/grpc/handler.go#L176-L177  
(Acquired when a call is received, defer-released after it finishes writing the response)

(Semi-related: I am curious how this behaves with streaming requests, as we are planning to switch to gRPC streaming eventually)

From pprof output, this bufferpool sometimes ends up dominating our inuse heap, and 48GB+ hosts can hit OOMs within about an hour of starting up *from a single abnormal caller* triggering a bad combination of calls.  
A sample pprof during one of these instances shows ~38GB of inuse memory due to this pool, with many byte-slices that are well over 50MB (of the most-commonly-sampled ranges):

<img width="591" alt="Image" src="https://github.com/user-attachments/assets/8dd23de2-c823-4c0a-a444-fb3cb9ca7c5e" />

(*Both* in-bound arrows are from call stacks that pass through this bufferpool, not from other stdlib-`bytes.Buffer` usage.  They're just not included in the screenshot because they are rendered very far away)

---

# <a href="#issue-why" id="issue-why">#</a> Why

For Cadence's purposes this bufferpool-using behavior means "thousands of held buffers for ~1m", and *roughly never* having "idle" time with few requests where it could release these objects.  Poll requests restart themselves essentially immediately in nearly all cases, so our concurrently-running request count has a *roughly* constant baseline in the tens of thousands.  
When combined with our occasional large requests, I believe what's happening is:
- A small buffer is created to handle a poll
- This ~never gets GC'd, so when it handles a random "large" request it grows to match (several megabytes)
- ... because it never gets GC'd, it *keeps holding* megabytes when it usually handles only a few kilobytes, essentially for the rest of the process's lifetime
- ... and eventually *all* long-poll-using buffers go through this cycle, gradually growing them all (randomly).

Some may be freed after a burst of other requests complete of course, but this isn't something we can control and it doesn't seem to be able to keep up in some situations.

This is all probably made worse by sync.Pool's "[it takes 2 GC cycles of no use to actually free memory](https://victoriametrics.com/blog/go-sync-pool/#victim-pool)" design, combined with infrequent GCs and lots and lots of requests.

Currently our only mitigation strategy is to restart more frequently, and try to identify the caller and what they're doing that's causing memory issues, and get them to make changes.  Often this takes hours or days to fully address.

---

# <a href="#issue-what-do-we-do-now" id="issue-what-do-we-do-now">#</a> What do we do now 

I assume this pool CPU-benchmarked quite well compared to no buffer reuse, and it's a common and very reasonable thing to have, but CPU is not the only resource to be concerned about.  
I kinda think being able to control this isn't optional, since the alternative is "the process does not continue running".

From a bit of exploring of the code, I'm really only seeing two high-level options:
1. Provide some way to configure this shared global pool.
   - The most-minimal option here might just be a constant "max bytes to reuse" size param, which can be set with `go build -ldflags "-X go.uber.org/yarpc/internal/bufferpool.MaxReuseBytes=123456`, and either don't return things larger than that to the pool at all, or truncate the buffer to fit the max.
   - We could also add a mutex/atomic, a `SetPool(...)` func, and re-expose it somewhere publicly (since this is in `internal`).
   - There does not seem to be a simple way to build a "don't pool more than xMB total" since pools don't report when they evict data, but [it might be possible now that weak pointers exist](https://pkg.go.dev/weak).
2. Allow injecting a pool implementation, so different calls(/clients/servers/?) can have different pools (which could even include a noop impl)
   - For our purposes, per procedure might be needed, or we'd probably set a rather small max (likely 100KB or so).  In the "does not rapidly crash" vs "saves a few %-CPU" equation, we will *absolutely* choose to prevent crashes.

1 might be reasonable as a short-term hotfix, and it'll fairly trivially have no effect on anyone who isn't doing clearly-unstable changes to that value.  Safe and quick but also obviously not a "good" solution.  (I personally lean towards the `-ldflags` route because it does not offer a new "public" API that could be assumed to be stable)

2 seems like probably a lot of work.  The pool is used in a LOT of places, through a LOT of very-different call stacks, and where to draw the line on "you can customize this instance with yarpc-option-X passed to call-Y" vs "there is no user-convenient way to get an argument into this call stack" will probably take some time to nail down.  
Or maybe I'm just not familiar enough with the code and this will be reasonably simple (though clearly still changing quite a few lines of code).

Are there other options?  Do you forsee any blockers in following option 2?  We essentially need something, so we might be willing to do much of the implementation work, but it's not worth doing if the results won't be approved in the end.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

internal/bufferpool has no restrictions on growth, causing un-preventable OOMs #2355

# What

# Why

# What do we do now

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	requestBuffer := bufferpool.Get()
	defer bufferpool.Put(requestBuffer)

	buf := bufferpool.Get()
	defer bufferpool.Put(buf)

internal/bufferpool has no restrictions on growth, causing un-preventable OOMs #2355

Description

# What

# Why

# What do we do now

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions