Description
ToC:
# What
Cadence has some somewhat-uncommon call patterns that seem to be causing major memory issues via yarpc when enough planets align:
- We have many thousands of concurrent requests doing long-polls to acquire work (~1 minute each before it gives up and polls again)
- These are mostly "small" responses, <0.5MB, and the graphs I'm looking at right now are averaging ~5KB.
- We have occasional very large requests
- These are essentially unbounded, but e.g. downloading a large workflow's history can very reasonably exceed tens of megabytes, and our replication-related requests are largely unlimited (they're tuned for throughput, e.g. hundreds of MB might be reasonable).
- I'm not sure if we have any concrete numbers here, but I've seen single GC cycles release >10GB of memory. "large" can be safely assumed.
Internally, yarpc uses a shared global sync.Pool
wrapper in internal/bufferpool:
and it holds items for the full length of a unary request:
yarpc-go/transport/tchannel/handler.go
Lines 199 to 200 in 42ee479
yarpc-go/transport/grpc/handler.go
Lines 176 to 177 in 7b06a11
(Acquired when a call is received, defer-released after it finishes writing the response)
(Semi-related: I am curious how this behaves with streaming requests, as we are planning to switch to gRPC streaming eventually)
From pprof output, this bufferpool sometimes ends up dominating our inuse heap, and 48GB+ hosts can hit OOMs within about an hour of starting up from a single abnormal caller triggering a bad combination of calls.
A sample pprof during one of these instances shows ~38GB of inuse memory due to this pool, with many byte-slices that are well over 50MB (of the most-commonly-sampled ranges):

(Both in-bound arrows are from call stacks that pass through this bufferpool, not from other stdlib-bytes.Buffer
usage. They're just not included in the screenshot because they are rendered very far away)
# Why
For Cadence's purposes this bufferpool-using behavior means "thousands of held buffers for ~1m", and roughly never having "idle" time with few requests where it could release these objects. Poll requests restart themselves essentially immediately in nearly all cases, so our concurrently-running request count has a roughly constant baseline in the tens of thousands.
When combined with our occasional large requests, I believe what's happening is:
- A small buffer is created to handle a poll
- This ~never gets GC'd, so when it handles a random "large" request it grows to match (several megabytes)
- ... because it never gets GC'd, it keeps holding megabytes when it usually handles only a few kilobytes, essentially for the rest of the process's lifetime
- ... and eventually all long-poll-using buffers go through this cycle, gradually growing them all (randomly).
Some may be freed after a burst of other requests complete of course, but this isn't something we can control and it doesn't seem to be able to keep up in some situations.
This is all probably made worse by sync.Pool's "it takes 2 GC cycles of no use to actually free memory" design, combined with infrequent GCs and lots and lots of requests.
Currently our only mitigation strategy is to restart more frequently, and try to identify the caller and what they're doing that's causing memory issues, and get them to make changes. Often this takes hours or days to fully address.
# What do we do now
I assume this pool CPU-benchmarked quite well compared to no buffer reuse, and it's a common and very reasonable thing to have, but CPU is not the only resource to be concerned about.
I kinda think being able to control this isn't optional, since the alternative is "the process does not continue running".
From a bit of exploring of the code, I'm really only seeing two high-level options:
- Provide some way to configure this shared global pool.
- The most-minimal option here might just be a constant "max bytes to reuse" size param, which can be set with
go build -ldflags "-X go.uber.org/yarpc/internal/bufferpool.MaxReuseBytes=123456
, and either don't return things larger than that to the pool at all, or truncate the buffer to fit the max. - We could also add a mutex/atomic, a
SetPool(...)
func, and re-expose it somewhere publicly (since this is ininternal
). - There does not seem to be a simple way to build a "don't pool more than xMB total" since pools don't report when they evict data, but it might be possible now that weak pointers exist.
- The most-minimal option here might just be a constant "max bytes to reuse" size param, which can be set with
- Allow injecting a pool implementation, so different calls(/clients/servers/?) can have different pools (which could even include a noop impl)
- For our purposes, per procedure might be needed, or we'd probably set a rather small max (likely 100KB or so). In the "does not rapidly crash" vs "saves a few %-CPU" equation, we will absolutely choose to prevent crashes.
1 might be reasonable as a short-term hotfix, and it'll fairly trivially have no effect on anyone who isn't doing clearly-unstable changes to that value. Safe and quick but also obviously not a "good" solution. (I personally lean towards the -ldflags
route because it does not offer a new "public" API that could be assumed to be stable)
2 seems like probably a lot of work. The pool is used in a LOT of places, through a LOT of very-different call stacks, and where to draw the line on "you can customize this instance with yarpc-option-X passed to call-Y" vs "there is no user-convenient way to get an argument into this call stack" will probably take some time to nail down.
Or maybe I'm just not familiar enough with the code and this will be reasonably simple (though clearly still changing quite a few lines of code).
Are there other options? Do you forsee any blockers in following option 2? We essentially need something, so we might be willing to do much of the implementation work, but it's not worth doing if the results won't be approved in the end.