Skip to content

RFC: Rewrite Postgres <-> Pageserver communication #10799

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

hlinnaka
Copy link
Contributor

This is not ready, I'm still collecting an organizing my own thoughts. I will update when it's ready for review.

That said, feel free to leave comments already if you wish.

This is not ready, I'm still collecting an organizing my own
thoughts. I will update when it's ready for review.

That said, feel free to leave comments already if you wish.
Copy link

github-actions bot commented Feb 13, 2025

7744 tests run: 7366 passed, 0 failed, 378 skipped (full report)


Flaky tests (1)

Postgres 16

Code coverage* (full report)

  • functions: 32.8% (8643 of 26362 functions)
  • lines: 48.6% (73209 of 150552 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
feead26 at 2025-02-28T12:14:03.915Z :recycle:

- we will have a feature flag to switch between old and new communicator. Once we're
comfortable with the new communicator, remove old code and protocol.

- What about relation size cache? Out of scope? Or move it to the communicator process,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving it to communication process allows to have relation cache of unlimited size, because extending hash in normal memory os much simple than in shared memory.
But it means that to get relation size backend needs to send request to this communication process and then wait reply. Postgres tries to minimise number of smgrnblocks/smgrexists calls, but still they are performed quite frequently.

There is one communicator process in the Postgres server. It's a background
worker process. It handles all communication with the pageservers.

The communicator process is written in a mix of C and Rust. Mostly in Rust, but
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I are absolutely sure that mixing C and Rust code in the same process is good idea?
Yes, Rust is "safe", but it needs to access data in shared memory. Not sure if it can be done in "safe" way.
Certainly rust allows use to use the same libraries at compute/PS side, for example serve for request serialization/deserializaiton. But still not sure whether there are more pros than cons.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

several mentioned interfaces are already done:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be I am missing something, but it is just shmem initialization, not shmem access.

write the rest of the communicator in safe rust.

The Rust parts of the communicator process can use multiple threads and
tokio. The glue code is written taken that into account, making it safe.
Copy link
Contributor

@knizhnik knizhnik Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need Rust AIO here? It is not quite clear from RFS whether you are going to launch separate Nokia task for each backend? But in this case we need a separate socket for each backend and the will be no multiplexing (one of the goals). And if we havre pool of threads which serve all backends, then there is not so much sense IMHO to use Tokio and add extra layer of mapping Tokio tasks to OS threads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we will need tokio. All the interesting libraries for network communication are async. You can have a task per backend, but still share a pool of network connections. I'm not sure yet how exactly that will look like.

Copy link
Contributor

@erikgrinaker erikgrinaker Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the async model tends to be a better fit for networking -- sync Rust sometimes has annoying limitations (like accept not being cancellable). And if we're going to use a thread pool anyway, we may as well let Tokio manage it.

But given that we have one process per backend anyway, we could get away with one thread per backend if we really wanted to.

the next available slot. The request also includes a pointer or buffer ID where
the resulting page should be written. The backend then wakes up the
communicator, with a signal/futex/latch or something, telling the communicator
that it has work to do.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So number of slots should be actually equal to PG_IOV_MAX?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe, not sure. It's a little different. PG_IOV_MAX is the max number of blocks you can request in a single preadv() / pwritev() call. If we use one "request slot" for buffer, then yeah, it makes sense to have at least PG_IOV_MAX requests slots for each backend, so that you can fit all the blocks from a single preadv() call in the slots. Alternatively, if we make it so that one request can contain multiple target buffers, then we need just a single request slot for one preadv() call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that per-backends "request slots" is the best solution. It is not clear how to efficiently notify communicator process that new request was enqueued. Communicator process has N worker threads that take requests from shared memory, write them to the socket and receive reply.
So generally each worker process should wait for one of the following events:

  1. Socket readable (some response is received from PS)
  2. Socket writable (it is possible to send new request to PS)
  3. New request arrived from backend
  4. Postmaster shutdown
    It is naturally handled by WaitEventOrSocket (I wonder how are you going to do it in Rust, especially if using somer external communication library).
    So each worker should have latch which can be singled by backend when new request is enqueued.
    But how worker will understand which backend send this signal and which ring to inspect?

Alternatively we can shared request ring for each worker. Backend will use round-robin discipline to choose ring(worker). Yes, it is necessary to synchronise access to the ring in this case. But it can be done using just atomic increment, without using any synchronisation primitives. Ok. not so easy. We can use atomic increment to reserve space in ring but we need some more variable to point to the current position in the ring. It can be advanced using CAS. So request can be queued in this way:

  1. Choose ring (round robin)
  2. Obtain current insert position in ring (atomic increment)
  3. Write request in the ring (no synchronization is needed)
  4. Advance current read-up-to position in the ring (CAS)
  5. Notify worker (SetLatch)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot about shards. Looks like the most we should worker should work wire one shard.
So number of rings will be n_parallel_connections*n_shards and size of each ring should be (max_prefetch_distance+PG_IOV_MAX)*max_backends where n_parallel_connections is number of connections/sockets we want to establish between communicator process and one shard (assuming that single connection can be not enough to fully utilise network throughput).

It seems too me not so much.
With max_prefetch_distance=68, PG_IOV_MAX=32, max_backends=1000, n_shards=10 and n_parallel_connections=2
it is only 2M entries. Size of each entry will be about 64 bytes, so total size will be about 128Mb.

shared buffer that the backend is holding a lock on), marks the request as
completed, and wakes up the backend.

In this design, because each backend has its own small ring, a backend doesn't
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How we are going to handle backend termination while it is waiting for response? Should we allow it or wait request completion? If PS is down, it can take quite long time during which user will not be able to interrupt the query.


A backend can also issue a "blind" prefetch request. When a communicator
processes a blind prefetch request, it starts the GetPage request and writes the
result to a local buffer within the communicator process. But it could also
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It mean that communicator process needs to have max_connections*effective_io_concurrency buffers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes. That's a lot :-(. Or it can have less, and ignore or queue up the prefetch requests if there are no buffers available. We effectively have that today already, the buffers just live in backend private memory today. When the buffers are private to the communicator process, we don't necessarily need to allocate all the memory up-front, they can be allocated as needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should have common prefetch ring in communicator process (shared by all backends) or maintain separate ring for each backend. First oe seems to be easier and more efficient. But it may cause flushing of prefetch requests of oner (slow) backend by another (fast) backend.

Assume that two backends are doing seqscans through larger tables:

Q1: select sum(sin(x)) from t1;
Q2: select sum(x) from t2;

First one is expected to be executed slowly than t2. If they will compete for the same ring buffer space, then prefetched pages from Q1 will just flush prefetched pages for Q2 before they are requested.


### Prefetching

A backend can also issue a "blind" prefetch request. When a communicator
Copy link
Contributor

@knizhnik knizhnik Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is prefetch request sent through the same "request slots" as other quests? In this case number of this slots per backend should be at least PG_IOV_MAX+effective_io_concurrency.

Should backend first check that prefetch request for such buffer is already registered? asa far as I understand all current prefetch ring logic with it's refetch ring and hash will be moved to communication process. But it means that backend will not be effectively check for duplicatete prefetch requests (unless pin buffer for it). And duplicated prefetch seems to be quite common for indexscan heap page prefetch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is prefetch request sent through the same "request slots" as other quests? In this case number of this slots per backend should be at least PG_IOV_MAX+effective_io_concurrency.

yes

Should backend first check that prefetch request for such buffer is already registered? asa far as I understand all current prefetch ring logic with it's refetch ring and hash will be moved to communication process. But it means that backend will not be effectively check for duplicatete prefetch requests (unless pin buffer for it). And duplicated prefetch seems to be quite common for indexscan heap page prefetch.

I don't think it's necessary. There's no deduplication mechanism like that in the posix_fadvise calls we make either. Probably needs some stress testing though


### PostgreSQL version 17

In version 17, when prefetching is requested, the pages are already pinned in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true only for seqscan (read stream). But what about prefetch for other plan nodes, i.e. indexscan?
Should we also redo it according to new schema (pin buffer...). It seems to be problematic, because there is no actually sequential streams in this case - randomly accessed pages. Pinning buffer will allow to prevent duplicated prefetch requests. But they should be hold somewhere. Will it be possible to reuse stream API for this or we should invent own data structure for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can continue to use the "blind" prefetch requests for these in v17.

shared buffer. On completion of an AIO, the process that processes the
completion will have to call a small callback routine that releases the buffer
lock and wakes up any processes waiting on the I/O. It'll require some care to
execute that safely from the communicator process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if prefetch request to pinned and locked shared buffer somehow differs from normal get page request?

I actually concerned that we have to implement three different prefetch schemas: <17,=17,>17.
I wonder if we can implement the same approach for all of them?
Yes, it will require patching of Postgres code. But we in any case modify it not to implement our own prefetch (may be except seqscan in pg17).

So if we will use the same schema: for example with pinning and locking shared buffers and registering normal get page request instead of some special prefetch requests?

One of the problem of such approach may be too larger number of pinned shared buffers: max_connectionseffective_io_concurrency, for example 100100 => 82Mb shared buffers - it is quite easy to exhaust 128Mb shared buffers we have now.

Copy link
Contributor Author

@hlinnaka hlinnaka Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if prefetch request to pinned and locked shared buffer somehow differs from normal get page request?

No difference.

I actually concerned that we have to implement three different prefetch schemas: <17,=17,>17.
I wonder if we can implement the same approach for all of them?
Yes, it will require patching of Postgres code. But we in any case modify it not to implement our own prefetch (may be except seqscan in pg17).

I think there won't be much difference in the versions <17 and 17. In v17, the read stream facility keeps the pages pinned while they're prefetched, but we don't need to care about that in the smgr.

One of the problem of such approach may be too larger number of pinned shared buffers: max_connectionseffective_io_concurrency, for example 100100 => 82Mb shared buffers - it is quite easy to exhaust 128Mb shared buffers we have now.

Yeah, that's a problem we already have. There is some logic in Postgres to limit the number of prefetching when shared_buffers is low. It's probably not working very well for us though.

establishing or holding a connection. No head-of-line blocking; prefetch
requests can be processed with lower priority. We would control our own
destiny. But it has its own set of challenges: congestion control,
authentication & encryption.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it doest somehow address problems with prefetch.


## Non Goals (if relevant)

- We will keep LFC unmodified for now. It might be a good idea to rewrite it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have separate communication process (through which all get page requests will be passed), then it seems to be quite easy to implement caching (in memory on on disk) in this process. It can be done in local memory and so easily scaled up and down. Also, if requests will be fetched from request buffers by single thread, then we do not need synchronisation (but I am not sure if it is good idea).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could move the LFC into the communicator process. It would add some delay to LFC cache hits though, as you'd need to go through the shared memory request/response even for cache hits. It might still be worth it, if it makes the LFC management simpler or more flexible.

Anyway, I don't want to tackle that right now.

that it has work to do.

The communicator picks up the request from the backend's ring, and performs
it. It writes result page to the address requested by the backend (most likely a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now buffer id is not passed to SMGR API. It has only pointer where data should be placed in.
We certainly can calculated buffer id ourselves as (ptr - BufferBlocks)/BLCKSZ+1.
But it seems to be a hack. Actually communicator process do not need to know buffer id: it is not going to unlock or release it. We can just pass pointer to communicator process. But only if it points to shared memory, otherwise sending it to another process is not correct. Is there some warranty that ReadBuffer can only be called for pinned shared buffer?

@knizhnik knizhnik mentioned this pull request Feb 19, 2025

Implentation phases:

- Implement new protocol in pageserver. In first prototype, maybe just
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to decouple this by having the executor initially use old protocol (but if necessary we can execute in parallel)

- Using a library might help with managing the pool of pageserver connnection,
so we want need to implement that ourselves

### Reliability, failure modes and corner cases (if relevant)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking of observability/debug -- once the communicator is in rust, shall we give it its own prometheus endpoint to scrape?

Failure modes: how will we debug/diagnose if a request somehow gets lost/stuck due to a bug in the IPC between worker and communicator?

Comment on lines +137 to +138
- Use protobuf or something else more standard. Maybe gRPC. So that we can use
standard tools like Wireshark to easily analyze the traffic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think gRPC is a reasonable default.

We can eke out more performance with something like FlatBuffers or Cap'n'Proto (Protobuf is pretty allocation-heavy), but will trade off flexibility and tooling. So let's decide how performant this has to be.

If we're going to multiplex backends, I'd also strongly prefer to attempt gRPC over QUIC rather than TCP to avoid head-of-line blockage.


## Alternative implementation (if relevant)

I think UDP might also be a good fit for the protocol. No overhead of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this above, but we should use QUIC rather than raw UDP. It has the same benefits but we don't have to build it ourselves, comes with prioritization and congestion control, and could presumably be combined with gRPC.

I'm not sure what the current state of gRPC over QUIC is in Rust though, but we should look into it.

Copy link
Contributor

@erikgrinaker erikgrinaker Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we're not all there yet:

  • QUIC implementation: quinn is stable.
  • HTTP/3 implementation: h3 is experimental.
  • Hyper server library: in progress.
  • Reqwest client library: experimental.
  • Tonic gRPC library: blocked on the above.

So I think we have a few options:

  1. YOLO with experimental/unstable gRPC-over-QUIC.
  2. Use gRPC-over-TCP, swap the transport to QUIC when stable.
  3. Do something fancy like FlatBuffers over QUIC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am +1 to @jcsp for not changing current protocol.
There are IMHO no urgent requirements for protocol change.
Yes, current manual encoding/decoding of message is awful. But it is already done. If we write communicator in Rust, we can reuse existed code.

Current protocol doesn't support batching, but it more or less efficiently done at PS side. So not sure that we need to support it at protocol level. At least right now.

But the main argument for preserving ing protocol is to provide backward compatibility and simplify de[ploying communicator in prod. If we develop new protocol, then at least for some time, PS will have to support both. Which can be really problematic and complicate PS code.

So I think that this two tasks: multiplexing and batching at protocol level can and should be separated. And we should better start with reusing existed protocol to allow communicator to work with existed PS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, we can decouple these changes if you prefer.

I think a request/response model would be easier to deal with than a stream protocol though, and that may fundamentally change how we'd build the communicator. We may also want to offload e.g. connection/stream management to the gRPC library, which would avoid a lot of unnecessary work. But I'll leave that for the compute team to judge.

If we develop new protocol, then at least for some time, PS will have to support both. Which can be really problematic and complicate PS code.

That will also be true if we change the protocol later. I don't think it's that much of a burden on the Pageserver side, it's just a transport adapter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to change the protocol at the same time. If we keep using the libpq protocol, we'll need to use either libpq, or the rust client, from the communicator process, and it'll be some effort to make that performant and integrate it well. Instead of spending time tweaking that, I'd rather go straight to a gRPC library and start to learn how to tune and tweak that.

I am assuming that implementing the pageserver side of a new protocol will be pretty straightforward. If it's not, that perhaps changes the tradeoff.

Copy link
Contributor

@erikgrinaker erikgrinaker Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming that implementing the pageserver side of a new protocol will be pretty straightforward.

That's my assumption too. The only thing I'm unsure of is the lock management if we move to a gRPC request/response model, as I'm not intimately familiar with this area, but I expect it to be manageable.

Edit: actually, that would necessarily have to be the same as today, since we already have multiple backends connected to the same shard.

@knizhnik
Copy link
Contributor

There are two possible ways of impelementing prefetch in communicator:

  1. Move (some changes are definitely needed) existed per-backend prefetch ring implementation to communicator. I.e. there will be some ring with registered prefetch requests and hash table to locate them. As far as they can be accessed from multiple threads, they should be protected by some mutex/rwlock. Prefetch ring contains received pages and get page request should perform lookup in prefetch cache to check it this page is already prefetched.
  2. Do not maintain any prefetch state in communicator and just store prefetch results in LFC. This is what I have implemented in my PoC: PS communicator #10896 . In this case LFC can be accessed from multiple threads. So it can not use Postgres LW-locks/condition variables because they are not thread-safe. I have to replace them with Posix analogues.

Let's discuss pros and cons of both approaches.
First approach allows to use prefetch without LFC. Not sure if it is important because we do not have nodes without LFC.
Also access to LFC is slightly slower than access to shared buffer. So in theory it can be faster, although I have not observed it in practice.

But there are several essential drawbacks of such approach:

  1. Extra memory for storing prefetch results in communicator (in LFC they re stored in OS cache and can eb swapped to the local disk).
  2. Not clear what should new the size of prefetch ring. Right now for backends it is 128. If we can have thousand backend, should the size of the ring be 128k or it is too much?
  3. Backends are accessing data at different rates. Prefetch request of one backend can throw away from ring prefetch requests of another backend.
  4. Something like "priority inversion". Assume that one backend place prefetch request to page P at mammal prefetch distance (100 now). But at the same time another backend tries to read this page. It performs lookup in prefetch rig and locates this prefetch request zt the end of prefetch queue. Should it wait until this request is sent? Or should it send it's own request to PS immediately?

Second approach is simpler - it doesn't require to maintain any prefetch state in coordinator. But as I already noticed, it requires some changes in LFC implementation to make it thread-safe.

@knizhnik
Copy link
Contributor

knizhnik commented Feb 23, 2025

Show we perform some discussion/brainstorming concerning this RFC?
Right now, after my PoC implementation #10896 I see 4 alternative ways of going ahead:

  1. As proposed in this RFC implement communicator as async multitasking application in Rust. The main problem with this approach is IMHO synchronisation between Postgres backends in C and Rust tasks. Even if there will be separate queue for each backend, still backend needs some way to ping communicator that new request is added. The problem is that standard Postgres sync primitives (Latch, LWLock, ConditionVariable) can not be accessed from Rust. Ask least because they are not thread-safe. And calling Rust synchronisation primitives, especially Tokio ones, requiring Tokio runtime, seems to be problematic.
  2. Implement multithreaded communicator in C (as it is done in my PoC). The drawbacks are mostly the same: it is not possible to use Postgres sync primitives in multithreaded environment. That is why I have to rewrite work with sockets (libpw wrappers), and LFC locking.
  3. Stay in Postgres multiprocess paradigm and launch BGW instead of threads. In this case it is possible to implement communicator in standard Postgres way with minimal changes in code (all synchronisation and working with libpq connections can be preserved). The drawback is large number of extra processes. Compute have to establish connections with multiple shards. Right now number of shards is not very large (~10). But in future it can jibe increased till ~100. And for each shard we need to establish several parallel connections. They are needed to execute get page requests at page server in parallel. Otherwise performance of system with multiple parallel queries will be much worser than we have now, when each backer has its own connection with PS. I am not sure about optimal number of parallel connections. But it seems to be not less than 10, taken in account possibility of hundreds of active backends. Also it will be nice to handle requests and responses in parallel, to avoid blocking of let's say send request to PS by stopping received resins in LFC. So number of processes should be doubled (reader+writer). 21010 = 200 background workers. It seems to be too much. And it is low boundary. 2100100 = 20k BGW. Something IMHO completely unacceptable.
  4. The same as 3 - use standard Postgres BGW for communicator but let it serve multiple channels. It will fix thread-safety problems and at the same time reduce number of communicator backends.Most likely single communicator will too be enough especially if it has too write to LFC. But it is possible to have pool of such BGW, i.e. 10, and scatter client between them.


## Summary

## Motivation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to split this motivations/goals into two categories:

  1. Problems/limitation of the current architecture
  2. Useful refactoring (safe rust, complex async libpq, code reuse,...)

There are two major architectural problems we want to address:

  1. Too large number of connections between compute and page server ((each backend needs to establish connection with each shard)
  2. Inefficient nd complicated work of prefetch: duplicates, not retrieved requests,...

There are two lessons I learned from my PoC:

  • It will be hard to determine optimal number of connections to PS and obtain at least the same level of performance as we have now. Communicator is in any case some kind of extra proxy and even the most efficient mechanism of communication between backends and communicator will add some extra overhead. We will always have to find some compromise between bottlenecks and lock contention. In my PoC even using lockless communication through shared memory, still performance is almost ~2x times slower for larger number of active backends.
  • By copying all request information in response (V3) it is possible to significantly simplify prefetch logic. Actually we do not need all this complex prefetch state (ring) maintenance. There is no need to match prefetch requests and responses. Prefetch results can be stored in LFC and it can be done by some other process than one issued prefetch request.

So taken it into account, I wonder if we can/should investigate another alternatives of addressing this two architecture problems? One was already mentioned in this PR: using UDP instead of TCP. It should in theory solve the problems with two many connections. I am not sure about speed of this approach. But the fact that fastest network protocols are based on UDP rather than TCP, makes me think that it is possible.

And concerning prefetch - it can be significantly simplified even without protocol changes and introducing communicator. The problem with fetching prefetch results can be solved by spawning one or more prefetch/prewarm workers which will receive just receive prefetch results (send directly by backends) and store them LFC. The same worker can be used to perform prewarm for cold start.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too large number of connections between compute and page server

What, specifically, is the problem with too many connections? Lots of idle tasks? Memory usage?

If we move to something like QUIC/UDP, we'll still have to maintain stream or request state as long as requests are ongoing, and stream setup is not free. With raw UDP, we'll have to re-implement flow control and packet retransmission/reassembly ourselves.

performance is almost ~2x times slower for larger number of active backends

Isn't this mostly a tuning problem? I don't think we have to be overly cheap with active connections, I think the main problem is to avoid all of the idle connections, no? Ideally, in steady state, there should be sufficient connections that there is no contention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What, specifically, is the problem with too many connections? Lots of idle tasks? Memory usage?

Actually I don't know. It was not my concern. Will also be glad to understand the problem.
Large number of tasks and memory usage should not be a problem as far as we are using Tokio: overhead of its task is quite small. What can matter is number of sockets/ports. I can imaging situation when we hit some limit.

Whether UDP can help, depends on answer to this question. I agree that using raw UDP is not an option, but we can use some existed more higher level protocol on top of UDP. But I am not an expert here.

Isn't this mostly a tuning problem?

I am not sure that it is just tuning problem. In my test there were 10 channels/threads in communicator and 90 backends. If we have 90 channels/threads, then the problem most likely will gone. But in this case number of sockets=connections to PS is the same as in case when each backend establish its own connection. Then why do we need communication at all if it doesn't perform multiplexing?

If we just need to avoid idle connection (why?), it can be done without introducing communicator: justyad some timeout and drop connection id it is not active within some interval. Then re-establish it on demand.

Copy link
Contributor

@erikgrinaker erikgrinaker Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just need to avoid idle connection (why?)

The proximate cause afaik is that idle connections take up a Tokio task and socket on the Pageserver, so we can't have hundreds of thousands of them on the server side (memory usage, scheduler pressure, etc).

I was just curious if there were other problems we wanted to solve re: idle connections, or if that's the only one.

it can be done without introducing communicator: justyad some timeout and drop connection id it is not active within some interval. Then re-establish it on demand.

This is a fair point. I think the main downside is that re-establishing it incurs a TCP+TLS handshake on the next request (3 RTTs).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We in any case has inactive none suspension (after 5 minutes of inactivity). Certainly PS disconnect timeout can be made smaller (seconds not minutes). But due to TLS handshake overhead mentioned above it should not be too small. If connection drop threshold Is 100 seconds and we have 100k backends sending requests to PS with more than 100 seconds intervals, then PS needs to handle 1000 SSL handshakes per second which seems to be quite large. So I do not think that we need some other timeout here rather than 5 seconds node suspension.

What do you think, at least as first step in this direction, about such quite simple architecture:

  1. Lease compute->PS communication as it is: each backend maintains its own TCP connection to PS. Using UDP will require us to maintain stream state, so actually redo work which TCP does.
  2. Stateless prefetch: there should be some number of prefetch workers starting the same UDP port (it is possible tin Linux). When receiving prefetch request, PS doesn't respond to the sender, but rather sends UDP message to this port. One (random) of prefetch workers receives this response and place it in LFC (all required information is present in responses itself).

So it requires minimal changes in Neon architecture and still address most of the existed problem of compute-PS communication. All current complex prefetch machine with its hash and ring can be removed, prefetch responses will be proceed with maximal possible speed, balancing is done by OS and no sophisticated tuning is needed. The only thing we need to choose is number of prefetch workers. But it is quite easy to reduce to increase them dynamically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lease compute->PS communication as it is: each backend maintains its own TCP connection to PS.

Pageservers currently have up to 5k concurrent pageservice connections per node. That number doesn't scare me too much, as long as they're mostly idle. I think we can probably go up to 20k without too much trouble, but I asked #team-proxy what they think. We could drop the timeout down to e.g. 1 minute to reduce this a bit.

Stateless prefetch: there should be some number of prefetch workers starting the same UDP port (it is possible tin Linux). When receiving prefetch request, PS doesn't respond to the sender, but rather sends UDP message to this port.

This should be fine on the Pageserver-side, as long as we treat prefetches as best-effort and don't make any attempts to retransmit them. We'll just fire and forget.

Prefetch requests will still cause head-of-line blocking and affect tail latencies, since the requests are transmitted across the same TCP connection and GetPage processing is single-threaded. I think ideally we'd want to have lower-priority processing of prefetches that don't interfere with foreground GetPages. We may want to solve that problem now, if we're making changes anyway -- we could add UDP prefetch workers on the Pageserver-side too, but at that point we may as well consider other protocols like gRPC too.

I also think there are qualitative reasons to use something like gRPC, even if they may not be urgent: connection/stream management, request/response and stream models, standardized and extensible protocol, client/server code generation, built-in encryption/authentication/compression, prioritization, cancellation and deadline propagation, metrics and observability tooling, migration path to QUIC, etc. But it's fine to decouple it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefetch requests will still cause head-of-line blocking and affect tail latencies, since the requests are transmitted across the same TCP connection and GetPage processing is single-threaded.

Of course, we could use a separate pool of TCP connections specifically for prefetches. That wouldn't require protocol changes nor Pageserver changes.

hlinnaka added a commit that referenced this pull request Apr 5, 2025
pagestore_smgr.c had grown pretty large. Split into two parts, such
that the smgr routines that PostgreSQL code calls stays in
pagestore_smgr.c, and all the prefetching logic and other lower-level
routines related to communicating with the pageserver are moved to a
new source file, "communicator.c".

There are plans to replace communicator parts with a new
implementation. See #10799.
This commit doesn't implement any of the new things yet, but it is
good preparation for it. I'm imagining that the new implementation
will approximately replace the current "communicator.c" code, exposing
roughly the same functions to pagestore_smgr.c.

This commit doesn't change any functionality or behavior, or make any
other changes to the existing code: It just moves existing code
around.
hlinnaka added a commit that referenced this pull request Apr 5, 2025
pagestore_smgr.c had grown pretty large. Split into two parts, such
that the smgr routines that PostgreSQL code calls stays in
pagestore_smgr.c, and all the prefetching logic and other lower-level
routines related to communicating with the pageserver are moved to a
new source file, "communicator.c".

There are plans to replace communicator parts with a new
implementation. See #10799.
This commit doesn't implement any of the new things yet, but it is
good preparation for it. I'm imagining that the new implementation
will approximately replace the current "communicator.c" code, exposing
roughly the same functions to pagestore_smgr.c.

This commit doesn't change any functionality or behavior, or make any
other changes to the existing code: It just moves existing code
around.
hlinnaka added a commit that referenced this pull request Apr 5, 2025
pagestore_smgr.c had grown pretty large. Split into two parts, such
that the smgr routines that PostgreSQL code calls stays in
pagestore_smgr.c, and all the prefetching logic and other lower-level
routines related to communicating with the pageserver are moved to a
new source file, "communicator.c".

There are plans to replace communicator parts with a new
implementation. See #10799.
This commit doesn't implement any of the new things yet, but it is
good preparation for it. I'm imagining that the new implementation
will approximately replace the current "communicator.c" code, exposing
roughly the same functions to pagestore_smgr.c.

This commit doesn't change any functionality or behavior, or make any
other changes to the existing code: It just moves existing code
around.
hlinnaka added a commit that referenced this pull request Apr 7, 2025
pagestore_smgr.c had grown pretty large. Split into two parts, such
that the smgr routines that PostgreSQL code calls stays in
pagestore_smgr.c, and all the prefetching logic and other lower-level
routines related to communicating with the pageserver are moved to a
new source file, "communicator.c".

There are plans to replace communicator parts with a new
implementation. See #10799.
This commit doesn't implement any of the new things yet, but it is
good preparation for it. I'm imagining that the new implementation
will approximately replace the current "communicator.c" code, exposing
roughly the same functions to pagestore_smgr.c.

This commit doesn't change any functionality or behavior, or make any
other changes to the existing code: It just moves existing code
around.
hlinnaka added a commit that referenced this pull request Apr 8, 2025
pagestore_smgr.c had grown pretty large. Split into two parts, such
that the smgr routines that PostgreSQL code calls stays in
pagestore_smgr.c, and all the prefetching logic and other lower-level
routines related to communicating with the pageserver are moved to a
new source file, "communicator.c".

There are plans to replace communicator parts with a new
implementation. See #10799.
This commit doesn't implement any of the new things yet, but it is
good preparation for it. I'm imagining that the new implementation
will approximately replace the current "communicator.c" code, exposing
roughly the same functions to pagestore_smgr.c.

This commit doesn't change any functionality or behavior, or make any
other changes to the existing code: It just moves existing code
around.
github-merge-queue bot pushed a commit that referenced this pull request Apr 9, 2025
…1459)

pagestore_smgr.c had grown pretty large. Split into two parts, such
that the smgr routines that PostgreSQL code calls stays in
pagestore_smgr.c, and all the prefetching logic and other lower-level
routines related to communicating with the pageserver are moved to a
new source file, "communicator.c".

There are plans to replace communicator parts with a new
implementation. See #10799.
This commit doesn't implement any of the new things yet, but it is
good preparation for it. I'm imagining that the new implementation
will approximately replace the current "communicator.c" code, exposing
roughly the same functions to pagestore_smgr.c.

This commit doesn't change any functionality or behavior, or make any
other changes to the existing code: It just moves existing code
around.
Bodobolero pushed a commit that referenced this pull request Apr 9, 2025
…1459)

pagestore_smgr.c had grown pretty large. Split into two parts, such
that the smgr routines that PostgreSQL code calls stays in
pagestore_smgr.c, and all the prefetching logic and other lower-level
routines related to communicating with the pageserver are moved to a
new source file, "communicator.c".

There are plans to replace communicator parts with a new
implementation. See #10799.
This commit doesn't implement any of the new things yet, but it is
good preparation for it. I'm imagining that the new implementation
will approximately replace the current "communicator.c" code, exposing
roughly the same functions to pagestore_smgr.c.

This commit doesn't change any functionality or behavior, or make any
other changes to the existing code: It just moves existing code
around.
@hlinnaka hlinnaka self-assigned this Apr 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants