Historical Data Access via RPC #1717
Replies: 7 comments 9 replies
-
+1 on supporting an archival node option - it's a common concept in other ecosystems. Sometimes folks just want to run a node in archival mode. The answer shouldn’t be “use Horizon.” I’m in favor of options 1, 3, and 5. Option 5 aligns with what node operators are already familiar with - so if we’re considering deviating from that industry standard, the alternative should deliver clear benefits to node operators, whether in terms of cost, usability, stability, or other meaningful gains. I'd also like us to consider worst-case scenarios—what happens if Galaxe and Hubble fail? Are there any single points of failure we might be overlooking? Lastly, maintaining full archival of a network within one’s own infra is a hard requirement for some users (e.g., exchanges). Any proposed alternative has to offer a significant improvement over the status quo to drive adoption. |
Beta Was this translation helpful? Give feedback.
-
Yes - i've heard something like this asked for at the start of most of the integrations i've been a part of. I like the idea of the 2nd approach because to the end user, it follows the RPC pattern. I think providing a single endpoint and abstracting away the how from the end user will be the best experience. Would # 2 allow the user to configure how far back they want to go? |
Beta Was this translation helpful? Give feedback.
-
+1 for option 'numero uno' with a focus on RPC using a Galexie fueled
For clients of RPC, acquiring this historical data will be focused on throughput. How fast can they get it. Rather than changing the existing JSON-RPC endpoints to dynamically proxy to the appropriate back end for historical vs. near term data, it might be worthwhile to retain existing JSON-RPC endpoints as-is to provide near-term data for native and browser-js clients. Consider RPC hosting a second service on a different port with a transport more optimal for just native clients receiving big data as fast as they can handle it, streams with back-pressure and no paging/polling. This could be GRPC streams. Here's mock-up of how a GRPC interface for historical data could potentially look: the |
Beta Was this translation helpful? Give feedback.
-
Another option to consider is a combination of 1️⃣/3️⃣ + 2️⃣/4️⃣ , where 1️⃣ and 3️⃣ are actually the same thing, RPC falling back to another RPC, and then 2️⃣ and 4️⃣ are RPC APIs that can be used either directly or via fallback. This idea isn't new, I think it's been discussed a few times going back to 2022. There was this doc that touched on it. Diagram copied below so anyone can see it. The diagram described things that are not relevant to this conversation, like pluggable functionality, but the part that is relevant is the idea that RPC could layer with other RPC-compatible APIs. The goal being to share resources, such as a stellar-core, or to cache, while allowing anyone to spin up light RPCs that could still do things like simulation.
This layering pattern removes the downsides between 1️⃣ and 2️⃣. The user experience becomes the same. What's different is the operator deployment model, and how they scale their instances. For example, 2️⃣ may be an advantage if it also caches, allowing the cache to be shared by a fleet of 1️⃣s connected to a number of 2️⃣s. 2️⃣ may be a disadvantage if an operator would rather host with a simpler single-instance model. |
Beta Was this translation helpful? Give feedback.
-
I think we need to agree on a definition for "historical data" here. This definition starkly changes the architecture we'd want to follow. Do we strictly mean the paginated endpoints,
or do we also want to include Should we architect with the expectation that the future will require some indexed Anyway, it'd be valuable to have clarity on what exactly people expect when they say "archive node" and "historical data." Depending on whether or not that means indexed data has heavy implications on the architecture we want to recommend or build. |
Beta Was this translation helpful? Give feedback.
-
Some feedback from dfuse on their ideal archival node/backfilling experience
|
Beta Was this translation helpful? Give feedback.
-
Posting back here for posterity. We'll leave this discussion open to collect further comments as features in this area develop and progress, but for now we are moving forward as follows:
|
Beta Was this translation helpful? Give feedback.
-
What
The concept of a "full archive node" is commonplace in other ecosystems. Typically when dealing with partners that support many different chains, they expect something like this to be available. At present, RPC is not architected in a way that would support retaining full history in it's database, but by-and-large people want to use the same API for historical and current data. So ultimately, we'd like to provide access to historical data through RPC's API (or an RPC-like API) to ease the integration burden as much as possible.
Why
Historical data is available today through a number of different means:
Most providers must run RPC (presuming they are interested in simulating+sending soroban txs and/or reading soroban events). Even for those that may not, we'd like to encourage the trend away from Horizon given it's imminent scalability challenges as TPS increases. However, using one of the above for historical data and RPC for non-historical data requires an end user to integrate with two different APIs depending on whether they are looking for historical data or recent data, which is a huge point of friction. Since the RPC API is the API that we'd like to push people towards, it follows that supporting some kind of "RPC-like" API on top of these historical sources could achieve multiple goals here. This would also align with expectations for people coming over from other L1s, as we would have something that looks approximately like a "full archive node".
There are a number of different ways that people may use historical data, and surely more that we are not aware of. At a minimum, these include:
How
Exactly how this user story gets built into our products is somewhat up for debate. There are at least 5 (maybe more) options that have been thrown around already as possibilities:
I am personally partial to 1, for the following reasons:
Beta Was this translation helpful? Give feedback.
All reactions