Historical Data Access via RPC #1717

mollykarcher · 2025-04-09T20:58:38Z

mollykarcher
Apr 9, 2025
Maintainer

What

The concept of a "full archive node" is commonplace in other ecosystems. Typically when dealing with partners that support many different chains, they expect something like this to be available. At present, RPC is not architected in a way that would support retaining full history in it's database, but by-and-large people want to use the same API for historical and current data. So ultimately, we'd like to provide access to historical data through RPC's API (or an RPC-like API) to ease the integration burden as much as possible.

Why

Historical data is available today through a number of different means:

Galexie-exported data lake
Hubble
Horizon, via limited third party infrastructure providers or self-host

Most providers must run RPC (presuming they are interested in simulating+sending soroban txs and/or reading soroban events). Even for those that may not, we'd like to encourage the trend away from Horizon given it's imminent scalability challenges as TPS increases. However, using one of the above for historical data and RPC for non-historical data requires an end user to integrate with two different APIs depending on whether they are looking for historical data or recent data, which is a huge point of friction. Since the RPC API is the API that we'd like to push people towards, it follows that supporting some kind of "RPC-like" API on top of these historical sources could achieve multiple goals here. This would also align with expectations for people coming over from other L1s, as we would have something that looks approximately like a "full archive node".

There are a number of different ways that people may use historical data, and surely more that we are not aware of. At a minimum, these include:

How do I get a one-time dump of all historical data?
- Example: providers (data, analytics, bridges, indexers, etc) that want a one-time dump of all historical data, but will then use RPC on a go-forward basis
How do I get all historical data, within a given range?
- Example: institutional players (financial, compliance, etc) that are looking at specific points in history to satisfy auditing requirements or perform reconciliations
How do I get random access to any ledger/transaction in history?
- Example: block explorers or indexers (note that this problem should be solved by these types of developers/users themselves, whereby they create indexing products to serve ecosystem needs)
How do I rewind ledger entries to a specific point in time? (related issue: ingest/cdp: derive ledger entry state for consumers go#5612)
- Example: developer tooling that simulates a contract against a specific ledger

How

Exactly how this user story gets built into our products is somewhat up for debate. There are at least 5 (maybe more) options that have been thrown around already as possibilities:

RPC can "fall back" or act as a proxy to a Galexie-exported data lake to retrieve data older than it's retention window.
- Clients/users would only ever interact with the RPC API
- RPC internally delegates requests either to on-disk storage or a data lake
Galexie-exported data lake is fronted with a lightweight wrapper/facade that mimics RPC's current API
- Wrapper/facade could be either a service, or a library/SDK
- Service has the drawback of necessitating deploying an additional piece of infrastructure, client has the drawback of requiring separate implementations in multiple languages
RPC can "fall back" or act as a proxy to Hubble to retrieve data older than it's retention window
Hubble is fronted with a lightweight wrapper/facade that mimics RPC's current API
RPC's database/architecture is reassessed to support full history

I am personally partial to 1, for the following reasons:

Galexie will be faster and closer to real-time than Hubble (as compared to 3/4)
Clients use the same exact service API (RPC's). The 1 vs 2 or 3 vs 4 distinction is just effectively moving the lack-of-single-API issue to the client SDK layer, where we'd also likely need to re-solve it but in multiple different languages
RPC performance for recent history/current state does not suffer (as compared to 5), and this is still likely to be the majority of API calls to RPC

janewang · 2025-04-09T21:32:42Z

janewang
Apr 9, 2025

+1 on supporting an archival node option - it's a common concept in other ecosystems. Sometimes folks just want to run a node in archival mode. The answer shouldn’t be “use Horizon.”

I’m in favor of options 1, 3, and 5. Option 5 aligns with what node operators are already familiar with - so if we’re considering deviating from that industry standard, the alternative should deliver clear benefits to node operators, whether in terms of cost, usability, stability, or other meaningful gains.

I'd also like us to consider worst-case scenarios—what happens if Galaxe and Hubble fail? Are there any single points of failure we might be overlooking?

Lastly, maintaining full archival of a network within one’s own infra is a hard requirement for some users (e.g., exchanges). Any proposed alternative has to offer a significant improvement over the status quo to drive adoption.

1 reply

Shaptic Apr 15, 2025
Maintainer

Sometimes folks just want to run a node in archival mode. The answer shouldn’t be “use Horizon.”

Can I ask you to clarify this? I don't get the push-back on Horizon: it literally is the archive node solution we've always had and built for this purpose. We also realized that running and maintaining one was a Bad Idea:tm: with a high expense, hence the history cutoff last year, but that doesn't stop others from running it. Running an RPC with full history would be much much worse than running Horizon because it was never built with this in mind and it won't be.

I guess what I'm asking is: what is so bad about Horizon? What functionality is lacking? If we knew better what the difference in perception is for downstream folks, we could bridge that gap and build something better that makes everyone happy without trying to reinvent the wheel and go down the same ugly scope creep path that Horizon followed.

johncanneto · 2025-04-09T22:17:41Z

johncanneto
Apr 9, 2025
Maintainer

Yes - i've heard something like this asked for at the start of most of the integrations i've been a part of. I like the idea of the 2nd approach because to the end user, it follows the RPC pattern. I think providing a single endpoint and abstracting away the how from the end user will be the best experience. Would # 2 allow the user to configure how far back they want to go?

2 replies

mollykarcher Apr 9, 2025
Maintainer Author

I think providing a single endpoint and abstracting away the how from the end user will be the best experience.

Based on this, I think you actually mean option 1, not option 2, right? Option 1 means, clients/end-users only ever interact with RPC and it's API directly, RPC internally will then optionally either pull data from it's on-disk storage or proxy through to Galexie, based on how old the data being requested is.

Edit: I added some sub-bullets in the descriptions for clarity

johncanneto Apr 23, 2025
Maintainer

late reply here: You are correct. 1 it is. Thanks for clarifying, @mollykarcher.

sreuland · 2025-04-11T23:47:38Z

sreuland
Apr 11, 2025
Maintainer

+1 for option 'numero uno' with a focus on RPC using a Galexie fueled Datastore as the interface to historical data.

Datastore is an abstraction and currently has one implementation for GCS bucket but will have more implementations over time such as a file-system implementation which equates to more scalability on how RPC deployments source historical data. For example, some RPC deployments might not have access to a cloud based GCS bucket for Datastore and it could instead be configured to use other implementations based on different physical storage layers. The ability for a Datastore to be pluggable to different backend storage layers like this is being discussed further in a separate design spike.

For clients of RPC, acquiring this historical data will be focused on throughput. How fast can they get it. Rather than changing the existing JSON-RPC endpoints to dynamically proxy to the appropriate back end for historical vs. near term data, it might be worthwhile to retain existing JSON-RPC endpoints as-is to provide near-term data for native and browser-js clients. Consider RPC hosting a second service on a different port with a transport more optimal for just native clients receiving big data as fast as they can handle it, streams with back-pressure and no paging/polling. This could be GRPC streams. Here's mock-up of how a GRPC interface for historical data could potentially look:

history_rpc.proto.zip

the PayloadType is a thought on how to just tunnel serialized formats of network xdr models through the GRPC stream transport.

1 reply

mollykarcher Apr 30, 2025
Maintainer Author

This is really interesting. We know that it's a desire of node providers to not have to deal with any additional service deployments. As an example, even Horizon has been stated to be undesirable because it requires postgres to be separately deployed/managed, whereas most RPC nodes in other chains are self-contained along with their data storage.

While we had discussed the idea of a local filesystem backend before, I had always imagined it being used just for testing. But given that we know in GCS the full size of the datalake is only about 3-4 TB, that's actually significantly smaller than the disk space required for an ethereum archival node (~8-10 TB), and so this would actually be a completely feasible solution for a "full archival node", if RPC were to "fallback" to a local filesystem data lake.

leighmcculloch · 2025-04-14T02:33:29Z

leighmcculloch
Apr 14, 2025
Maintainer

Another option to consider is a combination of 1️⃣/3️⃣ + 2️⃣/4️⃣ , where 1️⃣ and 3️⃣ are actually the same thing, RPC falling back to another RPC, and then 2️⃣ and 4️⃣ are RPC APIs that can be used either directly or via fallback.

This idea isn't new, I think it's been discussed a few times going back to 2022. There was this doc that touched on it. Diagram copied below so anyone can see it. The diagram described things that are not relevant to this conversation, like pluggable functionality, but the part that is relevant is the idea that RPC could layer with other RPC-compatible APIs. The goal being to share resources, such as a stellar-core, or to cache, while allowing anyone to spin up light RPCs that could still do things like simulation.

The 1 vs 2 or 3 vs 4 distinction is just effectively moving the lack-of-single-API issue to the client SDK layer

This layering pattern removes the downsides between 1️⃣ and 2️⃣. The user experience becomes the same. What's different is the operator deployment model, and how they scale their instances.

For example, 2️⃣ may be an advantage if it also caches, allowing the cache to be shared by a fleet of 1️⃣s connected to a number of 2️⃣s. 2️⃣ may be a disadvantage if an operator would rather host with a simpler single-instance model.

3 replies

sreuland Apr 14, 2025
Maintainer

Much of the architecture of this diagram seems realizable as an outcome from 1️⃣ as well if we swap out Zenith for Galexie/Datastores? Is the ability to proxy other RPC instances for deployment resource maximization a key aspect? Seems like it could be an addition to existing 1️⃣ design pattern as it is another back-end pathway abstracted by the server, client doesn't know.

An interesting aspect on pluggable network data sourcing is also potential with 1️⃣ through Pluggable CDP Datastores.

If we merge that diagram with 1️⃣ further and look at protos for IDL which can re-use event models already expressed in protos and use GRPC for built-in streaming transport the high level design for short-term and historical data from a single rpc starts shaping up like:

leighmcculloch Apr 15, 2025
Maintainer

There's a lot going on in that diagram that don't seem like a requirement, for example, GRPC?

The option I'm describing is a future where irrespective of where the code is implemented either in the existing RPC code base, or separately, that the deployment model would be that someone can run an RPC that is configured to collect data from other RPC-compatible API sources:

Options 1️⃣ vs 2️⃣ and 3️⃣ vs 4️⃣ mostly differ on where the implementation occurs and what pros/cons arise from that where.

What I'm saying is the where doesn't matter if RPC use its own API as a data source, so that we can implement them separately or together, but present them as one API to a developer in a way where we don't need to define any new APIs by reusing the existing API.

sreuland Apr 16, 2025
Maintainer

No requirements on implementation details are stated for the discussion such as GRPC. I think the goal is to debate ideas on how historical data can be realized from RPC. In that context, GRPC is proposed as part of architecture to accomplish this via 1️⃣ and by using more performant transport for clients to retrieve larger volumes of historical data rather than using JSON-RPC paging.

The historical data retrievals will likely entail larger result sets. Do we want to impose manual long polling/paging over JSON-RPC requests on clients to retrieve historical data results? Or is this an opportunity to provide a more fluent client interface to get larger volume of historical data reactively as fast as the client is capable of receiving with stream semantics, which is what GRPC can deliver.

I think both of these designs otherwise are quite similar in regards to an abstraction of real data sources behind rpc interface with some variance per the design proposal with grpc :

it depicts 'where the implementation occurs' as in the RPC server back-end code, does not use 'rpc api-compatible micro-services'
it proposes server-side rpc clients can change their client code to use the new historical data streaming endpoints when it makes sense for their use case.

Shaptic · 2025-04-15T16:30:18Z

Shaptic
Apr 15, 2025
Maintainer

we'd like to provide access to historical data through RPC's API

I think we need to agree on a definition for "historical data" here. This definition starkly changes the architecture we'd want to follow. Do we strictly mean the paginated endpoints,

getTransactions
getLedgers
getEvents

or do we also want to include getTransaction(hash)? This requires indexing transaction hashes and can't be done via a Galexie proxy.

Should we architect with the expectation that the future will require some indexed getPayments or deep inspection getOrderBook? There's a sentiment https://github.com/stellar/stellar-rpc/discussions/401#discussioncomment-12783807 that we don't want to reference Horizon but if the scope creeps just a little more, we're basically already there... Ethereum's archive node takes >10TB of disk space to run which is ~1yr of Horizon (if DB growth rates are still accurate).

Anyway, it'd be valuable to have clarity on what exactly people expect when they say "archive node" and "historical data." Depending on whether or not that means indexed data has heavy implications on the architecture we want to recommend or build.

2 replies

leighmcculloch Apr 15, 2025
Maintainer

+1 we should be clearer about the type of historical data. I think all of these are important, but have different use cases. getLedgerEntries is the one most relevant to fork testing and is the one most pressing for:

leighmcculloch Apr 15, 2025
Maintainer

The fact that Galexie can support some historical and not other suggests that a fallback model that doesn't require the user to navigate the options is even more important, because there might be different fallback options for different types of data (getLedgerEntries, vs getTransaction) and navigating these lower level details could be challenging.

That suggests to me there's even more value in the idea presented at https://github.com/stellar/stellar-rpc/discussions/401#discussioncomment-12847493, where RPC can be layered with other RPC API-compatible implementations that expose subsets of the API, where different implementations target supporting these different types of historical data in ways that make most sense for them.

johncanneto · 2025-05-05T15:19:58Z

johncanneto
May 5, 2025
Maintainer

Some feedback from dfuse on their ideal archival node/backfilling experience

Required RPC Endpoints: For EVM chains, only three core RPC endpoints are needed for the initial data pull: eth_blockNumber, eth_getBlockByNumber, and eth_getTransactionReceipt.
Ingestion Process: One-time data pull starting from the Genesis block. This data is then stored locally in compressed flat files, meaning they won't need to hit the our node repeatedly for the same historical data.
Rate Limiting is Critical: The speed of this initial ingestion is heavily dependent on the rate limits imposed by the RPC node provider. Uncapped or very high limits are strongly preferred.
Desired Speed: They aim for around 900 requests per second.
Timeframe: With ideal (high/uncapped) rate limits, the full history ingestion should take less than two weeks, ideally within one week or even a few days.
With restrictive rate limits (using Tron's current limits as an example), the process could take as long as 44 days, which is considered problematic due to the risk of timeouts or failures during the long process.
Post-Ingestion Indexing: Once the data is ingested into their system, they can index it very quickly (e.g., Arbitrum from Genesis in ~12 hours, Solana in a few hours), taking the burden off the original node provider.

CC @Shaptic @mollykarcher @2opremio

0 replies

mollykarcher · 2025-05-08T20:30:35Z

mollykarcher
May 8, 2025
Maintainer Author

Posting back here for posterity. We'll leave this discussion open to collect further comments as features in this area develop and progress, but for now we are moving forward as follows:

Option 1: Enable getLedgers to proxy to Galexie for requests outside it's retention window stellar-rpc#425
- Quicker solution that we can use in the shorter term for onboarding data providers
- We expect that there will be a third party provider hosting a public data lake within the next ~4 weeks, making this option more attractive for those that may have been turned off by having to run Galexie
Option 5: Spike: assess feasibility and hardware requirements for an archival node stellar-rpc#424
- More of a spike/investigation into what a longer term solution could be and what technical challenges it would entail
- In a possible outcome where this spike finds that option 5 is not possible, option 1 is already available as a fallback

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stellar

Historical Data Access via RPC #1717

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Stellar

Historical Data Access via RPC #1717

mollykarcher Apr 9, 2025 Maintainer

What

Why

How

Replies: 7 comments · 9 replies

janewang Apr 9, 2025

Shaptic Apr 15, 2025 Maintainer

johncanneto Apr 9, 2025 Maintainer

mollykarcher Apr 9, 2025 Maintainer Author

johncanneto Apr 23, 2025 Maintainer

sreuland Apr 11, 2025 Maintainer

mollykarcher Apr 30, 2025 Maintainer Author

leighmcculloch Apr 14, 2025 Maintainer

sreuland Apr 14, 2025 Maintainer

leighmcculloch Apr 15, 2025 Maintainer

sreuland Apr 16, 2025 Maintainer

Shaptic Apr 15, 2025 Maintainer

leighmcculloch Apr 15, 2025 Maintainer

leighmcculloch Apr 15, 2025 Maintainer

johncanneto May 5, 2025 Maintainer

mollykarcher May 8, 2025 Maintainer Author

mollykarcher
Apr 9, 2025
Maintainer

Replies: 7 comments 9 replies

janewang
Apr 9, 2025

Shaptic Apr 15, 2025
Maintainer

johncanneto
Apr 9, 2025
Maintainer

mollykarcher Apr 9, 2025
Maintainer Author

johncanneto Apr 23, 2025
Maintainer

sreuland
Apr 11, 2025
Maintainer

mollykarcher Apr 30, 2025
Maintainer Author

leighmcculloch
Apr 14, 2025
Maintainer

sreuland Apr 14, 2025
Maintainer

leighmcculloch Apr 15, 2025
Maintainer

sreuland Apr 16, 2025
Maintainer

Shaptic
Apr 15, 2025
Maintainer

leighmcculloch Apr 15, 2025
Maintainer

leighmcculloch Apr 15, 2025
Maintainer

johncanneto
May 5, 2025
Maintainer

mollykarcher
May 8, 2025
Maintainer Author