Skip to content

Implement Cross-Cluster Rate Limiting in Gateway #692

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
vishalya opened this issue May 29, 2025 · 12 comments
Open

Implement Cross-Cluster Rate Limiting in Gateway #692

vishalya opened this issue May 29, 2025 · 12 comments
Assignees

Comments

@vishalya
Copy link
Member

Motivation

The gateway distributes queries across multiple Trino clusters, making it the ideal place to enforce query limits per user or user group. This ensures fair usage and prevents overloading any single cluster.

Enhancement Proposal

1. Central State for Gateway

  • The gateway currently lacks a central state due to its distributed nature.
  • Introduce Redis (or compatible forks like Valkey) for fast, centralized state management.
  • Configuration for this shared state will be provided via a config file.
  • The central state will be abstracted to allow future migration to other storage systems.
  • If the central configuration is missing, the gateway will operate as before.
  • It should be extendable to store cluster statistics for use by routers.

2. Rate Limiting Configuration

  • Configuration will follow or closely resemble the Trino resource group schema.
  • Store configurations in a database.
  • Refresh configurations at regular intervals.
  • Load configurations into the central state for efficient access.

3. Rate Limiting Logic

  • Support request bursts to maintain a good user experience.
  • Limit queries based on:
    • User
    • User group
    • Query types
@RoeyoOgen
Copy link
Contributor

Great addition to the project!

A few comments:

  1. I'd love for an expansion of the "state" that's being saved, could you provide an example JSON for what this might look like?

  2. Maybe add a flow diagram to where this will sit in the flow of the query. For instance, it could be before the routing logic (and thus save redundant calls to the router) Or alternatively, be after as we want to rate-limit per cluster.

@andythsu
Copy link
Member

andythsu commented Jun 3, 2025

@RoeyoOgen

  1. We are thinking to make this generic enough so that each company can store whatever they want to the db choice they desire (not necessarily Redis). For now we are thinking to provide an interface like
interface CentralCache<T> {
    save(key): void
    delete(): void
    get(): T
}

This will help abstract the underlying logic and help modularize the code.

  1. It's a bit hard to decide where to put this logic because as you mentioned, we could put it before or after the routing logic. I was thinking to maybe just leave it for users to decide. I guess the first cut of this issue can just focus on getting CentralCache as a module into gateway, then we can decide what to do with this.

LMK your thoughts

@EdenKik
Copy link

EdenKik commented Jun 3, 2025

Hey,
Happy to see motivation for this feature :)

A few questions:

  1. Regarding the query flow mentioned above, do you think applying rate limiting after the routing logic is safe? Since clients could potentially send queries directly to different Trino cluster, wouldn't enforcing limits before routing be more effective?

  2. It would be great to support additional query limit parameters, such as limiting by IP address or by query source (Superset, Python client, etc.). These would be very beneficial for our use cases.

Also was wondering, which rate limiting algorithm are you considering for this? From my previous research, the token bucket algorithm seems like it suites good for this use case

@RoeyoOgen
Copy link
Contributor

RoeyoOgen commented Jun 3, 2025

@andythsu
Regarding:

  1. We are thinking to make this generic enough so that each company can store whatever they want to the db choice they desire (not necessarily Redis). For now we are thinking to provide an interface like

I don't seem to understand how this would work.

Wouldn't the "state" need to be the same for everyone? How will you make this customisable?

Also, does this approach expand towards a system similar to the routing rule system where limitations can use cached items as parameters?

Perhaps after @vishalya will add some sort of flow / pseudocode it will be clearer

@andythsu
Copy link
Member

andythsu commented Jun 3, 2025

I don't think the "state" needs to be the same for everyone.... eventually you could come up with your own routing rule that decides what to do by looking at the cache. Also, it is impossible to satisfy everyone's requirements in Redis. Some may want a specific format of the key, some may want a different eviction strategy... etc. So I think enforcing the same state for everyone is a bit hard

@RoeyoOgen
Copy link
Contributor

how do you plan to make the state customisable then? where will a user configure his cache properties?

@andythsu
Copy link
Member

andythsu commented Jun 4, 2025

I'm still thinking about it. Do you have any ideas on that?

@RoeyoOgen
Copy link
Contributor

Not really :( , we can maybe hold a dedicated Google-meet to further discuss this and the other points above

@andythsu
Copy link
Member

andythsu commented Jun 5, 2025

Yep. I think as starting point we can start a thread in trino-gateway channel, and then discuss the points during bi-weekly sync meetings. Here's the thread https://trinodb.slack.com/archives/C059KUNPTSP/p1749154145837889

@oneonestar
Copy link
Member

  1. The resource consumption for each query could vary by a lot. Do you plan
    to use query runtime information for the rate limiting logic? For example,
    a query run for 2hr is heavier than 10 queries running for 10s.
    Without monitoring the actual resource consumption for the running
    queries could result in a non-fair rate limit.

  2. Is the rate limiting system planned to work with Trino resource group or
    replacing it?

  3. Are multiple rate limiting algorithms being supported? It'll be difficult
    to share a common configuration for different routing algorithms. We
    should make it flexible when designing the configuration.

@vishalya
Copy link
Member Author

vishalya commented Jun 9, 2025

Thanks for all the feedback, I'll bring these points up in this week's dev sync, and we can dive deeper into our scope and ideas there.

@andythsu
Copy link
Member

Here's the architectural diagram for the rate limiting module. We will follow similar structure as Trino's resource groups

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants