Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust rate-limit usage on partial ingestion failure #3890

Open
bboreham opened this issue Mar 1, 2021 · 7 comments
Open

Adjust rate-limit usage on partial ingestion failure #3890

bboreham opened this issue Mar 1, 2021 · 7 comments

Comments

@bboreham
Copy link
Contributor

bboreham commented Mar 1, 2021

After #3825 (since reverted), any failure to ingest samples will cause the rate-limit reservation to be canceled.
However it is quite possible in Cortex that some samples were accepted and some rejected; we can only send one result back to the caller so we send an error.

I think we would need a special gRPC error object to carry back the count of succeeded/failed samples to get this more accurate.

@pracucci
Copy link
Contributor

pracucci commented Mar 1, 2021

You're right. What if we cancel only in case the error is a httpgrpc 5xx error or a non-httpgrpc error?

@bboreham
Copy link
Contributor Author

bboreham commented Mar 1, 2021

Right now I have some samples rejected due to being over the limit on series per metric, and the same user being over rate-limit, so #3825 (which we haven't rolled out yet) should improve matters.

Your suggestion would then make things worse, in this particular case.

@stevesg
Copy link
Contributor

stevesg commented Mar 1, 2021

A simple and minor improvement would be to only roll-back if all the samples fail to ingest.

This would still help with the original issue of ingesters being unavailable, but prevent a single bad sample from circumventing the rate limit (reverting to existing behavior).

@bboreham
Copy link
Contributor Author

bboreham commented Mar 1, 2021

How would we know that?

@stevesg
Copy link
Contributor

stevesg commented Mar 1, 2021

Good question, scratch that...

@pracucci
Copy link
Contributor

pracucci commented Mar 2, 2021

Your suggestion would then make things worse, in this particular case.

My suggestion was to cancel the rate-limiter reservation only in the case the distributor returns a 5xx, which means the client will retry it, regardless some samples have been ingested or not. I understand it's not as accurate as you propose (count the exact number of samples ingested), but may be a good compromise to solve the original issue which was the case when 2+ ingesters are unhealthy.

@bboreham
Copy link
Contributor Author

bboreham commented Mar 2, 2021

My point is that I don't want to report two errors when there is only one.
I understand that your suggestion solves your issue, but this is my issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants