-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust rate-limit usage on partial ingestion failure #3890
Comments
You're right. What if we cancel only in case the error is a httpgrpc 5xx error or a non-httpgrpc error? |
Right now I have some samples rejected due to being over the limit on series per metric, and the same user being over rate-limit, so #3825 (which we haven't rolled out yet) should improve matters. Your suggestion would then make things worse, in this particular case. |
A simple and minor improvement would be to only roll-back if all the samples fail to ingest. This would still help with the original issue of ingesters being unavailable, but prevent a single bad sample from circumventing the rate limit (reverting to existing behavior). |
How would we know that? |
Good question, scratch that... |
My suggestion was to cancel the rate-limiter reservation only in the case the distributor returns a 5xx, which means the client will retry it, regardless some samples have been ingested or not. I understand it's not as accurate as you propose (count the exact number of samples ingested), but may be a good compromise to solve the original issue which was the case when 2+ ingesters are unhealthy. |
My point is that I don't want to report two errors when there is only one. |
After #3825 (since reverted), any failure to ingest samples will cause the rate-limit reservation to be canceled.
However it is quite possible in Cortex that some samples were accepted and some rejected; we can only send one result back to the caller so we send an error.
I think we would need a special gRPC error object to carry back the count of succeeded/failed samples to get this more accurate.
The text was updated successfully, but these errors were encountered: