-
Notifications
You must be signed in to change notification settings - Fork 3.9k
StatusException: UNAVAILABLE on client when maxConnectionAge is set on server #9566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I looked at the logs but I don't see any server side logs. Also in "Client sends requests with large response size (4 MB)" did you by any chance mean to say "Client sends requests with large request message size (4 MB)" |
The logs in the minimal project are only client-side indeed, and are there primarily to show loop progress and exceptions with stacktraces. In this project, the server is the same process with the client, all created and started in
I meant to say: Client sends requests, to each request server sends response with large response message size (4 MB). This is what the minimal project tests. We also observed the same exception with different code, so it also happens when client sends requests with large request message size - again, not for each request, but eventually the exception is thrown. This case is not reflected in the minimal project. |
Okay I am taking a look at your repo - the source code and the logs. It does look like there's a bug . one question: in the logs I see lines like
Can you explain what each of the number is? When does the |
There is a nested loop in code:
The first two numbers are
There is no timestamp, the exception happens at different iteration numbers on different runs of the same code. I had some runs that did not throw exception at all, but most of them do - sooner or later. |
And exception happens long before the maxConnectionAgeGrace of 300 seconds (5 minutes), is that right? In that case this is most probably an issue. CC @ejona86 |
That is right. I have updated the code to print timestamps, and here is resulting output. Exception happened less than 5 seconds since the first request.
|
Okay, so I looked at the code and I realized this is working as expected with your values of maxConnectionAge of 1 second and maxConnectionAgeGrace of 300 seconds. Note that once 1 second has elapsed the server will only allow existing RPCs (that were started before the second elapsed) to finish. New RPCs after that 1 second may not be allowed. In this case your RPCs were allowed for about 4 to 5 seconds and that's beyond the 1 second window but I don't think that's an issue for you. Do you agree? |
Why do you say that? The original connection was probably already closed. We're looking at the 2rd or 3th connection. And each connection should be having a graceful GOAWAY where it doesn't cause RPCs to fail due to the network race. There are internal gRPC races, but those should be handled by transparent retry. In any case, this RPC seems like it was accepted by the server because we use a different error message for RPCs rejected by GOAWAY (generally mention "abrupt" goaway). This message simply means there was a GOAWAY in the past for the connection, and then later the connection happened to be closed. |
I'm wondering if this is caused by a TCP RST getting received before the trailers have been processed. I used |
I'm seeing the same issue in our prod environment. Any updates here? |
@ejona86 I've debugged this problem using the provided reproducer and I could confirm that, sometimes, the function When I move the call to |
@laurovenancio, I'm suspicious of that change. |
@ejona86 It doesn't seem to be just a timing issue because I am still able to consistently reproduce the issue when I add a 100ms sleep at the The following stack trace shows that the
So, it seems to me that the root cause of this issue is the call to It seems, to fix it, we have to move the |
Oh, right, we are extending that HTTP/2 base class, so the call flow becomes complicated. The issue at hand is "why are there active streams when the channel becomes inactive (TCP connection closes)?" And the RST I found is a smoking gun for that. The changes the question to "why is RST getting sent/how can we avoid the RST?"
I'd have to look very closely, but right now I'd be worried that such a change orphans some streams and they never get closed. So you might not being seeing an error any more because they hang instead. |
The provided reproducer shutdowns both the client and the server when the issue is detected. I believe, the RST we see in the tcpdump comes from this shutdown process and not from the issue. I've modified the reproducer to always run until the end. After this change, no RST is sent when the problem is detected. |
I agree that just moving the call to I've validated that all grpc call finished correctly. So there was no hang. |
@ejona86 You are right. To move the After debugging a little more, I found the following timeline:
This kind of issue was already described in some HTTP2 implementations. For instance, it was fixed by the following patch in the golang HTTP2 implementation. |
Oh, I really like that theory. Although looking at ejona-trial1.pcap.gz above, it doesn't entirely fit as the client wasn't sending HTTP/2 frames. The client does send a TCP ACK though, and it didn't ACK all the remaining data. So I think the general idea seems likely right. Delaying a second seems fair, in terms of complexity. I'm not quite sure where we'd want to make this change; in Netty or gRPC. |
was this issue resolved?? |
Hello, is a fix planned for this one? |
@glyiu98 @jinseo-jang Are you still seeing this, we are unable to reproduce in 1.68.1. |
I'd expect this is still happening with 1.68.1. It is racy, so testing is very timing-sensitive. |
@ejona86 @shivaspeaks I have run the POC locally using different versions, starting from v1.49.0 to v1.68.2, including v1.68.1. I observed that the issue is more likely to occur in the older versions. However, starting from v1.60.0, I am unable to see the issue, and it is not reproducible in the latest versions. I also ran v1.68.1 multiple times and was unable to reproduce it. POC: https://github.com/NaveenPrasannaV/MaxConnectionAgeV2.git |
Screenshots are really annoying to work with. I accept there are times you may have to use them but "logs" is the worst usage of screenshots, as I can't ctrl+f or copy/paste and I have to zoom in really far to make it legible. Simply copying and pasting the text into your document would have been superior. The change in behavior is most important, so seeing the error in 1.59.0 and not in 1.60.0 is the take-away. But since it is a race, that doesn't guarantee it is fixed, as we may have just changed the timing. Your reproducer doesn't reproduce on my machine, for example. I don't see any changes that would have fixed this. I looked through all the grpc changes between 1.59.0 and 1.60.0. We did upgrade Netty though 7ba30f6. But I looked through Netty 4.1.98-100 and didn't see any relevant changes there as well. One test to do would be to use grpc-java 1.59.0 but upgrade Netty:
Those three were chosen because they are brought in by grpc-netty. If the reproduction stops, you can then figure out which Netty release changed the behavior/timing. |
I tried using Netty version 4.1.100.Final and gRPC version v1.59.0. I was able to reproduce the exception 1 or 2 times out of 10 runs. I am attaching the error logs in the documents. |
@ejona86 @shivaspeaks I tried with the local build v1.60.0 of gRPC Java after reverting the changes from PRs #9314 and #10563 Previously, the issue was racy and occurred rarely. However, after reverting these two merged PR changes, I can now reproduce the issue more frequently in my local environment. Most likely, #10563 had an impact, as it includes changes to NettyClientHandler. Could you provide some suggestions on this. |
What version of gRPC-Java are you using?
1.49.0
Also grpc-kotlin 1.3.0
What is your environment?
MacOS 12.5.1 with OpenJdk 17.0.2
Ubuntu 18.04.3 with OpenJdk 17.0.3
What did you expect to see?
GRPC NettyServer is configured with
maxConnectionAge
of 1 second andmaxConnectionAgeGrace
of 300 seconds. Client sends requests with large response size (4 MB). Client receives responses successfully.What did you see instead?
Eventually the client throws
io.grpc.StatusException: UNAVAILABLE
after receivingGOAWAY
from the server. We first saw this behavior with large request size on our production and testing environments running on Linux. Then we were able to reproduce using a minimal project with large response size on MacOS.Steps to reproduce the bug
main()
function inMain.kt
, or./gradlew run
The text was updated successfully, but these errors were encountered: