StatusException: UNAVAILABLE on client when maxConnectionAge is set on server #9566

usischev · 2022-09-27T11:04:38Z

What version of gRPC-Java are you using?

1.49.0
Also grpc-kotlin 1.3.0

What is your environment?

MacOS 12.5.1 with OpenJdk 17.0.2
Ubuntu 18.04.3 with OpenJdk 17.0.3

What did you expect to see?

GRPC NettyServer is configured with maxConnectionAge of 1 second and maxConnectionAgeGrace of 300 seconds. Client sends requests with large response size (4 MB). Client receives responses successfully.

What did you see instead?

Eventually the client throws io.grpc.StatusException: UNAVAILABLE after receiving GOAWAY from the server. We first saw this behavior with large request size on our production and testing environments running on Linux. Then we were able to reproduce using a minimal project with large response size on MacOS.

Steps to reproduce the bug

Clone the project from https://github.com/usischev/grpc_poc
Run the main() function in Main.kt, or ./gradlew run
There are blocking and coroutine (kotlin) server and client configurations in the code, see README.md. All configurations trigger the exceptions eventually.
Error logs can be seen in https://github.com/usischev/grpc_poc/tree/master/logs

The text was updated successfully, but these errors were encountered:

sanjaypujare · 2022-09-27T17:05:11Z

I looked at the logs but I don't see any server side logs. Also in "Client sends requests with large response size (4 MB)" did you by any chance mean to say "Client sends requests with large request message size (4 MB)"

usischev · 2022-09-27T17:28:17Z

I looked at the logs but I don't see any server side logs.

The logs in the minimal project are only client-side indeed, and are there primarily to show loop progress and exceptions with stacktraces. In this project, the server is the same process with the client, all created and started in Main.kt. What kind of logs would help to investigate the problem?

Also in "Client sends requests with large response size (4 MB)" did you by any chance mean to say "Client sends requests with large request message size (4 MB)"

I meant to say: Client sends requests, to each request server sends response with large response message size (4 MB). This is what the minimal project tests.

We also observed the same exception with different code, so it also happens when client sends requests with large request message size - again, not for each request, but eventually the exception is thrown. This case is not reflected in the minimal project.

sanjaypujare · 2022-09-27T20:06:35Z

Okay I am taking a look at your repo - the source code and the logs. It does look like there's a bug .

one question: in the logs I see lines like

0 6 4194000

Can you explain what each of the number is? When does the io.grpc.StatusRuntimeException: UNAVAILABLE happen after the RPC is started? Is there a timestamp?

usischev · 2022-09-28T09:00:21Z

one question: in the logs I see lines like
0 6 4194000
Can you explain what each of the number is?

There is a nested loop in code:

repeat(10) { i ->
    repeat(100) { j ->
        ...

The first two numbers are i and j, showing loop iteration. The third number is response message length, which is always 4194000 with this code.

When does the io.grpc.StatusRuntimeException: UNAVAILABLE happen after the RPC is started? Is there a timestamp?

There is no timestamp, the exception happens at different iteration numbers on different runs of the same code. I had some runs that did not throw exception at all, but most of them do - sooner or later.

sanjaypujare · 2022-09-28T14:29:57Z

the exception happens at different iteration numbers on different runs of the same code

And exception happens long before the maxConnectionAgeGrace of 300 seconds (5 minutes), is that right? In that case this is most probably an issue. CC @ejona86

usischev · 2022-09-28T18:02:22Z

And exception happens long before the maxConnectionAgeGrace of 300 seconds (5 minutes), is that right? In that case this is most probably an issue. CC @ejona86

That is right. I have updated the code to print timestamps, and here is resulting output. Exception happened less than 5 seconds since the first request.

2022-09-28T20:51:41.841390 STARTING
2022-09-28T20:51:42.880495 0 0 4194000
2022-09-28T20:51:43.187745 0 1 4194000
2022-09-28T20:51:43.383800 0 2 4194000
2022-09-28T20:51:43.579963 0 3 4194000
2022-09-28T20:51:43.746368 0 4 4194000
2022-09-28T20:51:43.908364 0 5 4194000
2022-09-28T20:51:44.067139 0 6 4194000
2022-09-28T20:51:44.231042 0 7 4194000
2022-09-28T20:51:44.396685 0 8 4194000
2022-09-28T20:51:44.559553 0 9 4194000
2022-09-28T20:51:44.717476 0 10 4194000
2022-09-28T20:51:44.876554 0 11 4194000
2022-09-28T20:51:45.027898 0 12 4194000
2022-09-28T20:51:45.201945 0 13 4194000
Exception in thread "main" io.grpc.StatusRuntimeException: UNAVAILABLE: Connection closed after GOAWAY. HTTP/2 error code: NO_ERROR, debug data: max_age
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:271)
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:252)
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:165)
	at io.grpc.examples.helloworld.GreeterGrpc$GreeterBlockingStub.sayHello(GreeterGrpc.java:178)
	at org.example.grpc.poc.MainKt.blockingRequestCycle(Main.kt:71)
	at org.example.grpc.poc.MainKt.main(Main.kt:110)
	at org.example.grpc.poc.MainKt.main(Main.kt)

sanjaypujare · 2022-09-28T19:26:41Z

Okay, so I looked at the code and I realized this is working as expected with your values of maxConnectionAge of 1 second and maxConnectionAgeGrace of 300 seconds. Note that once 1 second has elapsed the server will only allow existing RPCs (that were started before the second elapsed) to finish. New RPCs after that 1 second may not be allowed. In this case your RPCs were allowed for about 4 to 5 seconds and that's beyond the 1 second window but I don't think that's an issue for you. Do you agree?

ejona86 · 2022-09-28T21:57:35Z

In this case your RPCs were allowed for about 4 to 5 seconds

Why do you say that? The original connection was probably already closed. We're looking at the 2rd or 3th connection. And each connection should be having a graceful GOAWAY where it doesn't cause RPCs to fail due to the network race. There are internal gRPC races, but those should be handled by transparent retry.

In any case, this RPC seems like it was accepted by the server because we use a different error message for RPCs rejected by GOAWAY (generally mention "abrupt" goaway). This message simply means there was a GOAWAY in the past for the connection, and then later the connection happened to be closed.

ejona86 · 2022-10-04T18:31:21Z

I'm wondering if this is caused by a TCP RST getting received before the trailers have been processed.

I used sudo tcpdump -i lo -B 10000 tcp port 50051 -w out.pcap -W 5 -C 20 to capture, and mergecap to concat files if necessary.

ejona-trial1.pcap.gz
ejona-trial2.pcap.gz

racevedoo · 2023-04-11T12:02:10Z

I'm seeing the same issue in our prod environment. Any updates here?

laurovenancio · 2023-04-13T22:30:32Z

@ejona86 I've debugged this problem using the provided reproducer and I could confirm that, sometimes, the function channelInactive is called before the final bytes of some streams are processed.

When I move the call to super.channelInactive(ctx) to the beginning of the function, I am not able to reproduce the problem anymore.
https://github.com/grpc/grpc-java/blob/master/netty/src/main/java/io/grpc/netty/NettyClientHandler.java#L462

ejona86 · 2023-04-14T00:48:13Z

@laurovenancio, I'm suspicious of that change. super.channelInactive() is an inbound event so it calls fireChannelInactive(), and there's no other handlers later in the pipeline. That means it shouldn't matter at all when it is called, because it will be a noop. I suspect that it just changed the timing slightly to hide the race, although that doesn't make full sense to me since we have been able to semi-reliably hit the race and such a small change seems unlikely to change the race to that degree...

laurovenancio · 2023-04-14T14:23:42Z

@ejona86 It doesn't seem to be just a timing issue because I am still able to consistently reproduce the issue when I add a 100ms sleep at the channelInactive() function beginning.

The following stack trace shows that the super.channelInactive() call closes the HTTP2 base decoder and it ensures that all streams are properly closed.

"Test worker@1" prio=5 tid=0x1 nid=NA runnable
  java.lang.Thread.State: RUNNABLE
	  at io.grpc.netty.NettyClientHandler$2.onStreamClosed(NettyClientHandler.java:308)
	  at io.netty.handler.codec.http2.DefaultHttp2Connection.notifyClosed(DefaultHttp2Connection.java:357)
	  at io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.removeFromActiveStreams(DefaultHttp2Connection.java:1007)
	  at io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.deactivate(DefaultHttp2Connection.java:963)
	  at io.netty.handler.codec.http2.DefaultHttp2Connection$DefaultStream.close(DefaultHttp2Connection.java:515)
	  at io.netty.handler.codec.http2.DefaultHttp2Connection.close(DefaultHttp2Connection.java:153)
	  at io.netty.handler.codec.http2.Http2ConnectionHandler$BaseDecoder.channelInactive(Http2ConnectionHandler.java:217)
	  at io.netty.handler.codec.http2.Http2ConnectionHandler.channelInactive(Http2ConnectionHandler.java:432)
	  at io.grpc.netty.NettyClientHandler.channelInactive(NettyClientHandler.java:434)
	  at io.grpc.netty.NettyClientHandlerTest.channelShutdownShouldFailInFlightStreams(NettyClientHandlerTest.java:627)

So, it seems to me that the root cause of this issue is the call to clientStream.transportReportStatus() in channelInactive() while netty provides no guarantee that all streams are finished at this point.

It seems, to fix it, we have to move the clientStream.transportReportStatus() call from channelInactive() to onStreamClosed().

ejona86 · 2023-04-14T14:42:02Z

Oh, right, we are extending that HTTP/2 base class, so the call flow becomes complicated. onStreamClosed() doesn't help us at all, because the RPC will still fail but have even less information about why it was closed.

The issue at hand is "why are there active streams when the channel becomes inactive (TCP connection closes)?" And the RST I found is a smoking gun for that. The changes the question to "why is RST getting sent/how can we avoid the RST?"

When I move the call to super.channelInactive(ctx) to the beginning of the function, I am not able to reproduce the problem anymore.

I'd have to look very closely, but right now I'd be worried that such a change orphans some streams and they never get closed. So you might not being seeing an error any more because they hang instead.

laurovenancio · 2023-04-14T15:02:49Z

The issue at hand is "why are there active streams when the channel becomes inactive (TCP connection closes)?" And the RST I found is a smoking gun for that. The changes the question to "why is RST getting sent/how can we avoid the RST?"

The provided reproducer shutdowns both the client and the server when the issue is detected. I believe, the RST we see in the tcpdump comes from this shutdown process and not from the issue.

I've modified the reproducer to always run until the end. After this change, no RST is sent when the problem is detected.

laurovenancio · 2023-04-14T15:22:10Z

When I move the call to super.channelInactive(ctx) to the beginning of the function, I am not able to reproduce the problem anymore.

I'd have to look very closely, but right now I'd be worried that such a change orphans some streams and they never get closed. So you might not being seeing an error any more because they hang instead.

I agree that just moving the call to super.channelInactive() is not a proper fix. It was just a test. For instance, it breaks the channelShutdownShouldFailInFlightStreams test because transportReportStatus()is never called in some cases.

I've validated that all grpc call finished correctly. So there was no hang.

laurovenancio · 2023-04-18T19:33:56Z

@ejona86 You are right. To move the clientStream.transportReportStatus() call makes no difference and there is a RST packet.

After debugging a little more, I found the following timeline:

Server sends first goaway and a ping
Client stops creating new streams and replies the ping
Server sends the second goaway
When all active stream are done, the server closes the connection
Before receiving the final bytes of some streams, the client sends a WINDOW_UPDATE frame
As the connection is already closed, the kernel sends a RST packet and discards all buffered data.

This kind of issue was already described in some HTTP2 implementations. For instance, it was fixed by the following patch in the golang HTTP2 implementation.

golang/net@cd69bc3

ejona86 · 2023-04-18T22:35:54Z

Oh, I really like that theory. Although looking at ejona-trial1.pcap.gz above, it doesn't entirely fit as the client wasn't sending HTTP/2 frames. The client does send a TCP ACK though, and it didn't ACK all the remaining data. So I think the general idea seems likely right.

Delaying a second seems fair, in terms of complexity. I'm not quite sure where we'd want to make this change; in Netty or gRPC.

jinseo-jang · 2024-04-25T15:09:04Z

was this issue resolved??

glyiu98 · 2024-07-23T13:53:25Z

Hello, is a fix planned for this one?

larry-safran · 2024-11-21T18:22:14Z

@glyiu98 @jinseo-jang Are you still seeing this, we are unable to reproduce in 1.68.1.

ejona86 · 2025-01-09T17:15:45Z

I'd expect this is still happening with 1.68.1. It is racy, so testing is very timing-sensitive.

NaveenPrasannaV · 2025-02-12T09:06:27Z

@ejona86 @shivaspeaks I have run the POC locally using different versions, starting from v1.49.0 to v1.68.2, including v1.68.1. I observed that the issue is more likely to occur in the older versions. However, starting from v1.60.0, I am unable to see the issue, and it is not reproducible in the latest versions. I also ran v1.68.1 multiple times and was unable to reproduce it.
Additionally, I have attached screenshots of all my findings in the document. Kindly have a look.

POC: https://github.com/NaveenPrasannaV/MaxConnectionAgeV2.git

9566-Screenshots of logs.docx

ejona86 · 2025-02-12T18:06:21Z

Screenshots are really annoying to work with. I accept there are times you may have to use them but "logs" is the worst usage of screenshots, as I can't ctrl+f or copy/paste and I have to zoom in really far to make it legible. Simply copying and pasting the text into your document would have been superior.

The change in behavior is most important, so seeing the error in 1.59.0 and not in 1.60.0 is the take-away. But since it is a race, that doesn't guarantee it is fixed, as we may have just changed the timing. Your reproducer doesn't reproduce on my machine, for example.

I don't see any changes that would have fixed this. I looked through all the grpc changes between 1.59.0 and 1.60.0. We did upgrade Netty though 7ba30f6. But I looked through Netty 4.1.98-100 and didn't see any relevant changes there as well.

One test to do would be to use grpc-java 1.59.0 but upgrade Netty:

dependencies {
  ...
  implementation 'io.grpc:netty-codec-http2:4.1.100.Final'
  implementation 'io.grpc:netty-handler-proxy:4.1.100.Final'
  implementation 'io.grpc:netty-netty-transport-native-unix-common:4.1.100.Final'
}

Those three were chosen because they are brought in by grpc-netty. If the reproduction stops, you can then figure out which Netty release changed the behavior/timing.

NaveenPrasannaV · 2025-02-20T08:47:07Z

I tried using Netty version 4.1.100.Final and gRPC version v1.59.0. I was able to reproduce the exception 1 or 2 times out of 10 runs. I am attaching the error logs in the documents.

Kotlin_POC_error_log.docx

Java_POC_error_log.docx

ejona86 · 2025-02-24T16:29:49Z

v1.60.0 should not have fixed anything here, so I suspect the reproduction timing or test just changed. #9314 and #10563 could have impacted it, as they would influence which packets were sent at which times and sizes.

NaveenPrasannaV · 2025-03-25T15:24:29Z

@ejona86 @shivaspeaks I tried with the local build v1.60.0 of gRPC Java after reverting the changes from PRs #9314 and #10563 Previously, the issue was racy and occurred rarely. However, after reverting these two merged PR changes, I can now reproduce the issue more frequently in my local environment. Most likely, #10563 had an impact, as it includes changes to NettyClientHandler. Could you provide some suggestions on this.

ejona86 added the bug label Oct 4, 2022

ejona86 added this to the Next milestone Dec 20, 2022

s-matyukevich mentioned this issue Feb 1, 2024

Delay closing connection on the server after all streams are completed grpc/grpc-go#6957

Closed

dfawley mentioned this issue Feb 9, 2024

server: wait to close connection until incoming socket is drained (with timeout) grpc/grpc-go#6977

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StatusException: UNAVAILABLE on client when maxConnectionAge is set on server #9566

StatusException: UNAVAILABLE on client when maxConnectionAge is set on server #9566

usischev commented Sep 27, 2022

sanjaypujare commented Sep 27, 2022

usischev commented Sep 27, 2022

sanjaypujare commented Sep 27, 2022

usischev commented Sep 28, 2022

sanjaypujare commented Sep 28, 2022

usischev commented Sep 28, 2022

sanjaypujare commented Sep 28, 2022

ejona86 commented Sep 28, 2022

ejona86 commented Oct 4, 2022

racevedoo commented Apr 11, 2023

laurovenancio commented Apr 13, 2023

ejona86 commented Apr 14, 2023

laurovenancio commented Apr 14, 2023

ejona86 commented Apr 14, 2023

laurovenancio commented Apr 14, 2023

laurovenancio commented Apr 14, 2023

laurovenancio commented Apr 18, 2023

ejona86 commented Apr 18, 2023

jinseo-jang commented Apr 25, 2024

glyiu98 commented Jul 23, 2024

larry-safran commented Nov 21, 2024

ejona86 commented Jan 9, 2025

NaveenPrasannaV commented Feb 12, 2025

ejona86 commented Feb 12, 2025

NaveenPrasannaV commented Feb 20, 2025 •

edited

Loading

ejona86 commented Feb 24, 2025

NaveenPrasannaV commented Mar 25, 2025 •

edited

Loading

StatusException: UNAVAILABLE on client when maxConnectionAge is set on server #9566

StatusException: UNAVAILABLE on client when maxConnectionAge is set on server #9566

Comments

usischev commented Sep 27, 2022

What version of gRPC-Java are you using?

What is your environment?

What did you expect to see?

What did you see instead?

Steps to reproduce the bug

sanjaypujare commented Sep 27, 2022

usischev commented Sep 27, 2022

sanjaypujare commented Sep 27, 2022

usischev commented Sep 28, 2022

sanjaypujare commented Sep 28, 2022

usischev commented Sep 28, 2022

sanjaypujare commented Sep 28, 2022

ejona86 commented Sep 28, 2022

ejona86 commented Oct 4, 2022

racevedoo commented Apr 11, 2023

laurovenancio commented Apr 13, 2023

ejona86 commented Apr 14, 2023

laurovenancio commented Apr 14, 2023

ejona86 commented Apr 14, 2023

laurovenancio commented Apr 14, 2023

laurovenancio commented Apr 14, 2023

laurovenancio commented Apr 18, 2023

ejona86 commented Apr 18, 2023

jinseo-jang commented Apr 25, 2024

glyiu98 commented Jul 23, 2024

larry-safran commented Nov 21, 2024

ejona86 commented Jan 9, 2025

NaveenPrasannaV commented Feb 12, 2025

ejona86 commented Feb 12, 2025

NaveenPrasannaV commented Feb 20, 2025 • edited Loading

ejona86 commented Feb 24, 2025

NaveenPrasannaV commented Mar 25, 2025 • edited Loading

NaveenPrasannaV commented Feb 20, 2025 •

edited

Loading

NaveenPrasannaV commented Mar 25, 2025 •

edited

Loading