Skip to content

Long running streams fail #7641

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
daveoxley opened this issue Sep 18, 2024 · 5 comments · Fixed by #7660
Closed

Long running streams fail #7641

daveoxley opened this issue Sep 18, 2024 · 5 comments · Fixed by #7660
Assignees
Labels
Area: Transport Includes HTTP/2 client/server and HTTP server handler transports and advanced transport features. Type: Bug

Comments

@daveoxley
Copy link

daveoxley commented Sep 18, 2024

What version of gRPC are you using?

1.66.2
But seen on all versions of 1.66.x so far. Issue did not occur on 1.65

What version of Go are you using (go version)?

N/A

What operating system (Linux, Windows, …) and version?

N/A

What did you do?

Long running stream fails with an error. Test case in #7642.

What did you expect to see?

Stream completing successfully:

=== RUN   TestLongRunningStream
--- PASS: TestLongRunningStream (127.39s)
PASS

What did you see instead?

Errors vary between the 2 below:

=== RUN   TestLongRunningStream
    stream_test.go:70: Failed to receive response: rpc error: code = Internal desc = received 4294967270-bytes data exceeding the limit 65535 bytes
--- FAIL: TestLongRunningStream (17.32s)

and

=== RUN   TestLongRunningStream
    stream_test.go:70: Failed to receive response: rpc error: code = Internal desc = unexpected EOF
--- FAIL: TestLongRunningStream (8.14s)
daveoxley added a commit to daveoxley/grpc-go that referenced this issue Sep 18, 2024
@eshitachandwani eshitachandwani self-assigned this Sep 19, 2024
@eshitachandwani eshitachandwani added the Area: Client Includes Channel/Subchannel/Streams, Connectivity States, RPC Retries, Dial/Call Options and more. label Sep 19, 2024
@arjan-bal
Copy link
Contributor

arjan-bal commented Sep 20, 2024

After using git bisect on the range from tag 1.65.0-dev to HEAD of master, the commit causing the regression seems to be 9ab8b62

Command:

# Run 10 times in parallel for higher confidence
seq 1 10 | xargs -n1 -P10 -I{}  go test ./test/stream/ -run TestLongRunningStream -count 1 -v || echo "A test failed"

I then verified that the test flakes on 9ab8b62 and passed 30/30 times on the commit immediately before it.

@arjan-bal arjan-bal removed their assignment Sep 20, 2024
@zhouyan
Copy link

zhouyan commented Sep 22, 2024

I have been seeing the same problems as well since upgrade to 1.66 and beyond.

the error I got was control flow error, saying received data exceeding limit. The received data is around 4GB according to the logs.

@arjan-bal
Copy link
Contributor

It seems like the amount of bytes being read > the amount of bytes received. This is causing an unsigned 32 bit integer to underflow resulting in 4294967270 which is -26.

I think I found the bug. The headers are being read incrementally here:

func (s *Stream) ReadHeader(header []byte) (err error) {
// Don't request a read if there was an error earlier
if er := s.trReader.er; er != nil {
return er
}
s.requestRead(len(header))
for len(header) != 0 {
n, err := s.trReader.ReadHeader(header)
header = header[n:]
if len(header) == 0 {
err = nil
}
if err != nil {
if n > 0 && err == io.EOF {
err = io.ErrUnexpectedEOF
}
return err
}
}
return nil
}

We keep filling the underlying header slice until its full, but we do so by moving the start of the slice forward. In the following function that is called by the function above to perform reads and update the flow control window, we do t.windowHandler(len(header)).

func (t *transportReader) ReadHeader(header []byte) (int, error) {
n, err := t.reader.ReadHeader(header)
if err != nil {
t.er = err
return 0, err
}
t.windowHandler(len(header))
return n, nil
}

The code has read n bytes, but it says it read len(header) bytes. len(header) is the total number of bytes that remain to be read, which is not always equal the the bytes there were read. I was able to get the test to pass after changing the line to t.windowHandler(n).

@arjan-bal arjan-bal added Area: Transport Includes HTTP/2 client/server and HTTP server handler transports and advanced transport features. and removed Area: Client Includes Channel/Subchannel/Streams, Connectivity States, RPC Retries, Dial/Call Options and more. labels Sep 22, 2024
@arjan-bal arjan-bal self-assigned this Sep 22, 2024
@arjan-bal
Copy link
Contributor

The fix has been merged in #7660, keeping the issue open to track the cherrypick into release branches.

@arjan-bal
Copy link
Contributor

The fix is part of the latest release: https://github.com/grpc/grpc-go/releases/tag/v1.67.1

Closing the issue now.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 31, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Area: Transport Includes HTTP/2 client/server and HTTP server handler transports and advanced transport features. Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants