Skip to content

Serverside keepalive error detection and cleanups #2756

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Serverside keepalive error detection and cleanups #2756

wants to merge 2 commits into from

Conversation

davidfiala
Copy link
Contributor

Improve server-side keepalives, possibly resolve bug where keepalive errors were not being treated as errors.

  • Ensure that http2 ping failures are detected at request time and in the callback. Code inspection indicates both cases may have silently been ignored before.
  • Ensure keepalive intervals and timeouts are always cleared.
  • Bug Fix: keepalive intervals were being cleared with an incorrect clearTimeout before. Not sure if this was causing intervals leaks in some nodejs impls or not. (v20.13.1 seems to accept this mismatch without issue)
  • Rename timers/intervals for clarity
  • Ensure all paths do contain a channelz keepalive trace message with unique error strings.

Note: I think that in nodejs/node#18447 half of the issue was clarified already, but the code just wasn't updated to reflect the answer there. The other half is checking the return value of session.ping for a non-true result, too. Given my experience of keepalives not working in #2734 my current conclusion is that although keepalives are being sent (I inspected via manual logging), failures may not have been properly detected.

…errors were not being treated as errors.

- Ensure that http2 ping failures are detected at request time and in the callback. Code inspection indicates both cases may have silently been ignored before.
- Ensure keepalive intervals and timeouts are always cleared.
- Bug Fix: keepalive intervals were being cleared with an incorrect clearTimeout before. Not sure if this was causing intervals leaks in some nodejs impls or not. (v20.13.1 seems to accept this mismatch without issue)
- Rename timers/intervals for clarity
- Ensure all paths do contain a channelz keepalive trace message with unique error strings.
…ace and session destroyed. transport.ts has a more comprehensive approach, but this ensures that an overly ping-happy server will at least take action based on a client's expectations. AIUI, before this patch the server did not take any action on a goaway frame.
@murgatroid99
Copy link
Member

Unfortunately, the master branch is actually behind the release branch on this, so your changes are going to conflict with that. How about I merge those changes into master and then we look at this again?

On the plus side, that version does already handle the ping error, so you won't need to do that again. On the minus side, there are two parallel implementations of keepalives for performance reasons, so any changes will need to be made twice.

@davidfiala
Copy link
Contributor Author

Thank you, Michael. Happy to come back to this after the merge. If OTOH you would be happy with me making a PR(s) to fix this in one or more release branches, I can do that too. Whichever you think is easier for you to get it out faster. LMK which to ones target if you think it's better for me to PR to branches.

@murgatroid99
Copy link
Member

The current release branch is @grpc/[email protected]. You can make the change on that branch.

@davidfiala
Copy link
Contributor Author

davidfiala commented May 28, 2024

Thank you- I've created #2760 -- I think we can close this PR here then if it is too stale to use? Closing now.

@davidfiala davidfiala closed this May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants