Skip to content

[Feature] Add timeout for apiserver grpc server #3427

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

machichima
Copy link
Contributor

@machichima machichima commented Apr 19, 2025

  • Add time out grpc server interceptor
  • Add unit test to ensure the timeout interceptor works well
  • Enable setting the timeout value through flag -grpc_timeout (e.g. -grpc_timeout 30s), default to 60 seconds

Why are these changes needed?

Currently, there's no timeout setting in gRPC server side. This PR added the timeout to ensure resource access doesn't get overloaded in all cases.

Related issue number

Part of #3344

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

-e
Signed-off-by: machichima <[email protected]>
@machichima
Copy link
Contributor Author

Hi @dentiny ,

I noticed that there's a PR related to this which works on http timeout (#3350), do we need to have a same timeout for http and grpc server or is it fine to have them different?

Thanks

@dentiny
Copy link
Contributor

dentiny commented Apr 20, 2025

I noticed that there's a PR related to this which works on http timeout (#3350), do we need to have a same timeout for http and grpc server or is it fine to have them different?

There're three timeout related issues, as far as I am aware of:

  • grpc service timeout, which limits the processing time for each request (i.e. query kubernetes apiserver)
  • http server timeout, which is used to limit http request read and response write timeout
  • client side timeout

@machichima
Copy link
Contributor Author

There're three timeout related issues, as far as I am aware of:

Got it, I think this PR is working on grpc service timeout, while PR #3350 works on limiting http request?

As I saw the comment in PR #3350 that they want to decide a default timeout, I am thinking if the default value here should be the same as what they set?

@dentiny
Copy link
Contributor

dentiny commented Apr 20, 2025

Got it, I think this PR is working on grpc service timeout, while PR #3350 works on limiting http request?

yes

As I saw the comment in PR #3350 that they want to decide a default timeout, I am thinking if the default value here should be the same as what they set?

No, I don't think they need to be the same

@bhks
Copy link

bhks commented Apr 23, 2025

Got it, I think this PR is working on grpc service timeout, while PR #3350 works on limiting http request?

yes

As I saw the comment in PR #3350 that they want to decide a default timeout, I am thinking if the default value here should be the same as what they set?

No, I don't think they need to be the same

60 seconds should be a good number ?

@machichima machichima marked this pull request as ready for review April 23, 2025 14:54
@machichima
Copy link
Contributor Author

60 seconds should be a good number ?

Yes I think its good for default value. I also add an env variable to let user set the timeout value.

@machichima
Copy link
Contributor Author

@dentiny PTAL

case <-ctx.Done():
// Raise error if time out
if ctx.Err() == context.DeadlineExceeded {
return nil, fmt.Errorf("grpc server timed out")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we name the grpc server with KubeRay API server ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Just changed

@@ -50,11 +51,21 @@ func main() {
_ = flagSet.Set("log_file", *logFile)
}

grpcTimeout := 60 * time.Second // Default timeout
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we following mechanisms to define constants or we are adding to each files where we are using it ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quickly glancing over the code, we have constants.go for other components (i.e. operator)
https://github.com/ray-project/kuberay/blob/ebb5ba441b0a7f888c17aa5c2d33943084a9a2d9/ray-operator/controllers/ray/utils/constant.go

I usually do this in two ways:

  • either place it to constant file, just as what we did for kuberay operator
    • the benefit of which is we group all constants in one place, rather than scattered around the codebase
  • or define a util function getGrpcServerTimeoutOrDefault and have default timeout besides
    • the benefit of which is it's easy to locate all timeout related functions and features

Our codebase seems to prefer (1).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I found that in apiserver, they put constants in config.go, I'll add it here

const (
// Label keys
RayClusterNameLabelKey = "ray.io/cluster-name"
RayClusterUserLabelKey = "ray.io/user"
RayClusterVersionLabelKey = "ray.io/version"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@@ -50,11 +51,21 @@ func main() {
_ = flagSet.Set("log_file", *logFile)
}

grpcTimeout := 60 * time.Second // Default timeout
if timeoutStr := os.Getenv("GRPC_SERVER_TIMEOUT"); timeoutStr != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw why do we use env var instead of flags? I think flags are strictly better in a few ways:

  • program checks env variables; for example, bazel uses env to decide whether we could reuse cache
  • impose security issue, because env is shared among all processes which could be accessed everywhere

I almost only use env variables when:

  • across language boundary
  • across process boundary, if no other easier way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the guidance!

The reason why I put it in environment variable instead of flag is because I search through the code base and find they put this (which I think is a bit similar to timeout?) in the environment variable, so I just simply follow what it does

requeueAfterSeconds, err := strconv.Atoi(os.Getenv(utils.RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV))

I agree to your points, if there's no other places that need this value, I think I'll just move it to flag instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to flag

grpcTimeout = timeout
klog.Infof("gRPC servier timeout set to %v", grpcTimeout)
} else {
klog.Warningf("Invalid GRPC_SERVER_TIMEOUT value: %v, using default timeout (60 seconds)", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use %d to print out default value, in case we change in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just added

@machichima machichima requested a review from dentiny April 26, 2025 08:12
@dentiny
Copy link
Contributor

dentiny commented Apr 28, 2025

let me know when it's ready for review, feel free to ping me any time :)

Comment on lines +29 to +43
select {
case <-time.After(delay):
return "test_response", h.returnErr
case <-ctx.Done():
var grpcCode codes.Code
switch ctx.Err() {
case context.Canceled:
grpcCode = codes.Canceled
case context.DeadlineExceeded:
grpcCode = codes.DeadlineExceeded
default:
grpcCode = codes.Unknown
}
return nil, status.Error(grpcCode, ctx.Err().Error())
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this to mimic the grpc IO handler for testing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is automatically updated when running make test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is automatically updated when running make test

@@ -50,11 +51,21 @@ func main() {
_ = flagSet.Set("log_file", *logFile)
}

grpcTimeout := 60 * time.Second // Default timeout
if timeoutStr := os.Getenv("GRPC_SERVER_TIMEOUT"); timeoutStr != "" {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to flag

@machichima machichima requested a review from dentiny April 29, 2025 01:45
@machichima
Copy link
Contributor Author

Sorry I just found that I didn't submit the review as comment

@machichima
Copy link
Contributor Author

@dentiny PTAL, Thanks!

Copy link
Contributor

@dentiny dentiny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for the help!
And sorry about the delay

func (mr *MockKubernetesClientInterfaceMockRecorder) ConfigMapClient(namespace interface{}) *gomock.Call {
mr.mock.ctrl.T.Helper()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the mock call setup fails (e.g., wrong argument types), Go's testing output will show the line in your 
actual test code where the error originated, rather than pointing you to this ConfigMapClient() method in 
the mock recorder.

TIL

@@ -61,7 +83,7 @@ func TestAPIServerInterceptor(t *testing.T) {
req,
info,
func(ctx context.Context, req interface{}) (interface{}, error) {
return tt.handler.Handle(ctx, req)
return tt.handler.Handle(ctx, req, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 0 /*delay*/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Thanks!

@@ -153,7 +175,7 @@ func TestAPIServerInterceptorLogging(t *testing.T) {
"test_request",
info,
func(receivedCtx context.Context, req interface{}) (interface{}, error) {
return handler.Handle(receivedCtx, req)
return handler.Handle(receivedCtx, req, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a comment besides constants

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Thanks!

@kevin85421 kevin85421 merged commit f45155b into ray-project:master May 6, 2025
24 checks passed
@dentiny
Copy link
Contributor

dentiny commented May 7, 2025

Enable setting the timeout value through env variable GRPC_SERVER_TIMEOUT

PR description needs to be updated :)

laurafitzgerald pushed a commit to laurafitzgerald/kuberay that referenced this pull request May 7, 2025
@machichima
Copy link
Contributor Author

PR description needs to be updated :)

Thanks! Just updated!

laurafitzgerald pushed a commit to laurafitzgerald/kuberay that referenced this pull request May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants