Skip to content

[Feature] Support configurable RayCluster deletion delay in RayService #3864

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

machichima
Copy link
Contributor

@machichima machichima commented Jul 12, 2025

  • Add the RayClusterDeletionDelaySeconds field to the RayService CRD to allow customization of the deletion delay (in seconds) before removing old RayClusters.
  • Enforce a minimum value constraint (>= 0) for RayClusterDeletionDelaySeconds in the CRD schema.
  • Update the controller logic to use the default value (60 seconds) if RayClusterDeletionDelaySeconds is omitted or set to a negative value.
  • Add tests to verify the correct handling of RayClusterDeletionDelaySeconds.

Why are these changes needed?

Previously, RayService would delete old RayCluster resources 60 seconds after they became unused. This could result in longer-running tasks on the old cluster being terminated before completion.

By making the deletion delay here configurable, users can adjust the delay to better suit their workloads and avoid premature termination of tasks.

Related issue number

Closes #3602

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@machichima machichima marked this pull request as ready for review July 13, 2025 04:34
@machichima machichima changed the title [Feature] Zero downtime upgrade for long-running requests [Feature] Support configurable RayCluster deletion delay in RayService Jul 13, 2025
@machichima
Copy link
Contributor Author

@kevin85421 PTAL, Thanks!

ts, exists := r.RayClusterDeletionTimestamps.Get(rayClusterName)
assert.True(t, exists, "Deletion timestamp should be set for the cluster")
expectedTs := now.Add(tc.expectedDuration)
assert.Equal(t, expectedTs.Unix(), ts.Unix(), "Deletion timestamp should match expected timestamp")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be a potential flaky test to compare these two timestamps directly if the execution isn't quick enough between time.Now() and metav1.now() in cleanUpRayClusterInstance(...)?

If it is, how about using WithinDuration or WithinRange ?

Copy link
Contributor Author

@machichima machichima Jul 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense! I changed to InDelta instead

Thanks!

)
} else {
deletionDelay = time.Duration(*rayServiceInstance.Spec.RayClusterDeletionDelaySeconds) * time.Second
logger.Info("RayClusterDeletionDelaySeconds is set correctly and will be used.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the logs are too verbose. Will they be printed for every reconciliation before the shutdown is complete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think so.

I removed the logging here when rayServiceInstance.Spec.RayClusterDeletionDelaySeconds is a valid value to prevent the loggings become too noisy.

I log the deletionDelay alongside with the deletionTimestamp in the existing logger here instead

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Because this is a public API and user-facing behavior change, would you mind opening an issue to track the progress of e2e tests?

@@ -65,6 +65,11 @@ type RayServiceUpgradeStrategy struct {

// RayServiceSpec defines the desired state of RayService
type RayServiceSpec struct {
// RayClusterDeletionDelaySeconds specifies the delay in seconds before deleting old RayClusters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// RayClusterDeletionDelaySeconds specifies the delay in seconds before deleting old RayClusters.
// RayClusterDeletionDelaySeconds specifies the delay, in seconds, before deleting old RayClusters.
// The default value is 60 seconds.

@@ -65,6 +65,11 @@ type RayServiceUpgradeStrategy struct {

// RayServiceSpec defines the desired state of RayService
type RayServiceSpec struct {
// RayClusterDeletionDelaySeconds specifies the delay in seconds before deleting old RayClusters.
// Only values >= 0 are allowed. If omitted or negative, defaults to 60 seconds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not reconcile the CR instead of falling back to the default value.

Suggested change
// Only values >= 0 are allowed. If omitted or negative, defaults to 60 seconds.

// Determine the ray cluster deletion delay seconds
deletionDelay := RayClusterDeletionDelayDuration
if rayServiceInstance.Spec.RayClusterDeletionDelaySeconds != nil {
if *rayServiceInstance.Spec.RayClusterDeletionDelaySeconds < 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move the check to https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/utils/validation.go

In addition, we have already checked the value in OpenAPI. Do we still need the check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we create the rayService in Golang from the RayService struct, the check in OpenAPI will not be performed. I add this condition here to ensure in any case, we can prevent the use of negative delay seconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the check!

@machichima
Copy link
Contributor Author

Opened an issue for adding e2e test here: #3870

name: "Use custom delay when set to positive",
delaySeconds: ptr.To[int32](5),
expectedDuration: 5 * time.Second,
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not test for delay value < 0 because the check is moved to ray-operator/controllers/ray/utils/validation.go

Will probably add tests for checking the behavior of setting negative deletion delay value in e2e test

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please add a test to validation_test.go.

@@ -256,5 +256,11 @@ func ValidateRayServiceSpec(rayService *rayv1.RayService) error {
*rayService.Spec.UpgradeStrategy.Type != rayv1.NewCluster {
return fmt.Errorf("Spec.UpgradeStrategy.Type value %s is invalid, valid options are %s or %s", *rayService.Spec.UpgradeStrategy.Type, rayv1.NewCluster, rayv1.None)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also update validation_test.go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

@kevin85421 kevin85421 merged commit 017e58f into ray-project:master Jul 16, 2025
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Zero downtime upgrade for long-running requests.
6 participants