-
Notifications
You must be signed in to change notification settings - Fork 567
[Feature] Support configurable RayCluster deletion delay in RayService #3864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Support configurable RayCluster deletion delay in RayService #3864
Conversation
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
@kevin85421 PTAL, Thanks! |
ts, exists := r.RayClusterDeletionTimestamps.Get(rayClusterName) | ||
assert.True(t, exists, "Deletion timestamp should be set for the cluster") | ||
expectedTs := now.Add(tc.expectedDuration) | ||
assert.Equal(t, expectedTs.Unix(), ts.Unix(), "Deletion timestamp should match expected timestamp") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be a potential flaky test to compare these two timestamps directly if the execution isn't quick enough between time.Now()
and metav1.now()
in cleanUpRayClusterInstance(...)
?
If it is, how about using WithinDuration
or WithinRange
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense! I changed to InDelta
instead
Thanks!
Signed-off-by: machichima <[email protected]>
) | ||
} else { | ||
deletionDelay = time.Duration(*rayServiceInstance.Spec.RayClusterDeletionDelaySeconds) * time.Second | ||
logger.Info("RayClusterDeletionDelaySeconds is set correctly and will be used.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the logs are too verbose. Will they be printed for every reconciliation before the shutdown is complete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think so.
I removed the logging here when rayServiceInstance.Spec.RayClusterDeletionDelaySeconds
is a valid value to prevent the loggings become too noisy.
I log the deletionDelay
alongside with the deletionTimestamp
in the existing logger here instead
Signed-off-by: machichima <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Because this is a public API and user-facing behavior change, would you mind opening an issue to track the progress of e2e tests?
@@ -65,6 +65,11 @@ type RayServiceUpgradeStrategy struct { | |||
|
|||
// RayServiceSpec defines the desired state of RayService | |||
type RayServiceSpec struct { | |||
// RayClusterDeletionDelaySeconds specifies the delay in seconds before deleting old RayClusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// RayClusterDeletionDelaySeconds specifies the delay in seconds before deleting old RayClusters. | |
// RayClusterDeletionDelaySeconds specifies the delay, in seconds, before deleting old RayClusters. | |
// The default value is 60 seconds. |
@@ -65,6 +65,11 @@ type RayServiceUpgradeStrategy struct { | |||
|
|||
// RayServiceSpec defines the desired state of RayService | |||
type RayServiceSpec struct { | |||
// RayClusterDeletionDelaySeconds specifies the delay in seconds before deleting old RayClusters. | |||
// Only values >= 0 are allowed. If omitted or negative, defaults to 60 seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not reconcile the CR instead of falling back to the default value.
// Only values >= 0 are allowed. If omitted or negative, defaults to 60 seconds. |
// Determine the ray cluster deletion delay seconds | ||
deletionDelay := RayClusterDeletionDelayDuration | ||
if rayServiceInstance.Spec.RayClusterDeletionDelaySeconds != nil { | ||
if *rayServiceInstance.Spec.RayClusterDeletionDelaySeconds < 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move the check to https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/utils/validation.go
In addition, we have already checked the value in OpenAPI. Do we still need the check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we create the rayService
in Golang from the RayService
struct, the check in OpenAPI will not be performed. I add this condition here to ensure in any case, we can prevent the use of negative delay seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the check!
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Opened an issue for adding e2e test here: #3870 |
name: "Use custom delay when set to positive", | ||
delaySeconds: ptr.To[int32](5), | ||
expectedDuration: 5 * time.Second, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not test for delay value < 0 because the check is moved to ray-operator/controllers/ray/utils/validation.go
Will probably add tests for checking the behavior of setting negative deletion delay value in e2e test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please add a test to validation_test.go
.
@@ -256,5 +256,11 @@ func ValidateRayServiceSpec(rayService *rayv1.RayService) error { | |||
*rayService.Spec.UpgradeStrategy.Type != rayv1.NewCluster { | |||
return fmt.Errorf("Spec.UpgradeStrategy.Type value %s is invalid, valid options are %s or %s", *rayService.Spec.UpgradeStrategy.Type, rayv1.NewCluster, rayv1.None) | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please also update validation_test.go
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, thanks!
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
RayClusterDeletionDelaySeconds
field to the RayService CRD to allow customization of the deletion delay (in seconds) before removing old RayClusters.RayClusterDeletionDelaySeconds
in the CRD schema.RayClusterDeletionDelaySeconds
is omitted or set to a negative value.RayClusterDeletionDelaySeconds
.Why are these changes needed?
Previously,
RayService
would delete oldRayCluster
resources 60 seconds after they became unused. This could result in longer-running tasks on the old cluster being terminated before completion.By making the deletion delay here configurable, users can adjust the delay to better suit their workloads and avoid premature termination of tasks.
Related issue number
Closes #3602
Checks