-
Notifications
You must be signed in to change notification settings - Fork 32
Add initial Kueue integration #313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial Kueue integration #313
Conversation
Hi @tedhtchang. Thanks for your PR. I'm waiting for a trustyai-explainability member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
24ae784
to
ffe34a3
Compare
ffe34a3
to
b087db4
Compare
b087db4
to
80ad816
Compare
Hi @tedhtchang this is looking great, thanks! Just a question, we cannot guarantee that Kueue is installed on the cluster. Would enabling Kueue support for LM-Eval (and not having Kueue installed) be a problem? |
@ruivieira @yhwang Can you guys change the PR to merge to main branch Or I can close & open a new PR?(done) |
@ruivieira We can start the operator without the job_mgr_controller if Kueue is not enabled. It shouldn't be a problem. |
80ad816
to
6639b9b
Compare
6639b9b
to
2003c61
Compare
/cc @ruivieira @yhwang Use the controllers/job_mgr/README.md to setup env and test this PR. |
PR image build and manifest generation completed successfully! 📦 PR image: 📦 LMES driver image: 📦 LMES job image: 🗂️ CI manifests
|
cmd/main.go
Outdated
@@ -69,7 +72,7 @@ func main() { | |||
var probeAddr string | |||
var configMap string | |||
var enabledServices controllers.EnabledServices | |||
flag.StringVar(&metricsAddr, "metrics-bind-address", ":8080", "The address the metric endpoint binds to.") | |||
flag.StringVar(&metricsAddr, "metrics-bind-address", ":9443", "The address and port the metric endpoint binds to.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switching the default value here to 9443 may conflict with the webhook port although we don't have a webhook server yet. No cert for running 9443 at this moment. And the metric port is not exposed either.
controllers/job_mgr/README.md
Outdated
kServeServerless: disabled | ||
lmes-default-batch-size: "8" | ||
lmes-driver-image: quay.io/yhwang/ta-lmes-driver:latest | ||
lmes-grpc-port: "8082" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no more gRPC settings. this line and the following line.
@@ -0,0 +1,5 @@ | |||
package job_mgr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing the license header.
Hi @tedhtchang , is this PR rebased on |
@ruivieira You should merge @yhwang's PR first because my PR was based on it. Once you merged it the main, I will rebased the main into my PR. Then you should be able to merged mine. |
@tedhtchang ack, thanks for the heads up. I've approved #340 now. |
c94e081
to
78cbad0
Compare
New changes are detected. LGTM label has been removed. |
@ruivieira @yhwang I rebased and resolved conflicts from main and update-lmevaljob-pod branches. |
78cbad0
to
f80c1b8
Compare
f80c1b8
to
353cf02
Compare
k8sClient = fake.NewClientBuilder().WithScheme(scheme.Scheme).WithStatusSubresource( | ||
&routev1.Route{}, | ||
&trustyaiopendatahubiov1alpha1.TrustyAIService{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to register the status subresoruce separately with the updated controller-runtime package.
Thanks @yhwang figured this out.
@ruivieira The controller-test failure was fix. Not sure about the smoke test error. Does it always fail ? Nvm, missing service secrets which are generated by Openshift. KinD would not generate those secrets. I try to create these secrets in this PR to pass the smoke test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tedhtchang LGTM, thanks for the changes!
I do have a final request: is it possible to regenerate the go.sum
?
I'm getting a checksum mismatch for github.com/kserve/[email protected]
dependency.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: yhwang The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: ted chang <[email protected]>
Signed-off-by: ted chang <[email protected]>
Signed-off-by: ted chang <[email protected]>
Signed-off-by: ted chang <[email protected]>
353cf02
to
605323e
Compare
@ruivieira I saw no difference in go.sum after
|
/retest |
@tedhtchang: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This PR is to optionally enable Kueue for LMEvalJob by implementing the Kueue's GenericJob interface. It still needs some other functionalities so I want to broke it down a little bit:
1. Kueue requires LMEvalJob to have Spec.Suspend. We need to handle these 3 suspend cases in the LMEvalJob reconciler:
See PR#317
2. Get the PodSpec from the LMEValJob. Currently PodSpec is not stored in the LMEvalJob object.
Kueue needs to admit the job as a kueue.x-k8s.io/v1beta1/Workload. So the Workload can represent the resources (cpu/memory) requirements for the job.
We need the PodSpec from the LMEvalJob object to create a PodSet for Workload. See PR#323
3. Node Affinity
When we want to run job on particular node. We can create a spec.nodeLabels in ResourceFlavor which can be referenced in a ClusterQueue.
Implement these runWithPodInfo and restoreWithPodInfo.
Extract nodeSelector object from from the PodSetInfo object and then overwrite the LMevalJob PodSpec spec affinity. @yhwang created another PR to introduce Pod.Spec.NodeAffinity into LMevalJob.