Skip to content

Add initial Kueue integration #313

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

tedhtchang
Copy link
Contributor

@tedhtchang tedhtchang commented Oct 9, 2024

This PR is to optionally enable Kueue for LMEvalJob by implementing the Kueue's GenericJob interface. It still needs some other functionalities so I want to broke it down a little bit:

1. Kueue requires LMEvalJob to have Spec.Suspend. We need to handle these 3 suspend cases in the LMEvalJob reconciler:

  1. Spec.Suspend(=true) upon LMEvalJob creation. This means the job is created but pods are running.
  2. Spec.Suspend(=true) while the LMEvalJob is running. This means delete the pods owned by the job.
  3. Spec.Suspend(=false) while the LMEvalJob is in suspended state. This means re-create the pods owned by the job.
    See PR#317

2. Get the PodSpec from the LMEValJob. Currently PodSpec is not stored in the LMEvalJob object.

Kueue needs to admit the job as a kueue.x-k8s.io/v1beta1/Workload. So the Workload can represent the resources (cpu/memory) requirements for the job.
We need the PodSpec from the LMEvalJob object to create a PodSet for Workload. See PR#323

3. Node Affinity

When we want to run job on particular node. We can create a spec.nodeLabels in ResourceFlavor which can be referenced in a ClusterQueue.

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "default-flavor"
spec:
  nodeLabels:
    kubernetes.io/hostname: kueue-worker

Implement these runWithPodInfo and restoreWithPodInfo.
Extract nodeSelector object from from the PodSetInfo object and then overwrite the LMevalJob PodSpec spec affinity. @yhwang created another PR to introduce Pod.Spec.NodeAffinity into LMevalJob.

Copy link

openshift-ci bot commented Oct 9, 2024

Hi @tedhtchang. Thanks for your PR.

I'm waiting for a trustyai-explainability member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@tedhtchang tedhtchang force-pushed the optional-kueue-integration branch 3 times, most recently from 24ae784 to ffe34a3 Compare October 9, 2024 23:11
@tedhtchang tedhtchang force-pushed the optional-kueue-integration branch from ffe34a3 to b087db4 Compare October 17, 2024 22:39
@yhwang yhwang added the lm-eval Issues related to LM-Eval label Oct 18, 2024
@tedhtchang tedhtchang force-pushed the optional-kueue-integration branch from b087db4 to 80ad816 Compare October 18, 2024 20:44
@ruivieira
Copy link
Member

Hi @tedhtchang this is looking great, thanks!

Just a question, we cannot guarantee that Kueue is installed on the cluster. Would enabling Kueue support for LM-Eval (and not having Kueue installed) be a problem?

@tedhtchang
Copy link
Contributor Author

tedhtchang commented Oct 23, 2024

@ruivieira @yhwang Can you guys change the PR to merge to main branch Or I can close & open a new PR?(done)
Thanks @yhwang

@tedhtchang tedhtchang changed the base branch from dev/lm-eval to main October 23, 2024 18:07
@tedhtchang
Copy link
Contributor Author

Hi @tedhtchang this is looking great, thanks!

Just a question, we cannot guarantee that Kueue is installed on the cluster. Would enabling Kueue support for LM-Eval (and not having Kueue installed) be a problem?

@ruivieira We can start the operator without the job_mgr_controller if Kueue is not enabled. It shouldn't be a problem.

@tedhtchang tedhtchang force-pushed the optional-kueue-integration branch from 80ad816 to 6639b9b Compare October 23, 2024 19:52
@tedhtchang tedhtchang force-pushed the optional-kueue-integration branch from 6639b9b to 2003c61 Compare October 24, 2024 09:34
@tedhtchang tedhtchang marked this pull request as ready for review October 24, 2024 09:36
@tedhtchang
Copy link
Contributor Author

/cc @ruivieira @yhwang Use the controllers/job_mgr/README.md to setup env and test this PR.

Copy link

github-actions bot commented Oct 24, 2024

PR image build and manifest generation completed successfully!

📦 PR image: quay.io/trustyai/trustyai-service-operator-ci:605323ebc1f1eb567a11e60671a3a09a1767b0df

📦 LMES driver image: quay.io/trustyai/ta-lmes-driver:605323ebc1f1eb567a11e60671a3a09a1767b0df

📦 LMES job image: quay.io/trustyai/ta-lmes-job:605323ebc1f1eb567a11e60671a3a09a1767b0df

🗂️ CI manifests

devFlags:
  manifests:
    - contextDir: config
      sourcePath: ''
      uri: https://api.github.com/repos/trustyai-explainability/trustyai-service-operator-ci/tarball/operator-605323ebc1f1eb567a11e60671a3a09a1767b0df

cmd/main.go Outdated
@@ -69,7 +72,7 @@ func main() {
var probeAddr string
var configMap string
var enabledServices controllers.EnabledServices
flag.StringVar(&metricsAddr, "metrics-bind-address", ":8080", "The address the metric endpoint binds to.")
flag.StringVar(&metricsAddr, "metrics-bind-address", ":9443", "The address and port the metric endpoint binds to.")
Copy link
Collaborator

@yhwang yhwang Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching the default value here to 9443 may conflict with the webhook port although we don't have a webhook server yet. No cert for running 9443 at this moment. And the metric port is not exposed either.

kServeServerless: disabled
lmes-default-batch-size: "8"
lmes-driver-image: quay.io/yhwang/ta-lmes-driver:latest
lmes-grpc-port: "8082"
Copy link
Collaborator

@yhwang yhwang Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no more gRPC settings. this line and the following line.

@@ -0,0 +1,5 @@
package job_mgr
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing the license header.

@ruivieira
Copy link
Member

Hi @tedhtchang , is this PR rebased on main? If not, is it ok to rebase it? Thank you!

@tedhtchang
Copy link
Contributor Author

tedhtchang commented Oct 30, 2024

Hi @tedhtchang , is this PR rebased on main? If not, is it ok to rebase it? Thank you!

@ruivieira You should merge @yhwang's PR first because my PR was based on it. Once you merged it the main, I will rebased the main into my PR. Then you should be able to merged mine.

@ruivieira
Copy link
Member

@tedhtchang ack, thanks for the heads up. I've approved #340 now.

@tedhtchang tedhtchang force-pushed the optional-kueue-integration branch from c94e081 to 78cbad0 Compare November 5, 2024 00:32
@openshift-ci openshift-ci bot removed the lgtm label Nov 5, 2024
Copy link

openshift-ci bot commented Nov 5, 2024

New changes are detected. LGTM label has been removed.

@tedhtchang
Copy link
Contributor Author

@ruivieira @yhwang I rebased and resolved conflicts from main and update-lmevaljob-pod branches.

Comment on lines +55 to +57
k8sClient = fake.NewClientBuilder().WithScheme(scheme.Scheme).WithStatusSubresource(
&routev1.Route{},
&trustyaiopendatahubiov1alpha1.TrustyAIService{},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to register the status subresoruce separately with the updated controller-runtime package.
Thanks @yhwang figured this out.

@tedhtchang
Copy link
Contributor Author

tedhtchang commented Nov 7, 2024

@ruivieira The controller-test failure was fix. Not sure about the smoke test error. Does it always fail ? Nvm, missing service secrets which are generated by Openshift. KinD would not generate those secrets. I try to create these secrets in this PR to pass the smoke test.

Copy link
Member

@ruivieira ruivieira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tedhtchang LGTM, thanks for the changes!

I do have a final request: is it possible to regenerate the go.sum?
I'm getting a checksum mismatch for github.com/kserve/[email protected] dependency.

Copy link

openshift-ci bot commented Nov 8, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: yhwang

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tedhtchang tedhtchang force-pushed the optional-kueue-integration branch from 353cf02 to 605323e Compare November 8, 2024 17:38
@tedhtchang
Copy link
Contributor Author

@ruivieira I saw no difference in go.sum after go mod tidy. I rebase PR anyway.
however @yhwang found mismatched go.sum in the main branch. This is the output:

go mod tidy
go: downloading golang.org/x/mod v0.12.0
go: downloading github.com/grpc-ecosystem/grpc-gateway/v2 v2.15.2

@tedhtchang
Copy link
Contributor Author

/retest

Copy link

openshift-ci bot commented Nov 8, 2024

@tedhtchang: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/trustyai-service-operator-e2e 605323e link true /test trustyai-service-operator-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ruivieira ruivieira merged commit 02de274 into trustyai-explainability:main Nov 11, 2024
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lm-eval Issues related to LM-Eval ok-to-test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants