Skip to content

dcm taint toleration from GPU Operator to KMM Operator #209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 35 commits into from

Conversation

sriram-30
Copy link
Contributor

@sriram-30 sriram-30 commented May 20, 2025

This PR has the proto changes which will support the latest KMM image once it is released publicly in v1.3.1.
So this PR should only get merged when v1.3.1 GPU Operator along with our v1.3.1 KMM is out

yansun1996 and others added 30 commits May 8, 2025 16:47
* fix: Only remove node labeller managed labels

The reconciler currently removes all labels with the amd.com and beta.amd.com prefix on nodes during cleanup. This is overly aggressive and can delete labels added by other users or systems.

This commit corrects the behavior to only remove labels that are specifically managed by node labeller, ensuring that only relevant labels are automatically cleaned up.

* test: Add more unit test cases for label cleanup modification
Argo turns pre-uggrade hooks into pre-sync hooks, which means you cannot
even install as said hook relies on CRDs, service accounts etc, which
aren't installed until after he hook executes.  Make the pre-upgrade
hook more tolerant by not doing anything if the CRD isn't installed.
…cheduling

Amend some details for nodeAffinity preferredDuringSchedulingIgnoredDuringExecution
…) (#410) (#411)

(cherry picked from commit b2154e2a62687de7faea8f037c0896eb18cd5e7b)

Co-authored-by: Titus Ou <[email protected]>
(cherry picked from commit aab02bfbd0bb585e5fdc770df6058467e859cd38)


(cherry picked from commit 740d293eab7a307c88383c65e02358713e5de025)

Co-authored-by: Titus Ou <[email protected]>
(cherry picked from commit 5f6edff5dd55afcf61e007473af6ec82b82ba741)

Co-authored-by: Titus Ou <[email protected]>
(cherry picked from commit 8f67ad10b875f2d2b73880b77c0642383398f8aa)


(cherry picked from commit a959f23a3cfa5319f07b34bb61e19ca934f9935f)

Co-authored-by: Titus Ou <[email protected]>
* Prometheus Integration support in GPU Operator (#594)

* Prometheus Integration: CRD Additions and Vendoring

- Adds CRD fields to support ServiceMonitorConfig. This change does not
  include TLS, Auth support in the CRD.
- Vendor prometheus-operator monitoring APIs

* Add validation for new CRD fields in ServiceMoitorConfig

- kubebuilder validation for interval
- verify ServiceMonitor CRD in the cluster if enabled in DeviceConfig

* Deploy ServiceMonitor objects in Operator

* Bump controller-gen version to 0.17.0

* Add ServiceMonitor, APIExtension CRUD permissions to Operator SA

- The ServiceAccount attached to the Operator needs elevated permissions
  to perform CRUDs on K8s APIExtension, CoreOS Monitoring groups to
  read installed CRDs and install/delete ServiceMonitor objects.

* Refactor code, address review comments

* Add TLS/Auth sections to Kube rbac proxy and ServiceMonitorConfig

---------

Co-authored-by: Nitish Bhat <[email protected]>

* Handle ServiceMonitor CRD not found error (#609)

- When ServiceMonitor CRD (monitoringv1) is not found, the error returned
  is a NoMatchError. There's nothing to delete when we see this error, so
  we have to handle it gracefully.

Co-authored-by: Nitish Bhat <[email protected]>

* Add monitoringv1.ServiceMonitor Patch RBAC Permission to GPU Operator

---------

Co-authored-by: Nitish Bhat <[email protected]>
…04) (#607)

(cherry picked from commit db1ee89cc0c8911cff473536953c9615d42629f6)

Co-authored-by: Nitish Bhat <[email protected]>
Co-authored-by: Nitish Bhat <[email protected]>
…ict (#632) (#633)

Co-authored-by: Nitish Bhat <[email protected]>
(cherry picked from commit 5a4fe675365227a818b8e2deb54fb8db3f93407d)

Co-authored-by: Nitish Bhat <[email protected]>
sriram-30 and others added 5 commits May 14, 2025 09:46
* Add default DeviceConfig CR for Helm Chart

* Fix helm chart pre-upgrade hook to support Argo deployment (#611)

* Optimize default CR's default values

* Add e2e test for helm chart

* Address comment
@sriram-30 sriram-30 marked this pull request as ready for review May 21, 2025 06:51
@sriram-30 sriram-30 changed the title dcm taint toleration from GPU Operator to KMM Operator dcm taint toleration from GPU Operator to KMM Operator [DO NOT MERGE] May 21, 2025
@sriram-30 sriram-30 changed the title dcm taint toleration from GPU Operator to KMM Operator [DO NOT MERGE] dcm taint toleration from GPU Operator to KMM Operator May 22, 2025
@sriram-30
Copy link
Contributor Author

Not sure why, this PR has lot of other commits pulled in by someone. Happened once before as well. Closing it. Will re-open

@sriram-30 sriram-30 closed this Jun 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants