Skip to content

CA DRA: handle device taints and tolerations (KEP-5055) #7947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
x13n opened this issue Mar 19, 2025 · 3 comments
Open

CA DRA: handle device taints and tolerations (KEP-5055) #7947

x13n opened this issue Mar 19, 2025 · 3 comments
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. kind/feature Categorizes issue or PR as related to a new feature. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Comments

@x13n
Copy link
Member

x13n commented Mar 19, 2025

Which component are you using?:

/area cluster-autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

KEP-5055 adds support for Device taints to DRA. This means that the individual Devices exposed in ResourceSlices can be tainted similarly to how Nodes can be tainted today (there are tolerations, taint-based eviction, etc.). The feature is behind a separate feature gate and went to alpha in 1.33.

As part of the KEP, admins can now create patch objects (DeviceTaintRule) that automatically add taints to all Devices matching certain conditions. Cluster Autoscaler needs to apply the patches to ResourceSlices before exposing them to the DRA scheduler plugin via scheduler framework.

Describe the solution you'd like.:

We should update

func (s snapshotSliceLister) List() ([]*resourceapi.ResourceSlice, error) {
to apply the taint patches before returning ResourceSlices. Ideally we would reuse k8s.io/dynamic-resource-allocation/resourceslice/tracker if possible. We need to make sure that the patches are applied to the fake ResourceSlices created by CA as well, before they're used for anything else.

Furthermore, we should extend the existing Node taint handling to Device taints:

  • Device taints should be filtered from template NodeInfos using the same logic as Node taints:
    • Well-known transient taints are filtered out.
    • Users can configure startup and status taints which are filtered out.
  • Device taints applied by DeviceTaintRules shouldn't be filtered out from template NodeInfos if we can detect that they will apply to the new Node (e.g. if they apply to all Devices from a driver, regardless of the pool or other attributes).

Additional context.:

@x13n x13n added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 19, 2025
@x13n
Copy link
Member Author

x13n commented Mar 19, 2025

/cc @towca

@towca towca added wg/device-management Categorizes an issue or PR as relevant to WG Device Management. area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. labels Apr 22, 2025
@towca towca changed the title Update CA to handle KEP 5055 CA DRA: handle device taints and tolerations (KEP-5055) Apr 24, 2025
@MenD32
Copy link

MenD32 commented Apr 30, 2025

I tried to start working on this, and this requires upgrading Kubernetes packages from v0.33.0.alpha to v0.33.0, which requires upgrading Go to v1.24.0. This can potentially affect a lot of other parts of CA. Should this be a different chore type issue/PR?

@x13n
Copy link
Member Author

x13n commented May 5, 2025

Yes, bumping deps is separate. I tried to do that in #7919, but many tests are failing and I didn't have time to get back to this yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. kind/feature Categorizes issue or PR as related to a new feature. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
None yet
Development

No branches or pull requests

4 participants