diff --git a/OWNERS_ALIASES b/OWNERS_ALIASES index eac30035171..061142f13b7 100644 --- a/OWNERS_ALIASES +++ b/OWNERS_ALIASES @@ -142,6 +142,11 @@ aliases: - jeremyrickard - liggitt - micahhausler + wg-node-lifecycle-leads: + - atiratree + - fabriziopandini + - humblec + - rthallisey wg-policy-leads: - JimBugwadia - poonam-lamba diff --git a/communication/slack-config/channels.yaml b/communication/slack-config/channels.yaml index be7dfe88d44..805b9fbd123 100644 --- a/communication/slack-config/channels.yaml +++ b/communication/slack-config/channels.yaml @@ -584,6 +584,7 @@ channels: - name: wg-multitenancy - name: wg-naming archived: true + - name: wg-node-lifecycle - name: wg-onprem archived: true - name: wg-policy diff --git a/liaisons.md b/liaisons.md index 42a3c54f5b6..f43a7a205b5 100644 --- a/liaisons.md +++ b/liaisons.md @@ -59,6 +59,7 @@ members will assume one of the departing members groups. | [WG Device Management](wg-device-management/README.md) | Patrick Ohly (**[@pohly](https://github.com/pohly)**) | | [WG etcd Operator](wg-etcd-operator/README.md) | Maciej Szulik (**[@soltysh](https://github.com/soltysh)**) | | [WG LTS](wg-lts/README.md) | Sascha Grunert (**[@saschagrunert](https://github.com/saschagrunert)**) | +| [WG Node Lifecycle](wg-node-lifecycle/README.md) | TBD (**[@TBD](https://github.com/TBD)**) | | [WG Policy](wg-policy/README.md) | Patrick Ohly (**[@pohly](https://github.com/pohly)**) | | [WG Serving](wg-serving/README.md) | Maciej Szulik (**[@soltysh](https://github.com/soltysh)**) | | [WG Structured Logging](wg-structured-logging/README.md) | Sascha Grunert (**[@saschagrunert](https://github.com/saschagrunert)**) | diff --git a/sig-apps/README.md b/sig-apps/README.md index ba5e073d7b3..fa2e645ea70 100644 --- a/sig-apps/README.md +++ b/sig-apps/README.md @@ -59,6 +59,7 @@ subprojects, and resolve cross-subproject technical issues and decisions. The following [working groups][working-group-definition] are sponsored by sig-apps: * [WG Batch](/wg-batch) * [WG Data Protection](/wg-data-protection) +* [WG Node Lifecycle](/wg-node-lifecycle) * [WG Serving](/wg-serving) diff --git a/sig-architecture/README.md b/sig-architecture/README.md index 2013d2b772d..649a7128794 100644 --- a/sig-architecture/README.md +++ b/sig-architecture/README.md @@ -58,6 +58,7 @@ The Chairs of the SIG run operations and processes governing the SIG. The following [working groups][working-group-definition] are sponsored by sig-architecture: * [WG Device Management](/wg-device-management) * [WG LTS](/wg-lts) +* [WG Node Lifecycle](/wg-node-lifecycle) * [WG Policy](/wg-policy) * [WG Serving](/wg-serving) * [WG Structured Logging](/wg-structured-logging) diff --git a/sig-autoscaling/README.md b/sig-autoscaling/README.md index 79d95480628..6c5a132ded9 100644 --- a/sig-autoscaling/README.md +++ b/sig-autoscaling/README.md @@ -48,6 +48,7 @@ The Chairs of the SIG run operations and processes governing the SIG. The following [working groups][working-group-definition] are sponsored by sig-autoscaling: * [WG Batch](/wg-batch) * [WG Device Management](/wg-device-management) +* [WG Node Lifecycle](/wg-node-lifecycle) * [WG Serving](/wg-serving) diff --git a/sig-cli/README.md b/sig-cli/README.md index f28cac88771..3fe661cb7ea 100644 --- a/sig-cli/README.md +++ b/sig-cli/README.md @@ -60,6 +60,12 @@ subprojects, and resolve cross-subproject technical issues and decisions. - [@kubernetes/sig-cli-test-failures](https://github.com/orgs/kubernetes/teams/sig-cli-test-failures) - Test Failures and Triage - Steering Committee Liaison: Paco Xu 徐俊杰 (**[@pacoxu](https://github.com/pacoxu)**) +## Working Groups + +The following [working groups][working-group-definition] are sponsored by sig-cli: +* [WG Node Lifecycle](/wg-node-lifecycle) + + ## Subprojects The following [subprojects][subproject-definition] are owned by sig-cli: diff --git a/sig-cloud-provider/README.md b/sig-cloud-provider/README.md index dceabfd51ae..d694a1abce4 100644 --- a/sig-cloud-provider/README.md +++ b/sig-cloud-provider/README.md @@ -58,6 +58,7 @@ subprojects, and resolve cross-subproject technical issues and decisions. ## Working Groups The following [working groups][working-group-definition] are sponsored by sig-cloud-provider: +* [WG Node Lifecycle](/wg-node-lifecycle) * [WG Structured Logging](/wg-structured-logging) diff --git a/sig-cluster-lifecycle/README.md b/sig-cluster-lifecycle/README.md index afc4e9a431f..aeb59b569cb 100644 --- a/sig-cluster-lifecycle/README.md +++ b/sig-cluster-lifecycle/README.md @@ -52,6 +52,7 @@ subprojects, and resolve cross-subproject technical issues and decisions. The following [working groups][working-group-definition] are sponsored by sig-cluster-lifecycle: * [WG LTS](/wg-lts) +* [WG Node Lifecycle](/wg-node-lifecycle) * [WG etcd Operator](/wg-etcd-operator) diff --git a/sig-list.md b/sig-list.md index a45672f9536..fb47bd3e591 100644 --- a/sig-list.md +++ b/sig-list.md @@ -66,6 +66,7 @@ When the need arises, a [new SIG can be created](sig-wg-lifecycle.md) |[Device Management](wg-device-management/README.md)|[device-management](https://github.com/kubernetes/kubernetes/labels/wg%2Fdevice-management)|* Architecture
* Autoscaling
* Network
* Node
* Scheduling
|* [John Belamaric](https://github.com/johnbelamaric), Google
* [Kevin Klues](https://github.com/klueska), NVIDIA
* [Patrick Ohly](https://github.com/pohly), Intel
|* [Slack](https://kubernetes.slack.com/messages/wg-device-management)
* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-device-management)|* Regular WG Meeting: [Tuesdays at 8:30 PT (Pacific Time) (biweekly)](TBD)
|[etcd Operator](wg-etcd-operator/README.md)|[etcd-operator](https://github.com/kubernetes/kubernetes/labels/wg%2Fetcd-operator)|* Cluster Lifecycle
* etcd
|* [Benjamin Wang](https://github.com/ahrtr), VMware
* [Ciprian Hacman](https://github.com/hakman), Microsoft
* [Josh Berkus](https://github.com/jberkus), Red Hat
* [James Blair](https://github.com/jmhbnz), Red Hat
* [Justin Santa Barbara](https://github.com/justinsb), Google
|* [Slack](https://kubernetes.slack.com/messages/wg-etcd-operator)
* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-etcd-operator)|* Regular WG Meeting: [Tuesdays at 11:00 PT (Pacific Time) (bi-weekly)](https://zoom.us/my/cncfetcdproject)
|[LTS](wg-lts/README.md)|[lts](https://github.com/kubernetes/kubernetes/labels/wg%2Flts)|* Architecture
* Cluster Lifecycle
* K8s Infra
* Release
* Security
* Testing
|* [Jeremy Rickard](https://github.com/jeremyrickard), Microsoft
* [Jordan Liggitt](https://github.com/liggitt), Google
* [Micah Hausler](https://github.com/micahhausler), Amazon
|* [Slack](https://kubernetes.slack.com/messages/wg-lts)
* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-lts)|* Regular WG Meeting: [Tuesdays at 07:00 PT (Pacific Time) (biweekly)](https://zoom.us/j/92480197536?pwd=dmtSMGJRQmNYYTIyZkFlQ25JRngrdz09)
+|[Node Lifecycle](wg-node-lifecycle/README.md)|[node-lifecycle](https://github.com/kubernetes/kubernetes/labels/wg%2Fnode-lifecycle)|* Apps
* Architecture
* Autoscaling
* CLI
* Cloud Provider
* Cluster Lifecycle
* Network
* Node
* Scheduling
* Storage
|* [Filip Křepinský](https://github.com/atiratree), Red Hat
* [Fabrizio Pandini](https://github.com/fabriziopandini), VMware
* [Humble Chirammal](https://github.com/humblec), VMware
* [Ryan Hallisey](https://github.com/rthallisey), NVIDIA
|* [Slack](https://kubernetes.slack.com/messages/wg-node-lifecycle)
* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-node-lifecycle)|* WG Node Lifecycle Weekly Meeting: [TBDs at TBD TBD (weekly)]()
|[Policy](wg-policy/README.md)|[policy](https://github.com/kubernetes/kubernetes/labels/wg%2Fpolicy)|* Architecture
* Auth
* Multicluster
* Network
* Node
* Scheduling
* Storage
|* [Jim Bugwadia](https://github.com/JimBugwadia), Kyverno/Nirmata
* [Poonam Lamba](https://github.com/poonam-lamba), Google
* [Andy Suderman](https://github.com/sudermanjr), Fairwinds
|* [Slack](https://kubernetes.slack.com/messages/wg-policy)
* [Mailing List](https://groups.google.com/forum/#!forum/kubernetes-wg-policy)|* Regular WG Meeting: [Wednesdays at 8:00 PT (Pacific Time) (semimonthly)](https://zoom.us/j/7375677271)
|[Serving](wg-serving/README.md)|[serving](https://github.com/kubernetes/kubernetes/labels/wg%2Fserving)|* Apps
* Architecture
* Autoscaling
* Instrumentation
* Network
* Node
* Scheduling
* Storage
|* [Eduardo Arango](https://github.com/ArangoGutierrez), NVIDIA
* [Jiaxin Shan](https://github.com/Jeffwan), Bytedance
* [Sergey Kanzhelev](https://github.com/SergeyKanzhelev), Google
* [Yuan Tang](https://github.com/terrytangyuan), Red Hat
|* [Slack](https://kubernetes.slack.com/messages/wg-serving)
* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-serving)|* WG Serving Weekly Meeting ([calendar](https://calendar.google.com/calendar/embed?src=e896b769743f3877edfab2d4c6a14132b2aa53287021e9bbf113cab676da54ba%40group.calendar.google.com)): [Wednesdays at 9:00 PT (Pacific Time) (weekly)](https://zoom.us/j/92615874244?pwd=VGhxZlJjRTNRWTZIS0dQV2MrZUJ5dz09)
|[Structured Logging](wg-structured-logging/README.md)|[structured-logging](https://github.com/kubernetes/kubernetes/labels/wg%2Fstructured-logging)|* API Machinery
* Architecture
* Cloud Provider
* Instrumentation
* Network
* Node
* Scheduling
* Storage
|* [Mengjiao Liu](https://github.com/mengjiao-liu), Independent
* [Patrick Ohly](https://github.com/pohly), Intel
|* [Slack](https://kubernetes.slack.com/messages/wg-structured-logging)
* [Mailing List](https://groups.google.com/forum/#!forum/kubernetes-wg-structured-logging)| diff --git a/sig-network/README.md b/sig-network/README.md index 494bc7a0866..09c4ea1c830 100644 --- a/sig-network/README.md +++ b/sig-network/README.md @@ -70,6 +70,7 @@ subprojects, and resolve cross-subproject technical issues and decisions. The following [working groups][working-group-definition] are sponsored by sig-network: * [WG Device Management](/wg-device-management) +* [WG Node Lifecycle](/wg-node-lifecycle) * [WG Policy](/wg-policy) * [WG Serving](/wg-serving) * [WG Structured Logging](/wg-structured-logging) diff --git a/sig-node/README.md b/sig-node/README.md index fdc411e48e9..3c5e7539833 100644 --- a/sig-node/README.md +++ b/sig-node/README.md @@ -55,6 +55,7 @@ subprojects, and resolve cross-subproject technical issues and decisions. The following [working groups][working-group-definition] are sponsored by sig-node: * [WG Batch](/wg-batch) * [WG Device Management](/wg-device-management) +* [WG Node Lifecycle](/wg-node-lifecycle) * [WG Policy](/wg-policy) * [WG Serving](/wg-serving) * [WG Structured Logging](/wg-structured-logging) diff --git a/sig-scheduling/README.md b/sig-scheduling/README.md index b760a57182f..d667b17df1f 100644 --- a/sig-scheduling/README.md +++ b/sig-scheduling/README.md @@ -67,6 +67,7 @@ subprojects, and resolve cross-subproject technical issues and decisions. The following [working groups][working-group-definition] are sponsored by sig-scheduling: * [WG Batch](/wg-batch) * [WG Device Management](/wg-device-management) +* [WG Node Lifecycle](/wg-node-lifecycle) * [WG Policy](/wg-policy) * [WG Serving](/wg-serving) * [WG Structured Logging](/wg-structured-logging) diff --git a/sig-storage/README.md b/sig-storage/README.md index 9847e62f299..ba854e7b8e2 100644 --- a/sig-storage/README.md +++ b/sig-storage/README.md @@ -59,6 +59,7 @@ subprojects, and resolve cross-subproject technical issues and decisions. The following [working groups][working-group-definition] are sponsored by sig-storage: * [WG Data Protection](/wg-data-protection) +* [WG Node Lifecycle](/wg-node-lifecycle) * [WG Policy](/wg-policy) * [WG Serving](/wg-serving) * [WG Structured Logging](/wg-structured-logging) diff --git a/sigs.yaml b/sigs.yaml index 8417ef70d90..7de0bb5c5a9 100644 --- a/sigs.yaml +++ b/sigs.yaml @@ -3697,6 +3697,58 @@ workinggroups: liaison: github: saschagrunert name: Sascha Grunert +- dir: wg-node-lifecycle + name: Node Lifecycle + mission_statement: > + Explore and improve node and pod lifecycle in Kubernetes. This should result in + better node drain/maintenance support and better pod disruption/termination. It + should also improve node and pod autoscaling, better application migration and + availability, load balancing, de/scheduling, node shutdown, cloud provider integrations, + and support other new scenarios and integrations. + + charter_link: charter.md + stakeholder_sigs: + - Apps + - Architecture + - Autoscaling + - CLI + - Cloud Provider + - Cluster Lifecycle + - Network + - Node + - Scheduling + - Storage + label: node-lifecycle + leadership: + chairs: + - github: atiratree + name: Filip Křepinský + company: Red Hat + email: atiratree@gmail.com + - github: fabriziopandini + name: Fabrizio Pandini + company: VMware + email: fabrizio.pandini@gmail.com + - github: humblec + name: Humble Chirammal + company: VMware + email: humble.devassy@gmail.com + - github: rthallisey + name: Ryan Hallisey + company: NVIDIA + email: rhallisey@nvidia.com + meetings: + - description: WG Node Lifecycle Weekly Meeting + day: TBD + time: TBD + tz: TBD + frequency: weekly + contact: + slack: wg-node-lifecycle + mailing_list: https://groups.google.com/a/kubernetes.io/g/wg-node-lifecycle + liaison: + github: TBD + name: TBD - dir: wg-policy name: Policy mission_statement: > diff --git a/wg-node-lifecycle/OWNERS b/wg-node-lifecycle/OWNERS new file mode 100644 index 00000000000..1a6563e77fe --- /dev/null +++ b/wg-node-lifecycle/OWNERS @@ -0,0 +1,8 @@ +# See the OWNERS docs at https://go.k8s.io/owners + +reviewers: + - wg-node-lifecycle-leads +approvers: + - wg-node-lifecycle-leads +labels: + - wg/node-lifecycle diff --git a/wg-node-lifecycle/README.md b/wg-node-lifecycle/README.md new file mode 100644 index 00000000000..919d42269d0 --- /dev/null +++ b/wg-node-lifecycle/README.md @@ -0,0 +1,45 @@ + +# Node Lifecycle Working Group + +Explore and improve node and pod lifecycle in Kubernetes. This should result in better node drain/maintenance support and better pod disruption/termination. It should also improve node and pod autoscaling, better application migration and availability, load balancing, de/scheduling, node shutdown, cloud provider integrations, and support other new scenarios and integrations. + +The [charter](charter.md) defines the scope and governance of the Node Lifecycle Working Group. + +## Stakeholder SIGs +* [SIG Apps](/sig-apps) +* [SIG Architecture](/sig-architecture) +* [SIG Autoscaling](/sig-autoscaling) +* [SIG CLI](/sig-cli) +* [SIG Cloud Provider](/sig-cloud-provider) +* [SIG Cluster Lifecycle](/sig-cluster-lifecycle) +* [SIG Network](/sig-network) +* [SIG Node](/sig-node) +* [SIG Scheduling](/sig-scheduling) +* [SIG Storage](/sig-storage) + +## Meetings +*Joining the [mailing list](https://groups.google.com/a/kubernetes.io/g/wg-node-lifecycle) for the group will typically add invites for the following meetings to your calendar.* +* WG Node Lifecycle Weekly Meeting: [TBDs at TBD TBD]() (weekly). [Convert to your timezone](http://www.thetimezoneconverter.com/?t=TBD&tz=TBD). + +## Organizers + +* Filip Křepinský (**[@atiratree](https://github.com/atiratree)**), Red Hat +* Fabrizio Pandini (**[@fabriziopandini](https://github.com/fabriziopandini)**), VMware +* Humble Chirammal (**[@humblec](https://github.com/humblec)**), VMware +* Ryan Hallisey (**[@rthallisey](https://github.com/rthallisey)**), NVIDIA + +## Contact +- Slack: [#wg-node-lifecycle](https://kubernetes.slack.com/messages/wg-node-lifecycle) +- [Mailing list](https://groups.google.com/a/kubernetes.io/g/wg-node-lifecycle) +- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/wg%2Fnode-lifecycle) +- Steering Committee Liaison: TBD (**[@TBD](https://github.com/TBD)**) + + + diff --git a/wg-node-lifecycle/charter.md b/wg-node-lifecycle/charter.md new file mode 100644 index 00000000000..90671be8361 --- /dev/null +++ b/wg-node-lifecycle/charter.md @@ -0,0 +1,163 @@ +# WG Node Lifecycle Charter + +This charter adheres to the conventions described in the [Kubernetes Charter README] and uses +the Roles and Organization Management outlined in [wg-governance]. + +[Kubernetes Charter README]: /committee-steering/governance/README.md + +## Scope + +The Kubernetes ecosystem currently faces challenges in node maintenance scenarios, with multiple +projects independently addressing similar issues. The goal of this working group is to develop +unified APIs that the entire ecosystem can depend on, reducing the maintenance burden across +projects and addressing scenarios that impede node drain or cause improper pod termination. Our +objective is to create easily configurable, out-of-the-box solutions that seamlessly integrate with +existing APIs and behaviors. We will strive to make these solutions minimalistic and extensible to +support advanced use cases across the ecosystem. + +To properly solve the node drain, we must first understand the node lifecycle. This includes +provisioning/sunsetting of the nodes, PodDisruptionBudgets, API-initiated eviction and node +shutdown. This then impacts both the node and pod autoscaling, de/scheduling, load balancing, and +the applications running in the cluster. All of these areas have issues and would benefit from a +unified approach. + +### In scope + +- Explore a unified way of draining the nodes and managing node maintenance by introducing new APIs + and extending the current ones. This includes exploring extension to or interactions with the Node + object. +- Analyze the node lifecycle, the Node API, and possible interactions. We want to explore augmenting + the Node API to expose additional state or status in order to coalesce other core Kubernetes and + community APIs around node lifecycle management. +- Improve the disruption model that is currently implemented by API-initiated Eviction API and PDBs. + Improve the descheduling, availability and migration capabilities of today's application + workloads. Also explore the interactions with other eviction mechanisms. +- Coordinate pod termination and issues around de/scheduling, preemption and eviction. +- Improve the Graceful/Non-Graceful Node Shutdown and consider how this affects the node lifecycle. + To graduate the [Graceful Node Shutdown](https://github.com/kubernetes/enhancements/issues/2000) + feature to GA and resolve the associated node shutdown issues. +- Improve the scheduling and pod/node autoscaling to take into account ongoing node maintenance and + the new disruption model/evictions. This includes balancing of the pods according to scheduling + constraints. +- Consider improving the pod lifecycle of DaemonSets and Static pods during a node maintenance. +- Explore the cloud provider use cases and how they can hook in into the node lifecycle. So that the + users can use the same APIs or configurations across the board. +- Migrate users of the eviction based kubectl-like drain (kubectl, cluster autoscaler, karpenter, + ...) and other scenarios to use the new unified node draining approach. +- Explore possible scenarios behind the reason why the node was terminated/drained/killed and how to + track and react to each of them. Consider past discussions/historical perspective + (e.g. "thumbstones"). + +### Out of scope + +- Implementing cloud provider specific logic, the goal is to have high-level API that the providers + can use, hook into, or extend. +- Infrastructure provisioning, deprovisioning solution or physical infrastructure lifecycle + management solution. + +## Stakeholders + +- SIG Apps +- SIG Architecture +- SIG Autoscaling +- SIG CLI +- SIG Cloud Provider +- SIG Cluster Lifecycle +- SIG Network +- SIG Node +- SIG Scheduling +- SIG Storage + +Stakeholders span from multiple SIGs to a broad set of end users, +public and private cloud providers, Kubernetes distribution providers, +and cloud provider end-users. Here are some user stories: + +- As a cluster admin I want to have a simple interface to initiate a node drain/maintenance without + any required manual interventions. I also want to be able to observe the node drain via the API + and check on its progress. I also want to be able to discover workloads that are blocking the node + drain. +- To support the new features, node maintenance, scheduler, descheduler, pod autoscaling, kubelet, + and other actors want to use a new eviction API to gracefully remove pods. This would enable new + migration strategies that prefer to surge (upscale) pods first rather than downscale them. It + would also allow other users/components to monitor pods that are gracefully removed/terminated + and provide better behaviour in terms of de/scheduling, scaling and availability. +- As a cluster admin, I want to be able to perform arbitrary actions after the node drain is + complete, such as resetting GPU drivers, resetting NICs, performing software updates or shutting + down the machine. +- As an end user, I would like more alternatives to blue-green upgrades, especially with special + hardware accelerators; it's far too expensive. I would like to choose a strategy on how to + coordinate the node drain and the upgrade to achieve better cost-effectiveness. +- As a cloud provider, I need to perform regular maintenance on the hardware in my fleet. Enhancing + Kubernetes to help CSPs safely remove hardware will reduce operational costs. +- Modelling the cost of doing accelerator maintenance in today's world can be massive. And since + hardware accelerators tend to need more love and care, having software support to coordinate + maintenance will reduce operational costs. +- As a cluster admin, I would like to use a mixture of on-demand and temporary spot instances in my + clusters to reduce cloud expenditure. Having more reliable lifecycle and drain mechanisms for + nodes will improve cluster stability in scenarios where instances may be terminated by the cloud + provider due to cost-related thresholds. +- As a user, I want to prevent any disruption to my pet or expensive workloads (VMs, ML with + accelerators) and either prevent termination altogether or have a reliable migration path. + Features like `terminationGracePeriodSeconds` are not sufficient as the termination/migration can + take hours if not days. +- As a user, I want my application to finish all network and storage operations before terminating a + pod. This includes closing pod connections, removing pods from endpoints, writing cached writes + to the underlying storage and completing storage cleanup routines. + +## Deliverables + +The WG will coordinate requirement gathering and design, eventually leading to +KEP(s)s and code associated with the ideas. + +Area we expect to explore: + +- An API to express node drain/maintenance. + Currently tracked in https://github.com/kubernetes/enhancements/issues/4212. +- An API to solve the problems wrt the API-initiated Eviction API and PDBs. + Currently tracked in https://github.com/kubernetes/enhancements/issues/4563. +- An API/mechanism to gracefully terminate pods during a node shutdown. + Graceful node shutdown feature tracked in https://github.com/kubernetes/enhancements/issues/2000. +- An API to deschedule pods that use DRA devices. + DRA: device taints and tolerations feature tracked in https://github.com/kubernetes/enhancements/issues/5055. +- An API to remove pods from endpoints before they terminate. + Currently tracked in https://docs.google.com/document/d/1t25jgO_-LRHhjRXf4KJ5xY_t8BZYdapv7MDAxVGY6R8/edit?tab=t.0#heading=h.i4lwa7rdng7y. +- Introduce enhancements across multiple Kubernetes SIGs to add support for the new APIs to solve + wide range of issue. + +We expect to provide reference implementations of the new APIs including but not limited to +controllers, API validation, integration with existing core components and extension points for the +ecosystem. This should be accompanied by E2E / Conformance tests. + +## Relevant Projects + +This is a list of known projects that solve similar problems in the ecosystem or would benefit from +the efforts of this WG: + +- https://github.com/aws/aws-node-termination-handler +- https://github.com/foriequal0/pod-graceful-drain +- https://github.com/jukie/karpenter-deprovision-controller +- https://github.com/kubereboot/kured +- https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler +- https://github.com/kubernetes-sigs/karpenter +- https://github.com/kubevirt/kubevirt +- https://github.com/medik8s/node-maintenance-operator +- https://github.com/Mellanox/maintenance-operator +- https://github.com/openshift/machine-config-operator +- https://github.com/planetlabs/draino +- https://github.com/strimzi/drain-cleaner + +There are also internal custom solutions that companies use. + +## Roles and Organization Management + +This WG adheres to the Roles and Organization Management outlined in [wg-governance] +and opts-in to updates and modifications to [wg-governance]. + +[wg-governance]: /committee-steering/governance/wg-governance.md + +## Timelines and Disbanding + +The working group will disband once the features and core APIs defined in the KEPs have reached a +stable state (GA) and ongoing maintenance ownership is established within the relevant SIGs. We will +review whether the working group should disband if appropriate SIG ownership +can't be reached.