Machine fails to finish draining/volume detachment after successful completion #11591

Danil-Grigorev · 2024-12-17T18:15:46Z

What steps did you take and what happened?

After upgrading CAPI to 1.9 we observed an issue with CAPRKE2 provider.

RKE2 uses kubelet local mode by default, so etcd membership management logic behaves as in k/k 1.32 in Kubeadm.
The problem causes loss of API server access after etcd member is removed, leading to inability to proceed with infrastructure machine deletion.

The issue is that in rke2 deployments, kubelet is configured to use local api server (127.0.0.1:443), which in turn relies on local etcd pod. But as this node is removed from etcd cluster, kubelet won't be able to reach the API any more, and it will fail to properly drain the node as all pods will remain stuck in Terminating state from kubernetes perspective.

Logs from the cluster:

12:21:45.068153       1 recorder.go:104] "success waiting for node volumes detaching Machine's node \"caprke2-e2e-fukv6s-upgrade-control-plane-kzt8q\"" logger="events" type="Normal" object={"kind":"Machine","namespace":"create-workload-cluster-s51eu2","name":"caprke2-e2e-fukv6s-upgrade-control-plane-kzt8q","uid":"825607f8-f44e-465b-a954-ce3de1eb291c","apiVersion":"cluster.x-k8s.io/v1beta1","resourceVersion":"2116"} reason="NodeVolumesDetached"
12:21:56.066942       1 recorder.go:104] "error waiting for node volumes detaching, Machine's node \"caprke2-e2e-fukv6s-upgrade-control-plane-kzt8q\": failed to list VolumeAttachments: failed to list VolumeAttachments: Get \"https://172.18.0.3:6443/apis/storage.k8s.io/v1/volumeattachments?limit=100&timeout=10s\": context deadline exceeded" logger="events" type="Warning" object={"kind":"Machine","namespace":"create-workload-cluster-s51eu2","name":"caprke2-e2e-fukv6s-upgrade-control-plane-kzt8q","uid":"825607f8-f44e-465b-a954-ce3de1eb291c","apiVersion":"cluster.x-k8s.io/v1beta1","resourceVersion":"2162"} reason="FailedWaitForVolumeDetach"
12:21:56.087814       1 controller.go:316] "Reconciler error" err="failed to list VolumeAttachments: failed to list VolumeAttachments: Get \"https://172.18.0.3:6443/apis/storage.k8s.io/v1/volumeattachments?limit=100&timeout=10s\": context deadline exceeded" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="create-workload-cluster-s51eu2/caprke2-e2e-fukv6s-upgrade-control-plane-kzt8q" namespace="create-workload-cluster-s51eu2" name="caprke2-e2e-fukv6s-upgrade-control-plane-kzt8q" reconcileID="960a3889-d9b7-41a3-92c4-63f438b0c980"

What did you expect to happen?

Draining and Volume detachment to succeed, and machine get deleted without issues.

Cluster API version

v1.9.0

Kubernetes version

v1.29.2 - management
v1.31.0 - workload

Anything else you would like to add?

Logs from CI run with all details: https://github.com/rancher/cluster-api-provider-rke2/actions/runs/12372669685/artifacts/2332172988

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

The text was updated successfully, but these errors were encountered:

chrischdi · 2024-12-17T19:02:11Z

Questions:

is this using KCP? (I guess no?)
v1.29.2 Is only the management cluster, right? WL cluster is somewhat >= v1.31.

Danil-Grigorev · 2024-12-18T10:07:27Z

It is using RKE2 as a bootstrap provider, workload cluster is 1.31.0. I opened a PR, which from what I could see followed up with machine deletion.

chrischdi · 2024-12-20T05:38:14Z

Just to mention it: until we have a proper fix, there might be the workaround viable to add the following two annotations from the control-plane provider's side, once the time is reached that no drain/detach should be done:

machine.cluster.x-k8s.io/exclude-node-draining
machine.cluster.x-k8s.io/exclude-wait-for-node-volume-detach

enxebre · 2025-01-07T15:21:02Z

is this a control plane Node? wouldn't this scenario make any other upcoming node deletion fail to query through the remote client as well?
EDIT: let's follow up here #11590 (comment)

chrischdi · 2025-01-22T14:28:25Z

/assign @Danil-Grigorev
/triage accepted
/priority important-soon

As discussion is happening and PR is already WIP

k8s-triage-robot · 2025-04-22T14:40:26Z

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 17, 2024

Danil-Grigorev linked a pull request Dec 17, 2024 that will close this issue

🐛 Perform draining and volume detachment once until completion #11590

Open

Danil-Grigorev mentioned this issue Dec 18, 2024

Update CAPI support to 1.9 rancher/cluster-api-provider-rke2#525

Closed

1 task

k8s-ci-robot assigned Danil-Grigorev Jan 22, 2025

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Apr 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine fails to finish draining/volume detachment after successful completion #11591

Machine fails to finish draining/volume detachment after successful completion #11591

Danil-Grigorev commented Dec 17, 2024 •

edited

Loading

chrischdi commented Dec 17, 2024 •

edited

Loading

Danil-Grigorev commented Dec 18, 2024

chrischdi commented Dec 20, 2024

enxebre commented Jan 7, 2025 •

edited

Loading

chrischdi commented Jan 22, 2025

k8s-triage-robot commented Apr 22, 2025

Machine fails to finish draining/volume detachment after successful completion #11591

Machine fails to finish draining/volume detachment after successful completion #11591

Comments

Danil-Grigorev commented Dec 17, 2024 • edited Loading

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

chrischdi commented Dec 17, 2024 • edited Loading

Danil-Grigorev commented Dec 18, 2024

chrischdi commented Dec 20, 2024

enxebre commented Jan 7, 2025 • edited Loading

chrischdi commented Jan 22, 2025

k8s-triage-robot commented Apr 22, 2025

Danil-Grigorev commented Dec 17, 2024 •

edited

Loading

chrischdi commented Dec 17, 2024 •

edited

Loading

enxebre commented Jan 7, 2025 •

edited

Loading