Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kill-host-pods.py: filter pods by node #4819

Merged
merged 1 commit into from
Jan 14, 2025
Merged

Conversation

petrutlucian94
Copy link
Contributor

@petrutlucian94 petrutlucian94 commented Jan 10, 2025

Starting with k8s 1.32, AuthorizeNodeWithSelectors is enabled by default: https://kubernetes.io/docs/reference/access-authn-authz/node/

If the rbac microk8s addon is enabled, the kube-apiserver will run with "--authorization-mode=RBAC,Node". This means that kublets (system:node:$node) will no longer be allowed to access pods that reside on other nodes.

For this reason, the "kill-host-pods.py" script is now getting access denied errors:

  Error from server (Forbidden): pods is forbidden:
  User "system:node:myhostname" cannot list resource "pods" in API group ""
  at the cluster scope: can only list/watch pods with spec.nodeName field selector

As suggested by the error message, we'll solve it by filtering pods by the node name.

Fixes: #4802

Summary

Changes

Testing

Possible Regressions

Checklist

  • Read the contributions page.
  • Submitted the CLA form, if you are a first time contributor.
  • The introduced changes are covered by unit and/or integration tests.

Notes

Starting with k8s 1.32, AuthorizeNodeWithSelectors is enabled by
default:

https://kubernetes.io/docs/reference/access-authn-authz/node/

If the rbac microk8s addon is enabled, the kube-apiserver will
run with "--authorization-mode=RBAC,Node". This means that
kublets (system:node:$node) will no longer be allowed to access
pods that reside on other nodes.

For this reason, the "kill-host-pods.py" script is now getting
access denied errors:

  Error from server (Forbidden): pods is forbidden:
  User "system:node:myhostname" cannot list resource "pods" in API group ""
  at the cluster scope: can only list/watch pods with spec.nodeName field selector

As suggested by the error message, we'll solve it by filtering
pods by the node name.

Fixes: #4802
Copy link
Member

@berkayoz berkayoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, I'll run some manual tests and merge if all is well

@petrutlucian94
Copy link
Contributor Author

LGTM overall, I'll run some manual tests and merge if all is well

Thanks! fwiw, I used the following to trigger a pod cleanup:

sudo touch /var/snap/microk8s/current/var/lock/snapdata-mounts-need-reload

Copy link
Member

@berkayoz berkayoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@berkayoz berkayoz merged commit 45a6abb into master Jan 14, 2025
12 checks passed
@berkayoz berkayoz deleted the lpetrut/fix-pods-restart branch January 14, 2025 09:32
@ianroberts
Copy link

I just ran into the linked issue on upgrading to 1.32 - this PR is a sensible fix, but I wasn't aware that it is necessary to manually sync /snap/microk8s/current/default-hooks over to /var/snap/microk8s/common/hooks after upgrading. If you don't do that then you end up still using the 10-pods-restart from an earlier revision, and you get the same error message.

snap/hooks/configure probably needs to be updated to copy in the new 10-pods-restart hook on upgrade if required, or else it should be documented somewhere that you have to manually sync any hook changes on every snap refresh.

@petrutlucian94
Copy link
Contributor Author

@ianroberts Thanks for bringing this up. It seems to be by design, the default-hooks are copied over only for fresh installations, probably so that it won't override custom hooks:

# Install default-hooks
if ! [ -d "${SNAP_COMMON}/hooks" ]
then
cp -r --preserve=mode ${SNAP}/default-hooks ${SNAP_COMMON}/hooks
fi

We have the following options:

  • document this
  • always copy the default hooks, even if this might override user defined hooks
    • the team suggested to only copy 10-pods-restart as part of snap/hooks/configure
      • I'll open a PR to keep the conversation going

petrutlucian94 added a commit that referenced this pull request Apr 3, 2025
Another PR fixed one of the default hooks [1], however we're not copying
over the hooks from $SNAP/default-hooks to $SNAP/hooks when
refreshing existing snap installations [2].

This is probably by design so that we don't override user defined hooks
However, it doesn't seem to be documented anywhere.

As suggested by the team, this patch will copy over the
reconcile.d/10-pods-restart hook.

Downsides:
* copying just one of the hooks seems a bit unintuitive and inconsistent
* we may override user hooks, which can be unexpected

Alternatives:
* document the fact that these hooks are not refreshed automatically
  and that users can/should copy over the default hooks
* always refresh all hooks

[1] #4819
[2] #4819 (comment)
@ianroberts
Copy link

@petrutlucian94 if someone can merge #4473 then that will at least mitigate the issue as it fixes the kill script itself rather than just the hook that calls the kill script.

@petrutlucian94
Copy link
Contributor Author

@ianroberts #4473 seems to fix a slightly different issue. Without the spec.nodeName=$(hostname) filter we aren't getting any pod at all, but instead we have the following error:

Error from server (Forbidden): pods is forbidden:
  User "system:node:myhostname" cannot list resource "pods" in API group ""
  at the cluster scope: can only list/watch pods with spec.nodeName field selector

Anyway, I've retriggered the CI jobs on your PR, it should pass now. Once we have the green light from the CI, it should be ready to merge.

petrutlucian94 added a commit that referenced this pull request Apr 7, 2025
Another PR fixed one of the default hooks [1], however we're not copying
over the hooks from $SNAP/default-hooks to $SNAP/hooks when
refreshing existing snap installations [2].

This is probably by design so that we don't override user defined hooks
However, it doesn't seem to be documented anywhere.

As suggested by the team, this patch will copy over the
reconcile.d/10-pods-restart hook.

Downsides:
* copying just one of the hooks seems a bit unintuitive and inconsistent
* we may override user hooks, which can be unexpected

Alternatives:
* document the fact that these hooks are not refreshed automatically
  and that users can/should copy over the default hooks
* always refresh all hooks

[1] #4819
[2] #4819 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kill-host-pods permission error after upgrade to 1.32
3 participants