Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider implementing pod (anti)affinity checks in NodeFit #1279

Closed
logyball opened this issue Oct 30, 2023 · 7 comments · Fixed by #1356
Closed

Consider implementing pod (anti)affinity checks in NodeFit #1279

logyball opened this issue Oct 30, 2023 · 7 comments · Fixed by #1356
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@logyball
Copy link

logyball commented Oct 30, 2023

Is your feature request related to a problem? Please describe.
Hello, we have been running into a situation which was somewhat surprising until I dug into the code and learned that NodeFit does not take into account all scheduler predicates, notably pod (anti)Affinity. At our organization, we've had multiple occasions where using descheduler HighNodeUtilization profile + GKE's optimize-utilization autoscaling profile has bit us because we assumed that NodeFit and the kube-scheduler would agree.

The basic situation is:

  • Node A contains Pod X
  • Node B contains Pod Y
  • Pod X and Pod Y have a pod anti-affinity rule that prevents them from being scheduled on the same node
  • Node A is considered "underutilized" by the descheduler
  • Node B is not considered "underutilized"
  • NodeFit algorithm is run and determines that Pod X can schedule on Node B
  • descheduler evicts Pod X
  • kube-scheduler determines Pod X cannot schedule on Node B because of Pod X/Y's anti-affinity
  • kube-scheduler schedules Pod X on Node A
  • The cycle continues with each loop of the descheduler

Describe the solution you'd like
I suggest that we look into implementing Pod (anti)affinity checks into the NodeFit calculation, and would be happy to look into contributing it. It probably adds a decent amount of complexity, but would likely make the descheduler's decisions more closely resemble that of the kube-scheduler, what do you think?

Describe alternatives you've considered

  • Do nothing: perhaps call out in the documentation that the NodeFit arg does not take into account pod (anti)affinity

What version of descheduler are you using?

descheduler version: v0.27.x

@logyball
Copy link
Author

logyball commented Oct 30, 2023

Also, in a bit of follow up, after digging into a bit of how the scheduler implements these checks, it's not easy to parse for me. I'd love some guidance here if this is an avenue we'd like to pursue. It looks like in the past (last seen in v1.9.11 of k8s), there was a Predicate checker that we could probably have used to good effect. However, now the ecosystem looks a bit more complex.

@gyuvaraj10
Copy link

We have also run this issue with the PodAntiAffinity in our case is a preferred choice. We have noticed that after the pod eviction the node still remained and the pod is scheduled onto the same node as the cluster autoscaler did not scale down the node immediately.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 11, 2024
@logyball
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 11, 2024
@nikimanoledaki
Copy link
Contributor

nikimanoledaki commented Feb 12, 2024

Hi! 👋 I am interested in picking up this issue and contributing a PR to add this feature.

In the scenario outlined above by @logyball, NodeFit() runs and determines Pod X can be scheduled on Node B when it shouldn't. The happy path would be that NodeFit() should run and determine that Pod X cannot be scheduled on Node B.

@a7i mentioned here that this could be implemented as a patch.

The solution would be to add a predicate to NodeFit() so that it considers pod anti-affinity. Something like this:

func NodeFit(nodeIndexer podutil.GetPodsAssignedToNodeFunc, pod *v1.Pod, node *v1.Node) []error {

...
	// Check if pod matches pod anti-affinity rule of other pod on node
	if ok, err := utils.PodMatchesAntiAffinityRule(pod, node); err != nil { // add to pkg/utils/predicates.go
		errors = append(errors, err)
	} else if !ok {
		errors = append(errors, fmt.Errorf("pod violates pod anti-affinity rule of other pod on the node"))
	}

	return errors
}

Implementation detail: Refactor the plugin RemovePodsViolatingInterPodAntiAffinity which already contains logic for fetching and comparing the pod anti-affinity rule of pods in checkPodsWithAntiAffinityExist. This logic could be moved to pkg/utils/predicates.go and used for NodeFit() as well.

Happy to assign myself to this issue and get started when I get the green light from SIG Scheduling/descheduler maintainers :)

@logyball
Copy link
Author

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 12, 2024
@nikimanoledaki
Copy link
Contributor

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
5 participants