Skip to content

manager: fix task scheduler infinite loop #3200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

corhere
Copy link
Contributor

@corhere corhere commented May 28, 2025

- What I did
- How I did it
If the running tasks for a service are not well balanced across the placement-preference tree, the task scheduler could enter an infinite loop when scaling the service up. The scheduleNTasksOnSubtree loop terminates when either all tasks have been scheduled onto nodes, or the nodes in all subtrees are out of room to accept new tasks. The trouble is that the algorithm only considers a subtree to be out of room if an attempt was made to schedule tasks onto its nodes but not all tasks were scheduled. Subtrees with more tasks already running than the desired number of tasks for a balanced tree are skipped over without attempting to assign any tasks, so do not have a chance to be considered out of room. The scheduler will therefore enter a tight infinite loop when there exists a node of the placement-preferences tree in which at least one subtree has more tasks running than desired, and all other subtrees are out of room for more tasks.

It would be incorrect to consider a subtree as out of room just because there are more tasks running than desired at a particular iteration of the scheduling loop. The desired number of tasks to assign changes as the scheduler iteratively schedules tasks and other subtrees run out of room, so it is possible for a subtree to become eligible in a future iteration.

Add a third condition to the task scheduler loop. Make it so the loop exits if there are no subtrees which are eligible for task scheduling, whether due to being out of room or have more tasks running than desired.

- How to test it
With a new regression test. TestMultiplePreferencesScaleUp times out without the scheduler change, but passes with it.

- Description for the changelog

Fix an issue where all new tasks in the Swarm could get stuck in the PENDING state forever after scaling up a service with placement preferences.

If the running tasks for a service are not well balanced across the
placement-preference tree, the task scheduler could enter an infinite
loop when scaling the service up. The scheduleNTasksOnSubtree loop
terminates when either all tasks have been scheduled onto nodes, or the
nodes in all subtrees are out of room to accept new tasks. The trouble
is that the algorithm only considers a subtree to be out of room if an
attempt was made to schedule tasks onto its nodes but not all tasks were
scheduled. Subtrees with more tasks already running than the desired
number of tasks for a balanced tree are skipped over without attempting
to assign any tasks, so do not have a chance to be considered out of
room. The scheduler will therefore enter a tight infinite loop when
there exists a node of the placement-preferences tree in which at least
one subtree has more tasks running than desired, and all other subtrees
are out of room for more tasks.

It would be incorrect to consider a subtree as out of room just because
there are more tasks running than desired at a particular iteration of
the scheduling loop. The desired number of tasks to assign changes as
the scheduler iteratively schedules tasks and other subtrees run out of
room, so it is possible for a subtree to become eligible in a future
iteration.

Add a third condition to the task scheduler loop. Make it so the loop
exits if there are no subtrees which are eligible for task scheduling,
whether due to being out of room or have more tasks running than
desired.

Co-authored-by: Xinfeng Liu <[email protected]>
Signed-off-by: Cory Snider <[email protected]>
@corhere corhere force-pushed the fix-scheduler-placementpref-infinite-loop branch from 3624081 to 2d6aff7 Compare May 28, 2025 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant