-
Notifications
You must be signed in to change notification settings - Fork 622
Fix an edge case that causes container dependency deadlock #2734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
// 1. Container's known status is earlier than running; | ||
// 2. Container's desired status is stopped; | ||
// 3. Container is not in the middle a transition (indicated by applied status is none status). | ||
func (c *Container) HasNotAndWillNotStart() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use the unsafe versions of the variables and add a read lock here, since as of now this comparison is not atomic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense. updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good find.
Scenario: two containers container A and container B, container A is a non-essential container, container B has a dependency on container A. Workflow leading to deadlock: 1. We start to pull container A's mage; 2. We fail to pull container A's image; 3. We try to create container A anyway (this is intended as we want to try cached image if possible); 4. We fail to create the container A due to missing image; 5. We try to stop container A; 6. We can't stop container A because when container B has a dependency on A, we implicitly have a reversed dependency on shutdown, such that container A won't stop before container B stops; 7. However we can't move forward container B either, because it's waiting on container A. This ends up in a deadlock. Fix: return an error when checking container B's dependency resolution when it sees that container A will never start.
Summary
Fix an edge case that can cause container dependency deadlock and task stuck in pending.
Scenario
Scenario: two containers container A and container B, container A is a non-essential container, container B has a dependency on container A, and we fail to pull container A's image.
Workflow leading to deadlock
Options to fix
On a high level, I see three ways to address this issue:
For option 1, it's somewhat tricky because the shutdown dependency is not modeled in container's dependsOn field. Instead it's implicit and is checked specially (
amazon-ecs-agent/agent/engine/dependencygraph/graph.go
Line 456 in 953af90
For option 3, this has a much larger impact due to it not limited to task using container dependency. I'm not fully certain what other consequence there would be if we skip a transition while we used to have.
Therefore, option 2 is chosen.
Actual fix
As mentioned above, option 2 is chosen. Take the above scenario as example, this is done by returning an error when checking container B's dependency resolution when it sees that container A has not started and will never start.
Implementation details
Add a helper function for container to indicate whether it's in a state that it has not started and will not start in the future. In dependency check, return an error if the dependency container has entered such state.
Testing
Added unit test. Manually tested with the following task def that triggers the scenario mentioned in summary:
(container_1 intentionally has an invalid image to fail the pull)
Before the change, the above task def will stuck in pending. With the change, it stops with "Task failed to start" and the pull image failure can be seen in container_1's status.
I will add a functional test separately.
New tests cover the changes: yes
Description for the changelog
Bug - Fix an edge case that can cause container dependency deadlock
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.