Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NTH queue handler behaviour broken by PR #940 #990

Closed
stevehipwell opened this issue Apr 16, 2024 · 8 comments · Fixed by #999
Closed

NTH queue handler behaviour broken by PR #940 #990

stevehipwell opened this issue Apr 16, 2024 · 8 comments · Fixed by #999
Assignees

Comments

@stevehipwell
Copy link
Contributor

Describe the bug
PR #940 removed the continue lifecycle action behaviour for ASGs which breaks the expected NTH behaviour; which is to signal to the ASG that the node can be deleted once it's drained.

If the PR author wanted to disable this functionality for their use case they should have provided configuration to do so, and only changed the default behaviour as part of a major version (if even wanted).

Steps to reproduce
Replace an ASG instance with instance refresh, the instance will not be terminated until the lifecycle hook times out.

Expected outcome
I'd expect the instance to be terminated by the ASG once NTH has completed draining it.

Application Logs
The log output when experiencing the issue.

Environment

  • NTH App Version: v1.21.0
  • NTH Mode (IMDS/Queue processor): Queue processor
  • OS/Arch: n/a
  • Kubernetes version: n/a
  • Installation method: Helm
@stevehipwell
Copy link
Contributor Author

CC @cjerad @bwagner5

@der-eismann
Copy link
Contributor

Can @GavinBurris42 say if this change was intended or a mistake?

@stevehipwell
Copy link
Contributor Author

Could someone from @aws/ec2-guacamole please respond to this. This issue has a significant cost overhead where nodes which should be terminated continue to run for the whole grace period (as grace period is cluster scoped this has to be the max permitted grace period).

@LikithaVemulapalli
Copy link
Contributor

Hello @stevehipwell, apologies for the late response and thank you for noticing the broken NTH behavior. This issue is considered as priority and will take measures to mitigate this ASAP. Thank you.

@stevehipwell
Copy link
Contributor Author

@LikithaVemulapalli I've opened PR #999 to solve this issue.

It looks like the problem the original PR author was attempting to solve could be handled better by setting the CompleteLifecycleActionDelaySeconds value, which would add a delay to all terminations but is user controlled rather than breaking NTH for everyone using it for other patterns (such as instance refresh).

@LikithaVemulapalli
Copy link
Contributor

Hello @stevehipwell, thanks for opening a PR to resolve this issue, yes our team had a discussion and identified the root cause with the post drain task that was modified as a part of latest changes. Appreciate the effort, will review the PR that you opened. Thank you.

@LikithaVemulapalli LikithaVemulapalli self-assigned this May 9, 2024
@LikithaVemulapalli LikithaVemulapalli added Pending-Release Pending an NTH or eks-charts release and removed Pending-Release Pending an NTH or eks-charts release labels May 10, 2024
@stevehipwell
Copy link
Contributor Author

@LikithaVemulapalli the build has failed so the Helm chart hasn't been released for the NTH release.

@LikithaVemulapalli
Copy link
Contributor

Hello @stevehipwell, yes the build failed adding windows binaries to assets, currently taking a look at that, we are aware of this issue, we are trying to fix it, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants