-
Notifications
You must be signed in to change notification settings - Fork 2.9k
[cmd/opampsupervisor] Add support for optional agent SIGHUP config reload #40522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cmd/opampsupervisor] Add support for optional agent SIGHUP config reload #40522
Conversation
Signed-off-by: Douglas Camata <[email protected]>
Signed-off-by: Douglas Camata <[email protected]>
Signed-off-by: Douglas Camata <[email protected]>
Signed-off-by: Douglas Camata <[email protected]>
48f2ef4
to
413842e
Compare
@TylerHelmuth this is good for a review, if you got some time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I like the approach here, and allowing users to take advantage of the Collector's SIGHUP feels like an improvement (though I'm not 100% sure how strong the benefits are), so thanks for taking this on.
Signed-off-by: Douglas Camata <[email protected]>
Signed-off-by: Douglas Camata <[email protected]>
…tor-contrib into supervisor-hup-collector
Signed-off-by: Douglas Camata <[email protected]>
The new name is `waitForAgentReady`. Other related variables were also renamed to keep it consistent. Signed-off-by: Douglas Camata <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I like the way this looks with the channel, thanks for making that adaptation. Just a few more questions on the implementation.
Using a 1:1 communication channel instead of a more complex 1:N. Signed-off-by: Douglas Camata <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking pretty good, thanks for your patience and attention to detail while dealing with these.
I like the markAgentReady
and resetAgentReady
methods, I think those will make it easier to switch to a broadcast model in the future should we find a need.
@evan-bradley all the comments taken care of, I think this is perfect now. Thanks for the great review and for your attention to detail too. Sorry that the PR got big and some small things were a bit halfway through some old vs new names I used (i.e. |
Signed-off-by: Douglas Camata <[email protected]>
…tor-contrib into supervisor-hup-collector
Signed-off-by: Douglas Camata <[email protected]>
Signed-off-by: Douglas Camata <[email protected]>
Signed-off-by: Douglas Camata <[email protected]>
Signed-off-by: Douglas Camata <[email protected]>
I thought this was required to make things work, but it isn't. Signed-off-by: Douglas Camata <[email protected]>
@evan-bradley ready for another pass. Supervisor's HUP e2e tests are all good now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @douglascamata!
One thing I want to call out so it is written somewhere is that it's still unclear how much practical benefit SIGHUP reloading brings vs. reloading the whole process at this point, despite being conceptually cleaner. I haven't profiled it myself, but I suspect that starting pipelines comprises the majority of the Collector's startup time before it is ready to receive data, and that SIGHUP reloads may not be substantially faster. The biggest benefit I can think of is that confmap Providers with persistent connections will likely stay alive through the reload. I think we'll likely want to consider benchmarking and documenting this in the future, especially if we want to consider enabling this by default.
All that said, this seems like something we want, and the functionality required to support this is sufficiently encapsulated that I'm alright including it.
I agree with you 200% on this statement, @evan-bradley. Currently we don't know if it is better than the stop-start reload AND it might even not be effectively better when a benchmark test is done and we do the math. Benchmarking, testing, and documenting are definitely necessary before making the SIGUP setting on by default. 👍 |
return errors.New("agent process is not running") | ||
} | ||
|
||
c.logger.Debug("Sending SIGHUP to agent process to reload config", zap.Int("pid", c.cmd.Process.Pid)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SIGHUP
isn't supported on Windows. Did this change break e2e-tests-windows / supervisor-test)(?
{"error": "failed to send SIGHUP to agent process: not supported by windows"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR was merged despite the error. Should changes in this module trigger e2e-tests-windows / supervisor-test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the notification @nenadnoveljic. Unfortunately right now we have to manually apply a label to trigger those tests, which I forgot to do this time. @douglascamata could you take a look? We will likely need to disable this feature at startup if the Supervisor is running on Windows and skip the SIGHUP tests on Windows as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This element can be made conditional to skip SIGHUP
tests on Windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will take care of this tomorrow as early as possible, it's EOD for me already. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nenadnoveljic I think your link doesn't work well. Maybe it requires certain files to be expanded for the anchor to work? Could you share your suggestion in a different way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I figured exactly what to expand to see your suggestion already, @nenadnoveljic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing the tests here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, that looks good to me (I'm approving it). Tomorrow I can add proper configuration validation, with tests, to ensure Windows users cannot enable this feature.
…sed on Windows (#41077) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description The changes in this PR add a validation in the Supervisor's agent configuration that prevents the usage of the SIGHUP configuration reload feature in Windows, as this OS doesn't support such signal. This PR is a follow up to #40522. Thank you @nenadnoveljic for point it out and notifying us of the Windows build failure. <!--Describe what testing was performed and which tests were added.--> #### Testing Automated test added. --------- Signed-off-by: Douglas Camata <[email protected]>
Description
This pull request adds a configuration option for the Collector Supervisor at
agent::use_hup_restart
that can be used to control how the Supervisor restarts the agent.When enabled, agent restarts will be done through sending a
SIGHUP
instead of stopping and restarting the process.While running e2e tests with HUP restart I found an interesting race condition between the Supervisor and the Collector: if the Collector receives a SIGHUP while it's not fully started yet, it might exit prematurely. This is bad because it's a scenario that could happen quite often: a whole fleet might have local + remote configuration, then when any new machine spins up (i.e. auto-scaling) it will immediately receive a remote config. Before opening an issue on the Collector to try to figure things out over there I decided to solve it doing the following:
I'm also open to any ideas on what to do if the agent didn't report health yet to the Supervisor.
Link to tracking issue
Fixes #40410.
Testing
I updated a bunch of e2e tests with an extra scenario, so that they are executed with and without the SIGHUP restart logic.
Documentation
None so far, but I plan to update the Supervisor specification here if/when we agree regarding the issue I mentioned above on the description section.