[cmd/opampsupervisor] Add support for optional agent SIGHUP config reload #40522

douglascamata · 2025-06-06T13:12:57Z

Description

This pull request adds a configuration option for the Collector Supervisor at agent::use_hup_restart that can be used to control how the Supervisor restarts the agent.

When enabled, agent restarts will be done through sending a SIGHUP instead of stopping and restarting the process.

While running e2e tests with HUP restart I found an interesting race condition between the Supervisor and the Collector: if the Collector receives a SIGHUP while it's not fully started yet, it might exit prematurely. This is bad because it's a scenario that could happen quite often: a whole fleet might have local + remote configuration, then when any new machine spins up (i.e. auto-scaling) it will immediately receive a remote config. Before opening an issue on the Collector to try to figure things out over there I decided to solve it doing the following:

If the Supervisor gets a remote config it'll only send SIGHUP to the Collector if it already reported its health at least once. This seems to ensure that everything's start and good to handle the signal well.
In case the health didn't arrive yet I have to wait a bit (yes, I'm also sad to wrote this). I didn't know what would be a good amount of time to wait for, so I picked the bootstrap timeout and I'm open to better suggestions if you have any. So every 100ms I check the health, up to a max wait time equal to the bootstrap timeout. If no health arrived after all the wait, the whole configuration application is considered failed.

I'm also open to any ideas on what to do if the agent didn't report health yet to the Supervisor.

Link to tracking issue

Fixes #40410.

Testing

I updated a bunch of e2e tests with an extra scenario, so that they are executed with and without the SIGHUP restart logic.

Documentation

None so far, but I plan to update the Supervisor specification here if/when we agree regarding the issue I mentioned above on the description section.

Signed-off-by: Douglas Camata <[email protected]>

douglascamata · 2025-06-12T09:43:33Z

@TylerHelmuth this is good for a review, if you got some time.

evan-bradley

Overall I like the approach here, and allowing users to take advantage of the Collector's SIGHUP feels like an improvement (though I'm not 100% sure how strong the benefits are), so thanks for taking this on.

cmd/opampsupervisor/supervisor/supervisor.go

Signed-off-by: Douglas Camata <[email protected]>

…tor-contrib into supervisor-hup-collector

Signed-off-by: Douglas Camata <[email protected]>

The new name is `waitForAgentReady`. Other related variables were also renamed to keep it consistent. Signed-off-by: Douglas Camata <[email protected]>

evan-bradley

Overall I like the way this looks with the channel, thanks for making that adaptation. Just a few more questions on the implementation.

cmd/opampsupervisor/supervisor/commander/commander.go

cmd/opampsupervisor/supervisor/supervisor.go

Using a 1:1 communication channel instead of a more complex 1:N. Signed-off-by: Douglas Camata <[email protected]>

evan-bradley

This is looking pretty good, thanks for your patience and attention to detail while dealing with these.

I like the markAgentReady and resetAgentReady methods, I think those will make it easier to switch to a broadcast model in the future should we find a need.

cmd/opampsupervisor/supervisor/supervisor.go

douglascamata · 2025-07-01T14:49:57Z

@evan-bradley all the comments taken care of, I think this is perfect now. Thanks for the great review and for your attention to detail too. Sorry that the PR got big and some small things were a bit halfway through some old vs new names I used (i.e. started vs ready).

Signed-off-by: Douglas Camata <[email protected]>

…tor-contrib into supervisor-hup-collector

cmd/opampsupervisor/e2e_test.go

Signed-off-by: Douglas Camata <[email protected]>

I thought this was required to make things work, but it isn't. Signed-off-by: Douglas Camata <[email protected]>

douglascamata · 2025-07-02T10:40:27Z

@evan-bradley ready for another pass. Supervisor's HUP e2e tests are all good now.

evan-bradley

Thanks @douglascamata!

One thing I want to call out so it is written somewhere is that it's still unclear how much practical benefit SIGHUP reloading brings vs. reloading the whole process at this point, despite being conceptually cleaner. I haven't profiled it myself, but I suspect that starting pipelines comprises the majority of the Collector's startup time before it is ready to receive data, and that SIGHUP reloads may not be substantially faster. The biggest benefit I can think of is that confmap Providers with persistent connections will likely stay alive through the reload. I think we'll likely want to consider benchmarking and documenting this in the future, especially if we want to consider enabling this by default.

All that said, this seems like something we want, and the functionality required to support this is sufficiently encapsulated that I'm alright including it.

cmd/opampsupervisor/supervisor/supervisor.go

douglascamata · 2025-07-02T14:25:59Z

One thing I want to call out so it is written somewhere is that it's still unclear how much practical benefit SIGHUP reloading brings vs. reloading the whole process at this point, despite being conceptually cleaner. I haven't profiled it myself, but I suspect that starting pipelines comprises the majority of the Collector's startup time before it is ready to receive data, and that SIGHUP reloads may not be substantially faster. The biggest benefit I can think of is that confmap Providers with persistent connections will likely stay alive through the reload. I think we'll likely want to consider benchmarking and documenting this in the future, especially if we want to consider enabling this by default.

I agree with you 200% on this statement, @evan-bradley. Currently we don't know if it is better than the stop-start reload AND it might even not be effectively better when a benchmark test is done and we do the math.

Benchmarking, testing, and documenting are definitely necessary before making the SIGUP setting on by default. 👍

nenadnoveljic · 2025-07-02T17:19:52Z

cmd/opampsupervisor/supervisor/commander/commander.go

+		return errors.New("agent process is not running")
+	}
+
+	c.logger.Debug("Sending SIGHUP to agent process to reload config", zap.Int("pid", c.cmd.Process.Pid))


SIGHUP isn't supported on Windows. Did this change break e2e-tests-windows / supervisor-test)(?
{"error": "failed to send SIGHUP to agent process: not supported by windows"}

The PR was merged despite the error. Should changes in this module trigger e2e-tests-windows / supervisor-test?

Thanks for the notification @nenadnoveljic. Unfortunately right now we have to manually apply a label to trigger those tests, which I forgot to do this time. @douglascamata could you take a look? We will likely need to disable this feature at startup if the Supervisor is running on Windows and skip the SIGHUP tests on Windows as well.

This element can be made conditional to skip SIGHUP tests on Windows.

Will take care of this tomorrow as early as possible, it's EOD for me already. 👍

@nenadnoveljic I think your link doesn't work well. Maybe it requires certain files to be expanded for the anchor to work? Could you share your suggestion in a different way?

Nevermind, I figured exactly what to expand to see your suggestion already, @nenadnoveljic.

Fixing the tests here

Cool, that looks good to me (I'm approving it). Tomorrow I can add proper configuration validation, with tests, to ensure Windows users cannot enable this feature.

@nenadnoveljic

…sed on Windows (#41077)  #### Description The changes in this PR add a validation in the Supervisor's agent configuration that prevents the usage of the SIGHUP configuration reload feature in Windows, as this OS doesn't support such signal. This PR is a follow up to #40522. Thank you @nenadnoveljic for point it out and notifying us of the Windows build failure.  #### Testing Automated test added. --------- Signed-off-by: Douglas Camata <[email protected]>

douglascamata added 4 commits June 6, 2025 12:13

[supervisor] Add support for optional agent SIGHUP config reload

6fc9d18

Signed-off-by: Douglas Camata <[email protected]>

Enable supervisor template configs to use agent hup restart

abe8566

Signed-off-by: Douglas Camata <[email protected]>

Run relevant tests agent hup restart besides normal restart

e169c76

Signed-off-by: Douglas Camata <[email protected]>

Fail hup restart is didn't get health from agent

41eb6aa

douglascamata requested review from evan-bradley, atoulme and a team as code owners June 6, 2025 13:12

github-actions bot assigned dmitryax Jun 6, 2025

github-actions bot added the cmd/opampsupervisor label Jun 6, 2025

Add changelog entry

413842e

Signed-off-by: Douglas Camata <[email protected]>

douglascamata force-pushed the supervisor-hup-collector branch from 48f2ef4 to 413842e Compare June 6, 2025 13:17

douglascamata added 3 commits June 6, 2025 15:26

Make linter happy

ac5b900

Merge branch 'main' into supervisor-hup-collector

2534261

Put back e2e go build flag

d1f9307

evan-bradley reviewed Jun 23, 2025

View reviewed changes

cmd/opampsupervisor/supervisor/supervisor.go Outdated Show resolved Hide resolved

douglascamata added 3 commits June 26, 2025 14:31

Remove unused fields in the Supervisor type

4dfdbae

Address PR review comments

d3bf969

Improve names (hup restart -> hup reload)

7181c06

github-actions bot requested a review from tigrannajaryan June 26, 2025 12:45

Protect access to agentStartedChan when closing it

5a82591

douglascamata requested a review from evan-bradley June 26, 2025 13:08

douglascamata added 5 commits June 27, 2025 10:21

Fix Supervisor config generation for tests

098dc4c

Signed-off-by: Douglas Camata <[email protected]>

Fix e2e test

6887fdc

Signed-off-by: Douglas Camata <[email protected]>

Merge branch 'main' of github.com:open-telemetry/opentelemetry-collec…

85cf701

…tor-contrib into supervisor-hup-collector

Put back go:build tag in supervisor e2e test

84d4c19

Signed-off-by: Douglas Camata <[email protected]>

Fix and rename Supervisor.waitForAgentStart

21cab60

The new name is `waitForAgentReady`. Other related variables were also renamed to keep it consistent. Signed-off-by: Douglas Camata <[email protected]>

evan-bradley reviewed Jun 30, 2025

View reviewed changes

cmd/opampsupervisor/supervisor/commander/commander.go Show resolved Hide resolved

cmd/opampsupervisor/supervisor/supervisor.go Outdated Show resolved Hide resolved

evan-bradley reviewed Jun 30, 2025

View reviewed changes

cmd/opampsupervisor/supervisor/supervisor.go Outdated Show resolved Hide resolved

Simplify approach for HUP reload implementation

601e2f6

Using a 1:1 communication channel instead of a more complex 1:N. Signed-off-by: Douglas Camata <[email protected]>

evan-bradley reviewed Jul 1, 2025

View reviewed changes

douglascamata added 2 commits July 1, 2025 16:44

Improve comments

e72d662

Remove unnecessary for loop when selecting on channels

6e2ebf7

douglascamata added 2 commits July 1, 2025 16:50

Fix typo to make linter happy

c1dbc6e

Signed-off-by: Douglas Camata <[email protected]>

Merge branch 'main' of github.com:open-telemetry/opentelemetry-collec…

b291949

…tor-contrib into supervisor-hup-collector

evan-bradley reviewed Jul 1, 2025

View reviewed changes

cmd/opampsupervisor/e2e_test.go Outdated Show resolved Hide resolved

douglascamata added 5 commits July 2, 2025 09:13

Passing all e2e tests

c130bc8

Signed-off-by: Douglas Camata <[email protected]>

Delegate agent start only to Supervisor.startAgent()

9dc39c1

Signed-off-by: Douglas Camata <[email protected]>

Ensure HUP e2e tests get the HUP reload config enabled

52644bc

Signed-off-by: Douglas Camata <[email protected]>

Remove commented code that got in by accident

8a160ae

Signed-off-by: Douglas Camata <[email protected]>

Move back nil health assignment to the top of new config handling

abebd1f

I thought this was required to make things work, but it isn't. Signed-off-by: Douglas Camata <[email protected]>

douglascamata requested a review from evan-bradley July 2, 2025 10:40

evan-bradley approved these changes Jul 2, 2025

View reviewed changes

cmd/opampsupervisor/supervisor/supervisor.go Outdated Show resolved Hide resolved

evan-bradley added 2 commits July 2, 2025 10:19

Update cmd/opampsupervisor/supervisor/supervisor.go

2ac33cd

Merge branch 'main' into supervisor-hup-collector

9397d52

evan-bradley merged commit a3a6f02 into open-telemetry:main Jul 2, 2025
177 checks passed

github-actions bot added this to the next release milestone Jul 2, 2025

nenadnoveljic reviewed Jul 2, 2025

View reviewed changes

nenadnoveljic mentioned this pull request Jul 2, 2025

[chore][cmd/opampsupervisor] Don't use SIGHUP in Windows tests #41070

Merged

douglascamata mentioned this pull request Jul 3, 2025

[cmd/opampsupervisor] Validate that the HUP config reload cannot be used on Windows #41077

Merged

[cmd/opampsupervisor] Add support for optional agent SIGHUP config reload #40522

[cmd/opampsupervisor] Add support for optional agent SIGHUP config reload #40522

Uh oh!

Conversation

douglascamata commented Jun 6, 2025

Description

Link to tracking issue

Testing

Documentation

Uh oh!

douglascamata commented Jun 12, 2025

Uh oh!

evan-bradley left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

evan-bradley left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

evan-bradley left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

douglascamata commented Jul 1, 2025

Uh oh!

Uh oh!

douglascamata commented Jul 2, 2025

Uh oh!

evan-bradley left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

douglascamata commented Jul 2, 2025

Uh oh!

Uh oh!

nenadnoveljic Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nenadnoveljic Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

evan-bradley Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

nenadnoveljic Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

douglascamata Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

douglascamata Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

douglascamata Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

nenadnoveljic Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

douglascamata Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nenadnoveljic Jul 2, 2025 •

edited

Loading

douglascamata Jul 2, 2025 •

edited

Loading

douglascamata Jul 2, 2025 •

edited

Loading