Skip to content

[cmd/opampsupervisor] Supervisor reports last collector STDERR message #39954

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

dpaasman00
Copy link
Contributor

Description

If the supervisor receives a "bad" remote config (collector is unable to start or fails shortly after) and starts the collector with it, the supervisor reports a "Failed" RemoteConfigStatus and an error. This error is usually either "Config apply timeout exceeded" or "Agent process PID=1234 exited unexpectedly, exit code=1. Will restart in a bit...".

This error isn't very descriptive though as to why the collector failed and requires retrieving the collector's log to determine the root issue. In situations where these logs aren't accessible it makes debugging very difficult if not impossible.

This PR changes how the collector process is ran so that we can keep track of the last message the collector writes to STDERR. Whenever the collector process fails, we include this last error message with the supervisor's description of the issue.

For example, if the failure is an unrecognized component in the config, this is the error reported to the OpAMP server:

"Config apply timeout exceeded: \nerror decoding 'exporters': unknown type: \"doesntexist\" for id: \"doesntexist\" (valid values: [file opensearch rabbitmq sapm signalfx splunk_hec nop alertmanager alibabacloud_logservice datadog elasticsearch googlecloud googlecloudpubsub sumologic azureblob influxdb sentry syslog zipkin otlphttp dataset stef debug awss3 awsxray azuredataexplorer honeycombmarker kafka logzio opencensus awscloudwatchlogs awsemf azuremonitor bmchelix loki mezmo prometheus pulsar carbon clickhouse tencentcloud_logservice otlp awskinesis doris googlemanagedprometheus loadbalancing logicmonitor otelarrow prometheusremotewrite cassandra coralogix])"

Testing

E2E test for restarting after a bad config is updated to check for an error message.

Documentation

@dpaasman00 dpaasman00 force-pushed the supervisor-reports-last-collector-stderr branch from 8df8cc6 to 6e51851 Compare June 4, 2025 15:35
@dpaasman00 dpaasman00 marked this pull request as ready for review June 4, 2025 15:35
@@ -79,11 +83,20 @@ func (c *Commander) Start(ctx context.Context) error {
c.cmd.Env = common.EnvVarMapToEnvMapSlice(c.cfg.Env)
c.cmd.SysProcAttr = sysProcAttrs()

// PassthroughLogging changes how collector start up happens
// grab cmd pipes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to have a comment about why do we need to do this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants