Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using new otlpfile pipeline for self metrics #1922

Merged
merged 30 commits into from
Apr 9, 2025
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
843a16f
Using new otlpfile pipeline for self metrics
rafaelwestphal Apr 1, 2025
3c459b2
Remove Diagnostic agent test
rafaelwestphal Apr 1, 2025
7a90e76
Fixing windows implmenetation
rafaelwestphal Apr 1, 2025
6b5aedb
Fixing windows config path and UAP implementation
rafaelwestphal Apr 1, 2025
cf393d6
Fixing test
rafaelwestphal Apr 1, 2025
9ccd96c
Increasing wait time for feature tracking metrics to be written
rafaelwestphal Apr 2, 2025
4cf50d9
Addressing code review
rafaelwestphal Apr 2, 2025
ed7ee16
Updating golden
rafaelwestphal Apr 2, 2025
3f788ef
Reverting sleep to 2 mins
rafaelwestphal Apr 2, 2025
730e968
Merge branch 'master' into westphalrafael-json-otlp-receiver
rafaelwestphal Apr 2, 2025
0fff72d
Giving more time for windows to boot and write feature tracking
rafaelwestphal Apr 3, 2025
edfeac6
Adding initial window metrics
rafaelwestphal Apr 3, 2025
238c4a6
Revert last commit
rafaelwestphal Apr 3, 2025
88e0c05
Merge branch 'master' into westphalrafael-json-otlp-receiver
rafaelwestphal Apr 3, 2025
2b0476b
Removing diagnostics from windows
rafaelwestphal Apr 3, 2025
5657c66
Revert "Giving more time for windows to boot and write feature tracking"
rafaelwestphal Apr 3, 2025
5efb332
Fix integration test
rafaelwestphal Apr 4, 2025
e440609
Merge branch 'master' into westphalrafael-json-otlp-receiver
rafaelwestphal Apr 4, 2025
7cae18f
actually generate the json files in windows plugin
XuechunHou Apr 4, 2025
3365aac
Fixing tests failure
rafaelwestphal Apr 7, 2025
77c1cd2
Fix typpo
rafaelwestphal Apr 7, 2025
762a64e
Removing binary from packing
rafaelwestphal Apr 7, 2025
1ae2cbc
remove diagnostics service reference in goospec
XuechunHou Apr 8, 2025
3adc7d9
explicitly delete the diagnostics windows service
XuechunHou Apr 8, 2025
f09c2a9
log install/uninstall errors to windows event log
XuechunHou Apr 8, 2025
07c68bc
delete the diagnostics windows Service during install
XuechunHou Apr 8, 2025
71d134e
Removing sleep
rafaelwestphal Apr 8, 2025
f94eff6
Reverting some windows changes
rafaelwestphal Apr 8, 2025
256bad3
Cleaing up window simplementation
rafaelwestphal Apr 8, 2025
a26f1ec
Address code review
rafaelwestphal Apr 8, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
10 changes: 2 additions & 8 deletions cmd/google_cloud_ops_agent_diagnostics/main_linux.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,20 +24,19 @@ import (
"syscall"

"github.com/GoogleCloudPlatform/ops-agent/cmd/google_cloud_ops_agent_diagnostics/utils"
"github.com/GoogleCloudPlatform/ops-agent/internal/self_metrics"
)

var (
config = flag.String("config", "/etc/google-cloud-ops-agent/config.yaml", "path to the user specified agent config")
)

func run(ctx context.Context) error {
userUc, mergedUc, err := utils.GetUserAndMergedConfigs(ctx, *config)
_, _, err := utils.GetUserAndMergedConfigs(ctx, *config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method can just return an err now. We don't ever use the first 2 return values I don't think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all will go away on the next PR that removes the diagnostics binary. I'm trying to change the least i can and maintain functionality to make it easier to review.

if err != nil {
return err
}

ctx, cancel := context.WithCancel(ctx)
_, cancel := context.WithCancel(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this anymore then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all will go away on the next PR that removes the diagnostics binary. I'm trying to change the least i can and maintain functionality to make it easier to review.

defer cancel()

go func() {
Expand All @@ -54,10 +53,5 @@ func run(ctx context.Context) error {
}
}()

err = self_metrics.CollectOpsAgentSelfMetrics(ctx, userUc, mergedUc)
if err != nil {
return err
}

return nil
}
9 changes: 1 addition & 8 deletions cmd/google_cloud_ops_agent_diagnostics/main_windows.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@ import (
"os"

"github.com/GoogleCloudPlatform/ops-agent/cmd/google_cloud_ops_agent_diagnostics/utils"
"github.com/GoogleCloudPlatform/ops-agent/internal/self_metrics"
"go.opentelemetry.io/otel"
"golang.org/x/sys/windows/svc"
"golang.org/x/sys/windows/svc/debug"
Expand Down Expand Up @@ -81,7 +80,7 @@ func (s *service) Execute(args []string, r <-chan svc.ChangeRequest, changes cha
return false, ERROR_INVALID_PARAMETER
}

userUc, mergedUc, err := utils.GetUserAndMergedConfigs(ctx, s.userConf)
_, _, err := utils.GetUserAndMergedConfigs(ctx, s.userConf)
if err != nil {
s.log.Error(DiagnosticsEventID, fmt.Sprintf("failed to obtain unified configuration: %v", err))
return false, ERROR_FILE_NOT_FOUND
Expand Down Expand Up @@ -113,12 +112,6 @@ func (s *service) Execute(args []string, r <-chan svc.ChangeRequest, changes cha
// Set otel error handler
otel.SetErrorHandler(s)

err = self_metrics.CollectOpsAgentSelfMetrics(ctx, userUc, mergedUc)
if err != nil {
s.log.Error(DiagnosticsEventID, fmt.Sprintf("failed to collect ops agent self metrics: %v", err))
return false, ERROR_INVALID_DATA
}

return false, ERROR_SUCCESS
}

Expand Down
7 changes: 0 additions & 7 deletions cmd/ops_agent_uap_plugin/service_linux.go
Original file line number Diff line number Diff line change
Expand Up @@ -185,13 +185,6 @@ func runSubagents(ctx context.Context, cancel context.CancelFunc, pluginInstallD
})

var wg sync.WaitGroup
// Starting the diagnostics service
runDiagnosticsCmd := exec.CommandContext(ctx,
path.Join(pluginInstallDirectory, DiagnosticsBinary),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also remove the DiagnosticsBinary const declaration on L41 of this file:

DiagnosticsBinary = "libexec/google_cloud_ops_agent_diagnostics"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"-config", OpsAgentConfigLocationLinux,
)
wg.Add(1)
go runSubAgentCommand(ctx, cancel, runDiagnosticsCmd, runCommand, &wg)

// Starting Otel
runOtelCmd := exec.CommandContext(ctx,
Expand Down
39 changes: 2 additions & 37 deletions cmd/ops_agent_uap_plugin/service_windows.go
Original file line number Diff line number Diff line change
Expand Up @@ -29,13 +29,10 @@ import (
"unsafe"

"github.com/GoogleCloudPlatform/ops-agent/apps"
"github.com/GoogleCloudPlatform/ops-agent/cmd/google_cloud_ops_agent_diagnostics/utils"
"github.com/GoogleCloudPlatform/ops-agent/confgenerator"
"github.com/GoogleCloudPlatform/ops-agent/internal/healthchecks"
"github.com/GoogleCloudPlatform/ops-agent/internal/logs"
"github.com/GoogleCloudPlatform/ops-agent/internal/self_metrics"
"github.com/kardianos/osext"
"go.opentelemetry.io/otel"
"golang.org/x/sys/windows"
"golang.org/x/sys/windows/svc/debug"
"golang.org/x/sys/windows/svc/eventlog"
Expand Down Expand Up @@ -160,8 +157,7 @@ func (ps *OpsAgentPluginServer) Start(ctx context.Context, msg *pb.StartRequest)
windowsEventLogger.Close()
}

otelErrorHandler := &otelErrorHandler{windowsEventLogger: windowsEventLogger, windowsEventId: OpsAgentUAPPluginEventID}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

go runSubagents(pContext, cancelFunc, pluginInstallDir, pluginStateDir, runSubAgentCommand, ps.runCommand, otelErrorHandler)
go runSubagents(pContext, cancelFunc, pluginInstallDir, pluginStateDir, runSubAgentCommand, ps.runCommand)

return &pb.StartResponse{}, nil
}
Expand Down Expand Up @@ -302,11 +298,6 @@ func generateSubAgentConfigs(ctx context.Context, userConfigPath string, pluginS
windowsEventLogger.Info(OpsAgentUAPPluginEventID, fmt.Sprintf("Built-in config:\n%s\n", apps.BuiltInConfStructs["windows"]))
windowsEventLogger.Info(OpsAgentUAPPluginEventID, fmt.Sprintf("Merged config:\n%s\n", uc))

// The generated otlp metric json files are used only by the otel service.
if err = self_metrics.GenerateOpsAgentSelfMetricsOTLPJSON(ctx, userConfigPath, filepath.Join(pluginStateDir, GeneratedConfigsOutDir, "otel")); err != nil {
return err
}

for _, subagent := range []string{
"otel",
"fluentbit",
Expand Down Expand Up @@ -378,14 +369,9 @@ func createWindowsJobHandle() (windows.Handle, error) {
//
// cancel: the cancel function for the parent context. By calling this function, the parent context is canceled,
// and GetStatus() returns a non-healthy status, signaling UAP to re-trigger Start().
//
// otelErrorHandler: an implementation of otel.ErrorHandler that is used in the diagnostics service to log otel errors to the Windows event log.
func runSubagents(ctx context.Context, cancel context.CancelFunc, pluginInstallDirectory string, pluginStateDirectory string, runSubAgentCommand RunSubAgentCommandFunc, runCommand RunCommandFunc, otelErrorHandler otel.ErrorHandler) {
func runSubagents(ctx context.Context, cancel context.CancelFunc, pluginInstallDirectory string, pluginStateDirectory string, runSubAgentCommand RunSubAgentCommandFunc, runCommand RunCommandFunc) {

var wg sync.WaitGroup
// Starting the diagnostics service
wg.Add(1)
go runDiagnosticsService(ctx, cancel, otelErrorHandler, &wg)

// Starting Otel
runOtelCmd := exec.CommandContext(ctx,
Expand All @@ -411,27 +397,6 @@ func runSubagents(ctx context.Context, cancel context.CancelFunc, pluginInstallD
wg.Wait()
}

func runDiagnosticsService(ctx context.Context, cancel context.CancelFunc, otelErrorHandler otel.ErrorHandler, wg *sync.WaitGroup) {
defer wg.Done()

userUc, mergedUc, err := utils.GetUserAndMergedConfigs(ctx, OpsAgentConfigLocationWindows)
if err != nil {
log.Printf("Failed to run the diagnostics service: %v", err)
cancel()
return
}

// Set otel error handler
otel.SetErrorHandler(otelErrorHandler)

err = self_metrics.CollectOpsAgentSelfMetrics(ctx, userUc, mergedUc)
if err != nil {
log.Printf("Failed to run the diagnostics service: %v", err)
cancel()
return
}
}

func runCommand(cmd *exec.Cmd) (string, error) {
if cmd == nil {
return "", nil
Expand Down
27 changes: 27 additions & 0 deletions confgenerator/agentmetrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,10 @@
package confgenerator

import (
"context"
"fmt"
"path/filepath"
"time"

"github.com/GoogleCloudPlatform/ops-agent/confgenerator/otel"
"github.com/GoogleCloudPlatform/ops-agent/confgenerator/otel/ottl"
Expand Down Expand Up @@ -166,4 +169,28 @@ func (r AgentSelfMetrics) LoggingSubmodulePipeline() otel.ReceiverPipeline {
}
}

func EnabledReceiversFeatureTrackingMetricsPipeline(ctx context.Context, outDir string) otel.ReceiverPipeline {
receiver_config := map[string]any{
"include": []string{
filepath.Join(outDir, "enabled_receivers_otlp.json"),
filepath.Join(outDir, "feature_tracking_otlp.json")},
"replay_file": true,
"poll_interval": time.Duration(60 * time.Second).String(),
}
return otel.ReceiverPipeline{
Receiver: otel.Component{
Type: "otlpjsonfile",
Config: receiver_config,
},
ExporterTypes: map[string]otel.ExporterType{
"metrics": otel.System,
},
Processors: map[string][]otel.Component{
"metrics": {
otel.Transform("metric", "datapoint", []ottl.Statement{"set(time, Now())"}),
},
},
}
}

// intentionally not registered as a component because this is not created by users
8 changes: 7 additions & 1 deletion confgenerator/confgenerator.go
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ func (uc *UnifiedConfig) getOTelLogLevel() string {
return logLevel
}

func (uc *UnifiedConfig) GenerateOtelConfig(ctx context.Context) (string, error) {
func (uc *UnifiedConfig) GenerateOtelConfig(ctx context.Context, outDir string) (string, error) {
p := platform.FromContext(ctx)
userAgent, _ := p.UserAgent("Google-Cloud-Ops-Agent-Metrics")
metricVersionLabel, _ := p.VersionLabel("google-cloud-ops-agent-metrics")
Expand All @@ -100,6 +100,12 @@ func (uc *UnifiedConfig) GenerateOtelConfig(ctx context.Context) (string, error)
ReceiverPipelineName: "otel",
}

receiverPipelines["enabled_receivers_feature_tracking"] = EnabledReceiversFeatureTrackingMetricsPipeline(ctx, outDir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest a more descriptive name for the pipeline instead of enabled_receivers_feature_tracking. A suggestion is otlp_agent_self_metrics.

Another alternative is to call the pipeline ops_agent since these metrics are Ops Agent self metrics. This following the naming of the others otel and fluentbit which represent self metrics about subagents.

pipelines["enabled_receivers_feature_tracking"] = otel.Pipeline{
Type: "metrics",
ReceiverPipelineName: "enabled_receivers_feature_tracking",
}

receiverPipelines["fluentbit"] = AgentSelfMetrics{
Version: loggingVersionLabel,
Port: fluentbit.MetricsPort,
Expand Down
2 changes: 1 addition & 1 deletion confgenerator/confgenerator_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -279,7 +279,7 @@ func generateConfigs(pc platformConfig, testDir string) (got map[string]string,
}

// Otel configs
otelGeneratedConfig, err := mergedUc.GenerateOtelConfig(ctx)
otelGeneratedConfig, err := mergedUc.GenerateOtelConfig(ctx, "")
if err != nil {
return
}
Expand Down
2 changes: 1 addition & 1 deletion confgenerator/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -1154,7 +1154,7 @@ func (uc *UnifiedConfig) OTelLoggingSupported(ctx context.Context) bool {
ucLoggingCopy.Logging.Service = &LoggingService{}
}
ucLoggingCopy.Logging.Service.OTelLogging = true
_, err = ucLoggingCopy.GenerateOtelConfig(ctx)
_, err = ucLoggingCopy.GenerateOtelConfig(ctx, "")
return err == nil
}

Expand Down
2 changes: 1 addition & 1 deletion confgenerator/files.go
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ func (uc *UnifiedConfig) GenerateFilesFromConfig(ctx context.Context, service, l
}
}
case "otel":
otelConfig, err := uc.GenerateOtelConfig(ctx)
otelConfig, err := uc.GenerateOtelConfig(ctx, outDir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] In the fluent bit case, the Generate Command just returns the files and then we write in this function. For OTel though, it looks like we write the file in side this call. Lets make the behaviour consistent between the two?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do on a follow up

Copy link
Contributor

@franciscovalentecastro franciscovalentecastro Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't write the otel generated configs inside GenerateOtelConfig, they are written in here the same way as the fluent-bit case (see the next 10 lines).

In the PR, @rafaelwestphal is passing outDir to know where did the *_oltp.json files were written too.

if err != nil {
return fmt.Errorf("can't parse configuration: %w", err)
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -448,6 +448,12 @@ processors:
resourcedetection/_global_0:
detectors:
- gcp
transform/enabled_receivers_feature_tracking_0:
error_mode: ignore
metric_statements:
- context: datapoint
statements:
- set(time, Now())
transform/otel_1:
error_mode: ignore
metric_statements:
Expand All @@ -471,6 +477,12 @@ receivers:
processes: {}
nvml/hostmetrics_1:
collection_interval: 60s
otlpjsonfile/enabled_receivers_feature_tracking:
include:
- enabled_receivers_otlp.json
- feature_tracking_otlp.json
poll_interval: 1m0s
replay_file: true
prometheus/fluentbit:
config:
scrape_configs:
Expand Down Expand Up @@ -508,6 +520,14 @@ service:
- resourcedetection/_global_0
receivers:
- nvml/hostmetrics_1
metrics/enabled_receivers_feature_tracking:
exporters:
- googlecloud
processors:
- transform/enabled_receivers_feature_tracking_0
- resourcedetection/_global_0
receivers:
- otlpjsonfile/enabled_receivers_feature_tracking
metrics/fluentbit:
exporters:
- googlecloud
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -424,6 +424,12 @@ processors:
resourcedetection/_global_0:
detectors:
- gcp
transform/enabled_receivers_feature_tracking_0:
error_mode: ignore
metric_statements:
- context: datapoint
statements:
- set(time, Now())
transform/otel_1:
error_mode: ignore
metric_statements:
Expand All @@ -445,6 +451,12 @@ receivers:
mute_process_exe_error: true
mute_process_name_error: true
processes: {}
otlpjsonfile/enabled_receivers_feature_tracking:
include:
- enabled_receivers_otlp.json
- feature_tracking_otlp.json
poll_interval: 1m0s
replay_file: true
prometheus/fluentbit:
config:
scrape_configs:
Expand Down Expand Up @@ -474,6 +486,14 @@ service:
- resourcedetection/_global_0
receivers:
- hostmetrics/hostmetrics
metrics/enabled_receivers_feature_tracking:
exporters:
- googlecloud
processors:
- transform/enabled_receivers_feature_tracking_0
- resourcedetection/_global_0
receivers:
- otlpjsonfile/enabled_receivers_feature_tracking
metrics/fluentbit:
exporters:
- googlecloud
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -431,6 +431,12 @@ processors:
resourcedetection/_global_0:
detectors:
- gcp
transform/enabled_receivers_feature_tracking_0:
error_mode: ignore
metric_statements:
- context: datapoint
statements:
- set(time, Now())
transform/otel_1:
error_mode: ignore
metric_statements:
Expand All @@ -455,6 +461,12 @@ receivers:
mute_process_exe_error: true
mute_process_name_error: true
processes: {}
otlpjsonfile/enabled_receivers_feature_tracking:
include:
- enabled_receivers_otlp.json
- feature_tracking_otlp.json
poll_interval: 1m0s
replay_file: true
prometheus/fluentbit:
config:
scrape_configs:
Expand Down Expand Up @@ -484,6 +496,14 @@ service:
- resourcedetection/_global_0
receivers:
- hostmetrics/hostmetrics
metrics/enabled_receivers_feature_tracking:
exporters:
- googlecloud
processors:
- transform/enabled_receivers_feature_tracking_0
- resourcedetection/_global_0
receivers:
- otlpjsonfile/enabled_receivers_feature_tracking
metrics/fluentbit:
exporters:
- googlecloud
Expand Down
Loading