Skip to content

Component instancing is complicated. Is it really necessary? #10534

Open
@djaglowski

Description

@djaglowski

Is your feature request related to a problem? Please describe.

Users define component.ID's in their collector configurations using the format type[/name] where type is for example otlp or filelog and name is an additional arbitrary string which helps identify a unique configuration.

When the collector runs with a given configuration, it often will create and run multiple instances of a given component. The rules that govern how instancing works are not well documented but I have previously described them in detail here.

In short:

  • Receivers and exporters are instanced per data type
  • Processors are instanced per pipeline
  • Connectors are instanced per ordered pair of data types
  • sharedcomponent is a hacky "best practice" which allows component authors to circumvent instancing

I believe that instancing solves some problems but introduces others. Ultimately, I believe we may be better off removing the notion of instancing as described below. Although we are working aggressively towards 1.0 of the collector, I think we should seriously consider this change because it would substantially improve one requirement of 1.0 (self-observability).


Terminology

Even discussing this issue is difficult because we do not have clear terminology so I will attempt to introduce some terms here which are useful for the rest of the discussion:

A "Component Configuration" is an element which a user defines within the receivers, processors, exporters, connectors, or extensions section. For example, in the following configuration, there are 3 Component Configurations (otlp/in, batch and otlp/out):

receivers:
  otlp/in:
processors:
  batch:
exporters:
  otlp/out:

A "Component Instance" is a corresponding struct which the collector instantiates internally. The following configuration will result in more than 3 Component Instances:

service:
  pipelines:
    logs/1:
      receivers: [ otlp/in ]
      processors: [ batch ]
      exporters: [ otlp/out ]
    logs/2:
      receivers: [ otlp/in ]
      processors: [ batch ]
      exporters: [ otlp/out ]
    traces/1:
      receivers: [ otlp/in ]
      processors: [ batch ]
      exporters: [ otlp/out ]

If you are doubting that our current instancing behavior is confusing, please consider whether you can easily answer the following:

  1. How many Component Instances are instantiated for the above configuration?
  2. For each instance, what is an appropriate set of attributes that allows a user to understand that a given piece of telemetry originated from that instance?

(my answer here)

Problem 1 - Observability

Observability is all about understanding the internal state of a system. As an observability tool it is critical that we are a good exemplar of an observable system. However, our current notion of instancing is difficult to understand and therefore makes the collector less observable than it should be.

Users should be able to understand when telemetry describes a specific Component Instance but we do not currently have an externally defined schema for this. In fact, proper identity is dependent on both the class (receiver/processor/exporter/connector) and type (otlp/filelog) of component so it may not even be possible to define a static schema for identity attributes.

Additionally, users should be able to understand the relationship between a Component Configuration and its associated Component Instance(s). Even if the identify problem were reasonably well solved, any attempt to communicate to users about a specific Instance is muddled behind these rules that define these relationships. Even a simple configuration such as the example above may require a deep understanding of collector internals in order to understand the number of Component Instances, how they each relate to a Component Configuration, and which pipeline or pipelines contain which Component Instances.

Problem 2 - Maintainability

Identifying and managing Component Instances alongside Component Configurations is difficult to do even within the collector codebase.

The effort to design and implement a robust component status model was largely complicated by the fact that distinct Component Instance that each correspond to the same Component Configuration may be in different states. For example, an exporter which pushes logs to one endpoint and traces to another should be very clear if one is healthy while the other is not.

Additionally, having worked on some early prototypes of a hot-reload capability, I recently discovered it is very difficult to accomplish something as simple as getting a handle to the set of Component Instances that were instantiated from a given Component Configuration. (Once the collector is started, if we want to apply a new config which only changes a single Component Configuration, we need to get a handle to the corresponding Component Instances before we can take any meaningful action.)

Describe the solution you'd like

I think we should seriously consider whether or not component instancing is actually a net positive, and what it would look if each Component Configuration were instantiated exactly once. In order to determine whether this model would be an improvement it is necessary to consider the reasons why we currently create Component Instances.

Processors

Processors are currently instanced in a unique way relative to other components. Every pipeline is given a unique instance of a processor in order to ensure that pipelines do not overlap. For example:

pipelines:
  logs/1:
    receivers: [ otlp/in/1 ]
    processors: [ batch ]
    exporters: [ otlp/out/1 ]
  logs/2:
    receivers: [ otlp/in/2 ]
    processors: [ batch ]
    exporters: [ otlp/out/2 ]

If batch were a single instance, then the inputs from both pipelines would all flow into the same processor, and the output of that processor would flow to both exporters. Essentially it would be the same as:

pipelines:
  logs:
    receivers: [ otlp/in/1, otlp/in/2 ]
    processors: [ batch ]
    exporters: [ otlp/out/1, otlp/out/2 ]

However, users could very easily explicitly configure their pipelines to remain separate by declaring two separate batch processors:

pipelines:
  logs/1:
    receivers: [ otlp/in/1 ]
    processors: [ batch/1 ]
    exporters: [ otlp/out/1 ]
  logs/2:
    receivers: [ otlp/in/2 ]
    processors: [ batch/2 ]
    exporters: [ otlp/out/2 ]

Importantly, we would lose one clear benefit here, which is that a single processor configuration could not be reused. In other words, the user would have to define batch/1 and batch/2 separately in the processors section. This may be a small problem in many cases but for complex processor configurations it could lead to a lot of duplicated configuration. Users can instead rely on YAML anchors if necessary. See Explicitly Instanced Configurations below for an explanation of how this can be resolved.

A notable implementation detail for this design is that we would need the service graph to manage fanouts and capabilities for every consumer. This is certainly doable and might actually be less complicated because we can have one generic way to calculate these factors vs the current logic which has one set of logic immediately after receivers and another immediately before exporters.

Receivers & Exporters

Receivers and exporters are currently instanced per data type. I believe the primary reason for this is to protect against order of instantiation bugs. For example, consider the following psuedocode:

// Current pattern
factory := GetFactory("foo")
logsReceiver := factory.CreateLogsReceiver(...)
tracesReceiver := factory.CreateTracesReceiver(...)

// Possible alternative pattern
factory := GetFactory("foo")
receiver := factory.CreateReceiver(WithLogs(...), WithTraces(...))

In the alternative pattern there is a chance that during instantiation of the component, each additional type may introduce an incorrect interaction into the state of the struct. In my opinion, this is the biggest open question about the feasibility and impact of this approach. That said, this is already opted-into by components using the sharedcomponent pattern. I don't believe it has proven to be very difficult to work with in those cases, so perhaps this is not much of a concern. If there is sufficient interest in this proposal I will investigate further.

Connectors

Connectors are currently instanced per ordered pair of data types. I believe the considerations here are inherited from receivers and exporters so I am not aware of any unique considerations at this point.

Extensions

Extensions already follow the one-instance-per-configuration pattern which I am proposing to apply across the board.

### Explicitly Instanced Configurations

As mentioned earlier one of the immediate drawbacks of removing our current instancing logic is that users could not necessarily reuse configurations for processors. This is already the case for other components but arguably it is useful functionality nonetheless so I think we should consider how else to provide it. First, there is a notion of YAML anchors which we could recommend. However, I believe it would not be difficult to provide users a native alternative.

In short, we can make all Component Configurations reusable by allowing users to explicitly instance components within the pipeline definitions. I believe this would require us to designate one additional special character for this syntax. As a placeholder I'll use # but obviously we would need to consider this more carefully. A complete example:

receivers:
  otlp/in/1:
  otlp/in/2:
processors:
  transform:
    # lots of statements I don't want to repeat
exporters:
  otlp/out/1:
  otlp/out/2:

pipelines:
  logs/1:
    receivers: [ otlp/in/1 ]
    processors: [ transform#1 ]
    exporters: [ otlp/out/1 ]
  logs/2:
    receivers: [ otlp/in/2 ]
    processors: [ transform#2 ]
    exporters: [ otlp/out/2 ]

In this example, we've defined a Component Configuration for a transform processor once, and then used it twice by applying an explicit name to each usage of it. Two separate instances are created because we explicitly indicated to do so. Additional scrutiny of this syntax is necessary but the point is that some syntax could presumably allow users to indicate whether an additional instance of the configuration is intended.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:managementCollector lifecycle managementcollector-telemetryhealthchecker and other telemetry collection issues

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions