Component instancing is complicated. Is it really necessary?

**Is your feature request related to a problem? Please describe.**

Users define `component.ID`'s in their collector configurations using the format `type[/name]` where `type` is for example `otlp` or `filelog` and `name` is an additional arbitrary string which helps identify a unique configuration.

When the collector runs with a given configuration, it often will create and run multiple instances of a given component. The rules that govern how instancing works are not well documented but I have [previously described them in detail here](https://github.com/open-telemetry/opentelemetry-collector/pull/8169#discussion_r1283727573).

In short:
- Receivers and exporters are instanced _per data type_
- Processors are instanced _per pipeline_
- Connectors are instanced _per ordered pair of data types_
- `sharedcomponent` is a hacky "best practice" which allows component authors to circumvent instancing

I believe that instancing _solves some problems_ but _introduces others_. Ultimately, I believe we may be better off removing the notion of instancing as described below. Although we are working aggressively towards 1.0 of the collector, I think we should seriously consider this change because it would substantially improve one requirement of 1.0 (self-observability).

---

### Terminology

Even discussing this issue is difficult because we do not have clear terminology so I will attempt to introduce some terms here which are useful for the rest of the discussion:

A "Component Configuration" is an element which a user defines within the `receivers`, `processors`, `exporters`, `connectors`, or `extensions` section. For example, in the following configuration, there are 3 Component Configurations (`otlp/in`, `batch` and `otlp/out`):

```yaml
receivers:
  otlp/in:
processors:
  batch:
exporters:
  otlp/out:
```

A "Component Instance" is a corresponding struct which the collector instantiates internally. The following configuration will result in more than 3 Component Instances:

```yaml
service:
  pipelines:
    logs/1:
      receivers: [ otlp/in ]
      processors: [ batch ]
      exporters: [ otlp/out ]
    logs/2:
      receivers: [ otlp/in ]
      processors: [ batch ]
      exporters: [ otlp/out ]
    traces/1:
      receivers: [ otlp/in ]
      processors: [ batch ]
      exporters: [ otlp/out ]
```

If you are doubting that our current instancing behavior is confusing, please consider whether you can easily answer the following:
1. How many Component Instances are instantiated for the above configuration? 
2. For each instance, what is an appropriate set of attributes that allows a user to understand that a given piece of telemetry originated from that instance?

([my answer here](https://gist.github.com/djaglowski/99f471545ecc630b8f65f3fbae9d5282))

### Problem 1 - Observability

Observability is all about understanding the internal state of a system. As an observability tool it is critical that we are a good exemplar of an observable system. However, our current notion of instancing is difficult to understand and therefore makes the collector less observable than it should be.

Users should be able to understand when telemetry describes a specific Component Instance but we do not currently have an externally defined schema for this. In fact, proper identity is dependent on both the class (receiver/processor/exporter/connector) and type (otlp/filelog) of component so it may not even be possible to define a _static_ schema for identity attributes.

Additionally, users should be able to understand the relationship between a Component Configuration and its associated Component Instance(s). Even if the identify problem were reasonably well solved, any attempt to communicate to users about a specific Instance is muddled behind these rules that define these relationships. Even a simple configuration such as the example above may require a deep understanding of collector internals in order to understand the number of Component Instances, how they each relate to a Component Configuration, and which pipeline or pipelines contain which Component Instances.

### Problem 2 - Maintainability

Identifying and managing Component Instances alongside Component Configurations is difficult to do even within the collector codebase.

The effort to design and implement a robust component status model was largely complicated by the fact that distinct Component Instance that each correspond to the same Component Configuration may be in different states. For example, an exporter which pushes logs to one endpoint and traces to another should be very clear if one is healthy while the other is not.

Additionally, having worked on some early prototypes of a hot-reload capability, I recently discovered it is very difficult to accomplish something as simple as getting a handle to the set of Component Instances that were instantiated from a given Component Configuration. (Once the collector is started, if we want to apply a new config which only changes a single Component Configuration, we need to get a handle to the corresponding Component Instances before we can take any meaningful action.)

**Describe the solution you'd like**

I think we should seriously consider whether or not component instancing is actually a net positive, and what it would look if each Component Configuration were instantiated exactly once. In order to determine whether this model would be an improvement it is necessary to consider the reasons why we currently create Component Instances.

### Processors 

Processors are currently instanced in a unique way relative to other components. Every pipeline is given a unique instance of a processor in order _to ensure that pipelines do not overlap_. For example: 

```yaml
pipelines:
  logs/1:
    receivers: [ otlp/in/1 ]
    processors: [ batch ]
    exporters: [ otlp/out/1 ]
  logs/2:
    receivers: [ otlp/in/2 ]
    processors: [ batch ]
    exporters: [ otlp/out/2 ]
```

If `batch` were a single instance, then the inputs from both pipelines would all flow into the same processor, and the output of that processor would flow to both exporters. Essentially it would be the same as:

```yaml
pipelines:
  logs:
    receivers: [ otlp/in/1, otlp/in/2 ]
    processors: [ batch ]
    exporters: [ otlp/out/1, otlp/out/2 ]
```

However, users could very easily explicitly configure their pipelines to remain separate by declaring two separate `batch` processors:

```yaml
pipelines:
  logs/1:
    receivers: [ otlp/in/1 ]
    processors: [ batch/1 ]
    exporters: [ otlp/out/1 ]
  logs/2:
    receivers: [ otlp/in/2 ]
    processors: [ batch/2 ]
    exporters: [ otlp/out/2 ]
```

Importantly, we would lose one clear benefit here, which is that a single processor _configuration_ could not be reused. In other words, the user would have to define `batch/1` and `batch/2` separately in the `processors` section. This may be a small problem in many cases but for complex processor configurations it could lead to a lot of duplicated configuration. Users can instead rely on YAML anchors if necessary. ~~See `Explicitly Instanced Configurations` below for an explanation of how this can be resolved.~~

A notable implementation detail for this design is that we would need the service graph to manage fanouts and capabilities for every consumer. This is certainly doable and might actually be less complicated because we can have one generic way to calculate these factors vs the current logic which has one set of logic immediately after receivers and another immediately before exporters.

### Receivers & Exporters

Receivers and exporters are currently instanced _per data type_. I believe the primary reason for this is to protect against _order of instantiation_ bugs. For example, consider the following psuedocode:

```go
// Current pattern
factory := GetFactory("foo")
logsReceiver := factory.CreateLogsReceiver(...)
tracesReceiver := factory.CreateTracesReceiver(...)

// Possible alternative pattern
factory := GetFactory("foo")
receiver := factory.CreateReceiver(WithLogs(...), WithTraces(...))
```

In the alternative pattern there is a chance that during instantiation of the component, each additional type may introduce an incorrect interaction into the state of the struct. **In my opinion, this is the biggest open question about the feasibility and impact of this approach.** That said, this is already opted-into by components using the `sharedcomponent` pattern. I don't believe it has proven to be very difficult to work with in those cases, so perhaps this is not much of a concern. If there is sufficient interest in this proposal I will investigate further.

### Connectors

Connectors are currently instanced _per ordered pair of data types_. I believe the considerations here are inherited from receivers and exporters so I am not aware of any unique considerations at this point. 

### Extensions

Extensions already follow the one-instance-per-configuration pattern which I am proposing to apply across the board.

~~### _Explicitly_ Instanced Configurations~~

~~As mentioned earlier one of the immediate drawbacks of removing our current instancing logic is that users could not necessarily reuse configurations for _processors_. This is already the case for other components but arguably it is useful functionality nonetheless so I think we should consider how else to provide it. First, there is a notion of YAML anchors which we could recommend. However, I believe it would not be difficult to provide users a native alternative.~~

~~In short, we can make _all_ Component Configurations reusable by allowing users to _explicitly_ instance components within the pipeline definitions. I believe this would require us to designate one additional special character for this syntax. As a placeholder I'll use `#` but obviously we would need to consider this more carefully. A complete example:~~

```yaml
receivers:
  otlp/in/1:
  otlp/in/2:
processors:
  transform:
    # lots of statements I don't want to repeat
exporters:
  otlp/out/1:
  otlp/out/2:

pipelines:
  logs/1:
    receivers: [ otlp/in/1 ]
    processors: [ transform#1 ]
    exporters: [ otlp/out/1 ]
  logs/2:
    receivers: [ otlp/in/2 ]
    processors: [ transform#2 ]
    exporters: [ otlp/out/2 ]
```

~~In this example, we've defined a Component Configuration for a `transform` processor once, and then used it twice by applying an explicit name to each usage of it. Two separate instances are created because we explicitly indicated to do so. Additional scrutiny of this syntax is necessary but the point is that _some syntax_ could presumably allow users to indicate whether an additional instance of the configuration is intended.~~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Component instancing is complicated. Is it really necessary? #10534

Terminology

Problem 1 - Observability

Problem 2 - Maintainability

Processors

Receivers & Exporters

Connectors

Extensions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component instancing is complicated. Is it really necessary? #10534

Description

Terminology

Problem 1 - Observability

Problem 2 - Maintainability

Processors

Receivers & Exporters

Connectors

Extensions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions