diff --git a/text/images/otel-pipeline-monitoring.png b/text/images/otel-pipeline-monitoring.png new file mode 100644 index 000000000..dc42a6625 Binary files /dev/null and b/text/images/otel-pipeline-monitoring.png differ diff --git a/text/metrics/0238-pipeline-monitoring.md b/text/metrics/0238-pipeline-monitoring.md new file mode 100644 index 000000000..aebe94508 --- /dev/null +++ b/text/metrics/0238-pipeline-monitoring.md @@ -0,0 +1,181 @@ +# OpenTelemetry Telemetry Pipeline metrics + +Propose a uniform standard for telemetry pipeline metrics generated by +OpenTelemetry SDKs and Collectors with support for several levels of +detail. + +**WIP**: This document has been edited recently, based on reviewer +feedback. Since it has changed substantially, I removed a lot of +text. I will restore this document after sharing the revisions with +reviewers. + +## Motivation + +OpenTelemetry desires to standardize conventions for the metrics +emitted by SDKs about success and failure of telemetry reporting. At +the same time, the OpenTelemetry Collector is becoming a stable and +critical part of the ecosystem, and it has existing conventions which +are expected to connect with metrics emitted by SDKs. + +We use the term "pipeline" to describe an arrangement of system +components which produce, consume, and process telemetry on its way +from the point of origin to the endpoint(s) in its journey. + +## Explanation + +### Detailed design + +The proposed metric instrument would be named distinctly depending on +whether it is a collector or an SDK, to prevent accidental aggregation +of these timeseries. The specified counter names would be: + +- `otelsdk.producer.items`: count of successful and failed items of + telemetry produced, by signal type, by an OpenTelemetry SDK. +- `otelcol.receiver.items`: count of successful and failed items of + telemetry received, by signal type, by an OpenTelemetry Collector + receiver component. +- `otelcol.processor.items`: count of successful and failed items of + telemetry processed, by signal type, by an OpenTelemetry Collector + receiver component. +- `otelcol.exporter.items`: count of successful and failed items of + telemetry processed, by signal type, by an OpenTelemetry Collector + receiver component. + +### Recommended conventional attributes + +- `otel.success` (boolean): This is true or false depending on whether the + component considers the outcome a success or a failure. +- `otel.outcome` (string): This describes the outcome in a more specific + way than `otel.success`, with recommended values specified below. +- `otel.signal` (string): This is the name of the signal (e.g., "logs", + "metrics", "traces") +- `otel.name` (string): Name of the component in a pipeline. +- `otel.pipeline` (string): Name of the pipeline in a collector. + +### Specified `otel.outcome` attribute values + +The `otel.outcome` attribute indicates extra information about a +success or failure. A set of standard conventional attribute values +is supplied and is considered a closed set. If these outcomes do not +accurately explain the reason for a success or failure outcome, they +SHOULD be extended by OpenTelemetry. + +For success=true: + +- `accepted`: Indicates a normal, synchronous request success case. + The item was consumed by the next stage of the pipeline, which + returned success. Note the item could have been suppressed by a + subsequent component, but as far as this component knows, the + request successful. +- `suppressed:`: When the true + outcome is not known at the time of counting, and the compnent + intentionally returns success to its producer. Examples are given + below. + +For both success=true and success=false, there is a special outcome +indicating items did not reach the next stage in the pipeline, +considered "dropped". When comparing pipeline metrics from one stage +to the next, those which are dropped by a component are expected not +to appear in totals of the subequent pipeline. + +- `dropped`: Processors may use this to indicate both success and + failure, for example include sampling processors and filtering + processors, which successfully avoid sending data based on + configuration. For all components, dropped with success=false + indicates that the component introduced an original failure and did + not send to the next stage in the pipeline. + +For success=false, transient and potentially retryable: + +- `deadline_exceeded`: The item was in the process of being sent but the request + timed out, or its deadline was exceeded. +- `resource_exhausted`: The item was handled by the next stage of the + pipeline, which returned an error code indicating that it was + overloaded. If the resource being exhausted is local and the item + was not handled by the next stage of the pipeline, use `dropped`. +- `retryable`: The item was handled by the next stage of the pipeline, + which returned a retryable error status not covered by any of the + above values. + +For success=false, permanent category: + +- `rejected`: The item was handled by the next stage of the pipeline, + which returned a permanent error status or partial success status + indicating that some items could not be accepted. + + +#### Success, Outcome matrix + +| Success | Outcome | Meaning | +|---------|------------------------------|-------------------------------------------------------------------| +| true | accepted | Synchronous send succeeded | +| true | dropped | Dropped by intention | +| false | dropped | Producer saw the component return failure, request was not sent | +| false | deadline_exceeded | Producer saw the component return failure, request timed out | +| false | resource_exhausted | Producer saw the component return failure, insufficient resources | +| false | retryable | Producer saw the component return other non-permanent condition | +| false | rejected | Producer saw the component return a permanent condition | +| true | supressed:accepted | Producer saw success; eventually accepted | +| true | supressed:dropped | Producer saw success; request was not sent | +| true | supressed:deadline_exceeded | Producer saw success; request sent, timed out | +| true | supressed:resource_exhausted | Producer saw success; request sent, insufficient resources | +| true | supressed:retryable | Producer saw success; request sent, other non-permanent condition | +| true | supressed:rejected | Producer saw success; request sent, permanent condition | +| true | supressed:unknown | Producer saw success; no effort to report true outcome | + +#### Examples of each outcome + +##### Success, Accepted + +This is the common success case. The item(s) were sent to the next +stage in the pipeline while blocking the producer. + +##### Success, Dropped + +A processor was configured with instructions not to pass certain data. + +##### Success, Suppressed-Accepted + +A component returned success to its producer, and later the outcome +was successful. + +##### Failure, Dropped and Success, Suppressed-Dropped + +(If suppressed: A component returned success to its producer, then ...) + +The component never sent the item(s) due to limits in effect. For +example, shutdown was ordered and the queue could not be drained in +time due to a limit on parallelism. + +##### Failure, Deadline exceeded and Success, Suppressed-Deadline exceeded + +(If suppressed: A component returned success to its producer, then ...) + +The component attempted sending the item(s), but the item(s) did not +succeed before the deadline expired. If there were attempts to retry, +this is outcome of the final attempt. + +##### Failure, Resource exhausted and Success, Suppressed-Resource exhausted + +(If suppressed: A component returned success to its producer, then ...) + +The component attempted sending the item(s), but the consumer +indicated its (or its consumers') resources were exceeded. If there +were attempts to retry, this is outcome of the final attempt. + +##### Failure, Retryable and Success, Suppressed-Retryable + +(If suppressed: A component returned success to its producer, then ...) + +A component returned success to its producer, and then it attempted +sending the item(s), but the consumer indicated some kind of transient +condition other than deadline- or resource-related (e.g., connection +not accepted). If there were attempts to retry, this is outcome of +the final attempt. + +##### Failure, Rejected and Success, Suppressed-Rejected + +(If suppressed: A component returned success to its producer, then ...) + +A compmnent returned success to its producer, and then it attempted +sending the item(s), but the consumer returned a permanent error.