Skip to content

Commit 0fa3dc8

Browse files
authored
Attribute sets not observed during async callbacks are not exported (open-telemetry#3242)
1 parent 584fa06 commit 0fa3dc8

File tree

3 files changed

+82
-58
lines changed

3 files changed

+82
-58
lines changed

CHANGELOG.md

+2
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ release.
1717
([#3648](https://github.com/open-telemetry/opentelemetry-specification/pull/3648))
1818
- MetricReader.Collect ignores Resource from MetricProducer.Produce.
1919
([#3636](https://github.com/open-telemetry/opentelemetry-specification/pull/3636))
20+
- Attribute sets not observed during async callbacks are not exported.
21+
([#3242](https://github.com/open-telemetry/opentelemetry-specification/pull/3242))
2022

2123
### Logs
2224

specification/metrics/sdk.md

+4
Original file line numberDiff line numberDiff line change
@@ -701,6 +701,10 @@ execution.
701701
The implementation MUST complete the execution of all callbacks for a
702702
given instrument before starting a subsequent round of collection.
703703

704+
The implementation SHOULD NOT produce aggregated metric data for a
705+
previously-observed attribute set which is not observed during a successful
706+
callback.
707+
704708
### Cardinality limits
705709

706710
**Status**: [Experimental](../document-status.md)

specification/metrics/supplementary-guidelines.md

+76-58
Original file line numberDiff line numberDiff line change
@@ -431,25 +431,30 @@ Instrument](./api.md#histogram). What if we collect measurements from an
431431
[Asynchronous Counter](./api.md#asynchronous-counter)?
432432

433433
The following example shows the number of [page
434-
faults](https://en.wikipedia.org/wiki/Page_fault) of each thread since the
435-
thread ever started:
434+
faults](https://en.wikipedia.org/wiki/Page_fault) of each process since
435+
it started:
436436

437437
* During the time range (T<sub>0</sub>, T<sub>1</sub>]:
438-
* pid = `1001`, tid = `1`, #PF = `50`
439-
* pid = `1001`, tid = `2`, #PF = `30`
438+
* pid = `1001`, #PF = `50`
439+
* pid = `1002`, #PF = `30`
440440
* During the time range (T<sub>1</sub>, T<sub>2</sub>]:
441-
* pid = `1001`, tid = `1`, #PF = `53`
442-
* pid = `1001`, tid = `2`, #PF = `38`
441+
* pid = `1001`, #PF = `53`
442+
* pid = `1002`, #PF = `38`
443443
* During the time range (T<sub>2</sub>, T<sub>3</sub>]
444-
* pid = `1001`, tid = `1`, #PF = `56`
445-
* pid = `1001`, tid = `2`, #PF = `42`
444+
* pid = `1001`, #PF = `56`
445+
* pid = `1002`, #PF = `42`
446446
* During the time range (T<sub>3</sub>, T<sub>4</sub>]:
447-
* pid = `1001`, tid = `1`, #PF = `60`
448-
* pid = `1001`, tid = `2`, #PF = `47`
447+
* pid = `1001`, #PF = `60`
448+
* pid = `1002`, #PF = `47`
449449
* During the time range (T<sub>4</sub>, T<sub>5</sub>]:
450-
* thread 1 died, thread 3 started
451-
* pid = `1001`, tid = `2`, #PF = `53`
452-
* pid = `1001`, tid = `3`, #PF = `5`
450+
* process 1001 died, process 1003 started
451+
* pid = `1002`, #PF = `53`
452+
* pid = `1003`, #PF = `5`
453+
* During the time range (T<sub>5</sub>, T<sub>6</sub>]:
454+
* A new process 1001 started
455+
* pid = `1001`, #PF = `10`
456+
* pid = `1002`, #PF = `57`
457+
* pid = `1003`, #PF = `8`
453458

454459
Note that in the following examples, Cumulative aggregation
455460
temporality is discussed before Delta aggregation temporality because
@@ -461,47 +466,56 @@ API with specified Cumulative aggregation temporality.
461466
If we export the metrics using **Cumulative Temporality**:
462467

463468
* (T<sub>0</sub>, T<sub>1</sub>]
464-
* attributes: {pid = `1001`, tid = `1`}, sum: `50`
465-
* attributes: {pid = `1001`, tid = `2`}, sum: `30`
469+
* attributes: {pid = `1001`}, sum: `50`
470+
* attributes: {pid = `1002`}, sum: `30`
466471
* (T<sub>0</sub>, T<sub>2</sub>]
467-
* attributes: {pid = `1001`, tid = `1`}, sum: `53`
468-
* attributes: {pid = `1001`, tid = `2`}, sum: `38`
472+
* attributes: {pid = `1001`}, sum: `53`
473+
* attributes: {pid = `1002`}, sum: `38`
469474
* (T<sub>0</sub>, T<sub>3</sub>]
470-
* attributes: {pid = `1001`, tid = `1`}, sum: `56`
471-
* attributes: {pid = `1001`, tid = `2`}, sum: `42`
475+
* attributes: {pid = `1001`}, sum: `56`
476+
* attributes: {pid = `1002`}, sum: `42`
472477
* (T<sub>0</sub>, T<sub>4</sub>]
473-
* attributes: {pid = `1001`, tid = `1`}, sum: `60`
474-
* attributes: {pid = `1001`, tid = `2`}, sum: `47`
478+
* attributes: {pid = `1001`}, sum: `60`
479+
* attributes: {pid = `1002`}, sum: `47`
475480
* (T<sub>0</sub>, T<sub>5</sub>]
476-
* attributes: {pid = `1001`, tid = `2`}, sum: `53`
477-
* attributes: {pid = `1001`, tid = `3`}, sum: `5`
481+
* attributes: {pid = `1002`}, sum: `53`
482+
* (T<sub>4</sub>, T<sub>5</sub>]
483+
* attributes: {pid = `1003`}, sum: `5`
484+
* (T<sub>5</sub>, T<sub>6</sub>]
485+
* attributes: {pid = `1001`}, sum: `10`
486+
* (T<sub>0</sub>, T<sub>6</sub>]
487+
* attributes: {pid = `1002`}, sum: `57`
488+
* (T<sub>4</sub>, T<sub>6</sub>]
489+
* attributes: {pid = `1003`}, sum: `8`
478490

479491
The behavior in the first four periods is quite straightforward - we
480492
just take the data being reported from the asynchronous instruments
481493
and send them.
482494

483-
The data model prescribes several valid behaviors at T<sub>5</sub> in
484-
this case, where one stream dies and another starts. The [Resets and
485-
Gaps](./data-model.md#resets-and-gaps) section describes how start
486-
timestamps and staleness markers can be used to increase the
495+
The data model prescribes several valid behaviors at T<sub>5</sub> and
496+
T<sub>6</sub> in this case, where one stream dies and another starts.
497+
The [Resets and Gaps](./data-model.md#resets-and-gaps) section describes
498+
how start timestamps and staleness markers can be used to increase the
487499
receiver's understanding of these events.
488500

489501
Consider whether the SDK maintains individual timestamps for the
490502
individual stream, or just one per process. In this example, where a
491-
thread can die and start counting page faults from zero, the valid
492-
behaviors at T<sub>5</sub> are:
503+
process can die and restart, it starts counting page faults from zero.
504+
In this case, the valid behaviors at T<sub>5</sub> and T<sub>6</sub>
505+
are:
493506

494507
1. If all streams in the process share a start time, and the SDK is
495508
not required to remember all past streams: the thread restarts with
496-
zero sum. Receivers with reset detection are able to calculate a
497-
correct rate (except for frequent restarts relative to the
498-
collection interval), however the precise time of a reset will be
499-
unknown.
500-
2. If the SDK maintains per-stream start times, it signals to the
501-
receiver precisely when a stream started, making the first
502-
observation in a stream more useful for diagnostics. Receivers can
503-
perform overlap detection or duplicate suppression and do not
504-
require reset detection, in this case.
509+
zero sum, and the start time of the process. Receivers with reset
510+
detection are able to calculate a correct rate (except for frequent
511+
restarts relative to the collection interval), however the precise
512+
time of a reset will be unknown.
513+
2. If the SDK maintains per-stream start times, it provides the previous
514+
callback time as the start time, as this time is before the occurrence
515+
of any events which are measured during the subsequent callback. This
516+
makes the first observation in a stream more useful for diagnostics,
517+
as downstream consumers can perform overlap detection or duplicate
518+
suppression and do not require reset detection in this case.
505519
3. Independent of above treatments, the SDK can add a staleness marker
506520
to indicate the start of a gap in the stream when one thread dies
507521
by remembering which streams have previously reported but are not
@@ -519,20 +533,23 @@ data model.
519533
If we export the metrics using **Delta Temporality**:
520534

521535
* (T<sub>0</sub>, T<sub>1</sub>]
522-
* attributes: {pid = `1001`, tid = `1`}, delta: `50`
523-
* attributes: {pid = `1001`, tid = `2`}, delta: `30`
536+
* attributes: {pid = `1002`}, delta: `30`
524537
* (T<sub>1</sub>, T<sub>2</sub>]
525-
* attributes: {pid = `1001`, tid = `1`}, delta: `3`
526-
* attributes: {pid = `1001`, tid = `2`}, delta: `8`
538+
* attributes: {pid = `1001`}, delta: `3`
539+
* attributes: {pid = `1002`}, delta: `8`
527540
* (T<sub>2</sub>, T<sub>3</sub>]
528-
* attributes: {pid = `1001`, tid = `1`}, delta: `3`
529-
* attributes: {pid = `1001`, tid = `2`}, delta: `4`
541+
* attributes: {pid = `1001`}, delta: `3`
542+
* attributes: {pid = `1002`}, delta: `4`
530543
* (T<sub>3</sub>, T<sub>4</sub>]
531-
* attributes: {pid = `1001`, tid = `1`}, delta: `4`
532-
* attributes: {pid = `1001`, tid = `2`}, delta: `5`
544+
* attributes: {pid = `1001`}, delta: `4`
545+
* attributes: {pid = `1002`}, delta: `5`
533546
* (T<sub>4</sub>, T<sub>5</sub>]
534-
* attributes: {pid = `1001`, tid = `2`}, delta: `6`
535-
* attributes: {pid = `1001`, tid = `3`}, delta: `5`
547+
* attributes: {pid = `1002`}, delta: `6`
548+
* attributes: {pid = `1003`}, delta: `5`
549+
* (T<sub>5</sub>, T<sub>6</sub>]
550+
* attributes: {pid = `1001`}, delta: `10`
551+
* attributes: {pid = `1002`}, delta: `4`
552+
* attributes: {pid = `1003`}, delta: `3`
536553

537554
You can see that we are performing Cumulative->Delta conversion, and it requires
538555
us to remember the last value of **every single permutation we've encountered so
@@ -560,27 +577,28 @@ So here are some suggestions that we encourage SDK implementers to consider:
560577
##### Asynchronous example: attribute removal in a view
561578

562579
Suppose the metrics in the asynchronous example above are exported
563-
through a view configured to remove the `tid` attribute, leaving a
564-
single-dimensional count of page faults by `pid`. For each metric
565-
stream, two measurements are produced covering the same interval of
566-
time, which the SDK is expected to aggregate before producing the
567-
output.
580+
through a view configured to remove the `pid` attribute, leaving a
581+
count of page faults. For each metric stream, two measurements are produced
582+
covering the same interval of time, which the SDK is expected to aggregate
583+
before producing the output.
568584

569585
The data model specifies to use the "natural merge" function, in this
570586
case meaning to add the current point values together because they
571587
are `Sum` data points. The expected output is, still in **Cumulative
572588
Temporality**:
573589

574590
* (T<sub>0</sub>, T<sub>1</sub>]
575-
* dimensions: {pid = `1001`}, sum: `80`
591+
* dimensions: {}, sum: `80`
576592
* (T<sub>0</sub>, T<sub>2</sub>]
577-
* dimensions: {pid = `1001`}, sum: `91`
593+
* dimensions: {}, sum: `91`
578594
* (T<sub>0</sub>, T<sub>3</sub>]
579-
* dimensions: {pid = `1001`}, sum: `98`
595+
* dimensions: {}, sum: `98`
580596
* (T<sub>0</sub>, T<sub>4</sub>]
581-
* dimensions: {pid = `1001`}, sum: `107`
597+
* dimensions: {}, sum: `107`
582598
* (T<sub>0</sub>, T<sub>5</sub>]
583-
* dimensions: {pid = `1001`}, sum: `58`
599+
* dimensions: {}, sum: `58`
600+
* (T<sub>0</sub>, T<sub>6</sub>]
601+
* dimensions: {}, sum: `75`
584602

585603
As discussed in the asynchronous cumulative temporality example above,
586604
there are various treatments available for detecting resets. Even if

0 commit comments

Comments
 (0)