Skip to content

Commit a551f33

Browse files
chore: RFC for Data Volume Insights (#17322)
[Rendered](https://github.com/vectordotdev/vector/blob/stephen/data_volume_rfc/rfcs/2023-05-03-data-volume-metrics.md) --------- Signed-off-by: Stephen Wakely <[email protected]>
1 parent 4ce3278 commit a551f33

File tree

1 file changed

+240
-0
lines changed

1 file changed

+240
-0
lines changed
+240
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,240 @@
1+
# RFC 2023-05-02 - Data Volume Insights metrics
2+
3+
Vector needs to be able to emit accurate metrics that can be usefully queried
4+
to give users insights into the volume of data moving through the system.
5+
6+
## Scope
7+
8+
### In scope
9+
10+
- All volume event metrics within Vector need to emit the estimated JSON size of the
11+
event. With a consistent method for determining the size it will be easier to accurately
12+
compare data in vs data out.
13+
- `component_received_event_bytes_total`
14+
- `component_sent_event_bytes_total`
15+
- `component_received_event_total`
16+
- `component_sent_event_total`
17+
- The metrics sent by each sink needs to be tagged with the source id of the
18+
event so the route an event takes through Vector can be queried.
19+
- Each event needs to be labelled with a `service`. This is a new concept
20+
within Vector and represents the application that generated the log,
21+
metric or trace.
22+
- The service tag and source tag in the metrics needs to be opt in so customers
23+
that don't need the increased cardinality are unaffected.
24+
25+
### Out of scope
26+
27+
- Separate metrics, `component_sent_bytes_total` and `component_received_bytes_total`
28+
that indicate network bytes sent by Vector are not considered here.
29+
30+
## Pain
31+
32+
Currently it is difficult to accurately gauge the volume of data that is moving
33+
through Vector. It is difficult to query where data being sent out has come
34+
from.
35+
36+
## Proposal
37+
38+
### User Experience
39+
40+
Global config options will be provided to indicate that the `service` tag and the
41+
`source` tag should be sent. For example:
42+
43+
```yaml
44+
telemetry:
45+
tags:
46+
service: true
47+
source_id: true
48+
```
49+
50+
This will cause Vector to emit a metric like (note the last two tags):
51+
52+
```prometheus
53+
vector_component_sent_event_bytes_total{component_id="out",component_kind="sink",component_name="out",component_type="console"
54+
,host="machine",service="potato",source_id="stdin"} 123
55+
```
56+
57+
The default will be to not emit these tags.
58+
59+
### Implementation
60+
61+
#### Metric tags
62+
63+
**service** - to attach the service, we need to add a new meaning to Vector -
64+
`service`. Any sources that receive data that could potentially
65+
be considered a service will need to indicate which field means
66+
`service`. This work has largely already been done with the
67+
LogNamespacing work, so it will be trivial to add this new field.
68+
Not all sources will be able to specify a specific field to
69+
indicate the `service`. In time it will be possible for this to
70+
be accomplished through `VRL`.
71+
72+
**source_id** - A new field will be added to the [Event metadata][event_metadata] -
73+
`Arc<OutputId>` that will indicate the source of the event.
74+
`OutputId` will need to be serializable so it can be stored in
75+
the disk buffer. Since this field is just an identifier, it can
76+
still be used even if the source no longer exists when the event
77+
is consumed by a sink.
78+
79+
We will need to do an audit of all components to ensure the
80+
bytes emitted for the `component_received_event_bytes_total` and
81+
`component_sent_event_bytes_total` metrics are the estimated JSON size of the
82+
event.
83+
84+
These tags will be given the name that was configured in [User Experience]
85+
(#user-experience).
86+
87+
Transforms `reduce` and `aggregate` combine multiple events together. In this
88+
case the `source` and `service` of the first event will be taken.
89+
90+
If there is no `source` a source of `-` will be emitted. The only way this can
91+
happen is if the event was created by the `lua` transform.
92+
93+
If there is no `service` available, a service of `-` will be emitted.
94+
95+
Emitting a `-` rather than not emitting anything at all makes it clear that
96+
there was no value rather than it just having been forgotten and ensures it
97+
is clear that the metric represents no `service` or `source` rather than the
98+
aggregate value across all services.
99+
100+
The [Component Spec][component_spec] will need updating to indicate these tags
101+
will need including.
102+
103+
**Performance** - There is going to be a performance hit when emitting these metrics.
104+
Currently for each batch a simple event is emitted containing the count and size
105+
of the entire batch. With this change it will be necessary to scan the entire
106+
batch to obtain the count of source, service combinations of events before emitting
107+
the counts. This will involve additional allocations to maintain the counts as well
108+
as the O(1) scan.
109+
110+
#### `component_received_event_bytes_total`
111+
112+
This metric is emitted by the framework [here][source_sender], so it looks like
113+
the only change needed is to add the service tag.
114+
115+
#### `component_sent_event_bytes_total`
116+
117+
For stream based sinks this will typically be the byte value returned by
118+
`DriverResponse::events_sent`.
119+
120+
Despite being in the [Component Spec][component_spec], not all sinks currently
121+
conform to this.
122+
123+
As an example, from a cursory glance over a couple of sinks:
124+
125+
The Amqp sink currently emits this value as the length of the binary
126+
data that is sent. By the time the data has reached the code where the
127+
`component_sent_event_bytes_total` event is emitted, that event has been
128+
encoded and the actual estimated JSON size has been lost. The sink will need
129+
to be updated so that when the event is encoded, the encoded event together
130+
with the pre-encoded JSON bytesize will be sent to the service where the event
131+
is emitted.
132+
133+
The Kafka sink also currently sends the binary size, but it looks like the
134+
estimated JSON bytesize is easily accessible at the point of emitting, so would
135+
not need too much of a change.
136+
137+
To ensure that the correct metric is sent in a type-safe manner, we will wrap
138+
the estimated JSON size in a newtype:
139+
140+
```rust
141+
pub struct JsonSize(usize);
142+
```
143+
144+
The `EventsSent` metric will only accept this type.
145+
146+
### Registered metrics
147+
148+
It is currently not possible to have dynamic tags with preregistered metrics.
149+
150+
Preregistering these metrics are essential to ensure that they don't expire.
151+
152+
The current mechanism to expire metrics is to check if a handle to the given
153+
metric is being held. If it isn't, and nothing has updated that metric in
154+
the last cycle - the metric is dropped. If a metric is dropped, the next time
155+
that event is emitted with those tags, the count starts at zero again.
156+
157+
We will need to introduce a registered event caching layer that will register
158+
and cache new events keyed on the tags that are sent to it.
159+
160+
Currently a registered metrics is stored in a `Registered<EventSent>`.
161+
162+
We will need a new struct that can wrap this that will be generic over a tuple of
163+
the tags for each event and the event - eg. `Cached<(String, String), EventSent>`.
164+
This struct will maintain a BTreeMap of tags -> `Registered`. Since this will
165+
need to be shared across threads, the cache will need to be stored in an `RwLock`.
166+
167+
In pseudo rust:
168+
169+
```rust
170+
struct Cached<Tags, Event> {
171+
cache: Arc<RwLock<BTreemap<Tags, Registered<Event>>>,
172+
register: Fn(Tags) -> Registered<Event>,
173+
}
174+
175+
impl<Tags, Event> Cached<Tags, Event> {
176+
fn emit(&mut self, tags: Tags, value: Event) -> {
177+
if Some(event) = self.cache.get(tags) {
178+
event.emit(value);
179+
} else {
180+
let event = self.register(tags);
181+
event.emit(value);
182+
self.cache.insert(tags, event);
183+
}
184+
}
185+
}
186+
```
187+
188+
## Rationale
189+
190+
The ability to visualize data flowing through Vector will allow users to ascertain
191+
the effectiveness of the current use of Vector. This will enable users to
192+
optimise their configurations to make the best use of Vector's features.
193+
194+
## Drawbacks
195+
196+
The additional tags being added to the metrics will increase the cardinality of
197+
those metrics if they are enabled.
198+
199+
## Prior Art
200+
201+
202+
## Alternatives
203+
204+
We could use an alternative metric instead of estimated JSON size.
205+
206+
- *Network bytes* This provides a more accurate picture of the actual data being received
207+
and sent by Vector, but will regularly produce different sizes for an incoming event
208+
to an outgoing event.
209+
- *In memory size* The size of the event as held in memory. This may be more accurate in
210+
determining the amount of memory Vector will be utilizing at any time, will often be
211+
less accurate compared to the data being sent and received which is often JSON.
212+
213+
## Outstanding Questions
214+
215+
## Plan Of Attack
216+
217+
Incremental steps to execute this change. These will be converted to issues after the RFC is approved:
218+
219+
- [ ] Add the `source` field to the Event metadata to indicate the source the event has come from.
220+
- [ ] Update the Volume event metrics to take a `JsonSize` value. Use the compiler to ensure all metrics
221+
emitted use this. The `EstimatedJsonEncodedSizeOf` trait will be updated return a `JsonSize`.
222+
- [ ] Add the Service meaning. Update any sources that potentially create a service to point the meaning
223+
to the relevant field.
224+
- [ ] Introduce an event caching layer that caches registered events based on the tags sent to it.
225+
- [ ] Update the emitted events to accept the new tags - taking the `telemetry` configuration options
226+
into account.
227+
- [ ] There is going to be a hit on performance with these changes. Add benchmarking to help us understand
228+
how much the impact will be.
229+
230+
## Future Improvements
231+
232+
- Logs emitted by Vector should also be tagged with `source_id` and `service`.
233+
- This rfc proposes storing the source and service as strings. This incurs a cost since scanning each
234+
event to get the counts of events by source and service will involve multiple string comparisons. A
235+
future optimization could be to hash the combination of these values at the source into a single
236+
integer.
237+
238+
[component_spec]: https://github.com/vectordotdev/vector/blob/master/docs/specs/component.md#componenteventssent
239+
[source_sender]: https://github.com/vectordotdev/vector/blob/master/src/source_sender/mod.rs#L265-L268
240+
[event_metadata]: https://github.com/vectordotdev/vector/blob/master/lib/vector-core/src/event/metadata.rs#L20-L38

0 commit comments

Comments
 (0)