|
| 1 | +# RFC 2023-05-02 - Data Volume Insights metrics |
| 2 | + |
| 3 | +Vector needs to be able to emit accurate metrics that can be usefully queried |
| 4 | +to give users insights into the volume of data moving through the system. |
| 5 | + |
| 6 | +## Scope |
| 7 | + |
| 8 | +### In scope |
| 9 | + |
| 10 | +- All volume event metrics within Vector need to emit the estimated JSON size of the |
| 11 | + event. With a consistent method for determining the size it will be easier to accurately |
| 12 | + compare data in vs data out. |
| 13 | + - `component_received_event_bytes_total` |
| 14 | + - `component_sent_event_bytes_total` |
| 15 | + - `component_received_event_total` |
| 16 | + - `component_sent_event_total` |
| 17 | +- The metrics sent by each sink needs to be tagged with the source id of the |
| 18 | + event so the route an event takes through Vector can be queried. |
| 19 | +- Each event needs to be labelled with a `service`. This is a new concept |
| 20 | + within Vector and represents the application that generated the log, |
| 21 | + metric or trace. |
| 22 | +- The service tag and source tag in the metrics needs to be opt in so customers |
| 23 | + that don't need the increased cardinality are unaffected. |
| 24 | + |
| 25 | +### Out of scope |
| 26 | + |
| 27 | +- Separate metrics, `component_sent_bytes_total` and `component_received_bytes_total` |
| 28 | + that indicate network bytes sent by Vector are not considered here. |
| 29 | + |
| 30 | +## Pain |
| 31 | + |
| 32 | +Currently it is difficult to accurately gauge the volume of data that is moving |
| 33 | +through Vector. It is difficult to query where data being sent out has come |
| 34 | +from. |
| 35 | + |
| 36 | +## Proposal |
| 37 | + |
| 38 | +### User Experience |
| 39 | + |
| 40 | +Global config options will be provided to indicate that the `service` tag and the |
| 41 | +`source` tag should be sent. For example: |
| 42 | + |
| 43 | +```yaml |
| 44 | +telemetry: |
| 45 | + tags: |
| 46 | + service: true |
| 47 | + source_id: true |
| 48 | +``` |
| 49 | +
|
| 50 | +This will cause Vector to emit a metric like (note the last two tags): |
| 51 | +
|
| 52 | +```prometheus |
| 53 | +vector_component_sent_event_bytes_total{component_id="out",component_kind="sink",component_name="out",component_type="console" |
| 54 | + ,host="machine",service="potato",source_id="stdin"} 123 |
| 55 | +``` |
| 56 | + |
| 57 | +The default will be to not emit these tags. |
| 58 | + |
| 59 | +### Implementation |
| 60 | + |
| 61 | +#### Metric tags |
| 62 | + |
| 63 | +**service** - to attach the service, we need to add a new meaning to Vector - |
| 64 | + `service`. Any sources that receive data that could potentially |
| 65 | + be considered a service will need to indicate which field means |
| 66 | + `service`. This work has largely already been done with the |
| 67 | + LogNamespacing work, so it will be trivial to add this new field. |
| 68 | + Not all sources will be able to specify a specific field to |
| 69 | + indicate the `service`. In time it will be possible for this to |
| 70 | + be accomplished through `VRL`. |
| 71 | + |
| 72 | +**source_id** - A new field will be added to the [Event metadata][event_metadata] - |
| 73 | + `Arc<OutputId>` that will indicate the source of the event. |
| 74 | + `OutputId` will need to be serializable so it can be stored in |
| 75 | + the disk buffer. Since this field is just an identifier, it can |
| 76 | + still be used even if the source no longer exists when the event |
| 77 | + is consumed by a sink. |
| 78 | + |
| 79 | +We will need to do an audit of all components to ensure the |
| 80 | +bytes emitted for the `component_received_event_bytes_total` and |
| 81 | +`component_sent_event_bytes_total` metrics are the estimated JSON size of the |
| 82 | +event. |
| 83 | + |
| 84 | +These tags will be given the name that was configured in [User Experience] |
| 85 | +(#user-experience). |
| 86 | + |
| 87 | +Transforms `reduce` and `aggregate` combine multiple events together. In this |
| 88 | +case the `source` and `service` of the first event will be taken. |
| 89 | + |
| 90 | +If there is no `source` a source of `-` will be emitted. The only way this can |
| 91 | +happen is if the event was created by the `lua` transform. |
| 92 | + |
| 93 | +If there is no `service` available, a service of `-` will be emitted. |
| 94 | + |
| 95 | +Emitting a `-` rather than not emitting anything at all makes it clear that |
| 96 | +there was no value rather than it just having been forgotten and ensures it |
| 97 | +is clear that the metric represents no `service` or `source` rather than the |
| 98 | +aggregate value across all services. |
| 99 | + |
| 100 | +The [Component Spec][component_spec] will need updating to indicate these tags |
| 101 | +will need including. |
| 102 | + |
| 103 | +**Performance** - There is going to be a performance hit when emitting these metrics. |
| 104 | +Currently for each batch a simple event is emitted containing the count and size |
| 105 | +of the entire batch. With this change it will be necessary to scan the entire |
| 106 | +batch to obtain the count of source, service combinations of events before emitting |
| 107 | +the counts. This will involve additional allocations to maintain the counts as well |
| 108 | +as the O(1) scan. |
| 109 | + |
| 110 | +#### `component_received_event_bytes_total` |
| 111 | + |
| 112 | +This metric is emitted by the framework [here][source_sender], so it looks like |
| 113 | +the only change needed is to add the service tag. |
| 114 | + |
| 115 | +#### `component_sent_event_bytes_total` |
| 116 | + |
| 117 | +For stream based sinks this will typically be the byte value returned by |
| 118 | +`DriverResponse::events_sent`. |
| 119 | + |
| 120 | +Despite being in the [Component Spec][component_spec], not all sinks currently |
| 121 | +conform to this. |
| 122 | + |
| 123 | +As an example, from a cursory glance over a couple of sinks: |
| 124 | + |
| 125 | +The Amqp sink currently emits this value as the length of the binary |
| 126 | +data that is sent. By the time the data has reached the code where the |
| 127 | +`component_sent_event_bytes_total` event is emitted, that event has been |
| 128 | +encoded and the actual estimated JSON size has been lost. The sink will need |
| 129 | +to be updated so that when the event is encoded, the encoded event together |
| 130 | +with the pre-encoded JSON bytesize will be sent to the service where the event |
| 131 | +is emitted. |
| 132 | + |
| 133 | +The Kafka sink also currently sends the binary size, but it looks like the |
| 134 | +estimated JSON bytesize is easily accessible at the point of emitting, so would |
| 135 | +not need too much of a change. |
| 136 | + |
| 137 | +To ensure that the correct metric is sent in a type-safe manner, we will wrap |
| 138 | +the estimated JSON size in a newtype: |
| 139 | + |
| 140 | +```rust |
| 141 | +pub struct JsonSize(usize); |
| 142 | +``` |
| 143 | + |
| 144 | +The `EventsSent` metric will only accept this type. |
| 145 | + |
| 146 | +### Registered metrics |
| 147 | + |
| 148 | +It is currently not possible to have dynamic tags with preregistered metrics. |
| 149 | + |
| 150 | +Preregistering these metrics are essential to ensure that they don't expire. |
| 151 | + |
| 152 | +The current mechanism to expire metrics is to check if a handle to the given |
| 153 | +metric is being held. If it isn't, and nothing has updated that metric in |
| 154 | +the last cycle - the metric is dropped. If a metric is dropped, the next time |
| 155 | +that event is emitted with those tags, the count starts at zero again. |
| 156 | + |
| 157 | +We will need to introduce a registered event caching layer that will register |
| 158 | +and cache new events keyed on the tags that are sent to it. |
| 159 | + |
| 160 | +Currently a registered metrics is stored in a `Registered<EventSent>`. |
| 161 | + |
| 162 | +We will need a new struct that can wrap this that will be generic over a tuple of |
| 163 | +the tags for each event and the event - eg. `Cached<(String, String), EventSent>`. |
| 164 | +This struct will maintain a BTreeMap of tags -> `Registered`. Since this will |
| 165 | +need to be shared across threads, the cache will need to be stored in an `RwLock`. |
| 166 | + |
| 167 | +In pseudo rust: |
| 168 | + |
| 169 | +```rust |
| 170 | +struct Cached<Tags, Event> { |
| 171 | + cache: Arc<RwLock<BTreemap<Tags, Registered<Event>>>, |
| 172 | + register: Fn(Tags) -> Registered<Event>, |
| 173 | +} |
| 174 | + |
| 175 | +impl<Tags, Event> Cached<Tags, Event> { |
| 176 | + fn emit(&mut self, tags: Tags, value: Event) -> { |
| 177 | + if Some(event) = self.cache.get(tags) { |
| 178 | + event.emit(value); |
| 179 | + } else { |
| 180 | + let event = self.register(tags); |
| 181 | + event.emit(value); |
| 182 | + self.cache.insert(tags, event); |
| 183 | + } |
| 184 | + } |
| 185 | +} |
| 186 | +``` |
| 187 | + |
| 188 | +## Rationale |
| 189 | + |
| 190 | +The ability to visualize data flowing through Vector will allow users to ascertain |
| 191 | +the effectiveness of the current use of Vector. This will enable users to |
| 192 | +optimise their configurations to make the best use of Vector's features. |
| 193 | + |
| 194 | +## Drawbacks |
| 195 | + |
| 196 | +The additional tags being added to the metrics will increase the cardinality of |
| 197 | +those metrics if they are enabled. |
| 198 | + |
| 199 | +## Prior Art |
| 200 | + |
| 201 | + |
| 202 | +## Alternatives |
| 203 | + |
| 204 | +We could use an alternative metric instead of estimated JSON size. |
| 205 | + |
| 206 | +- *Network bytes* This provides a more accurate picture of the actual data being received |
| 207 | + and sent by Vector, but will regularly produce different sizes for an incoming event |
| 208 | + to an outgoing event. |
| 209 | +- *In memory size* The size of the event as held in memory. This may be more accurate in |
| 210 | + determining the amount of memory Vector will be utilizing at any time, will often be |
| 211 | + less accurate compared to the data being sent and received which is often JSON. |
| 212 | + |
| 213 | +## Outstanding Questions |
| 214 | + |
| 215 | +## Plan Of Attack |
| 216 | + |
| 217 | +Incremental steps to execute this change. These will be converted to issues after the RFC is approved: |
| 218 | + |
| 219 | +- [ ] Add the `source` field to the Event metadata to indicate the source the event has come from. |
| 220 | +- [ ] Update the Volume event metrics to take a `JsonSize` value. Use the compiler to ensure all metrics |
| 221 | + emitted use this. The `EstimatedJsonEncodedSizeOf` trait will be updated return a `JsonSize`. |
| 222 | +- [ ] Add the Service meaning. Update any sources that potentially create a service to point the meaning |
| 223 | + to the relevant field. |
| 224 | +- [ ] Introduce an event caching layer that caches registered events based on the tags sent to it. |
| 225 | +- [ ] Update the emitted events to accept the new tags - taking the `telemetry` configuration options |
| 226 | + into account. |
| 227 | +- [ ] There is going to be a hit on performance with these changes. Add benchmarking to help us understand |
| 228 | + how much the impact will be. |
| 229 | + |
| 230 | +## Future Improvements |
| 231 | + |
| 232 | +- Logs emitted by Vector should also be tagged with `source_id` and `service`. |
| 233 | +- This rfc proposes storing the source and service as strings. This incurs a cost since scanning each |
| 234 | + event to get the counts of events by source and service will involve multiple string comparisons. A |
| 235 | + future optimization could be to hash the combination of these values at the source into a single |
| 236 | + integer. |
| 237 | + |
| 238 | +[component_spec]: https://github.com/vectordotdev/vector/blob/master/docs/specs/component.md#componenteventssent |
| 239 | +[source_sender]: https://github.com/vectordotdev/vector/blob/master/src/source_sender/mod.rs#L265-L268 |
| 240 | +[event_metadata]: https://github.com/vectordotdev/vector/blob/master/lib/vector-core/src/event/metadata.rs#L20-L38 |
0 commit comments