-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[pdata] Find replacement for now deprecated gogo/protobuf #7095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yes, Collector uses Gogoproto instead of Google's implementation for performance reasons. Here are a few replacement options that I can think of:
Unfortunately I do not see great options here, all have some downsides. We will need to decide how much effort we are willing to invest into this. There is probably also an option to find some other alternate Protobuf library but I have not explored this. |
As far as replacements, there are other projects that seem to provide similar capabilities to what Gogo had: At minimum, they seem to be maintained at the moment, and hopefully are more narrow in scope than gogo. |
What is Jaeger using nowadays? Any experience you can share with us? |
Jaeger hasn't changed from gogo/proto |
I've not tried using It's probably worth another go at it now that it's been over a year. |
@codeboten I saw that PR, but can you share some findings / conclusions? It was closed without any explanation, so not clear if you ran into a dead end with something. |
The scope of the work then was to add support in the collector for From what I recall the performance numbers were not encouraging at first. Additionally I ran into issues w/ lack of support for |
**Description:** Reorganize `VERSIONING.md`, specify compatibility guarantees that have already been discussed and answer questions on #8002. Three new guarantees in the document had already been discussed elsewhere: - String representation: #7625 (comment) - Go version compatibility: Mentioned [on README.md](https://github.com/open-telemetry/opentelemetry-collector?tab=readme-ov-file#compatibility) - Dependency updates: Discussed in private, most recently in relation to whether #7095 should block `pdata`'s 1.0 (consensus seems to be that it should not). Regarding the questions mentioned on #8002: > Does adding new validation rules mean a breaking change? Generally, yes except when the configuration was already invalid (i.e. the Collector did not work properly with it). > Should we offer a "NewDefault...." for each config and say that the behavior of the config is stable only if initialized with NewDefault? I did not add anything for this one. I think we should discuss it, but: - I don't think there is a sensible output for `NewDefault...` for all configurations (e.g. for `confignet.TCPAddr`, what would the default be?) - It seems to me that we should handle configuration validity through `Validate`, and not through `NewDefault...`. `NewDefault` may help but ultimately we need to verify through `Validate`. > Is adding new fields means breaking change? Let's assume someone "squash" the message into another message, then adding a field with same mapstruct value will be a breaking change, what do we do with that? What are the rules we follow. This was already in this document, where we had decided that adding new fields was okay. I think it would be too stringent to bar us from adding new fields. **Link to tracking Issue:** Fixes #8002 --------- Co-authored-by: Yang Song <[email protected]>
This may be time to revisit this issue. I was dealing with grpc upgrade to v1.66.0 in Jaeger where gRPC introduced |
https://go.dev/blog/protobuf-opaque looks very promising as a path towards removing gogo/protobuf |
Yes, let's see if performance is on par. Also, yay for lazy decoding. |
We are very interested in this initiative, as we've observed the following in our OTel deployment:
Would you be interested in our contribution to switch to |
@els0r Thanks for offering to contribute this! I would definitely be interested in seeing how the performance compares and making the switch if we can do so :) Based on prior experience, this will take time and some back and forth, but the blog post shared by Yuri and your comments makes it feel like this is a very promising option. If it is possible to start small in some way (e.g. the simplest/smallest PoC you can think of), I think the first step should be to do this and bring it up on one of our Collector SIG virtual meetings, to get some early feedback. |
Why do you use JSON encoding? |
Thanks for the quick responses. We'll explore the option on a fork to get a sense of the complexity / the back and forth. We use json encoding when writing telemetry (mainly logs) onto a Kafka queue towards external parties. It's the OTLP -> JSON conversion step where we currently see the most CPU time spent. |
After a bit of experimentation (see https://github.com/denysvitali/opentelemetry-collector/tree/feature/use-protobuf-opaque-api) I was able to partially migrate to Before entering into the details of the feature, here are some results: ResultsNote I bet the results would be even faster as soon as golang/protobuf#1673 is implemented. The current implementation is still a bit inefficient, but already shows some nice improvements. Upstream implementation (6757cc7)Only addition:
Fork (bba99f7)Note This implementation doesn't generate the exact same JSON - but for the relevant parts (log lines) it does.
Comparison
This shows that the new implementation is 88% faster (CPU) and decreases the memory usage by 76%. ImplementationMy implementation is far from complete, and probably very buggy. My goal was to understand how much of a performance improvement this change would bring - so I focused entirely on that and didn't care too much about external users of the Adding
|
@denysvitali Thanks for sharing the results! The results are definitely promising for OTLP/HTTP JSON. I expect most of our users use OTLP/gRPC which uses a protobuf binary encoding, so it's very important for us to benchmark those as well, to ensure there are no regressions on this case. |
Yes, that would be also interesting. I can try to benchmark that too if I find the time in the following days. |
Can you run those two benchmarks with benchstat for an easier comparison? |
👋 Just some datapoints from the Prometheus ecosystem (SDKs, Prometheus, Thanos, Cortex etc).: We still use gogo/protobuf in general and we are (very) slowly progressing on the migration. I tried OpaqueAPI a bit and it feels similar to @tigrannajaryan https://github.com/splunk/exp-lazyproto indeed. However, I’m not sure OpaqueAPI+lazy decoding addresses all challenges we have with data-intensive proto use. The non-nullability of fields (e.g. []Elem instead of []*Elem ) is still beneficial in practice (less allocs, less GC, less need for pooling, trivial to benchmark) and lazy decoding makes already too big structs even bigger (what was the last time our code needed information about unknown fields?) . I wonder if there’s an opportunity to maintain together some simpler efficient Go generator for both Otel and Prometheus for simpler use case cases e.g.
Last two features were extremely beneficial for the tailored Prometheus use for scraping. Some recent discussion: https://groups.google.com/g/prometheus-developers/c/uFWRyqZaQis/m/IvnEjVkHAQAJ cc @bboreham |
@bwplotka I offered to work on this a while ago (when working on lazy-proto) but there were no takers. Unfortunately I no longer have time to work on this at the moment, but as a data point: doing lazy-proto didn't feel like an insurmountable amount of work to write your own Protobuf generator. When working on lazy-proto I also thought about doing a streaming Protobuf deserializer and eventually decided to spend my time on a better streaming format (STEF handily beats OTLP Protobuf both on size and speed dimensions. When it matures, I would love to discuss this with Prometheus folks). |
I don't know the full background of using gogo/protobuf, I suspect it was similar to Jaeger's use of gogoproto for serialization efficiency (in particular, the ability to embed structs by value, as in
[]KeyValue
instead of[]*KeyValue
, which significantly reduces memory allocations).Well, the bad news, which was long coming, is that GoGo is officially deprecated since Oct'22. Continue to depend on it going forward presents security and compatibility risks.
This ticket is to kick off a discussion for a viable path forward, since I don't have a solution in mind right now.
Some past related tickets:
The text was updated successfully, but these errors were encountered: