[pdata] Find replacement for now deprecated gogo/protobuf #7095

yurishkuro · 2023-02-02T02:19:44Z

I don't know the full background of using gogo/protobuf, I suspect it was similar to Jaeger's use of gogoproto for serialization efficiency (in particular, the ability to embed structs by value, as in []KeyValue instead of []*KeyValue, which significantly reduces memory allocations).

Well, the bad news, which was long coming, is that GoGo is officially deprecated since Oct'22. Continue to depend on it going forward presents security and compatibility risks.

This ticket is to kick off a discussion for a viable path forward, since I don't have a solution in mind right now.

Some past related tickets:

Support optional fields in proto #4258

The text was updated successfully, but these errors were encountered:

tigrannajaryan · 2023-02-02T03:01:27Z

Yes, Collector uses Gogoproto instead of Google's implementation for performance reasons.

Here are a few replacement options that I can think of:

Move to Google Protobuf library. This will most likely result in worse performance for the the Collector (about 1.3x worse in our past benchmarks IIRC). The newer version of Google Protobuf (v2?) is even slower than the older version.
Otel to take responsibility of entire Gogoproto. This is a big undertaking, I don't think we want to subscribe to that.
Fork Gogoproto and maintain an Otel fork just for ourselves. It is still a pretty large codebase to maintain and my cursory assessment of it was that it was not a codebase that I would want to take responsibility to maintain myself.
Create our own Protobuf library. This may sound crazy but may be tractable if we limit it to only support what OTLP needs. I managed to write an experimental compiler/generator in just a few days that works on OTLP Logs and may be faster than Gogoproto (and with optional fields support, BTW). Obviously a production quality library requires much bigger investment and I definitely won't claim that I alone want to be a sole author and maintainer of such a library.

Unfortunately I do not see great options here, all have some downsides. We will need to decide how much effort we are willing to invest into this.

There is probably also an option to find some other alternate Protobuf library but I have not explored this.

yurishkuro · 2023-02-02T06:15:37Z

As far as replacements, there are other projects that seem to provide similar capabilities to what Gogo had:

At minimum, they seem to be maintained at the moment, and hopefully are more narrow in scope than gogo.

jpkrohling · 2023-02-02T18:57:59Z

What is Jaeger using nowadays? Any experience you can share with us?

yurishkuro · 2023-02-02T19:27:13Z

Jaeger hasn't changed from gogo/proto

codeboten · 2023-02-02T23:56:43Z

I've not tried using csproto, but there were a few investigations I'd done in the past including trying to use vtproto with the standard google protobuf library here: #4658

It's probably worth another go at it now that it's been over a year.

yurishkuro · 2023-02-03T00:40:19Z

@codeboten I saw that PR, but can you share some findings / conclusions? It was closed without any explanation, so not clear if you ran into a dead end with something.

codeboten · 2023-02-07T16:29:32Z

The scope of the work then was to add support in the collector for optional fields which was achieved through a workaround (#4258 (comment)). That was the reason this PR was closed.

From what I recall the performance numbers were not encouraging at first. Additionally I ran into issues w/ lack of support foroneof fields in pools.

**Description:** Reorganize `VERSIONING.md`, specify compatibility guarantees that have already been discussed and answer questions on #8002. Three new guarantees in the document had already been discussed elsewhere: - String representation: #7625 (comment) - Go version compatibility: Mentioned [on README.md](https://github.com/open-telemetry/opentelemetry-collector?tab=readme-ov-file#compatibility) - Dependency updates: Discussed in private, most recently in relation to whether #7095 should block `pdata`'s 1.0 (consensus seems to be that it should not). Regarding the questions mentioned on #8002: > Does adding new validation rules mean a breaking change? Generally, yes except when the configuration was already invalid (i.e. the Collector did not work properly with it). > Should we offer a "NewDefault...." for each config and say that the behavior of the config is stable only if initialized with NewDefault? I did not add anything for this one. I think we should discuss it, but: - I don't think there is a sensible output for `NewDefault...` for all configurations (e.g. for `confignet.TCPAddr`, what would the default be?) - It seems to me that we should handle configuration validity through `Validate`, and not through `NewDefault...`. `NewDefault` may help but ultimately we need to verify through `Validate`. > Is adding new fields means breaking change? Let's assume someone "squash" the message into another message, then adding a field with same mapstruct value will be a breaking change, what do we do with that? What are the rules we follow. This was already in this document, where we had decided that adding new fields was okay. I think it would be too stringent to bar us from adding new fields. **Link to tracking Issue:** Fixes #8002 --------- Co-authored-by: Yang Song <[email protected]>

yurishkuro · 2024-08-29T05:22:27Z

This may be time to revisit this issue. I was dealing with grpc upgrade to v1.66.0 in Jaeger where gRPC introduced CodecV2 and memory buffers that broke out integration with gogoproto. I was able to fix it by forcing the new buffers back to []byte, which is probably going to degrate the efficiency of marshaling, thus partially negating the benefits of gogoproto. And I also noticed new dependency on https://github.com/planetscale/vtprotobuf coming from grpc via envoyproxy/go-control-plane, which led me to this issue envoyproxy/go-control-plane#824 showing large perf improvements by switching to vtprotobuf (which it calls "it is gogo protobuf, but built on top of protobuf v2, instead of replacing it.")

yurishkuro · 2024-12-18T03:33:17Z

https://go.dev/blog/protobuf-opaque looks very promising as a path towards removing gogo/protobuf

tigrannajaryan · 2024-12-18T14:34:42Z

https://go.dev/blog/protobuf-opaque looks very promising as a path towards removing gogo/protobuf

Yes, let's see if performance is on par.

Also, yay for lazy decoding.

els0r · 2025-01-16T13:08:01Z

We are very interested in this initiative, as we've observed the following in our OTel deployment:

CPU is the defining cost-driver in the OTel deployment (~15 TB / day shipped)
Profiles show that the hot path / allocation-heavy path is jsonpb.go from the gogo proto library

Would you be interested in our contribution to switch to google.golang.org/protobuf?

mx-psi · 2025-01-16T16:48:49Z

@els0r Thanks for offering to contribute this!

I would definitely be interested in seeing how the performance compares and making the switch if we can do so :) Based on prior experience, this will take time and some back and forth, but the blog post shared by Yuri and your comments makes it feel like this is a very promising option.

If it is possible to start small in some way (e.g. the simplest/smallest PoC you can think of), I think the first step should be to do this and bring it up on one of our Collector SIG virtual meetings, to get some early feedback.

sfc-gh-bdrutu · 2025-01-16T20:07:43Z

@els0r

Profiles show that the hot path / allocation-heavy path is jsonpb.go from the gogo proto library

Why do you use JSON encoding?

els0r · 2025-01-17T07:41:25Z

Thanks for the quick responses. We'll explore the option on a fork to get a sense of the complexity / the back and forth.

We use json encoding when writing telemetry (mainly logs) onto a Kafka queue towards external parties. It's the OTLP -> JSON conversion step where we currently see the most CPU time spent.

denysvitali · 2025-01-24T11:19:52Z

After a bit of experimentation (see https://github.com/denysvitali/opentelemetry-collector/tree/feature/use-protobuf-opaque-api) I was able to partially migrate to protobuf-go to test the performance improvement it would have.

Before entering into the details of the feature, here are some results:

Results

Note

I bet the results would be even faster as soon as golang/protobuf#1673 is implemented. The current implementation is still a bit inefficient, but already shows some nice improvements.

Upstream implementation (`6757cc7`)

Only addition: json_test.go (and the fix logRecord.SetSpanID([8]byte{0, 1, 2, 3, 4, 5, 6, 7}))

goos: darwin
goarch: arm64
pkg: go.opentelemetry.io/collector/pdata/internal/json
cpu: Apple M3 Pro
BenchmarkEncoding-12    	     133	   8909393 ns/op	 5652242 B/op	  126244 allocs/op
BenchmarkEncoding-12    	     132	   9045715 ns/op	 5652374 B/op	  126244 allocs/op
BenchmarkEncoding-12    	     130	   8997149 ns/op	 5652260 B/op	  126243 allocs/op
BenchmarkEncoding-12    	     129	   9038421 ns/op	 5652305 B/op	  126244 allocs/op
BenchmarkEncoding-12    	     129	   8960452 ns/op	 5652328 B/op	  126244 allocs/op
BenchmarkEncoding-12    	     133	   8933958 ns/op	 5652268 B/op	  126244 allocs/op
BenchmarkEncoding-12    	     132	   9031978 ns/op	 5652378 B/op	  126245 allocs/op
BenchmarkEncoding-12    	     132	   8986970 ns/op	 5652293 B/op	  126243 allocs/op
BenchmarkEncoding-12    	     133	   8983367 ns/op	 5652272 B/op	  126243 allocs/op
BenchmarkEncoding-12    	     132	   8985762 ns/op	 5652313 B/op	  126244 allocs/op
PASS
ok  	go.opentelemetry.io/collector/pdata/internal/json	21.541s

Fork (bba99f7)

Note

This implementation doesn't generate the exact same JSON - but for the relevant parts (log lines) it does.
This is due to the fact that the whole refactoring was taking too long due to the generated code - I'll explain in details down below why this is the case.

goos: darwin
goarch: arm64
pkg: go.opentelemetry.io/collector/pdata/internal/json
cpu: Apple M3 Pro
BenchmarkEncoding-12    	    1198	   1014819 ns/op	 1342051 B/op	   17065 allocs/op
BenchmarkEncoding-12    	    1183	   1016699 ns/op	 1342074 B/op	   17065 allocs/op
BenchmarkEncoding-12    	    1130	   1006196 ns/op	 1342030 B/op	   17065 allocs/op
BenchmarkEncoding-12    	    1204	   1015811 ns/op	 1342045 B/op	   17065 allocs/op
BenchmarkEncoding-12    	    1172	   1013799 ns/op	 1342047 B/op	   17065 allocs/op
BenchmarkEncoding-12    	    1197	   1029485 ns/op	 1342032 B/op	   17065 allocs/op
BenchmarkEncoding-12    	    1202	   1005060 ns/op	 1342055 B/op	   17065 allocs/op
BenchmarkEncoding-12    	    1231	    999719 ns/op	 1342044 B/op	   17065 allocs/op
BenchmarkEncoding-12    	    1195	    989382 ns/op	 1342058 B/op	   17065 allocs/op
BenchmarkEncoding-12    	    1206	    995486 ns/op	 1342062 B/op	   17065 allocs/op
PASS
ok  	go.opentelemetry.io/collector/pdata/internal/json	13.318s

Comparison

goos: darwin
goarch: arm64
pkg: go.opentelemetry.io/collector/pdata/internal/json
cpu: Apple M3 Pro
            │ /tmp/old.txt │            /tmp/new.txt             │
            │    sec/op    │   sec/op     vs base                │
Encoding-12    8.986m ± 1%   1.010m ± 1%  -88.76% (p=0.000 n=10)

            │ /tmp/old.txt │             /tmp/new.txt             │
            │     B/op     │     B/op      vs base                │
Encoding-12   5.390Mi ± 0%   1.280Mi ± 0%  -76.26% (p=0.000 n=10)

            │ /tmp/old.txt │            /tmp/new.txt             │
            │  allocs/op   │  allocs/op   vs base                │
Encoding-12   126.24k ± 0%   17.07k ± 0%  -86.48% (p=0.000 n=10)

This shows that the new implementation is 88% faster (CPU) and decreases the memory usage by 76%.
The amount of allocations per operation also decrease of a good 86%.

Implementation

My implementation is far from complete, and probably very buggy. My goal was to understand how much of a performance improvement this change would bring - so I focused entirely on that and didn't care too much about external users of the pdata module or the correctness outside of the main benchmark part (log lines).

Adding `protojson` to the encoding part

Initially I thought it was enough to just call protojson in the json.go file to already get a better performance:

https://github.com/denysvitali/opentelemetry-collector/blob/bba99f7239fed1fb9379487383efb928b36507ab/pdata/internal/json/json.go#L20

Unfortunately I was very wrong: since the GoGo generated protobuf files aren't compatible with google.golang.org/protobuf (duh!), we can't use them like this.

Recompiling the protos with `proto-gen-go` + `API_OPAQUE`

For this reason, I've recompiled the protos using protoc-gen-go and added the GRPC definitions via protoc-gen-go-grpc.
In this case, I had to create a Dockerfile since the upstream build-tools don't support this (yet).
Additionally, I had to use nix-unstable because the latest version of protoc-gen-go on Nix doesn't yet support the Opaque API (1.6.x+ is needed).

I then built the Docker image docker build -f Dockerfile.build -t opentelemetry-collector/build ..
I then modified the way we compile the protos to use protoc-gen-go and protoc-gen-go-grpc and to add the Opaque API switch: https://github.com/denysvitali/opentelemetry-collector/blob/bba99f7239fed1fb9379487383efb928b36507ab/Makefile#L235

Finally, I removed most of the original sed hacks (that part is horrible, sorry, but thankfully with the new protobuf library it would not be needed) and replaced the fixed-size SpanIDs and TraceIDs to a slice - so that they match the proto type.

"Fixing" the code generation

In the Opaque API, most of the fields are pointers and they should be accessed using the getters and setters Get{{ .fieldName }} and Set{{ .fieldName }}. On top of that, the way proto structs are filled changed and thus the _builder struct should be used in those cases. For oneofs, the case matching changed slightly too.

Of course pdata contains a lot of generated code, and this code needs to be fixed in order to test the encoding performance. In my branch, I've tried to do that without breaking the pdata types (too much).

After spending a lot of time refactoring, I managed to get the code to compile and fixed enough of the generated code to get to the point where I was able to test the json encoding performance.

My implementation is mostly patched together just for the sake of trying to get the encoder benchmark to work and give us an idea of how good of an improvement it would be.

Next Steps

I intend to work on golang/protobuf#1673 if it's trivial and no-one picks it up before me, so that we can see an hopefully further increase in performance and decrease in memory consumption.

Conclusion

I will not make a PR to switch away from gogo/protobuf as I don't think I have the right amount of pdata's code generation knowledge to properly fix the module - but I hope my little experimentation shows you that this part is feasible albeit complex in the code-generation part.

You're free to reuse my feature branch - but I would not recommend to use my code generation part there as it's pretty messy.

mx-psi · 2025-01-30T15:10:30Z

@denysvitali Thanks for sharing the results! The results are definitely promising for OTLP/HTTP JSON.

I expect most of our users use OTLP/gRPC which uses a protobuf binary encoding, so it's very important for us to benchmark those as well, to ensure there are no regressions on this case.

denysvitali · 2025-01-30T16:43:16Z

Yes, that would be also interesting. I can try to benchmark that too if I find the time in the following days.

dmathieu · 2025-01-30T16:56:49Z

Can you run those two benchmarks with benchstat for an easier comparison?

bwplotka · 2025-02-21T10:27:02Z

👋 Just some datapoints from the Prometheus ecosystem (SDKs, Prometheus, Thanos, Cortex etc).:

We still use gogo/protobuf in general and we are (very) slowly progressing on the migration. I tried OpaqueAPI a bit and it feels similar to @tigrannajaryan https://github.com/splunk/exp-lazyproto indeed.

However, I’m not sure OpaqueAPI+lazy decoding addresses all challenges we have with data-intensive proto use. The non-nullability of fields (e.g. []Elem instead of []*Elem ) is still beneficial in practice (less allocs, less GC, less need for pooling, trivial to benchmark) and lazy decoding makes already too big structs even bigger (what was the last time our code needed information about unknown fields?) .

I wonder if there’s an opportunity to maintain together some simpler efficient Go generator for both Otel and Prometheus for simpler use case cases e.g.

empty value == not specified, so non-nullability
lack of unknown fields (or optionality in generator for this)
streaming (alternative or different form of lazy decoding)
ways to extend generation to decod/encode directly to native types.

Last two features were extremely beneficial for the tailored Prometheus use for scraping.

Some recent discussion: https://groups.google.com/g/prometheus-developers/c/uFWRyqZaQis/m/IvnEjVkHAQAJ

cc @bboreham

tigrannajaryan · 2025-02-21T16:52:31Z

I wonder if there’s an opportunity to maintain together some simpler efficient Go generator for both Otel and Prometheus for simpler use case cases

@bwplotka I offered to work on this a while ago (when working on lazy-proto) but there were no takers. Unfortunately I no longer have time to work on this at the moment, but as a data point: doing lazy-proto didn't feel like an insurmountable amount of work to write your own Protobuf generator.

When working on lazy-proto I also thought about doing a streaming Protobuf deserializer and eventually decided to spend my time on a better streaming format (STEF handily beats OTLP Protobuf both on size and speed dimensions. When it matures, I would love to discuss this with Prometheus folks).

yurishkuro transferred this issue from open-telemetry/opentelemetry-proto Feb 2, 2023

mx-psi added dependencies Pull requests that update a dependency file priority:p1 High area:pdata pdata module related issues labels Apr 13, 2023

arminru mentioned this issue Apr 20, 2023

Define restrictions for attribute and label keys open-telemetry/opentelemetry-specification#504

Closed

mx-psi mentioned this issue Nov 7, 2023

[VERSIONING.md] Reorganize and clarify compatibility guarantees #8812

Merged

This was referenced Jan 20, 2025

Allow users to pass an io.Writer to protojson protocolbuffers/protobuf#20049

Closed

Allow users to pass an io.Writer to protojson golang/protobuf#1673

Open

bwplotka mentioned this issue Feb 21, 2025

Move off of gogo/protobuf to e.g. vtproto prometheus/prometheus#11908

Open

bwplotka mentioned this issue Feb 25, 2025

Optimize zlabel usage for stringlabels thanos-io/thanos#8077

Open

2 tasks

bwplotka mentioned this issue Apr 9, 2025

Add RW2 support grafana/mimir#11100

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pdata] Find replacement for now deprecated gogo/protobuf #7095

[pdata] Find replacement for now deprecated gogo/protobuf #7095

yurishkuro commented Feb 2, 2023

tigrannajaryan commented Feb 2, 2023 •

edited

Loading

yurishkuro commented Feb 2, 2023

jpkrohling commented Feb 2, 2023

yurishkuro commented Feb 2, 2023

codeboten commented Feb 2, 2023

yurishkuro commented Feb 3, 2023

codeboten commented Feb 7, 2023 •

edited

Loading

yurishkuro commented Aug 29, 2024 •

edited

Loading

yurishkuro commented Dec 18, 2024

tigrannajaryan commented Dec 18, 2024

els0r commented Jan 16, 2025

mx-psi commented Jan 16, 2025 •

edited

Loading

sfc-gh-bdrutu commented Jan 16, 2025

els0r commented Jan 17, 2025

denysvitali commented Jan 24, 2025

mx-psi commented Jan 30, 2025

denysvitali commented Jan 30, 2025

dmathieu commented Jan 30, 2025

bwplotka commented Feb 21, 2025

tigrannajaryan commented Feb 21, 2025

[pdata] Find replacement for now deprecated gogo/protobuf #7095

[pdata] Find replacement for now deprecated gogo/protobuf #7095

Comments

yurishkuro commented Feb 2, 2023

tigrannajaryan commented Feb 2, 2023 • edited Loading

yurishkuro commented Feb 2, 2023

jpkrohling commented Feb 2, 2023

yurishkuro commented Feb 2, 2023

codeboten commented Feb 2, 2023

yurishkuro commented Feb 3, 2023

codeboten commented Feb 7, 2023 • edited Loading

yurishkuro commented Aug 29, 2024 • edited Loading

yurishkuro commented Dec 18, 2024

tigrannajaryan commented Dec 18, 2024

els0r commented Jan 16, 2025

mx-psi commented Jan 16, 2025 • edited Loading

sfc-gh-bdrutu commented Jan 16, 2025

els0r commented Jan 17, 2025

denysvitali commented Jan 24, 2025

Results

Upstream implementation (6757cc7)

Fork (bba99f7)

Comparison

Implementation

Adding protojson to the encoding part

Recompiling the protos with proto-gen-go + API_OPAQUE

"Fixing" the code generation

Next Steps

Conclusion

mx-psi commented Jan 30, 2025

denysvitali commented Jan 30, 2025

dmathieu commented Jan 30, 2025

bwplotka commented Feb 21, 2025

tigrannajaryan commented Feb 21, 2025

tigrannajaryan commented Feb 2, 2023 •

edited

Loading

codeboten commented Feb 7, 2023 •

edited

Loading

yurishkuro commented Aug 29, 2024 •

edited

Loading

mx-psi commented Jan 16, 2025 •

edited

Loading

Upstream implementation (`6757cc7`)

Adding `protojson` to the encoding part

Recompiling the protos with `proto-gen-go` + `API_OPAQUE`