Backend Detection #38

jkowall · 2017-10-14T12:53:27Z

I do believe we need some kind of construct to identify a "backend" this would mean either having a way to pass back to the monitoring system that the transaction has reached its deepest part and what that part was. For example, in Java, if the system calls a MongoDB backend, we should show this and pull other stats on the backend. We could go farther and collect other data about the call to that backend. Example backends could be a message queue (without instrumentation on the other end) database, data platform, streaming platform (Kafka + others), HTTP (call to external API), or other transaction processing system (tuxedo, mainframe, etc), RMI, protobuf. The instrumentation library would have to auto recognize these, allow the developer to define or otherwise illustrate them. This would be communicated back to the monitoring system or tool.

yurishkuro · 2017-10-14T14:58:49Z

I don't think this issue is in scope of this project. The project is about a common propagation format, not about instrumentation. If you're interested in standards for instrumentation, that's what OpenTracing is meant to address, and it already has data conventions of describing an outbound call with specific tags, like db.instance, db.type, db.statement, etc.

jkowall · 2017-10-14T15:06:55Z

Well, the idea would be to have common communications back to the tools (Zipkin, Lightstep, AppDynamics, Dynatrace, New Relic, Brave, etc...). OpenTracing doesn't do this at all. Even if we cannot do so from the instrumentation layer we should see if there is a way to ensure topology visibility is consistent and works between implementations.

yurishkuro · 2017-10-14T16:01:16Z

OpenTracing doesn't do it, but it enables it. If we all agree on a common out-of-band format for tracing data (I'm all for that - opentracing/specification#64), I can go to Jaeger libraries and implement that output format, yet I don't have to go and rewrite thousands of places in the code that are doing instrumentation. It's like saying "I don't want to use slf4j because it prints my logs in JSON but I'm really into XML these days" :-) slf4j is not responsible for how logs are printed; just swap the implementation.

The core of the issue that you opened, I think, is how instrumentation should tell the tracing system about making calls to uninstrumented backends. I argue that OpenTracing gives you that for, say, databases, and we're completely open to support more standard data semantics. But the format of out-of-band data is not a concern of instrumentation, just an implementation of the OpenTracing API.

jkowall · 2017-10-15T17:57:57Z

Yes, I understand your point. The way I see it is that the instrumentation and the wire protocol should be one in the same. Meaning the way instrumentation communicates back to the tool should have some kind of standard wire protocol as well. Having who different projects (with two different teams, and conflicts which are different) is going to cause issues. When I see APM vendors implementing OT I scratch my head, considering it does absolutely nothing for the user in terms of making data reuse possible or having other advantages between tools and implementations.

When there is a backend we should communicate that back to a tool, and thus it should be somewhere in here versus within the instrumentation.

Taking a step back, when looking at logs. I can send logs to many different tools (which is quite common in the enterprise). I want to see APM data follow the same path. let's reuse this data across many tools, use cases, vendors, etc. The way things work today that's never going to happen. Having a wire protocol is great for pcap, but it's going to remove a lot of the context that we can get from the software/instrumentation layer.

yurishkuro · 2017-10-15T18:35:34Z

The way I see it is that the instrumentation and the wire protocol should be one in the same.

I see it exactly the opposite. Instrumentation's main goal is to collect semantic and perf data from the application. It has no opinion on what to do with that data - the actual implementation of the instrumentation API may be even no-op. If I am an application developer and I want to log something, I don't give a rat's ass about how that log data will be extracted from my application, in what format, by what means, I just need to make my dead simple logger.info(...) call and see it later in some tool. At deployment I will need to decide how all those things are done by choosing an appropriate implementation, but I won't need to go back to my application code and change my log statement because some committee decided that logging data should be formatted this or that way.

The reason APMs implement OT (case in point - New Relic) is because monkey-patching sucks (support-wise), doesn't even work in Go, and the application developers always know best the semantics of their application. Letting developers program towards an API decouples the instrumentation from data collection. We at Uber probably have millions lines of Go code, if some vendor comes and says "just instrument all that code with APIs from vendor-x.*** package", I'll tell them to take a hike. If they say "instrument with OpenTracing and we'll take from there" then we can talk.

This TraceContext repo was started like 6 months ago, and it still doesn't have an agreed upon spec even for the in-band wire format, not to mention out of band. Once the OpenTracing API was published it allowed frameworks and applications to actually proceed with instrumentation (the most expensive part of tracing) without worrying about the wire formats.

[end of rant]

jkowall · 2017-10-15T18:48:42Z

That's kind of sad. OT doesn't replace a full-blown agent which handles instrumentation without developers. It also doesn't do any diagnostics beyond trace and timing. Monkey patching will continue since APM vendors aren't going to replace full agents with OT. Java and .NET have better APIs, and they command a substantial portion of the code out there. I know this isn't the case at Uber (or other companies who are less than a decade old), but if you look at the enterprise it's another ball game. They are more advanced in some ways, but also have a substantial debt to handle including a wide variance of technology stacks.

When you look at developer focus I agree with your perspective, but most code being run out there is vendor code. Advanced APM tools can extract business data from running code, avoiding the need to change the code, to begin with. The second point is that auto instrumentation is the goal to enable anyone to get advantages without lock-in. I don't think we want to require developers to write instrumentation. Looking even longer term out there, wouldn't it be great if we had a standard way to share data between tools. We have so many customers which span service providers and customers that would love to have something like this.

Developers don't think about overhead they introduce with logging and manual instrumentation. This is why adding intelligence into auto-instrumentation is another good idea IMHO.

yurishkuro · 2017-10-15T19:14:24Z

I do not disagree with that. However, if we're talking about java-agent style instrumentation, then we're already in the proprietary space. I have not seen any vendor trying to open source their agents (sky-walking and stagemonitor come to mind as oss alternatives).

but most code being run out there is vendor code.

Yes, and that's why I think tracing hasn't progressed much since the Dapper paper, nor really became mainstream. I am not optimistic about monkey-patching/agent approach because for any language you pick there are tons of different RPC frameworks, database drivers, etc. No vendor can keep up with that, and they are not sharing the implementations. But an open instrumentation standard allows each of those frameworks to be instrumented exactly once. When you say not requiring developers to write instrumentation, I agree when it comes to application developers, but not so much for framework developers - if I write some fancy new async framework who's better than me to know how to instrument it? A generic agent-based instrumentation can only work on the lower level, thus missing the semantics. And btw, it's not one vs. the other, they can co-exist.

Note sure if we're progressing on this specific ticket. I think my point was - discussing semantic aspects of instrumentation (like "which backend") is best done in the instrumentation API, while defining the data reporting format is of secondary concern.

yurishkuro · 2017-10-15T19:17:19Z

Actually, I do disagree with this: "It [OT] also doesn't do any diagnostics beyond trace and timing." - there are OT implementations that simply collect metrics, there are extensions that monitor http connection events. These are all examples of instrumentation concerns, I would love to discuss specifics.

jkowall · 2017-10-16T12:23:13Z

@yurishkuro that's my whole issue with it as being called a "standard". There is a lot missing from something which would be considered well-defined.

The base functionality of a tracing capability should be part of the definition of the said standard. If we want to add diagnostics or additional capture we can simply extend the standard for specific implementations and use cases.

My thinking here would be to have this base capability work across implementations, so if we can all have the ability to capture and measure end to end visibility regardless of how tracing is implemented.

codefromthecrypt · 2017-10-22T13:41:27Z

I'd be interested in fleshing out what you mean by backend detection, ideally separate from the existential value discussion about opentracing.

jkowall · 2017-10-22T16:19:17Z

Sure, well often times you want to segment or visualize what is being called at the lowest level of a traced transaction. For example a DB versus a queue versus an API. I'm happy to discuss and explain this further.

wu-sheng · 2017-10-23T01:20:30Z

@jkowall I want to know more about backend detection too. I am a little confuse. Maybe more examples may help. Thanks.

jkowall · 2017-10-23T15:50:22Z

I guess I tried to explain it, but it wasn't clear.

Backend - where a transaction terminates. It could be a database, API call (without instrumentation on the other side), or some other non-instrumented end of a transaction.

If your instrumentor sees a call to a database or a message queue you might want to visualize or display them differently in the monitoring tool. Hence there should be a space in the protocol to display the backend, or if the backend is detected.

SergeyKanzhelev · 2017-10-23T15:57:24Z

@jkowall how is it related to the protocol? Instrumentor will report information about this call to the telemetry store and it will be displayed from it. Do you envision information about backend to be propagated to the caller of instrumentor with the response headers? Or something else I miss?

jkowall · 2017-10-23T17:22:56Z

I'm looking at this standard as a way to interoperate with multiple tools, hence I can see on the wire or an ingest point information about the transaction path. If we have the data needed to reconstruct the path and details we can leverage this across tooling without requiring instrumentors to write to specific tools. If I implement an instrumentor then I know how it will communicate and propagate.

SergeyKanzhelev · 2017-10-23T17:33:21Z

@jkowall can you please give more specific example on what information you want to propagate over the wire? What does "transaction path" means to you?

For the ingest specs - I don't think this repo is a forum for it. I'd like to see some convergence there as well. With this issue - do you propose to start telemetry ingest schema discussion?

jkowall · 2017-10-23T17:58:49Z

Sure, that makes sense. Another use for the backend identifier would be to correlate across languages.

Java -> Oracledb1 (JDBC)
Phython -> Ora.int.domain.com

Yet these may be the same backend, we could handle naming and correlation within the instrumentor to make them "go to the same place" allowing for answers to questions about backend usage. There are other use cases for this as well. Once again, these are advanced use cases, but things we do.

I would like to have standards for tools and propagation so we can work together, but also keep in mind the wire data ingestion possibilities. Those tools will not have access to anything which isn't passed over the network.

AloisReitbauer · 2018-05-02T08:33:43Z

Handled by #35

AloisReitbauer closed this as completed May 2, 2018

Backend Detection #38

Backend Detection #38

Comments

jkowall commented Oct 14, 2017

yurishkuro commented Oct 14, 2017

Uh oh!

jkowall commented Oct 14, 2017

Uh oh!

yurishkuro commented Oct 14, 2017

Uh oh!

jkowall commented Oct 15, 2017

Uh oh!

yurishkuro commented Oct 15, 2017

Uh oh!

jkowall commented Oct 15, 2017

Uh oh!

yurishkuro commented Oct 15, 2017

Uh oh!

yurishkuro commented Oct 15, 2017

Uh oh!

jkowall commented Oct 16, 2017

Uh oh!

codefromthecrypt commented Oct 22, 2017 via email

Uh oh!

jkowall commented Oct 22, 2017

Uh oh!

wu-sheng commented Oct 23, 2017

Uh oh!

jkowall commented Oct 23, 2017

Uh oh!

SergeyKanzhelev commented Oct 23, 2017

Uh oh!

jkowall commented Oct 23, 2017

Uh oh!

SergeyKanzhelev commented Oct 23, 2017

Uh oh!

jkowall commented Oct 23, 2017

Uh oh!

AloisReitbauer commented May 2, 2018

Uh oh!