Skip to content

Commit d027b3c

Browse files
authored
Move Agent document to README. (#29)
1 parent 327f040 commit d027b3c

File tree

3 files changed

+114
-0
lines changed

3 files changed

+114
-0
lines changed

README.md

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,3 +73,117 @@ $ opencensusd
7373
You should be able to see the traces in Stackdriver and Zipkin.
7474
If you stop the opencensusd, example application will stop exporting.
7575
If you run it again, it will start exporting again.
76+
77+
## OpenCensus Agent
78+
79+
### Architecture Overview
80+
81+
On a typical VM/container, there are user applications running in some processes/pods with
82+
OpenCensus Library (Library). Previously, Library did all the recording, collecting, sampling and
83+
aggregation on spans/stats/metrics, and exported them to other persistent storage backends via the
84+
Library exporters, or displayed them on local zpages. This pattern has several drawbacks, for
85+
example:
86+
87+
1. For each OpenCensus Library, exporters/zpages need to be re-implemented in native languages.
88+
2. In some programming languages (e.g Ruby, PHP), it is difficult to do the stats aggregation in
89+
process.
90+
3. To enable exporting OpenCensus spans/stats/metrics, application users need to manually add
91+
library exporters and redeploy their binaries. This is especially difficult when there’s already
92+
an incident and users want to use OpenCensus to investigate what’s going on right away.
93+
4. Application users need to take the responsibility in configuring and initializing exporters.
94+
This is error-prone (e.g they may not set up the correct credentials\monitored resources), and
95+
users may be reluctant to “pollute” their code with OpenCensus.
96+
97+
To resolve the issues above, we are introducing OpenCensus Agent (Agent). Agent runs as a daemon
98+
in the VM/container and can be deployed independent of Library. Once Agent is deployed and
99+
running, it should be able to retrieve spans/stats/metrics from Library, export them to other
100+
backends. We MAY also give Agent the ability to push configurations (e.g sampling probability) to
101+
Library. For those languages that cannot do stats aggregation in process, they should also be
102+
able to send raw measurements and have Agent do the aggregation.
103+
104+
For developers/maintainers of other libraries: Agent can also be extended to accept spans/stats/metrics from
105+
other tracing/monitoring libraries, such as Zipkin, Prometheus, etc. This is done by adding specific
106+
interceptors. See [Interceptors](#interceptors) for details.
107+
108+
![agent-architecture](image/agent-architecture.png)
109+
110+
To support Agent, Library should have “agent exporters”, similar to the existing exporters to
111+
other backends. There should be 3 separate agent exporters for tracing/stats/metrics
112+
respectively. Agent exporters will be responsible for sending spans/stats/metrics and (possibly)
113+
receiving configuration updates from Agent.
114+
115+
### Communication
116+
117+
Communication between Library and Agent should user a bi-directional gRPC stream. Library should
118+
initiate the connection, since there’s only one dedicated port for Agent, while there could be
119+
multiple processes with Library running.
120+
By default, Agent is available on port 55678.
121+
122+
### Protocol Workflow
123+
124+
1. Library will try to directly establish connections for Config and Export streams.
125+
2. As the first message in each stream, Library must sent its identifier. Each identifier should
126+
uniquely identify Library within the VM/container. Identifier is no longer needed once the streams
127+
are established.
128+
3. If streams were disconnected and retries failed, the Library identifier would be considered
129+
expired on Agent side. Library needs to start a new connection with a unique identifier
130+
(MAY be different than the previous one).
131+
132+
### Implementation details of Agent Server
133+
134+
This section describes the in-process implementation details of OC-Agent.
135+
136+
![agent-implementation](image/agent-implementation.png)
137+
138+
Note: Red arrows represent RPCs or HTTP requests. Black arrows represent local method
139+
invocations.
140+
141+
The Agent consists of three main parts:
142+
143+
1. The interceptors of different instrumentation libraries, such as OpenCensus, Zipkin,
144+
Istio Mixer, Prometheus client, etc. Interceptors act as the “frontend” or “gateway” of
145+
Agent. In addition, there MAY be one special receiver for receiving configuration updates
146+
from outside.
147+
2. The core Agent module. It acts as the “brain” or “dispatcher” of Agent.
148+
3. The exporters to different monitoring backends or collector services, such as
149+
Omnition Collector, Stackdriver Trace, Jaeger, Zipkin, etc.
150+
151+
#### Interceptors
152+
153+
Each interceptor can be connected with multiple instrumentation libraries. The
154+
communication protocol between interceptors and libraries is the one we described in the
155+
proto files (for example trace_service.proto). When a library opens the connection with the
156+
corresponding interceptor, the first message it sends must have the `Node` identifier. The
157+
interceptor will then cache the `Node` for each library, and `Node` is not required for
158+
the subsequent messages from libraries.
159+
160+
#### Agent Core
161+
162+
Most functionalities of Agent are in Agent Core. Agent Core's responsibilies include:
163+
164+
1. Accept `SpanProto` from each interceptor. Note that the `SpanProto`s that are sent to
165+
Agent Core must have `Node` associated, so that Agent Core can differentiate and group
166+
`SpanProto`s by each `Node`.
167+
2. Store and batch `SpanProto`s.
168+
3. Augment the `SpanProto` or `Node` sent from the interceptor.
169+
For example, in a Kubernetes container, Agent Core can detect the namespace, pod id
170+
and container name and then add them to its record of Node from interceptor
171+
4. For some configured period of time, Agent Core will push `SpanProto`s (grouped by
172+
`Node`s) to Exporters.
173+
5. Display the currently stored `SpanProto`s on local zPages.
174+
6. MAY accept the updated configuration from Config Receiver, and apply it to all the
175+
config service clients.
176+
7. MAY track the status of all the connections of Config streams. Depending on the
177+
language and implementation of the Config service protocol, Agent Core MAY either
178+
store a list of active Config streams (e.g gRPC-Java), or a list of last active time for
179+
streams that cannot be kept alive all the time (e.g gRPC-Python).
180+
181+
#### Exporters
182+
183+
Once in a while, Agent Core will push `SpanProto` with `Node` to each exporter. After
184+
receiving them, each exporter will translate `SpanProto` to the format supported by the
185+
backend (e.g Jaeger Thrift Span), and then push them to corresponding backend or service.
186+
187+
## OpenCensus Collector
188+
189+
TODO: add content about Collector.

image/agent-architecture.png

149 KB
Loading

image/agent-implementation.png

317 KB
Loading

0 commit comments

Comments
 (0)