Skip to content

Commit 3cf68cf

Browse files
committed
rfcs: new RFC on per-session resource metrics
Release note: None
1 parent ac0f819 commit 3cf68cf

File tree

1 file changed

+393
-0
lines changed

1 file changed

+393
-0
lines changed
+393
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,393 @@
1+
- Feature Name: Per-session CPU usage and other metrics
2+
- Status: draft
3+
- Start Date: 2021-02-08
4+
- Authors: knz
5+
- RFC PR: [#60589](https://github.com/cockroachdb/cockroach/pull/60589)
6+
- Cockroach Issue: [#59998](https://github.com/cockroachdb/cockroach/issues/59998)
7+
8+
9+
# Summary
10+
11+
This RFC proposes to use custom extensions to the Go runtime to track
12+
CPU usage, as well as other metrics, per SQL session -- including all
13+
goroutines that do work on behalf of a SQL session, and without
14+
instrumenting the code manually inside CockroachDB.
15+
16+
In a later stage, this work will also enable constraining
17+
CPU usage and other resources according to configurable quota.
18+
19+
# Motivation
20+
21+
Today there is an issue when CPU usage is high, it is very difficult
22+
to track down which SQL session or query is responsible for that load.
23+
24+
In turn the difficulty comes from the Go runtime which does not provide
25+
an abstraction to extract metrics per group of goroutines performing
26+
related work on behalf of a common client-defined query or session.
27+
28+
# Technical design
29+
30+
In a nutshell, the proposed change adds new APIs to Go's `runtime` package to
31+
create a new “task group” abstraction, which captures counters and other
32+
control mechanisms shared by a collection of goroutines and is inherited
33+
automatically by all new goroutines indirectly spawned.
34+
35+
Once this API is in place inside Go, we can extend CockroachDB as follows:
36+
37+
- whenever a SQL session is created, a new task group is created for its
38+
goroutine and its children goroutines
39+
- distsql execution on other nodes also sets up a task group whenever
40+
one or more flows are set up on behalf of a query started on a different node,
41+
to connect all the flows for that queries “under” a common, separate task group.
42+
- as the distsql processors advance or complete work, the current metrics
43+
of the task group are pushed back to the gateway to update common counters
44+
for the session.
45+
- on the gateway, the vtable `crdb_internal.node_sessions` and other related
46+
facilities report, for each session/query, the additional metrics from the task group.
47+
- SQL job execution also creates task groups, one per job, in a way
48+
that can be inspected separately (e.g. via `crdb_internal.jobs`? TBD)
49+
50+
A preliminary version of the proposed Go runtime changes is available
51+
in this branch:
52+
53+
https://github.com/cockroachdb/go/commits/crdb-fixes
54+
55+
And an example application inside CockroachDB, to measure CPU scheduling ticks
56+
separately for each SQL session, is available here:
57+
58+
https://github.com/cockroachdb/cockroach/pull/60588
59+
60+
## Go runtime extension
61+
62+
### Task group abstraction
63+
64+
The center of the proposed approach is the definition of a new “task
65+
group” abstraction inside the Go runtime / scheduler, in addition to
66+
the existing G / M / P abstractions.
67+
68+
(The reader of this RFC is expected to have some familiarity with the
69+
G / M / P abstractions already. Refer to the Go team's own
70+
documentation and the top comment inside the `runtime/proc.go` file in
71+
the Go distribution.)
72+
73+
A task group is a collection of zero or more goroutines.
74+
75+
Each goroutine is associated to exactly one task group.
76+
77+
There is a “default” (“global”? “common”?) task group defined
78+
initially for all goroutines spawned from the main/initial goroutine.
79+
80+
Each new goroutine (started with the `go` keyword) inherits the task
81+
group of its parent.
82+
83+
A `runtime.SetInternalTaskGroup()` API enables Go applications to override the
84+
task group of the current goroutine. After this has been overridden,
85+
all further goroutines spawned from that point inherit the new task
86+
group. (But all previously spawned goroutines retain their previously
87+
defined task group.)
88+
89+
The Go runtime scheduler is aware of the task groups and is able to
90+
update / maintain runtime counters and other metrics at the level
91+
of the task group, not just the goroutine.
92+
93+
### Internal vs external state for task groups
94+
95+
A task group is represented at runtime by two different pieces of
96+
state inside every goroutine:
97+
98+
- the internal task group struct has a definition known only to the
99+
`runtime` package, and cannot be read / changed by Go applications.
100+
101+
This is defined as the (opaque) type `runtime.InternalTaskGroup`.
102+
It is also the state defined via the main new API `SetInternalTaskGroup()`
103+
outlined above.
104+
105+
This is the struct used inside the Go scheduler and other runtime
106+
subsystems (e.g. the heap GC in the future).
107+
108+
There are public accessors in the `runtime` package to retrieve
109+
the values stored inside the internal task group struct.
110+
111+
For example, `GetInternalTaskGroupSchedTicks(TaskGroup) uint64` retrieves
112+
the number of scheduling ticks incurred by the given task group.
113+
114+
- for applications that wish to attach additional state to a task group,
115+
that can be left opaque to the go runtime, another API is provided:
116+
117+
```go
118+
// GetLogicalTaskGroupID retrieves the current goroutine's task group
119+
// ID. This is inherited from the goroutine's parent Top-level
120+
// goroutine are assigned their own ID as group ID.
121+
func GetLogicalTaskGroupID() int64
122+
123+
// SetGoroutineGroupID sets the current goroutine's task group ID.
124+
// This value is inherited to children goroutines.
125+
func SetLogicalTaskGroupID(groupid int64)
126+
```
127+
128+
These IDs can then be used in application code to serve as key
129+
in a Go map or other similar structure to attach additional
130+
state to a task group.
131+
132+
## Tracked metrics and state
133+
134+
- The internal task group struct will be responsible exclusively to track
135+
metrics that cannot be tracked by application code (or not effectively).
136+
137+
Examples:
138+
139+
- CPU usage
140+
- current total RAM usage
141+
- number of discrete heap allocations
142+
143+
- The logical task group ID can be used to attach identification
144+
labels or descriptions, or metrics / counters that are updated at
145+
application level. Example valuable attachments:
146+
147+
- task group label as a string,
148+
- current SQL statement AST (or maybe some other aspects of the SQL session)
149+
- owner / user
150+
- `application_name`
151+
152+
All these additional fields are going to make it easier to understand
153+
who / what is responsible for resource usage.
154+
155+
The reason why we prefer an ID-to-state mapping at the application
156+
level is that we wish to avoid having to update the Go runtime every
157+
time we want to attach new state to task groups. The internal task
158+
group struct should remain restricted to those metrics / variables
159+
that must absolutely be known inside the `runtime` package, in
160+
particular because they are accessed/updated inside the Go scheduler
161+
and other sub-systems.
162+
163+
## Implementation of the task group abstraction
164+
165+
Implementation-wise, we extend the `g` struct inside the go runtime
166+
with two fields `taskGroupCtx` (for the internal task group)
167+
and `taskGroupId` (for the logical task group ID).
168+
169+
Both are inherited when creating a new goroutine. When a goroutine is
170+
created without a parent, they are initialized as follows:
171+
172+
- `taskGroupCtx` is set to `defaultTaskGroupCtx`, the default/common
173+
task group.
174+
- `taskGroupId` is set to the goroutine's ID (this default may change, unsure).
175+
176+
These defaults are not too important/interesting since we will ensure
177+
that the task group is overridden every time we create a new SQL
178+
session or job.
179+
180+
## Example metrics: CPU scheduler ticks
181+
182+
Context: The Go scheduler is preemptive and runs goroutines in segments of time
183+
called "ticks". The specific time length of a tick is variable but is
184+
bounded to a maximum, which is quite small (there are more than 10 per
185+
second, TBD how much precisely). For CPU-heavy loads every tick tends
186+
to use its fully allowed length before control is switched over to a
187+
different goroutine, to ensure fair scheduling.
188+
189+
In order to estimate CPU usage for different SQL queries/sessions, and
190+
generally identify which SQL session(s)/query(ies) is/are responsible
191+
for peaks in CPU usage, we need to collect stats about scheduler
192+
ticks.
193+
194+
However we cannot / do not want to do so separately per goroutine,
195+
since a single SQL session/query may be served by many goroutines over
196+
time. Instead, we wish to collect this metric across all goroutines
197+
that participate in a SQL query/session.
198+
199+
For this we can use the new task group abstraction.
200+
201+
The RFC proposes to extend the internal task group struct with a new
202+
`schedticks` field, incremented upon every scheduler tick for the
203+
current task group.
204+
205+
Example diff to add this increment, in `runtime/proc.go`:
206+
207+
```go
208+
func execute(gp *g, inheritTime bool) {
209+
...
210+
if gp.taskGroupCtx != nil {
211+
atomic.Xadd64(&gp.taskGroupCtx.schedtick, 1)
212+
}
213+
```
214+
215+
Then we can retrieve the scheduling ticks for a given task group, in
216+
`runtime/task_group.go`:
217+
218+
```go
219+
// GetInternalTaskGroupSchedTicks retrieves the number of scheduler ticks for
220+
// all goroutines in the given task group.
221+
func GetInternalTaskGroupSchedTicks(taskGroup TaskGroup) uint64 {
222+
tg := (*t)(taskGroup)
223+
return atomic.Load64(&tg.schedtick)
224+
}
225+
```
226+
227+
And then we can connect this inside SQL, in `sql/conn_executor.go`:
228+
229+
```go
230+
func (ex *connExecutor) run(...) {
231+
...
232+
ex.taskGroup = runtime.SetTaskGroup()
233+
}
234+
235+
...
236+
237+
func (ex *connExecutor) serialize() serverpb.Session {
238+
...
239+
return serverpb.Session{
240+
...
241+
// Experimental
242+
SchedTicks: runtime.GetTaskGroupSchedTicks(ex.taskGroup),
243+
}
244+
}
245+
```
246+
247+
And then, to make it observable for SQL DBAs, in `sql/crdb_internal.go`:
248+
249+
```go
250+
func populateSessionsTable(...) {
251+
...
252+
... addRow(
253+
...
254+
tree.NewDInt(tree.DInt(session.SchedTicks)),
255+
)
256+
...
257+
}
258+
```
259+
260+
Example usage:
261+
262+
```
263+
root@:26257/defaultdb> select node_id, client_address, last_active_Query, sched_ticks
264+
from crdb_internal.node_sessions;
265+
266+
node_id | client_address | last_active_query | sched_ticks
267+
----------+----------------+-------------------+--------------
268+
1 | [::1]:27580 | SHOW database | 972
269+
1 | [::1]:27577 | SHOW database | 6675
270+
(2 rows)
271+
```
272+
273+
This shows 2 distinct SQL sessions (there are two interactive terminals
274+
running a SQL shell connected to this node). The `sched_ticks` column
275+
reports that one session has used 972 ticks and the other has used 6675 ticks.
276+
277+
The value of sched ticks can be confirmed to increase in relation to
278+
actual CPU usage. We can compare how the ticks are incremented for several operations:
279+
280+
| Operation | Tick increment |
281+
|-----------------------------------------------------|---------------------------------------------|
282+
| `pg_sleep(0)` | 20 |
283+
| `pg_sleep(1)` | 20 (sleeping time does not incur CPU usage) |
284+
| `select count(*) from generate_series(1,10000000)` | 280 |
285+
| `select count(*) from generate_series(1,100000000)` | 2800 (ten times more than previous) |
286+
287+
As this confirms, idle sessions don't increase their `sched_ticks`, whereas busy sessions increment
288+
it in proportion to the work actually done.
289+
290+
(NB: the Go changes outlined above do not include the aggregation of
291+
sched ticks across multiple nodes when queries are distributed. We'd
292+
need to think further about how to reason about CPU usage on multiple
293+
nodes, and whether we even want this to be aggregated, or whether we
294+
want the CPU ticks to be presented side-by-side for separate nodes.)
295+
296+
PR where this change has been prototyped: https://github.com/cockroachdb/cockroach/pull/60589
297+
298+
## Example metrics: heap usage
299+
300+
The basic concept here is to force the Go heap allocator to use
301+
different heaps for different task groups.
302+
303+
To achieve this we change the heap allocator structures to attach
304+
a task group to "pages" (blocks) of system memory that are used
305+
to carve heap allocations from.
306+
307+
When a heap allocation is performed, the RAM usage counter on the task
308+
group is incremented; when the GC liberates all the allocations on the
309+
page/block it decreases the task group's RAM usage by the size of the
310+
page.
311+
312+
This is a bit more complicated to get right, because we probably
313+
want to avoid fragmentation for "small objects" that are only allocated
314+
once per SQL session. How to deal with those?
315+
316+
There are multiple schemes we can envision:
317+
318+
- separate the abstraction of "task group" into sub-containers for
319+
heap usage, such that the small allocations are "billed" against the
320+
common/default task group while "large" allocations remain billed
321+
per task group.
322+
323+
This can be considered acceptable if we have confidence that the
324+
number of small allocations inside each SQL session/query remains
325+
small (i.e. most problematic heap usage is incurred by larger
326+
allocations).
327+
328+
- track the task group for every object inside the page, or for groups
329+
of objects of the same type (e.g. slices).
330+
331+
This will yield more precise / complete heap usage metrics but incur
332+
higher RAM usage overall due to the additional task group tracking.
333+
334+
Which scheme to use needs to be guided by further experiments.
335+
336+
## Later benefit: resource control (not just measurement)
337+
338+
Once these facilites are in place, we can then extend the task group
339+
abstraction inside the go runtime, with a per-group configurable
340+
maximum value, such that an exception or other control mechanism is
341+
issued whenever the go runtime notices the metric is exceeding the
342+
configured budget.
343+
344+
345+
346+
## Drawbacks
347+
348+
Maintaining a custom runtime extension in the Go project really means
349+
creating a new project maintained by Cockroach Labs, which will
350+
responsible for importing all the changes made upstream by the Go
351+
team. This is additional effort, also requiring a skill set that we
352+
have not yet been hiring for inside the CockroachDB and Cockroach
353+
Cloud projets.
354+
355+
That will in turn incur working hours for the unforeseeable future.
356+
357+
We can mitigate that risk either:
358+
359+
- by setting up a communication forum
360+
with the Go team to demonstrate the runtime extensions we are
361+
working with and advocate for their inclusion in the mainstream
362+
Go distribution.
363+
364+
- by developing the runtime extensions not as a repository fork, but
365+
instead as a patchset, and create automation which will
366+
automatically detect changes to upstream Go and apply the patches
367+
automatically every time a new Go release is made. (This is the
368+
approach taken by the `redact` package to override the `fmt`
369+
behavior, see
370+
[here](https://github.com/cockroachdb/redact/blob/master/internal/README.md#refreshing-the-sources).)
371+
372+
## Rationale and Alternatives
373+
374+
The main alternative for the approach taken here is to manually
375+
instrument every single entry point in Go where "CPU time" or "memory
376+
usage" can be incurred. This has been attempted in the past and was
377+
largely unsuccessful.
378+
379+
# Explain it to someone else
380+
381+
Magic!
382+
383+
# Unresolved questions
384+
385+
- How to ensure that CockroachDB can build and run with a standard
386+
Go distribution not equipped with the proposed API? What are the
387+
exact build steps to make this possible?
388+
389+
(We accept that the resource monitoring benefits would not be available
390+
when the custom extension is not used.)
391+
392+
- Do we want to extend the approach here to also measure resource
393+
usage incurred by KV-level goroutines? How would that work?

0 commit comments

Comments
 (0)