Skip to content

Commit dd65bd8

Browse files
committed
Formatting and style fixes for files in conceptual directory
1 parent 89bba0e commit dd65bd8

File tree

2 files changed

+75
-63
lines changed

2 files changed

+75
-63
lines changed

docs/conceptual/how-omnitrace-works.rst

Lines changed: 66 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -11,89 +11,101 @@ some basic tips to help you get started. It also explains the main data
1111
collection modes, including a comparison between binary instrumentation
1212
and statistical sampling.
1313

14-
Omnitrace Nomenclature
14+
Omnitrace nomenclature
1515
========================================
1616

1717
The list provided below is intended to provide a basic glossary for those who
1818
are not familiar with binary instrumentation. It also clarifies ambiguities
1919
when certain terms have different
20-
contextual meanings, for example, Omnitrace's definition of the term "module"
20+
contextual meanings, for example, the Omnitrace meaning of the term "module"
2121
when instrumenting Python.
2222

2323
**Binary**
24-
A file written in the Executable and Linkable Format (ELF). This is the standard file format for executable files, shared libraries, etc.
24+
A file written in the Executable and Linkable Format (ELF). This is the standard file
25+
format for executable files, shared libraries, etc.
2526

26-
**Binary Instrumentation**
27-
Inserting callbacks to instrumentation into an existing binary. This can be performed statically or dynamically.
27+
**Binary instrumentation**
28+
Inserting callbacks to instrumentation into an existing binary. This can be performed
29+
statically or dynamically.
2830

29-
**Static Binary Instrumentation**
30-
Loads an existing binary, determines instrumentation points, and generates a new binary with instrumentation directly embedded. It is applicable to executables and libraries but limited to only the functions defined in the binary. This is also known as **Binary Rewrite**.
31+
**Static binary instrumentation**
32+
Loads an existing binary, determines instrumentation points, and generates a new binary
33+
with instrumentation directly embedded. It is applicable to executables and libraries but
34+
limited to only the functions defined in the binary. This is also known as **Binary rewrite**.
3135

32-
**Dynamic Binary Instrumentation**
33-
Loads an existing binary into memory, inserts instrumentation, and executes the binary. It is limited to executables but capable of instrumenting linked libraries. This is also known as: **Runtime Instrumentation**
36+
**Dynamic binary instrumentation**
37+
Loads an existing binary into memory, inserts instrumentation, and executes the binary.
38+
It is limited to executables but capable of instrumenting linked libraries.
39+
This is also known as: **Runtime instrumentation**.
3440

35-
**Statistical Sampling**
36-
At periodic intervals, the application is paused and the current call-stack of the CPU is recorded alongside with various other metrics. It uses timers that measure either (A) real clock time or (B) the CPU time used by the current thread and the CPU time expended on behalf of the thread by the system. This is also known as just **sampling**.
41+
**Statistical sampling**
42+
At periodic intervals, the application is paused and the current call-stack of the CPU
43+
is recorded alongside with various other metrics. It uses timers that measure either (A) real clock time or (B) the CPU time used by the current thread and the CPU time expended on behalf of the thread by the system. This is also known as just **sampling**.
3744

38-
**Sampling Rate**
45+
**Sampling rate**
3946
* The period at which (A) or (B) are triggered (in units of ``# interrupts / second``)
4047
* Higher values increase the number of samples
4148

42-
**Sampling Delay**
49+
**Sampling delay**
4350
* How long to wait before (A) and (B) begin triggering at their designated rate
4451

45-
**Sampling Duration**
46-
* The time (in realtime) after the start of the application to record samples. Once this time limit has been reached, no more samples will be recorded.
52+
**Sampling duration**
53+
* The time (in real-time) after the start of the application to record samples.
54+
* Once this time limit has been reached, no more samples will be recorded.
4755

48-
**Process Sampling**
49-
At periodic (realtime) intervals, a background thread records global metrics without
56+
**Process sampling**
57+
At periodic (real-time) intervals, a background thread records global metrics without
5058
interrupting the current process. These metrics include, but are not limited to:
5159
CPU frequency, CPU memory high-water mark (i.e. peak memory usage), GPU Temperature,
5260
and GPU Power usage.
5361

54-
**Sampling Rate**
55-
* The realtime period for recording metrics (in units of ``# measurements / second``)
62+
**Sampling rate**
63+
* The real-time period for recording metrics (in units of ``# measurements / second``)
5664
* Higher values increase the number of samples
5765

58-
**Sampling Delay**
59-
* How long to wait (in realtime) before recording samples
66+
**Sampling delay**
67+
* How long to wait (in real-time) before recording samples
6068

61-
**Sampling Duration**
62-
* The time (in realtime) after the start of the application to record samples. Once this time limit has been reached, no more samples will be recorded.
69+
**Sampling duration**
70+
* The time (in real-time) after the start of the application to record samples.
71+
* Once this time limit has been reached, no more samples will be recorded.
6372

6473
**Module**
6574
With respect to binary instrumentation, a module is defined as either the filename
6675
(such as ``foo.c``) or library name (``libfoo.so``) which contains the definition
6776
of one or more functions.
6877

69-
With respect to Python instrumentation, a module is defined as the **file** which contains the definition of one or more functions. The full path to this file typically contains the name of the "Python module".
78+
With respect to Python instrumentation, a module is defined as the **file** which contains
79+
the definition of one or more functions. The full path to this file typically contains the
80+
name of the "Python module".
7081

71-
**Basic Block**
82+
**Basic block**
7283
Straight-line code sequence with no branches in (except for the entry) and
7384
no branches out (except for the exit).
7485

75-
**Address Range**
86+
**Address range**
7687
The instructions for a function in a binary start at certain address with the ELF file and end at a certain address. The range is ``end - start``.
7788

7889
The address range is a decent approximation for the "cost" of a function.
7990
For example, a larger address range approximately equates to more instructions.
8091

81-
**Instrumentation Traps**
92+
**Instrumentation traps**
8293
On the x86 architecture, because instructions are of variable size, the instruction
8394
at a point may be too small for Dyninst to replace it with the normal code sequence
8495
used to call instrumentation. When instrumentation is placed at points other
8596
than subroutine entry, exit, or call points, traps may be used to ensure
86-
the instrumentation fits. (By default, omnitrace-instrument avoids instrumentation
97+
the instrumentation fits. (By default, ``omnitrace-instrument`` avoids instrumentation
8798
which requires using a trap.)
8899

89100
**Overlapping functions**
90101
Due to language constructs or compiler optimizations, it may be possible for
91102
multiple functions to overlap (that is, share part of the same function body)
92103
or for a single function to have multiple entry points. In practice, it is
93104
impossible to determine the difference between multiple overlapping functions
94-
and a single function with multiple entry points. (By default, omnitrace-instrument avoids instrumenting overlapping functions.)
105+
and a single function with multiple entry points. (By default, ``omnitrace-instrument``
106+
avoids instrumenting overlapping functions.)
95107

96-
General Tips for Using Omnitrace
108+
General tips for using Omnitrace
97109
========================================
98110

99111
* Use ``omnitrace-avail`` to lookup configuration settings, hardware counters, and data collection components
@@ -110,7 +122,7 @@ General Tips for Using Omnitrace
110122
* Use binary instrumentation for characterizing the performance of every invocation of specific functions
111123
* Use statistical sampling to characterize the performance of the entire application while minimizing overhead
112124
* Enable statistical sampling after binary instrumentation to help "fill in the gaps" between instrumented regions
113-
* Use the user API to create custom regions, enable/disable omnitrace to specific processes, threads, and/or regions
125+
* Use the user API to create custom regions, enable/disable Omnitrace to specific processes, threads, and/or regions
114126
* Dynamic symbol interception, callback APIs, and the user API are always available with binary instrumentation and sampling
115127

116128
* Dynamic symbol interception and callback APIs are (generally) controlled through ``OMNITRACE_USE_<API>`` options, e.g. ``OMNITRACE_USE_KOKKOSP``, ``OMNITRACE_USE_OMPT`` enable Kokkos-Tools and OpenMP-Tools callbacks, respectively
@@ -122,7 +134,7 @@ General Tips for Using Omnitrace
122134
* When call-counts are high, improving the performance of this function or "inlining" the function can be quick and easy performance improvements
123135
* When the standard-deviation is high, collect a hierarchical profile and see if the high variation can be attributable to the calling context. In this scenario, consider creating a specialized version for the function for the longer running contexts
124136
* Collect a hierarchical profile and, keeping the flat-profiling data in mind, verify the functions noted in the flat profile are part of the "critical path" of your application
125-
* E.g. function(s) with high call counts, etc. which are part of a "setup" or "post-processing" phase which does not consume much time relative to the overall time is, generally, a lower priority for optimization
137+
* E.g. functions with high call counts, etc. which are part of a "setup" or "post-processing" phase which does not consume much time relative to the overall time is, generally, a lower priority for optimization
126138

127139
* Use the information from the profiles when analyzing detailed traces
128140
* When using binary instrumentation in the "trace" mode, the binary rewrites are preferable to runtime instrumentation.
@@ -134,10 +146,10 @@ General Tips for Using Omnitrace
134146
* Runtime instrumentation requires a fork + ptrace: which is generally incompatible with how MPI applications spawn their processes
135147
* Binary rewrite the executable using MPI (and, optionally, libraries used by the executable) and execute the generated instrumented executable via ``omnitrace-run`` instead of the original, e.g. ``mpirun -n 2 ./myexe`` should be ``mpirun -n 2 omnitrace-run -- ./myexe.inst`` where ``myexe.inst`` is the generated instrumented ``myexe`` executable.
136148

137-
Data Collection Modes
149+
Data collection modes
138150
========================================
139151

140-
OmniTrace supports several modes of recording trace and profiling data for your application:
152+
Omnitrace supports several modes of recording trace and profiling data for your application:
141153

142154
+-----------------------------+---------------------------------------------------------+
143155
| Mode | Description |
@@ -149,7 +161,7 @@ OmniTrace supports several modes of recording trace and profiling data for your
149161
| | and records various metrics for the given call-stack |
150162
+-----------------------------+---------------------------------------------------------+
151163
| Callback APIs | Parallelism frameworks such as ROCm, OpenMP, and Kokkos |
152-
| | make callbacks into omnitrace to provide information |
164+
| | make callbacks into Omnitrace to provide information |
153165
| | about the work the API is performing |
154166
+-----------------------------+---------------------------------------------------------+
155167
| Dynamic Symbol Interception | Wrap function symbols defined in position independent |
@@ -166,15 +178,15 @@ executable but for statistical sampling, it is highly recommended to use the
166178
``omnitrace-sample`` executable instead if no binary instrumentation is required/desired.
167179
With either tool, the callback APIs and dynamic symbol interception can be utilized.
168180

169-
Binary Instrumentation
181+
Binary instrumentation
170182
-----------------------------------
171183

172-
Binary instrumentation will allow one to deterministically record measurements for
184+
Binary instrumentation will allow one to record deterministic measurements for
173185
every single invocation of a given function.
174186
Binary instrumentation effectively adds instructions to the target application to
175187
collect the required information and, thus, has the potential to cause performance
176188
changes which may, in some cases, lead to inaccurate results. The effect depends on
177-
what information being collected and which features are activated in omnitrace.
189+
what information being collected and which features are activated in Omnitrace.
178190
For example, collecting only the wall-clock timing data
179191
will have less effect than collected the wall-clock timing, cpu-clock timing,
180192
memory usage, cache-misses, and number of instructions executed. Similarly,
@@ -186,14 +198,14 @@ In Omnitrace, the primary heuristic for controlling the overhead with binary
186198
instrumentation is the minimum number of instructions for selecting functions
187199
for instrumentation.
188200

189-
Statistical Sampling
201+
Statistical sampling
190202
-----------------------------------
191203

192204
Statistical call-stack sampling periodically interrupts the application at
193205
regular intervals using operating system interrupts.
194206
Sampling is typically less numerically accurate and specific, but allows the
195207
target program to run at near full speed.
196-
In constrast to the data derived from binary instrumentation, the resulting
208+
In contrast to the data derived from binary instrumentation, the resulting
197209
data is not exact but, instead, a statistical approximation.
198210
However, sampling often provides a more accurate picture of the application
199211
execution because it is less intrusive to the target application and has fewer
@@ -206,27 +218,27 @@ In Omnitrace, the overhead for statistical sampling is a factor of the
206218
sampling rate and whether the samples are taken with respect to the CPU time
207219
and/or real time.
208220

209-
Binary Instrumentation vs. Statistical Sampling Example
221+
Binary instrumentation vs. statistical sampling example
210222
-------------------------------------------------------
211223

212224
Consider the following code:
213225

214226
.. code:: cpp
215227
216-
long fib(long n)
217-
{
228+
long fib(long n)
229+
{
218230
if(n < 2) return n;
219231
return fib(n - 1) + fib(n - 2);
220-
}
232+
}
221233
222-
void run(long n)
223-
{
234+
void run(long n)
235+
{
224236
long result = fib(nfib);
225237
printf("[%li] fibonacci(%li) = %li\n", i, nfib, result);
226-
}
238+
}
227239
228-
int main(int argc, char** argv)
229-
{
240+
int main(int argc, char** argv)
241+
{
230242
long nfib = 30;
231243
long nitr = 10;
232244
if(argc > 1) nfib = atol(argv[1]);
@@ -236,7 +248,7 @@ Consider the following code:
236248
run(nfib);
237249
238250
return 0;
239-
}
251+
}
240252
241253
Binary instrumentation of the ``fib`` function will record **every single invocation**
242254
of the function -- which for a very small function
@@ -259,14 +271,14 @@ Consider the level of details in the following traces where, in the top image,
259271
every instance of the ``fib`` function was instrumented vs. the bottom image
260272
where the ``fib`` call-stack was derived via sampling:
261273

262-
Binary Instrumentation of Fibonacci Function
274+
Binary instrumentation of the Fibonacci function
263275
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
264276

265277
.. image:: ../data/fibonacci-instrumented.png
266-
:alt: Visualization of the output of a binary instrumentation of the Fibonacci fucnction
278+
:alt: Visualization of the output of a binary instrumentation of the Fibonacci function
267279

268-
Statistical Sampling of Fibonacci Function
280+
Statistical sampling of the Fibonacci function
269281
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
270282

271283
.. image:: ../data/fibonacci-sampling.png
272-
:alt: Visualization of the output of a statistical sample of the Fibonacci fucnction
284+
:alt: Visualization of the output of a statistical sample of the Fibonacci function

docs/conceptual/omnitrace-feature-set.rst

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Internally, it leverages the `timemory performance analysis toolkit <https://git
1111
to manage extensions, resources, data, and other items. It supports the following features,
1212
modes, metrics, and APIs.
1313

14-
Data Collection Modes
14+
Data collection modes
1515
========================================
1616

1717
* Dynamic instrumentation
@@ -23,18 +23,18 @@ Data Collection Modes
2323
* Process-level sampling: Background thread records process-, system- and device-level metrics while the application executes
2424
* Causal profiling: Quantifies the potential impact of optimizations in parallel codes
2525

26-
Data Analysis
26+
Data analysis
2727
========================================
2828

29-
* High-level summary profiles with mean/min/max/stddev statistics
29+
* High-level summary profiles with mean/min/max/standard deviation statistics
3030

3131
* Low overhead, memory efficient
3232
* Ideal for running at scale
3333

3434
* Comprehensive traces for every individual event/measurement
3535
* Application speedup predictions resulting from potential optimizations in functions and lines of code (causal profiling)
3636

37-
Parallelism API Support
37+
Parallelism API support
3838
========================================
3939

4040
* HIP
@@ -44,7 +44,7 @@ Parallelism API Support
4444
* Kokkos-Tools (KokkosP)
4545
* OpenMP-Tools (OMPT)
4646

47-
GPU Metrics
47+
GPU metrics
4848
========================================
4949

5050
* GPU hardware counters
@@ -59,7 +59,7 @@ GPU Metrics
5959
* Temperature
6060
* Utilization
6161

62-
CPU Metrics
62+
CPU metrics
6363
========================================
6464

6565
* CPU hardware counters sampling and profiles
@@ -98,7 +98,7 @@ Omnitrace use cases
9898

9999
When analyzing the performance of an application, it is always best to NOT
100100
assume you know where the performance bottlenecks are
101-
and why they are happening. OmniTrace is a tool for the entire execution
101+
and why they are happening. Omnitrace is a tool for the entire execution
102102
of application. It is the sort of tool which is
103103
ideal for characterizing where optimization would have the greatest impact
104104
on the end-to-end execution of the application and/or
@@ -112,8 +112,8 @@ to 1 microsecond (1000x speed-up) but the original application never
112112
spent time waiting for kernel(s) to complete,
113113
you will see zero statistically significant speed-up in end-to-end
114114
runtime of your application. In other words, it does not matter
115-
how fast or slow the code on GPU is if the application is not
116-
bottlenecked waiting on the GPU.
115+
how fast or slow the code on GPU is if the application has a
116+
bottleneck on waiting on the GPU.
117117

118118
Use OmniTrace to obtain a high-level view of the entire application. Use it
119119
to determine where the performance bottlenecks are and

0 commit comments

Comments
 (0)