Skip to content

Commit df91a34

Browse files
samjwuamd-jnovotnypeterjunparkajanicijamddgaliffiAMD
authored
Cherry pick Omnitrace docs refactoring (#353) (#364)
Omnitrace docs refactoring (#353) --------- Signed-off-by: David Galiffi <[email protected]> Co-authored-by: Jeffrey Novotny <[email protected]> Co-authored-by: Peter Jun Park <[email protected]> Co-authored-by: ajanicijamd <[email protected]> Co-authored-by: David Galiffi <[email protected]> Co-authored-by: Jonathan R. Madsen <[email protected]>
1 parent f0bd912 commit df91a34

38 files changed

+7122
-9
lines changed

.github/CODEOWNERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,4 @@
44
docs/* @ROCm/rocm-documentation
55
*.md @ROCm/rocm-documentation
66
*.rst @ROCm/rocm-documentation
7+
.readthedocs.yaml @ROCm/rocm-documentation

.github/dependabot.yml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,14 @@ updates:
99
directory: "/" # Location of package manifests
1010
schedule:
1111
interval: "weekly"
12+
13+
- package-ecosystem: "pip" # See documentation for possible values
14+
directory: "/docs/sphinx" # Location of package manifests
15+
open-pull-requests-limit: 10
16+
schedule:
17+
interval: "daily"
18+
labels:
19+
- "documentation"
20+
- "dependencies"
21+
reviewers:
22+
- "samjwu"

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,10 @@
3737
# Python cache files
3838
*.pyc
3939

40+
# Documentation artifacts
41+
/_build
42+
_toc.yml
43+
4044
/build*
4145
/.vscode
4246
/.cache

.readthedocs.yaml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# Read the Docs configuration file
2+
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
3+
4+
version: 2
5+
6+
build:
7+
os: ubuntu-22.04
8+
tools:
9+
python: "3.10"
10+
11+
python:
12+
install:
13+
- requirements: docs/sphinx/requirements.txt
14+
15+
sphinx:
16+
configuration: docs/conf.py
17+
18+
formats: []

README.md

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,6 @@
88
[![Installer Packaging (CPack)](https://github.com/ROCm/omnitrace/actions/workflows/cpack.yml/badge.svg)](https://github.com/ROCm/omnitrace/actions/workflows/cpack.yml)
99
[![Documentation](https://github.com/ROCm/omnitrace/actions/workflows/docs.yml/badge.svg)](https://github.com/ROCm/omnitrace/actions/workflows/docs.yml)
1010

11-
> ***[Omnitrace](https://github.com/ROCm/omnitrace) is an AMD open source research project and is not supported as part of the ROCm software stack.***
12-
1311
## Overview
1412

1513
AMD Research is seeking to improve observability and performance analysis for software running on AMD heterogeneous systems.
@@ -87,8 +85,8 @@ such as the memory usage, page-faults, and context-switches, and thread-level me
8785

8886
## Documentation
8987

90-
The full documentation for [omnitrace](https://github.com/ROCm/omnitrace) is available at [rocm.github.io/omnitrace](https://rocm.github.io/omnitrace/).
91-
See the [Getting Started documentation](https://rocm.github.io/omnitrace/getting_started) for general tips and a detailed discussion about sampling vs. binary instrumentation.
88+
The full documentation for [omnitrace](https://github.com/ROCm/omnitrace) is available at [the ROCm Omnitrace documentation repository](https://rocm.docs.amd.com/projects/omnitrace/en/latest/index.html).
89+
See the [Getting Started documentation](https://rocm.docs.amd.com/projects/omnitrace/en/conceptual/how-omnitrace-works.html) for general tips and a detailed discussion about sampling vs. binary instrumentation.
9290

9391
## Quick Start
9492

@@ -109,7 +107,7 @@ wget https://github.com/ROCm/omnitrace/releases/latest/download/omnitrace-instal
109107
python3 ./omnitrace-install.py --prefix /opt/omnitrace/rocm-5.4 --rocm 5.4
110108
```
111109

112-
See the [Installation Documentation](https://rocm.github.io/omnitrace/installation) for detailed information.
110+
See the [Installation Documentation](https://rocm.docs.amd.com/projects/omnitrace/en/install/install.html) for detailed information.
113111

114112
### Setup
115113

@@ -298,13 +296,13 @@ for `foo` via the direct call within `spam`. There will be no entries for `bar`
298296
- Select "Open trace file" from panel on the left
299297
- Locate the omnitrace perfetto output (extension: `.proto`)
300298

301-
![omnitrace-perfetto](source/docs/images/omnitrace-perfetto.png)
299+
![omnitrace-perfetto](docs/data/omnitrace-perfetto.png)
302300

303-
![omnitrace-rocm](source/docs/images/omnitrace-rocm.png)
301+
![omnitrace-rocm](docs/data/omnitrace-rocm.png)
304302

305-
![omnitrace-rocm-flow](source/docs/images/omnitrace-rocm-flow.png)
303+
![omnitrace-rocm-flow](docs/data/omnitrace-rocm-flow.png)
306304

307-
![omnitrace-user-api](source/docs/images/omnitrace-user-api.png)
305+
![omnitrace-user-api](docs/data/omnitrace-user-api.png)
308306

309307
## Using Perfetto tracing with System Backend
310308

docs/.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
_build/
2+
_doxygen/
Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
.. meta::
2+
:description: Omnitrace documentation and reference
3+
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
4+
5+
**********************
6+
Data collection modes
7+
**********************
8+
9+
Omnitrace supports several modes of recording trace and profiling data for your application.
10+
11+
.. note::
12+
13+
For an explanation of the terms used in this topic, see
14+
the :doc:`Omnitrace glossary <../reference/omnitrace-glossary>`.
15+
16+
+-----------------------------+---------------------------------------------------------+
17+
| Mode | Description |
18+
+=============================+=========================================================+
19+
| Binary Instrumentation | Locates functions (and loops, if desired) in the binary |
20+
| | and inserts snippets at the entry and exit |
21+
+-----------------------------+---------------------------------------------------------+
22+
| Statistical Sampling | Periodically pauses application at specified intervals |
23+
| | and records various metrics for the given call stack |
24+
+-----------------------------+---------------------------------------------------------+
25+
| Callback APIs | Parallelism frameworks such as ROCm, OpenMP, and Kokkos |
26+
| | make callbacks into Omnitrace to provide information |
27+
| | about the work the API is performing |
28+
+-----------------------------+---------------------------------------------------------+
29+
| Dynamic Symbol Interception | Wrap function symbols defined in a position independent |
30+
| | dynamic library/executable, like ``pthread_mutex_lock`` |
31+
| | in ``libpthread.so`` or ``MPI_Init`` in the MPI library |
32+
+-----------------------------+---------------------------------------------------------+
33+
| User API | User-defined regions and controls for Omnitrace |
34+
+-----------------------------+---------------------------------------------------------+
35+
36+
The two most generic and important modes are binary instrumentation and statistical sampling.
37+
It is important to understand their advantages and disadvantages.
38+
Binary instrumentation and statistical sampling can be performed with the ``omnitrace-instrument``
39+
executable. For statistical sampling, it's highly recommended to use the
40+
``omnitrace-sample`` executable instead if binary instrumentation isn't required or needed.
41+
Callback APIs and dynamic symbol interception can be utilized with either tool.
42+
43+
Binary instrumentation
44+
-----------------------------------
45+
46+
Binary instrumentation lets you record deterministic measurements for
47+
every single invocation of a given function.
48+
Binary instrumentation effectively adds instructions to the target application to
49+
collect the required information. It therefore has the potential to cause performance
50+
changes which might, in some cases, lead to inaccurate results. The effect depends on
51+
the information being collected and which features are activated in Omnitrace.
52+
For example, collecting only the wall-clock timing data
53+
has less of an effect than collecting the wall-clock timing, CPU-clock timing,
54+
memory usage, cache-misses, and number of instructions that were run. Similarly,
55+
collecting a flat profile has less overhead than a hierarchical profile
56+
and collecting a trace OR a profile has less overhead than collecting a
57+
trace AND a profile.
58+
59+
In Omnitrace, the primary heuristic for controlling the overhead with binary
60+
instrumentation is the minimum number of instructions for selecting functions
61+
for instrumentation.
62+
63+
Statistical sampling
64+
-----------------------------------
65+
66+
Statistical call-stack sampling periodically interrupts the application at
67+
regular intervals using operating system interrupts.
68+
Sampling is typically less numerically accurate and specific, but the
69+
target program runs at nearly full speed.
70+
In contrast to the data derived from binary instrumentation, the resulting
71+
data is not exact but is instead a statistical approximation.
72+
However, sampling often provides a more accurate picture of the application
73+
execution because it is less intrusive to the target application and has fewer
74+
side effects on memory caches or instruction decoding pipelines. Furthermore,
75+
because sampling does not affect the execution speed as much, is it
76+
relatively immune to over-evaluating the cost of small, frequently called
77+
functions or "tight" loops.
78+
79+
In Omnitrace, the overhead for statistical sampling depends on the
80+
sampling rate and whether the samples are taken with respect to the CPU time
81+
and/or real time.
82+
83+
Binary instrumentation vs. statistical sampling example
84+
-------------------------------------------------------
85+
86+
Consider the following code:
87+
88+
.. code-block:: c++
89+
90+
long fib(long n)
91+
{
92+
if(n < 2) return n;
93+
return fib(n - 1) + fib(n - 2);
94+
}
95+
96+
void run(long n)
97+
{
98+
long result = fib(n);
99+
printf("[%li] fibonacci(%li) = %li\n", i, n, result);
100+
}
101+
102+
int main(int argc, char** argv)
103+
{
104+
long nfib = 30;
105+
long nitr = 10;
106+
if(argc > 1) nfib = atol(argv[1]);
107+
if(argc > 2) nitr = atol(argv[2]);
108+
109+
for(long i = 0; i < nitr; ++i)
110+
run(nfib);
111+
112+
return 0;
113+
}
114+
115+
Binary instrumentation of the ``fib`` function will record **every single invocation**
116+
of the function. For a very small function
117+
such as ``fib``, this results in **significant** overhead since this simple function
118+
takes about 20 instructions, whereas the entry and
119+
exit snippets are ~1024 instructions. Therefore, you generally want to avoid
120+
instrumenting functions where the instrumented function has significantly fewer
121+
instructions than entry and exit instrumentation. (Note that many of the
122+
instructions in entry and exit functions are either logging functions or
123+
depend on the runtime settings and thus might never run). However,
124+
due to the number of potential instructions in the entry and exit snippets,
125+
the default behavior of ``omnitrace-instrument`` is to only instrument functions
126+
which contain fewer than 1024 instructions.
127+
128+
However, recording every single invocation of the function can be extremely
129+
useful for detecting anomalies, such as profiles that show minimum or maximum values much smaller or larger
130+
than the average or a high standard deviation. In this case, the traces help you
131+
identify exactly when and where those instances deviated from the norm.
132+
Compare the level of detail in the following traces. In the top image,
133+
every instance of the ``fib`` function is instrumented, while in the bottom image,
134+
the ``fib`` call-stack is derived via sampling.
135+
136+
Binary instrumentation of the Fibonacci function
137+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
138+
139+
.. image:: ../data/fibonacci-instrumented.png
140+
:alt: Visualization of the output of a binary instrumentation of the Fibonacci function
141+
142+
Statistical sampling of the Fibonacci function
143+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
144+
145+
.. image:: ../data/fibonacci-sampling.png
146+
:alt: Visualization of the output of a statistical sample of the Fibonacci function
Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
.. meta::
2+
:description: Omnitrace documentation and reference
3+
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
4+
5+
***************************************
6+
The Omnitrace feature set and use cases
7+
***************************************
8+
9+
`Omnitrace <https://github.com/ROCm/omnitrace>`_ is designed to be highly extensible.
10+
Internally, it leverages the `Timemory performance analysis toolkit <https://github.com/NERSC/timemory>`_
11+
to manage extensions, resources, data, and other items. It supports the following features,
12+
modes, metrics, and APIs.
13+
14+
Data collection modes
15+
========================================
16+
17+
* Dynamic instrumentation
18+
19+
* Runtime instrumentation: Instrument executables and shared libraries at runtime
20+
* Binary rewriting: Generate a new executable and/or library with instrumentation built-in
21+
22+
* Statistical sampling: Periodic software interrupts per-thread
23+
* Process-level sampling: A background thread records process-, system- and device-level metrics while the application runs
24+
* Causal profiling: Quantifies the potential impact of optimizations in parallel code
25+
26+
.. note::
27+
28+
Critical trace support was removed in Omnitrace v1.11.0.
29+
It was replaced by the causal profiling feature.
30+
31+
Data analysis
32+
========================================
33+
34+
* High-level summary profiles with mean, min, max, and standard deviation statistics
35+
36+
* Low overhead and memory efficient
37+
* Ideal for running at scale
38+
39+
* Comprehensive traces for every individual event and measurement
40+
* Application speed-up predictions resulting from potential optimizations in functions and lines of code based on causal profiling
41+
42+
Parallelism API support
43+
========================================
44+
45+
* HIP
46+
* HSA
47+
* Pthreads
48+
* MPI
49+
* Kokkos-Tools (KokkosP)
50+
* OpenMP-Tools (OMPT)
51+
52+
GPU metrics
53+
========================================
54+
55+
* GPU hardware counters
56+
* HIP API tracing
57+
* HIP kernel tracing
58+
* HSA API tracing
59+
* HSA operation tracing
60+
* System-level sampling (via rocm-smi)
61+
62+
* Memory usage
63+
* Power usage
64+
* Temperature
65+
* Utilization
66+
67+
CPU metrics
68+
========================================
69+
70+
* CPU hardware counters sampling and profiles
71+
* CPU frequency sampling
72+
* Various timing metrics
73+
74+
* Wall time
75+
* CPU time (process and thread)
76+
* CPU utilization (process and thread)
77+
* User CPU time
78+
* Kernel CPU time
79+
80+
* Various memory metrics
81+
82+
* High-water mark (sampling and profiles)
83+
* Memory page allocation
84+
* Virtual memory usage
85+
86+
* Network statistics
87+
* I/O metrics
88+
* Many others
89+
90+
Third-party API support
91+
========================================
92+
93+
* TAU
94+
* LIKWID
95+
* Caliper
96+
* CrayPAT
97+
* VTune
98+
* NVTX
99+
* ROCTX
100+
101+
Omnitrace use cases
102+
========================================
103+
104+
When analyzing the performance of an application, do NOT
105+
assume you know where the performance bottlenecks are
106+
and why they are happening. Omnitrace is a tool for analyzing the entire
107+
application and its performance. It is
108+
ideal for characterizing where optimization would have the greatest impact
109+
on an end-to-end run of the application and for
110+
viewing what else is happening on the system during a performance bottleneck.
111+
112+
When GPUs are involved, there is a tendency to assume that
113+
the quickest path to performance improvement is minimizing
114+
the runtime of the GPU kernels. This is a highly flawed assumption.
115+
If you optimize the runtime of a kernel from one millisecond
116+
to 1 microsecond (1000x speed-up) but the original application never
117+
spent time waiting for kernels to complete,
118+
there would be no statistically significant reduction in the end-to-end
119+
runtime of your application. In other words, it does not matter
120+
how fast or slow the code on GPU is if the application has a
121+
bottleneck on waiting on the GPU.
122+
123+
Use Omnitrace to obtain a high-level view of the entire application. Use it
124+
to determine where the performance bottlenecks are and
125+
obtain clues to why these bottlenecks are happening. Rather than worrying about kernel
126+
performance, start your investigation with Omnitrace, which characterizes the
127+
broad picture.
128+
129+
.. note::
130+
131+
For insight into the execution of individual kernels on the GPU,
132+
use `Omniperf <https://github.com/rocm/omniperf>`_.
133+
134+
In terms of CPU analysis, Omnitrace does not target any specific vendor.
135+
It works just as well on AMD and non-AMD CPUs.
136+
With regard to the GPU, Omnitrace is currently restricted to HIP and HSA APIs
137+
and kernels running on AMD GPUs.

0 commit comments

Comments
 (0)