Formatting and style fixes for files in conceptual directory

amd-jnovotny · amd-jnovotny · commit dd65bd8f3f2d · 2024-06-21T16:48:25.000-04:00
diff --git a/docs/conceptual/how-omnitrace-works.rst b/docs/conceptual/how-omnitrace-works.rst
@@ -11,89 +11,101 @@ some basic tips to help you get started. It also explains the main data
 collection modes, including a comparison between binary instrumentation 
 and statistical sampling.
 
-Omnitrace Nomenclature
+Omnitrace nomenclature
 ========================================
 
 The list provided below is intended to provide a basic glossary for those who 
 are not familiar with binary instrumentation. It also clarifies ambiguities 
 when certain terms have different 
-contextual meanings, for example, Omnitrace's definition of the term "module" 
+contextual meanings, for example, the Omnitrace meaning of the term "module" 
 when instrumenting Python.
 
 **Binary**
-  A file written in the Executable and Linkable Format (ELF). This is the standard file format for executable files, shared libraries, etc.
+  A file written in the Executable and Linkable Format (ELF). This is the standard file 
+  format for executable files, shared libraries, etc.
 
-**Binary Instrumentation**
-  Inserting callbacks to instrumentation into an existing binary. This can be performed statically or dynamically.
+**Binary instrumentation**
+  Inserting callbacks to instrumentation into an existing binary. This can be performed 
+  statically or dynamically.
 
-**Static Binary Instrumentation**
-  Loads an existing binary, determines instrumentation points, and generates a new binary with instrumentation directly embedded. It is applicable to executables and libraries but limited to only the functions defined in the binary. This is also known as **Binary Rewrite**.
+**Static binary instrumentation**
+  Loads an existing binary, determines instrumentation points, and generates a new binary 
+  with instrumentation directly embedded. It is applicable to executables and libraries but 
+  limited to only the functions defined in the binary. This is also known as **Binary rewrite**.
 
-**Dynamic Binary Instrumentation**
-  Loads an existing binary into memory, inserts instrumentation, and executes the binary. It is limited to executables but capable of instrumenting linked libraries. This is also known as: **Runtime Instrumentation**
+**Dynamic binary instrumentation**
+  Loads an existing binary into memory, inserts instrumentation, and executes the binary. 
+  It is limited to executables but capable of instrumenting linked libraries. 
+  This is also known as: **Runtime instrumentation**.
 
-**Statistical Sampling**  
-  At periodic intervals, the application is paused and the current call-stack of the CPU is recorded alongside with various other metrics. It uses timers that measure either (A) real clock time or (B) the CPU time used by the current thread and the CPU time expended on behalf of the thread by the system. This is also known as just **sampling**.
+**Statistical sampling**  
+  At periodic intervals, the application is paused and the current call-stack of the CPU 
+  is recorded alongside with various other metrics. It uses timers that measure either (A) real clock time or (B) the CPU time used by the current thread and the CPU time expended on behalf of the thread by the system. This is also known as just **sampling**.
 
-  **Sampling Rate**
+  **Sampling rate**
     * The period at which (A) or (B) are triggered (in units of ``# interrupts / second``)
     * Higher values increase the number of samples
 
-  **Sampling Delay**
+  **Sampling delay**
     * How long to wait before (A) and (B) begin triggering at their designated rate
 
-  **Sampling Duration**
-    * The time (in realtime) after the start of the application to record samples. Once this time limit has been reached, no more samples will be recorded.
+  **Sampling duration**
+    * The time (in real-time) after the start of the application to record samples. 
+    * Once this time limit has been reached, no more samples will be recorded.
 
-**Process Sampling**
-  At periodic (realtime) intervals, a background thread records global metrics without 
+**Process sampling**
+  At periodic (real-time) intervals, a background thread records global metrics without 
   interrupting the current process. These metrics include, but are not limited to: 
   CPU frequency, CPU memory high-water mark (i.e. peak memory usage), GPU Temperature,
   and GPU Power usage.
 
-  **Sampling Rate**
-    * The realtime period for recording metrics (in units of ``# measurements / second``)
+  **Sampling rate**
+    * The real-time period for recording metrics (in units of ``# measurements / second``)
     * Higher values increase the number of samples
 
-  **Sampling Delay**
-    * How long to wait (in realtime) before recording samples
+  **Sampling delay**
+    * How long to wait (in real-time) before recording samples
 
-  **Sampling Duration**
-    * The time (in realtime) after the start of the application to record samples. Once this time limit has been reached, no more samples will be recorded.
+  **Sampling duration**
+    * The time (in real-time) after the start of the application to record samples. 
+    * Once this time limit has been reached, no more samples will be recorded.
 
 **Module**
   With respect to binary instrumentation, a module is defined as either the filename 
   (such as ``foo.c``) or library name (``libfoo.so``) which contains the definition 
   of one or more functions.
 
-  With respect to Python instrumentation, a module is defined as the **file** which contains the definition of one or more functions. The full path to this file typically contains the name of the "Python module".
+  With respect to Python instrumentation, a module is defined as the **file** which contains 
+  the definition of one or more functions. The full path to this file typically contains the 
+  name of the "Python module".
 
-**Basic Block**
+**Basic block**
   Straight-line code sequence with no branches in (except for the entry) and 
   no branches out (except for the exit).
 
-**Address Range**
+**Address range**
   The instructions for a function in a binary start at certain address with the ELF file and end at a certain address. The range is ``end - start``.
 
   The address range is a decent approximation for the "cost" of a function. 
   For example, a larger address range approximately equates to more instructions.
 
-**Instrumentation Traps**
+**Instrumentation traps**
   On the x86 architecture, because instructions are of variable size, the instruction 
   at a point may be too small for Dyninst to replace it with the normal code sequence 
   used to call instrumentation. When instrumentation is placed at points other 
   than subroutine entry, exit, or call points, traps may be used to ensure 
-  the instrumentation fits. (By default, omnitrace-instrument avoids instrumentation 
+  the instrumentation fits. (By default, ``omnitrace-instrument`` avoids instrumentation 
   which requires using a trap.)
 
 **Overlapping functions**
   Due to language constructs or compiler optimizations, it may be possible for 
   multiple functions to overlap (that is, share part of the same function body) 
   or for a single function to have multiple entry points. In practice, it is 
   impossible to determine the difference between multiple overlapping functions 
-  and a single function with multiple entry points. (By default, omnitrace-instrument avoids instrumenting overlapping functions.)
+  and a single function with multiple entry points. (By default, ``omnitrace-instrument`` 
+  avoids instrumenting overlapping functions.)
 
-General Tips for Using Omnitrace
+General tips for using Omnitrace
 ========================================
 
 * Use ``omnitrace-avail`` to lookup configuration settings, hardware counters, and data collection components
@@ -110,7 +122,7 @@ General Tips for Using Omnitrace
 * Use binary instrumentation for characterizing the performance of every invocation of specific functions
 * Use statistical sampling to characterize the performance of the entire application while minimizing overhead
 * Enable statistical sampling after binary instrumentation to help "fill in the gaps" between instrumented regions
-* Use the user API to create custom regions, enable/disable omnitrace to specific processes, threads, and/or regions
+* Use the user API to create custom regions, enable/disable Omnitrace to specific processes, threads, and/or regions
 * Dynamic symbol interception, callback APIs, and the user API are always available with binary instrumentation and sampling
 
   * Dynamic symbol interception and callback APIs are (generally) controlled through ``OMNITRACE_USE_<API>`` options, e.g. ``OMNITRACE_USE_KOKKOSP``, ``OMNITRACE_USE_OMPT`` enable Kokkos-Tools and OpenMP-Tools callbacks, respectively
@@ -122,7 +134,7 @@ General Tips for Using Omnitrace
   * When call-counts are high, improving the performance of this function or "inlining" the function can be quick and easy performance improvements
   * When the standard-deviation is high, collect a hierarchical profile and see if the high variation can be attributable to the calling context. In this scenario, consider creating a specialized version for the function for the longer running contexts
   * Collect a hierarchical profile and, keeping the flat-profiling data in mind, verify the functions noted in the flat profile are part of the "critical path" of your application
-  * E.g. function(s) with high call counts, etc. which are part of a "setup" or "post-processing" phase which does not consume much time relative to the overall time is, generally, a lower priority for optimization
+  * E.g. functions with high call counts, etc. which are part of a "setup" or "post-processing" phase which does not consume much time relative to the overall time is, generally, a lower priority for optimization
 
 * Use the information from the profiles when analyzing detailed traces
 * When using binary instrumentation in the "trace" mode, the binary rewrites are preferable to runtime instrumentation.
@@ -134,10 +146,10 @@ General Tips for Using Omnitrace
   * Runtime instrumentation requires a fork + ptrace: which is generally incompatible with how MPI applications spawn their processes
   * Binary rewrite the executable using MPI (and, optionally, libraries used by the executable) and execute the generated instrumented executable via ``omnitrace-run`` instead of the original, e.g. ``mpirun -n 2 ./myexe`` should be ``mpirun -n 2 omnitrace-run -- ./myexe.inst`` where ``myexe.inst`` is the generated instrumented ``myexe`` executable.
 
-Data Collection Modes
+Data collection modes
 ========================================
 
-OmniTrace supports several modes of recording trace and profiling data for your application:
+Omnitrace supports several modes of recording trace and profiling data for your application:
 
 +-----------------------------+---------------------------------------------------------+
 | Mode                        | Description                                             |
@@ -149,7 +161,7 @@ OmniTrace supports several modes of recording trace and profiling data for your
 |                             | and records various metrics for the given call-stack    |
 +-----------------------------+---------------------------------------------------------+
 | Callback APIs               | Parallelism frameworks such as ROCm, OpenMP, and Kokkos |
-|                             | make callbacks into omnitrace to provide information    |
+|                             | make callbacks into Omnitrace to provide information    |
 |                             | about the work the API is performing                    |
 +-----------------------------+---------------------------------------------------------+
 | Dynamic Symbol Interception | Wrap function symbols defined in position independent   |
@@ -166,15 +178,15 @@ executable but for statistical sampling, it is highly recommended to use the
 ``omnitrace-sample`` executable instead if no binary instrumentation is required/desired. 
 With either tool, the callback APIs and dynamic symbol interception can be utilized.
 
-Binary Instrumentation
+Binary instrumentation
 -----------------------------------
 
-Binary instrumentation will allow one to deterministically record measurements for 
+Binary instrumentation will allow one to record deterministic measurements for 
 every single invocation of a given function.
 Binary instrumentation effectively adds instructions to the target application to 
 collect the required information and, thus, has the potential to cause performance 
 changes which may, in some cases, lead to inaccurate results. The effect depends on 
-what information being collected and which features are activated in omnitrace. 
+what information being collected and which features are activated in Omnitrace. 
 For example, collecting only the wall-clock timing data
 will have less effect than collected the wall-clock timing, cpu-clock timing, 
 memory usage, cache-misses, and number of instructions executed. Similarly, 
@@ -186,14 +198,14 @@ In Omnitrace, the primary heuristic for controlling the overhead with binary
 instrumentation is the minimum number of instructions for selecting functions 
 for instrumentation.
 
-Statistical Sampling
+Statistical sampling
 -----------------------------------
 
 Statistical call-stack sampling periodically interrupts the application at 
 regular intervals using operating system interrupts.
 Sampling is typically less numerically accurate and specific, but allows the 
 target program to run at near full speed.
-In constrast to the data derived from binary instrumentation, the resulting 
+In contrast to the data derived from binary instrumentation, the resulting 
 data is not exact but, instead, a statistical approximation.
 However, sampling often provides a more accurate picture of the application 
 execution because it is less intrusive to the target application and has fewer
@@ -206,27 +218,27 @@ In Omnitrace, the overhead for statistical sampling is a factor of the
 sampling rate and whether the samples are taken with respect to the CPU time 
 and/or real time.
 
-Binary Instrumentation vs. Statistical Sampling Example
+Binary instrumentation vs. statistical sampling example
 -------------------------------------------------------
 
 Consider the following code:
 
 .. code:: cpp
 
-    long fib(long n)
-    {
+   long fib(long n)
+   {
         if(n < 2) return n;
         return fib(n - 1) + fib(n - 2);
-    }
+   }
 
-    void run(long n)
-    {
+   void run(long n)
+   {
         long result = fib(nfib);
         printf("[%li] fibonacci(%li) = %li\n", i, nfib, result);
-    }
+   }
 
-    int main(int argc, char** argv)
-    {
+   int main(int argc, char** argv)
+   {
         long nfib = 30;
         long nitr = 10;
         if(argc > 1) nfib = atol(argv[1]);
@@ -236,7 +248,7 @@ Consider the following code:
             run(nfib);
 
         return 0;
-    }
+   }
 
 Binary instrumentation of the ``fib`` function will record **every single invocation** 
 of the function -- which for a very small function
@@ -259,14 +271,14 @@ Consider the level of details in the following traces where, in the top image,
 every instance of the ``fib`` function was instrumented vs. the bottom image
 where the ``fib`` call-stack was derived via sampling:
 
-Binary Instrumentation of Fibonacci Function
+Binary instrumentation of the Fibonacci function
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. image:: ../data/fibonacci-instrumented.png
-   :alt: Visualization of the output of a binary instrumentation of the Fibonacci fucnction
+   :alt: Visualization of the output of a binary instrumentation of the Fibonacci function
 
-Statistical Sampling of Fibonacci Function
+Statistical sampling of the Fibonacci function
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. image:: ../data/fibonacci-sampling.png
-   :alt: Visualization of the output of a statistical sample of the Fibonacci fucnction
+   :alt: Visualization of the output of a statistical sample of the Fibonacci function
diff --git a/docs/conceptual/omnitrace-feature-set.rst b/docs/conceptual/omnitrace-feature-set.rst
@@ -11,7 +11,7 @@ Internally, it leverages the `timemory performance analysis toolkit <https://git
 to manage extensions, resources, data, and other items. It supports the following features, 
 modes, metrics, and APIs.
 
-Data Collection Modes
+Data collection modes
 ========================================
 
 * Dynamic instrumentation
@@ -23,18 +23,18 @@ Data Collection Modes
 * Process-level sampling: Background thread records process-, system- and device-level metrics while the application executes
 * Causal profiling: Quantifies the potential impact of optimizations in parallel codes
 
-Data Analysis
+Data analysis
 ========================================
 
-* High-level summary profiles with mean/min/max/stddev statistics
+* High-level summary profiles with mean/min/max/standard deviation statistics
 
   * Low overhead, memory efficient
   * Ideal for running at scale
 
 * Comprehensive traces for every individual event/measurement
 * Application speedup predictions resulting from potential optimizations in functions and lines of code (causal profiling)
 
-Parallelism API Support
+Parallelism API support
 ========================================
 
 * HIP
@@ -44,7 +44,7 @@ Parallelism API Support
 * Kokkos-Tools (KokkosP)
 * OpenMP-Tools (OMPT)
 
-GPU Metrics
+GPU metrics
 ========================================
 
 * GPU hardware counters
@@ -59,7 +59,7 @@ GPU Metrics
   * Temperature
   * Utilization
 
-CPU Metrics
+CPU metrics
 ========================================
 
 * CPU hardware counters sampling and profiles
@@ -98,7 +98,7 @@ Omnitrace use cases
 
 When analyzing the performance of an application, it is always best to NOT 
 assume you know where the performance bottlenecks are
-and why they are happening. OmniTrace is a tool for the entire execution 
+and why they are happening. Omnitrace is a tool for the entire execution 
 of application. It is the sort of tool which is
 ideal for characterizing where optimization would have the greatest impact 
 on the end-to-end execution of the application and/or
@@ -112,8 +112,8 @@ to 1 microsecond (1000x speed-up) but the original application never
 spent time waiting for kernel(s) to complete,
 you will see zero statistically significant speed-up in end-to-end 
 runtime of your application. In other words, it does not matter
-how fast or slow the code on GPU is if the application is not 
-bottlenecked waiting on the GPU.
+how fast or slow the code on GPU is if the application has a  
+bottleneck on waiting on the GPU.
 
 Use OmniTrace to obtain a high-level view of the entire application. Use it 
 to determine where the performance bottlenecks are and