[FEA] Add filtered diagnostic output for GPU slowness in Profiler tool #1548

cindyyuanjiang · 2025-02-18T23:29:44Z

Contributes to #1374

Changes

Add a filter diagnostic (operator and SQL metrics) view in Profiler output: filtered_diagnostic_metrics.csv
- Introduced function generateFilteredDiagnosticAccums to compute filtered diagnostic metrics in core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala
- Introduced FilteredAccumDiagnosticMetrics to store selected filtered related metric names and methods
- Introduced FilteredDiagnosticResult to represent filtered diagnostic metrics result

Testing

Add unit test test filtered diagnostic metrics in core/src/test/scala/com/nvidia/spark/rapids/tool/profiling/AnalysisSuite.scala

Example Output

appIndex,appName,appId,sqlId,stageId,stageDurationMs,nodeId,nodeName,numFilesReadMin,numFilesReadMedian,numFilesReadMax,numFilesReadTotal,numPartitionsMin,numPartitionsMedian,numPartitionsMax,numPartitionsTotal,metadataTimeMin,metadataTimeMedian,metadataTimeMax,metadataTimeTotal,outputBatchesMin,outputBatchesMedian,outputBatchesMax,outputBatchesTotal,inputBatchesMin,inputBatchesMedian,inputBatchesMax,inputBatchesTotal,outputRowsMin,outputRowsMedian,outputRowsMax,outputRowsTotal,sortTimeMin,sortTimeMedian,sortTimeMax,sortTimeTotal,peakMemoryMin,peakMemoryMedian,peakMemoryMax,peakMemoryTotal,shuffleBytesWrittenMin,shuffleBytesWrittenMedian,shuffleBytesWrittenMax,shuffleBytesWrittenTotal,shuffleWriteTimeMin,shuffleWriteTimeMedian,shuffleWriteTimeMax,shuffleWriteTimeTotal
1,Spark shell,local-1622814619968,0,0,1743,16,"GpuColumnarExchange",0,0,0,0,0,0,0,0,0,0,0,0,200,200,200,1200,0,0,0,0,1666666,1666666,1666667,10000000,0,0,0,0,0,0,0,0,6688608,6688707,6688825,40132250,41434653,66714083,100858775,400284505
1,Spark shell,local-1622814619968,0,0,1743,21,"Scan",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1666666,1666666,1666667,10000000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Spark shell,local-1622814619968,0,1,1631,8,"GpuColumnarExchange",0,0,0,0,0,0,0,0,0,0,0,0,200,200,200,1200,0,0,0,0,1666666,1666666,1666667,10000000,0,0,0,0,0,0,0,0,6688602,6688708,6688833,40132258,37444140,84791745,108992798,508750471
1,Spark shell,local-1622814619968,0,1,1631,13,"Scan",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1666666,1666666,1666667,10000000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Spark shell,local-1622814619968,0,2,688,3,"GpuColumnarExchange",0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,200,0,0,0,0,1,1,1,200,0,0,0,0,0,0,0,0,77,77,77,15400,139875,465911,9747416,93193331
1,Spark shell,local-1622814619968,0,2,688,4,"GpuHashAggregate",0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,200,0,0,0,0,1,1,1,200,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

Note:
For details about the logic of this PR, please see comment below.

Signed-off-by: cindyyuanjiang <[email protected]>

sayedbilalbari

Just a few doubts

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/util/DiagnosticMetrics.scala

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

sayedbilalbari

Looks good overall ! My two main doubts -

Better to verify whether the mapping between accumID and nodeID is one-one
Everywhere we are using SQL Metrics accumulables ( scan time, fetch time, etc. ) instead of internal.metrics.shuffleWaitTime etc. Would be good to cross check this part about how trustworthy the SQL accum updates vs the task accum updates are

parthosa · 2025-03-13T17:42:43Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala

+   * Retrieves the top 7 stage IDs with the highest execution durations.
+   * 7 is a magic number that was chosen from past experiences.
+   */
+  def topDurationStageIds: Set[Int] = {


Are we querying this set multiple times? Should we make this a lazy val?

thanks @parthosa, updated this.

cindyyuanjiang · 2025-03-14T00:19:05Z

Had an offline discussion with @bilalbari yesterday. I will explain the logic of this PR here for future reviewers to better understand the changes.

This PR wants to generate a filtered diagnostic SQL metrics view where each line is uniquely identified by (SQL ID, stage ID, node ID), and each line contains metrics statistics for the specific node (e.g. HashAggregate) in the stage ID which belongs to the SQL ID. These metrics results are already generated from Profiler tool. This PR tries to reuse and reorganize the results without using too much performance or memory.

One way is to reuse the results from SQL Plan Metrics for Application which looks like:

+--------+-----+------+---------------------+-------------+-----------------------+--------+--------+---------+----------+----------+--------+
|appIndex|sqlID|nodeID|nodeName             |accumulatorId|name                   |min     |median  |max      |total     |metricType|stageIds|
+--------+-----+------+---------------------+-------------+-----------------------+--------+--------+---------+----------+----------+--------+
|1       |0    |3     |GpuColumnarExchange  |46           |output rows            |1       |1       |1        |200       |sum       |2,3     |
|1       |0    |3     |GpuColumnarExchange  |47           |output columnar batches|1       |1       |1        |200       |sum       |2,3     |
+--------+-----+------+---------------------+-------------+-----------------------+--------+--------+---------+----------+----------+--------+

The caveat of using only this table is that each metric (row) in this table can associate with multiple stage IDs. So to translate this table to the desired diagnostic view, I need to recompute the metric results for those rows with multiple stage IDs.

There is another table Stage Level All Metrics which also contains metrics statistics (at stage level) for each accumulator ID, which looks like:

+--------+-------+-------------+---------------------------------------------------+---------+---------+---------+----------+
|appIndex|stageId|accumulatorId|name                                               |min      |median   |max      |total     |
+--------+-------+-------------+---------------------------------------------------+---------+---------+---------+----------+
|1       |2      |46           |output rows                                        |1        |1        |1        |200       |
|1       |2      |47           |output columnar batches                            |1        |1        |1        |200       |
+--------+-------+-------------+---------------------------------------------------+---------+---------+---------+----------+

This table contains metrics statistics which are exactly what we need, but the caveat is that the table has no node or SQL info.

The approach of this PR is to utilize both tables which shares a unique identification accumulatorId, so we can reuse the metrics statistics from Stage Level All Metrics and the node and SQL info from SQL Plan Metrics for Application.

Steps:

When generating results for Stage Level All Metrics, filter (based on the metrics we want) and store the results into a separate list filteredStageAccumList
When generating results for SQL Plan Metrics for Application, construct a map accumIdToNodeInfoMap which maps from accumulator ID to (SQL ID, Node ID, Node Name)
Iterate through the filteredStageAccumList (note this list contains all the metrics we want, but missing the node and SQl info)
Construct a map nodeInfoToFilterMetricsMap maps from (SQL ID, Node ID, Node Name, Stage ID) to metrics statistics (stored as a map from metric names to metric statistics), note that each entry in this map will result to a row in the final diagnostic view
Iterate through map nodeInfoToFilterMetricsMap to generate a list of diagnostic results

A row in the diagnostic view will look like:

+--------+-----------+-------------------+-----+-------+---------------+------+-------------------+---------------+------------------+---------------+-----------------+----------------+-------------------+----------------+------------------+---------------+------------------+---------------+-----------------+----------------+-------------------+----------------+------------------+---------------+------------------+---------------+-----------------+-------------+----------------+-------------+---------------+-----------+--------------+-----------+-------------+-------------+----------------+-------------+---------------+----------------------+-------------------------+----------------------+------------------------+-------------------+----------------------+-------------------+---------------------+
|appIndex|appName    |appId              |sqlId|stageId|stageDurationMs|nodeId|nodeName           |numFilesReadMin|numFilesReadMedian|numFilesReadMax|numFilesReadTotal|numPartitionsMin|numPartitionsMedian|numPartitionsMax|numPartitionsTotal|metadataTimeMin|metadataTimeMedian|metadataTimeMax|metadataTimeTotal|outputBatchesMin|outputBatchesMedian|outputBatchesMax|outputBatchesTotal|inputBatchesMin|inputBatchesMedian|inputBatchesMax|inputBatchesTotal|outputRowsMin|outputRowsMedian|outputRowsMax|outputRowsTotal|sortTimeMin|sortTimeMedian|sortTimeMax|sortTimeTotal|peakMemoryMin|peakMemoryMedian|peakMemoryMax|peakMemoryTotal|shuffleBytesWrittenMin|shuffleBytesWrittenMedian|shuffleBytesWrittenMax|shuffleBytesWrittenTotal|shuffleWriteTimeMin|shuffleWriteTimeMedian|shuffleWriteTimeMax|shuffleWriteTimeTotal|
+--------+-----------+-------------------+-----+-------+---------------+------+-------------------+---------------+------------------+---------------+-----------------+----------------+-------------------+----------------+------------------+---------------+------------------+---------------+-----------------+----------------+-------------------+----------------+------------------+---------------+------------------+---------------+-----------------+-------------+----------------+-------------+---------------+-----------+--------------+-----------+-------------+-------------+----------------+-------------+---------------+----------------------+-------------------------+----------------------+------------------------+-------------------+----------------------+-------------------+---------------------+
|1       |Spark shell|local-1622814619968|0    |0      |1743           |16    |GpuColumnarExchange|0              |0                 |0              |0                |0               |0                  |0               |0                 |0              |0                 |0              |0                |200             |200                |200             |1200              |0              |0                 |0              |0                |1666666      |1666666         |1666667      |10000000       |0          |0             |0          |0            |0            |0               |0            |0              |6688608               |6688707                  |6688825               |40132250                |41434653           |66714083              |100858775          |400284505            |
+--------+-----------+-------------------+-----+-------+---------------+------+-------------------+---------------+------------------+---------------+-----------------+----------------+-------------------+----------------+------------------+---------------+------------------+---------------+-----------------+----------------+-------------------+----------------+------------------+---------------+------------------+---------------+-----------------+-------------+----------------+-------------+---------------+-----------+--------------+-----------+-------------+-------------+----------------+-------------+---------------+----------------------+-------------------------+----------------------+------------------------+-------------------+----------------------+-------------------+---------------------+

cc: @parthosa @amahussein

Signed-off-by: cindyyuanjiang <[email protected]>

amahussein

Thanks @cindyyuanjiang
LGTME!

…iler tool (NVIDIA#1548)" This reverts commit 97cd249. PR NVIDIA#1548 is reverted because it has 14% negative impact on the performance of the tools which is a considerable amount. For now, the PR is being reverted until the implementation is revisited. Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

…iler tool (#1548)" (#1602) This reverts commit 97cd249. PR #1548 is reverted because it has 14% negative impact on the performance of the tools which is a considerable amount. For now, the PR is being reverted until the implementation is revisited. Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

cindyyuanjiang added 7 commits February 5, 2025 16:27

view 3 implementation part 1

9af6aba

Signed-off-by: cindyyuanjiang <[email protected]>

working implementation for view 3

6e0fd5d

Signed-off-by: cindyyuanjiang <[email protected]>

cleaned up AppSQLPlanAnalyzer.scala file

61ced7d

Signed-off-by: cindyyuanjiang <[email protected]>

fixes comments

a410712

Signed-off-by: cindyyuanjiang <[email protected]>

fixes comments

926995c

Signed-off-by: cindyyuanjiang <[email protected]>

profiler resolve merge conflict

96b4a3d

Signed-off-by: cindyyuanjiang <[email protected]>

updated PR due to recent changes in tools repo

ea2648f

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang changed the title ~~WIP: [FEA] Add filtered diagnostic output for GPU slowness in Profiler tool~~ [FEA] Add filtered diagnostic output for GPU slowness in Profiler tool Feb 24, 2025

cindyyuanjiang requested review from amahussein and sayedbilalbari February 24, 2025 06:10

cindyyuanjiang self-assigned this Feb 24, 2025

cindyyuanjiang added the feature request New feature or request label Feb 24, 2025

cindyyuanjiang marked this pull request as ready for review February 24, 2025 06:10

sayedbilalbari reviewed Feb 25, 2025

View reviewed changes

Merge branch 'dev' into diagnostic-view-3

d6b0230

sayedbilalbari reviewed Mar 13, 2025

View reviewed changes

parthosa reviewed Mar 13, 2025

View reviewed changes

cindyyuanjiang added 2 commits March 13, 2025 17:46

address review feedback

dcaea87

Signed-off-by: cindyyuanjiang <[email protected]>

Merge branch 'dev' into diagnostic-view-3

64eebb4

cindyyuanjiang added the core_tools Scope the core module (scala) label Mar 14, 2025

amahussein approved these changes Mar 24, 2025

View reviewed changes

amahussein merged commit 97cd249 into NVIDIA:dev Mar 25, 2025
14 checks passed

amahussein mentioned this pull request Mar 26, 2025

Revert "[FEA] Add filtered diagnostic output for GPU slowness in Profiler tool (#1548)" #1602

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add filtered diagnostic output for GPU slowness in Profiler tool #1548

[FEA] Add filtered diagnostic output for GPU slowness in Profiler tool #1548

cindyyuanjiang commented Feb 18, 2025 •

edited

Loading

sayedbilalbari left a comment

sayedbilalbari left a comment

parthosa Mar 13, 2025

cindyyuanjiang Mar 14, 2025

cindyyuanjiang commented Mar 14, 2025 •

edited

Loading

amahussein left a comment •

edited

Loading

[FEA] Add filtered diagnostic output for GPU slowness in Profiler tool #1548

[FEA] Add filtered diagnostic output for GPU slowness in Profiler tool #1548

Conversation

cindyyuanjiang commented Feb 18, 2025 • edited Loading

Changes

Testing

Example Output

sayedbilalbari left a comment

Choose a reason for hiding this comment

sayedbilalbari left a comment

Choose a reason for hiding this comment

parthosa Mar 13, 2025

Choose a reason for hiding this comment

cindyyuanjiang Mar 14, 2025

Choose a reason for hiding this comment

cindyyuanjiang commented Mar 14, 2025 • edited Loading

amahussein left a comment • edited Loading

Choose a reason for hiding this comment

cindyyuanjiang commented Feb 18, 2025 •

edited

Loading

cindyyuanjiang commented Mar 14, 2025 •

edited

Loading

amahussein left a comment •

edited

Loading