Skip to content

[FEA] Add filtered diagnostic output for GPU slowness in Profiler tool #1548

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 25, 2025

Conversation

cindyyuanjiang
Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang commented Feb 18, 2025

Contributes to #1374

Changes

  • Add a filter diagnostic (operator and SQL metrics) view in Profiler output: filtered_diagnostic_metrics.csv
    • Introduced function generateFilteredDiagnosticAccums to compute filtered diagnostic metrics in core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala
    • Introduced FilteredAccumDiagnosticMetrics to store selected filtered related metric names and methods
    • Introduced FilteredDiagnosticResult to represent filtered diagnostic metrics result

Testing

  • Add unit test test filtered diagnostic metrics in core/src/test/scala/com/nvidia/spark/rapids/tool/profiling/AnalysisSuite.scala

Example Output

appIndex,appName,appId,sqlId,stageId,stageDurationMs,nodeId,nodeName,numFilesReadMin,numFilesReadMedian,numFilesReadMax,numFilesReadTotal,numPartitionsMin,numPartitionsMedian,numPartitionsMax,numPartitionsTotal,metadataTimeMin,metadataTimeMedian,metadataTimeMax,metadataTimeTotal,outputBatchesMin,outputBatchesMedian,outputBatchesMax,outputBatchesTotal,inputBatchesMin,inputBatchesMedian,inputBatchesMax,inputBatchesTotal,outputRowsMin,outputRowsMedian,outputRowsMax,outputRowsTotal,sortTimeMin,sortTimeMedian,sortTimeMax,sortTimeTotal,peakMemoryMin,peakMemoryMedian,peakMemoryMax,peakMemoryTotal,shuffleBytesWrittenMin,shuffleBytesWrittenMedian,shuffleBytesWrittenMax,shuffleBytesWrittenTotal,shuffleWriteTimeMin,shuffleWriteTimeMedian,shuffleWriteTimeMax,shuffleWriteTimeTotal
1,Spark shell,local-1622814619968,0,0,1743,16,"GpuColumnarExchange",0,0,0,0,0,0,0,0,0,0,0,0,200,200,200,1200,0,0,0,0,1666666,1666666,1666667,10000000,0,0,0,0,0,0,0,0,6688608,6688707,6688825,40132250,41434653,66714083,100858775,400284505
1,Spark shell,local-1622814619968,0,0,1743,21,"Scan",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1666666,1666666,1666667,10000000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Spark shell,local-1622814619968,0,1,1631,8,"GpuColumnarExchange",0,0,0,0,0,0,0,0,0,0,0,0,200,200,200,1200,0,0,0,0,1666666,1666666,1666667,10000000,0,0,0,0,0,0,0,0,6688602,6688708,6688833,40132258,37444140,84791745,108992798,508750471
1,Spark shell,local-1622814619968,0,1,1631,13,"Scan",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1666666,1666666,1666667,10000000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Spark shell,local-1622814619968,0,2,688,3,"GpuColumnarExchange",0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,200,0,0,0,0,1,1,1,200,0,0,0,0,0,0,0,0,77,77,77,15400,139875,465911,9747416,93193331
1,Spark shell,local-1622814619968,0,2,688,4,"GpuHashAggregate",0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,200,0,0,0,0,1,1,1,200,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

Note:
For details about the logic of this PR, please see comment below.

Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
@cindyyuanjiang cindyyuanjiang changed the title WIP: [FEA] Add filtered diagnostic output for GPU slowness in Profiler tool [FEA] Add filtered diagnostic output for GPU slowness in Profiler tool Feb 24, 2025
@cindyyuanjiang cindyyuanjiang self-assigned this Feb 24, 2025
@cindyyuanjiang cindyyuanjiang added the feature request New feature or request label Feb 24, 2025
@cindyyuanjiang cindyyuanjiang marked this pull request as ready for review February 24, 2025 06:10
Copy link
Collaborator

@sayedbilalbari sayedbilalbari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few doubts

Copy link
Collaborator

@sayedbilalbari sayedbilalbari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall ! My two main doubts -

  1. Better to verify whether the mapping between accumID and nodeID is one-one
  2. Everywhere we are using SQL Metrics accumulables ( scan time, fetch time, etc. ) instead of internal.metrics.shuffleWaitTime etc. Would be good to cross check this part about how trustworthy the SQL accum updates vs the task accum updates are

* Retrieves the top 7 stage IDs with the highest execution durations.
* 7 is a magic number that was chosen from past experiences.
*/
def topDurationStageIds: Set[Int] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we querying this set multiple times? Should we make this a lazy val?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @parthosa, updated this.

@cindyyuanjiang
Copy link
Collaborator Author

cindyyuanjiang commented Mar 14, 2025

Had an offline discussion with @bilalbari yesterday. I will explain the logic of this PR here for future reviewers to better understand the changes.

This PR wants to generate a filtered diagnostic SQL metrics view where each line is uniquely identified by (SQL ID, stage ID, node ID), and each line contains metrics statistics for the specific node (e.g. HashAggregate) in the stage ID which belongs to the SQL ID. These metrics results are already generated from Profiler tool. This PR tries to reuse and reorganize the results without using too much performance or memory.

One way is to reuse the results from SQL Plan Metrics for Application which looks like:

+--------+-----+------+---------------------+-------------+-----------------------+--------+--------+---------+----------+----------+--------+
|appIndex|sqlID|nodeID|nodeName             |accumulatorId|name                   |min     |median  |max      |total     |metricType|stageIds|
+--------+-----+------+---------------------+-------------+-----------------------+--------+--------+---------+----------+----------+--------+
|1       |0    |3     |GpuColumnarExchange  |46           |output rows            |1       |1       |1        |200       |sum       |2,3     |
|1       |0    |3     |GpuColumnarExchange  |47           |output columnar batches|1       |1       |1        |200       |sum       |2,3     |
+--------+-----+------+---------------------+-------------+-----------------------+--------+--------+---------+----------+----------+--------+

The caveat of using only this table is that each metric (row) in this table can associate with multiple stage IDs. So to translate this table to the desired diagnostic view, I need to recompute the metric results for those rows with multiple stage IDs.

There is another table Stage Level All Metrics which also contains metrics statistics (at stage level) for each accumulator ID, which looks like:

+--------+-------+-------------+---------------------------------------------------+---------+---------+---------+----------+
|appIndex|stageId|accumulatorId|name                                               |min      |median   |max      |total     |
+--------+-------+-------------+---------------------------------------------------+---------+---------+---------+----------+
|1       |2      |46           |output rows                                        |1        |1        |1        |200       |
|1       |2      |47           |output columnar batches                            |1        |1        |1        |200       |
+--------+-------+-------------+---------------------------------------------------+---------+---------+---------+----------+

This table contains metrics statistics which are exactly what we need, but the caveat is that the table has no node or SQL info.

The approach of this PR is to utilize both tables which shares a unique identification accumulatorId, so we can reuse the metrics statistics from Stage Level All Metrics and the node and SQL info from SQL Plan Metrics for Application.

Steps:

  1. When generating results for Stage Level All Metrics, filter (based on the metrics we want) and store the results into a separate list filteredStageAccumList
  2. When generating results for SQL Plan Metrics for Application, construct a map accumIdToNodeInfoMap which maps from accumulator ID to (SQL ID, Node ID, Node Name)
  3. Iterate through the filteredStageAccumList (note this list contains all the metrics we want, but missing the node and SQl info)
  4. Construct a map nodeInfoToFilterMetricsMap maps from (SQL ID, Node ID, Node Name, Stage ID) to metrics statistics (stored as a map from metric names to metric statistics), note that each entry in this map will result to a row in the final diagnostic view
  5. Iterate through map nodeInfoToFilterMetricsMap to generate a list of diagnostic results

A row in the diagnostic view will look like:

+--------+-----------+-------------------+-----+-------+---------------+------+-------------------+---------------+------------------+---------------+-----------------+----------------+-------------------+----------------+------------------+---------------+------------------+---------------+-----------------+----------------+-------------------+----------------+------------------+---------------+------------------+---------------+-----------------+-------------+----------------+-------------+---------------+-----------+--------------+-----------+-------------+-------------+----------------+-------------+---------------+----------------------+-------------------------+----------------------+------------------------+-------------------+----------------------+-------------------+---------------------+
|appIndex|appName    |appId              |sqlId|stageId|stageDurationMs|nodeId|nodeName           |numFilesReadMin|numFilesReadMedian|numFilesReadMax|numFilesReadTotal|numPartitionsMin|numPartitionsMedian|numPartitionsMax|numPartitionsTotal|metadataTimeMin|metadataTimeMedian|metadataTimeMax|metadataTimeTotal|outputBatchesMin|outputBatchesMedian|outputBatchesMax|outputBatchesTotal|inputBatchesMin|inputBatchesMedian|inputBatchesMax|inputBatchesTotal|outputRowsMin|outputRowsMedian|outputRowsMax|outputRowsTotal|sortTimeMin|sortTimeMedian|sortTimeMax|sortTimeTotal|peakMemoryMin|peakMemoryMedian|peakMemoryMax|peakMemoryTotal|shuffleBytesWrittenMin|shuffleBytesWrittenMedian|shuffleBytesWrittenMax|shuffleBytesWrittenTotal|shuffleWriteTimeMin|shuffleWriteTimeMedian|shuffleWriteTimeMax|shuffleWriteTimeTotal|
+--------+-----------+-------------------+-----+-------+---------------+------+-------------------+---------------+------------------+---------------+-----------------+----------------+-------------------+----------------+------------------+---------------+------------------+---------------+-----------------+----------------+-------------------+----------------+------------------+---------------+------------------+---------------+-----------------+-------------+----------------+-------------+---------------+-----------+--------------+-----------+-------------+-------------+----------------+-------------+---------------+----------------------+-------------------------+----------------------+------------------------+-------------------+----------------------+-------------------+---------------------+
|1       |Spark shell|local-1622814619968|0    |0      |1743           |16    |GpuColumnarExchange|0              |0                 |0              |0                |0               |0                  |0               |0                 |0              |0                 |0              |0                |200             |200                |200             |1200              |0              |0                 |0              |0                |1666666      |1666666         |1666667      |10000000       |0          |0             |0          |0            |0            |0               |0            |0              |6688608               |6688707                  |6688825               |40132250                |41434653           |66714083              |100858775          |400284505            |
+--------+-----------+-------------------+-----+-------+---------------+------+-------------------+---------------+------------------+---------------+-----------------+----------------+-------------------+----------------+------------------+---------------+------------------+---------------+-----------------+----------------+-------------------+----------------+------------------+---------------+------------------+---------------+-----------------+-------------+----------------+-------------+---------------+-----------+--------------+-----------+-------------+-------------+----------------+-------------+---------------+----------------------+-------------------------+----------------------+------------------------+-------------------+----------------------+-------------------+---------------------+

cc: @parthosa @amahussein

@cindyyuanjiang cindyyuanjiang added the core_tools Scope the core module (scala) label Mar 14, 2025
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang
LGTME!

@amahussein amahussein merged commit 97cd249 into NVIDIA:dev Mar 25, 2025
14 checks passed
amahussein added a commit to amahussein/spark-rapids-tools that referenced this pull request Mar 26, 2025
…iler tool (NVIDIA#1548)"

This reverts commit 97cd249.

PR NVIDIA#1548 is reverted because it has 14% negative impact on the
performance of the tools which is a considerable amount.
For now, the PR is being reverted until the implementation is revisited.

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
amahussein added a commit that referenced this pull request Mar 26, 2025
…iler tool (#1548)" (#1602)

This reverts commit 97cd249.

PR #1548 is reverted because it has 14% negative impact on the
performance of the tools which is a considerable amount. For now, the PR
is being reverted until the implementation is revisited.

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core_tools Scope the core module (scala) feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants