-
Notifications
You must be signed in to change notification settings - Fork 44
[FEA] Add filtered diagnostic output for GPU slowness in Profiler tool #1548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few doubts
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/util/DiagnosticMetrics.scala
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/util/DiagnosticMetrics.scala
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall ! My two main doubts -
- Better to verify whether the mapping between accumID and nodeID is one-one
- Everywhere we are using SQL Metrics accumulables ( scan time, fetch time, etc. ) instead of internal.metrics.shuffleWaitTime etc. Would be good to cross check this part about how trustworthy the SQL accum updates vs the task accum updates are
* Retrieves the top 7 stage IDs with the highest execution durations. | ||
* 7 is a magic number that was chosen from past experiences. | ||
*/ | ||
def topDurationStageIds: Set[Int] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we querying this set multiple times? Should we make this a lazy val?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @parthosa, updated this.
Had an offline discussion with @bilalbari yesterday. I will explain the logic of this PR here for future reviewers to better understand the changes. This PR wants to generate a filtered diagnostic SQL metrics view where each line is uniquely identified by One way is to reuse the results from
The caveat of using only this table is that each metric (row) in this table can associate with multiple stage IDs. So to translate this table to the desired diagnostic view, I need to recompute the metric results for those rows with multiple stage IDs. There is another table
This table contains metrics statistics which are exactly what we need, but the caveat is that the table has no node or SQL info. The approach of this PR is to utilize both tables which shares a unique identification Steps:
A row in the diagnostic view will look like:
cc: @parthosa @amahussein |
Signed-off-by: cindyyuanjiang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cindyyuanjiang
LGTME!
…iler tool (NVIDIA#1548)" This reverts commit 97cd249. PR NVIDIA#1548 is reverted because it has 14% negative impact on the performance of the tools which is a considerable amount. For now, the PR is being reverted until the implementation is revisited. Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
…iler tool (#1548)" (#1602) This reverts commit 97cd249. PR #1548 is reverted because it has 14% negative impact on the performance of the tools which is a considerable amount. For now, the PR is being reverted until the implementation is revisited. Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
Contributes to #1374
Changes
filtered_diagnostic_metrics.csv
generateFilteredDiagnosticAccums
to compute filtered diagnostic metrics incore/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSQLPlanAnalyzer.scala
FilteredAccumDiagnosticMetrics
to store selected filtered related metric names and methodsFilteredDiagnosticResult
to represent filtered diagnostic metrics resultTesting
test filtered diagnostic metrics
incore/src/test/scala/com/nvidia/spark/rapids/tool/profiling/AnalysisSuite.scala
Example Output
Note:
For details about the logic of this PR, please see comment below.