You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adds aggregation across metrics for failed/succeeded and non completed stages (#1558)
Fixes#1552
Currently we store the stageInfo using the stageModelManager class where
we map incoming stage information during the following events -
1. doSparkListenerStageCompleted
https://github.com/NVIDIA/spark-rapids-tools/blob/1f037fa867e4df0952e29d82164cc7fc507c9b4e/core/src/main/scala/org/apache/spark/sql/rapids/tool/EventProcessorBase.scala#L475
2. doSparkListenerStageSubmitted. -
https://github.com/NVIDIA/spark-rapids-tools/blob/1f037fa867e4df0952e29d82164cc7fc507c9b4e/core/src/main/scala/org/apache/spark/sql/rapids/tool/EventProcessorBase.scala#L464
So a stage information is updated once when a stage is submitted and
once during completion.
A stageCompleted event comes for all attempts of a stage ( eg - there
will be two stage Submitted and StageCompleted events for stage that
fails on first attempt and succeeds on attempt 2)
This PR changes that behavior to aggregate all attempts for a stage (
failed + succeeded )
### Changes -
This pull request includes several changes to improve the handling of
stage attempts and task metrics in the Spark RAPIDS tool. The most
important changes include adding logic to handle multiple stage
attempts, modifying methods to aggregate metrics for these attempts, and
updating the `AccumManager` to simplify task accumulation.
Handling multiple stage attempts:
*
[`core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala`](diffhunk://#diff-4b0aab10a86746bb7480cc3bde4e013c04707758c61782934c07604443160b40L450-R455):
Added logic to handle multiple stage attempts by aggregating metrics for
each attempt.
*
[`core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala`](diffhunk://#diff-8d5819c9445c1489d61ee8d03fd2b1ee1e0cb33896f402f4ceb7782c35deed69R688-R746):
Introduced `aggregateStageProfileMetric` method to combine metrics for
multiple attempts of the same stage.
Simplifying task accumulation:
*
[`core/src/main/scala/org/apache/spark/sql/rapids/tool/EventProcessorBase.scala`](diffhunk://#diff-9b551b7ad326fd9175e0c5b0ba69e947058ee2587922f1fe059e85623604e9c1L372-R372):
Modified `addAccToTask` method to remove the `taskId` parameter and
simplify task accumulation.
*
[`core/src/main/scala/org/apache/spark/sql/rapids/tool/store/AccumInfo.scala`](diffhunk://#diff-2cdf5cec29c5cfc15962382b2134c8e88b6623afdfd7cc6a81ec3dfc5663b4a1L87-R89):
Updated `addAccumToTask` method to remove the `taskId` parameter.
*
[`core/src/main/scala/org/apache/spark/sql/rapids/tool/store/AccumManager.scala`](diffhunk://#diff-ff390301f53c6470012e1c36878c1987f176c7eeaa52e30e18f93f76e58587b3L43-R45):
Simplified `addAccToTask` method by removing the `taskId` parameter.
### Testing
This change has been tested against internal event logs and integration
tests have been updated to ensure this behavior is tested for the future
---------
Signed-off-by: Sayed Bilal Bari <[email protected]>
0 commit comments