Adds aggregation across metrics for failed/succeeded and non completed stages #1558

sayedbilalbari · 2025-02-21T19:56:34Z

Currently we store the stageInfo using the stageModelManager class where we map incoming stage information during the following events -

doSparkListenerStageCompleted

spark-rapids-tools/core/src/main/scala/org/apache/spark/sql/rapids/tool/EventProcessorBase.scala

Line 475 in 1f037fa

app.getOrCreateStage(event.stageInfo)
doSparkListenerStageSubmitted. -

spark-rapids-tools/core/src/main/scala/org/apache/spark/sql/rapids/tool/EventProcessorBase.scala

Line 464 in 1f037fa

app.getOrCreateStage(event.stageInfo)

So a stage information is updated once when a stage is submitted and once during completion.
A stageCompleted event comes for all attempts of a stage ( eg - there will be two stage Submitted and StageCompleted events for stage that fails on first attempt and succeeds on attempt 2)
This PR changes that behavior to aggregate all attempts for a stage ( failed + succeeded )

Changes -

This pull request includes several changes to improve the handling of stage attempts and task metrics in the Spark RAPIDS tool. The most important changes include adding logic to handle multiple stage attempts, modifying methods to aggregate metrics for these attempts, and updating the AccumManager to simplify task accumulation.

Handling multiple stage attempts:

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala: Added logic to handle multiple stage attempts by aggregating metrics for each attempt.
core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala: Introduced aggregateStageProfileMetric method to combine metrics for multiple attempts of the same stage.

Simplifying task accumulation:

core/src/main/scala/org/apache/spark/sql/rapids/tool/EventProcessorBase.scala: Modified addAccToTask method to remove the taskId parameter and simplify task accumulation.
core/src/main/scala/org/apache/spark/sql/rapids/tool/store/AccumInfo.scala: Updated addAccumToTask method to remove the taskId parameter.
core/src/main/scala/org/apache/spark/sql/rapids/tool/store/AccumManager.scala: Simplified addAccToTask method by removing the taskId parameter.

Testing

This change has been tested against internal event logs and integration tests have been updated to ensure this behavior is tested for the future

Signed-off-by: Sayed Bilal Bari <[email protected]>

amahussein

Did you see any change in the output as an impact of that change? If you a diff between the outputs of before/after is there any change in the metrics?

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala

core/src/main/scala/org/apache/spark/sql/rapids/tool/store/StageModel.scala

eordentlich · 2025-02-25T16:11:21Z

@leewyang can you take a look at this wrt qualx?

Signed-off-by: Sayed Bilal Bari <[email protected]>

…ids-tools into issue1552-bilal

core/src/main/scala/org/apache/spark/sql/rapids/tool/store/StageModel.scala

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala

core/src/main/scala/org/apache/spark/sql/rapids/tool/store/StageModel.scala

core/src/main/scala/org/apache/spark/sql/rapids/tool/store/StageModelManager.scala

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala

amahussein

Thanks @sayedbilalbari

It is interesting that this change did not trigger any change in the expected results of the unit tests. If that's the case, it will be great if we can add a test that shows this version is actually generating something different compared to the previous version.
Otherwise, it might be possible that the code had no impact at all.

leewyang · 2025-02-28T00:05:53Z

For qualx purposes, would there be a way to get the metrics associated with the failed stages/tasks, or will these be dropped entirely? For customers who care about the total time/resources used, it would be useful to get the metrics associated with successful stages/tasks and the failed stages/tasks, so we could tell that a job spent 90% of it's time/resources on failed stages/tasks. Otherwise, the job would appear to only take 10% of the actual time/resources used in reality.

Signed-off-by: Sayed Bilal Bari <[email protected]>

sayedbilalbari · 2025-02-28T04:18:16Z

Thanks @amahussein for reviewing.
For the killed/failed tasks, as mentioned before, the way accumulable come in is different from how we are currently parsing it. The logic is as follows-

For killed tasks, the accumUpdates show up in the object event.reason.accumUpdates but not all of them are relevant. Ob further analysis, majority of them are zeros. They just denote the accumulables this task was supposed to deal with.
But the accumulables where it actually made updates show up in the taskInfo.accumulables ( as mentioned above like peakExecutoinMemory, jvmGCTime etc. )
So no extra logic is needed to parse the accumulableUpdates that come in with killedTasks, since we already parse the TaskInfo and use that to store relevant information.

As for no failure scenarios in the test, this change tackles two major scenarios -

Multiple successful attempts for a stage - will update the test case for this ( have already tested that this gives correct and different results for event logs )
Failed or non completed stages - test event logs did not have scenarios where a stage does not complete at all. There are scenarios where a primary attempt will fail but the secondary attempt will override the primary failure entry and hence no change in aggregated files is seen.

We can discuss this offline about what would be the best way to generate event logs like this for the test case.

Signed-off-by: Sayed Bilal Bari <[email protected]>

leewyang · 2025-02-28T19:13:29Z

For qualx purposes, would there be a way to get the metrics associated with the failed stages/tasks, or will these be dropped entirely? For customers who care about the total time/resources used, it would be useful to get the metrics associated with successful stages/tasks and teh failed stages/tasks, so we could tell that a job spent 90% of it's time/resources on failed stages/stasks. Otherwise, the job would appear to only take 10% of the actual time/resources used in reality.

FYI, we have existing code that is trying to look at metrics for failed stages.

winningsix · 2025-03-03T03:23:59Z

core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala

-    // TODO: this has stage attempts. we should handle different attempts
-    app.stageManager.getAllStages.map { sm =>
-      // TODO: Should we only consider successful tasks?
+    app.stageManager.getAllSuccessfulStageAttempts.map { sm =>


For sm.duration, can we output both all stages and successFulStageAttempts? It's quite common to have failed stages for an in-product runs. For duration, we want to see the cost there.

I think a better way of adding this would be another view which incorporates both failed and successful tasks/stages. Doing that segregation on a metric level does not seem like a nice idea. That would require a another view creation. We can take care of this in another issue/PR.

wjxiz1992 · 2025-03-04T02:08:21Z

Found a case that no "sql_level_agg_task_metrics.csv" is producde with this change while 24.12.3 release version does produce. I'll file an internal issue.

sayedbilalbari · 2025-03-04T23:27:28Z

@wjxiz1992 Thanks for testing this out. This is happening because of the fact that the events logs had no successful attempt. Hence some of the files that dealt with the stage_aggregated results were not generated.
However post discussion with @amahussein , will change this behavior to incorporate a combined output as this can lead to confusion. Will update the PR

Signed-off-by: Sayed Bilal Bari <[email protected]>

leewyang · 2025-03-05T22:42:28Z

After discussion with @sayedbilalbari and @amahussein, I think this PR is fine for qualx purposes. 👍

Note: there may be changes to the models after this PR is merged, since the stage metrics will be different than before (but theoretically more correct).

amahussein

Thanks @sayedbilalbari !
QQ: since this PR fixes the bug caused by iterating on all stages without considerations to the attempt.
is there any other calls to stageManager.getAllStages that might exhibit the same bug? If so, can you enumerate them so we target them in followups?

core/src/main/scala/org/apache/spark/sql/rapids/tool/store/StageModelManager.scala

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala

sayedbilalbari · 2025-03-06T19:18:04Z

Thanks @amahussein for reviewing. I see three more usages of getAllStages -

One in the generateTimeline - this I am assuming will be taken care in the issue for dead code removal
getStagesWithMlFunctions - this gets all stages and checks for Ml operations, Now this will basically have duplicate entries of the same stage coming in with the use of this function. No breaking change related to aggregation but might be a good idea to look into the function to check for duplicate handling
getFailedStages - this needs no change as this filters out the attempts of a stage that might have failed. So should be good

Signed-off-by: Sayed Bilal Bari <[email protected]>

parthosa

Thanks @sayedbilalbari. LGTME.

sayedbilalbari added 2 commits February 21, 2025 11:54

Changes for filtering failed stages

28f4d94

Signed-off-by: Sayed Bilal Bari <[email protected]>

Updated header for StageModelManager

fbb56e2

Signed-off-by: Sayed Bilal Bari <[email protected]>

sayedbilalbari requested review from amahussein and cindyyuanjiang February 21, 2025 19:56

amahussein reviewed Feb 21, 2025

View reviewed changes

amahussein assigned sayedbilalbari Feb 21, 2025

amahussein added bug Something isn't working core_tools Scope the core module (scala) labels Feb 21, 2025

sayedbilalbari added 7 commits February 25, 2025 10:50

Removed TODO

174d5f9

Signed-off-by: Sayed Bilal Bari <[email protected]>

Changes for filtering failed stages

0ec5073

Signed-off-by: Sayed Bilal Bari <[email protected]>

Updated header for StageModelManager

1d9dd9a

Signed-off-by: Sayed Bilal Bari <[email protected]>

Removed TODO

af56acf

Signed-off-by: Sayed Bilal Bari <[email protected]>

Aggregate metrics across successful stage attempts

97dad6e

Signed-off-by: Sayed Bilal Bari <[email protected]>

Merge branch 'issue1552-bilal' of github.com:sayedbilalbari/spark-rap…

595edca

…ids-tools into issue1552-bilal

Merge remote-tracking branch 'nv/dev' into issue1552-bilal

88142e8

sayedbilalbari requested a review from amahussein February 27, 2025 18:58

amahussein requested changes Feb 27, 2025

View reviewed changes

amahussein reviewed Feb 27, 2025

View reviewed changes

sayedbilalbari added 2 commits February 27, 2025 17:19

Removed unnecessary variable

98b4033

Signed-off-by: Sayed Bilal Bari <[email protected]>

Typo correction

ad9aa9f

Signed-off-by: Sayed Bilal Bari <[email protected]>

Review comment changes

08dc1ec

Signed-off-by: Sayed Bilal Bari <[email protected]>

winningsix reviewed Mar 3, 2025

View reviewed changes

Changed behaviour to getAllAtempts from just succesful ones

9651c16

Signed-off-by: Sayed Bilal Bari <[email protected]>

sayedbilalbari requested a review from amahussein March 4, 2025 23:40

sayedbilalbari changed the title ~~Adds filter for failed and non completed stages~~ Adds aggregation across metrics for failed/succeeded and non completed stages Mar 5, 2025

amahussein reviewed Mar 6, 2025

View reviewed changes

parthosa reviewed Mar 6, 2025

View reviewed changes

core/src/main/scala/org/apache/spark/sql/rapids/tool/store/StageModelManager.scala Outdated Show resolved Hide resolved

core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala Show resolved Hide resolved

Updated header

03fda6f

Signed-off-by: Sayed Bilal Bari <[email protected]>

sayedbilalbari requested a review from amahussein March 6, 2025 22:14

parthosa approved these changes Mar 7, 2025

View reviewed changes

amahussein approved these changes Mar 10, 2025

View reviewed changes

sayedbilalbari merged commit 634bd96 into NVIDIA:dev Mar 10, 2025
13 checks passed

Adds aggregation across metrics for failed/succeeded and non completed stages #1558

Adds aggregation across metrics for failed/succeeded and non completed stages #1558

Uh oh!

Conversation

sayedbilalbari commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes -

Testing

Uh oh!

amahussein left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eordentlich commented Feb 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amahussein left a comment

Choose a reason for hiding this comment

Uh oh!

leewyang commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayedbilalbari commented Feb 28, 2025

Uh oh!

leewyang commented Feb 28, 2025

Uh oh!

winningsix Mar 3, 2025

Choose a reason for hiding this comment

Uh oh!

sayedbilalbari Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

wjxiz1992 commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayedbilalbari commented Mar 4, 2025

Uh oh!

leewyang commented Mar 5, 2025

Uh oh!

amahussein left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sayedbilalbari commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

parthosa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sayedbilalbari commented Feb 21, 2025 •

edited

Loading

leewyang commented Feb 28, 2025 •

edited

Loading

wjxiz1992 commented Mar 4, 2025 •

edited

Loading

sayedbilalbari commented Mar 6, 2025 •

edited

Loading