Skip to content

Adds aggregation across metrics for failed/succeeded and non completed stages #1558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Mar 10, 2025

Conversation

sayedbilalbari
Copy link
Collaborator

@sayedbilalbari sayedbilalbari commented Feb 21, 2025

Fixes #1552

Currently we store the stageInfo using the stageModelManager class where we map incoming stage information during the following events -

  1. doSparkListenerStageCompleted
  2. doSparkListenerStageSubmitted. -

So a stage information is updated once when a stage is submitted and once during completion.
A stageCompleted event comes for all attempts of a stage ( eg - there will be two stage Submitted and StageCompleted events for stage that fails on first attempt and succeeds on attempt 2)
This PR changes that behavior to aggregate all attempts for a stage ( failed + succeeded )

Changes -

This pull request includes several changes to improve the handling of stage attempts and task metrics in the Spark RAPIDS tool. The most important changes include adding logic to handle multiple stage attempts, modifying methods to aggregate metrics for these attempts, and updating the AccumManager to simplify task accumulation.

Handling multiple stage attempts:

Simplifying task accumulation:

Testing

This change has been tested against internal event logs and integration tests have been updated to ensure this behavior is tested for the future

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see any change in the output as an impact of that change? If you a diff between the outputs of before/after is there any change in the metrics?

@amahussein amahussein added bug Something isn't working core_tools Scope the core module (scala) labels Feb 21, 2025
@eordentlich
Copy link
Collaborator

@leewyang can you take a look at this wrt qualx?

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sayedbilalbari

It is interesting that this change did not trigger any change in the expected results of the unit tests. If that's the case, it will be great if we can add a test that shows this version is actually generating something different compared to the previous version.
Otherwise, it might be possible that the code had no impact at all.

@leewyang
Copy link
Collaborator

leewyang commented Feb 28, 2025

For qualx purposes, would there be a way to get the metrics associated with the failed stages/tasks, or will these be dropped entirely? For customers who care about the total time/resources used, it would be useful to get the metrics associated with successful stages/tasks and the failed stages/tasks, so we could tell that a job spent 90% of it's time/resources on failed stages/tasks. Otherwise, the job would appear to only take 10% of the actual time/resources used in reality.

Signed-off-by: Sayed Bilal Bari <[email protected]>
Signed-off-by: Sayed Bilal Bari <[email protected]>
@sayedbilalbari
Copy link
Collaborator Author

Thanks @amahussein for reviewing.
For the killed/failed tasks, as mentioned before, the way accumulable come in is different from how we are currently parsing it. The logic is as follows-

  1. For killed tasks, the accumUpdates show up in the object event.reason.accumUpdates but not all of them are relevant. Ob further analysis, majority of them are zeros. They just denote the accumulables this task was supposed to deal with.
  2. But the accumulables where it actually made updates show up in the taskInfo.accumulables ( as mentioned above like peakExecutoinMemory, jvmGCTime etc. )
  3. So no extra logic is needed to parse the accumulableUpdates that come in with killedTasks, since we already parse the TaskInfo and use that to store relevant information.

As for no failure scenarios in the test, this change tackles two major scenarios -

  1. Multiple successful attempts for a stage - will update the test case for this ( have already tested that this gives correct and different results for event logs )
  2. Failed or non completed stages - test event logs did not have scenarios where a stage does not complete at all. There are scenarios where a primary attempt will fail but the secondary attempt will override the primary failure entry and hence no change in aggregated files is seen.

We can discuss this offline about what would be the best way to generate event logs like this for the test case.

Signed-off-by: Sayed Bilal Bari <[email protected]>
@leewyang
Copy link
Collaborator

For qualx purposes, would there be a way to get the metrics associated with the failed stages/tasks, or will these be dropped entirely? For customers who care about the total time/resources used, it would be useful to get the metrics associated with successful stages/tasks and teh failed stages/tasks, so we could tell that a job spent 90% of it's time/resources on failed stages/stasks. Otherwise, the job would appear to only take 10% of the actual time/resources used in reality.

FYI, we have existing code that is trying to look at metrics for failed stages.

// TODO: this has stage attempts. we should handle different attempts
app.stageManager.getAllStages.map { sm =>
// TODO: Should we only consider successful tasks?
app.stageManager.getAllSuccessfulStageAttempts.map { sm =>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sm.duration, can we output both all stages and successFulStageAttempts? It's quite common to have failed stages for an in-product runs. For duration, we want to see the cost there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better way of adding this would be another view which incorporates both failed and successful tasks/stages. Doing that segregation on a metric level does not seem like a nice idea. That would require a another view creation. We can take care of this in another issue/PR.

@wjxiz1992
Copy link
Collaborator

wjxiz1992 commented Mar 4, 2025

Found a case that no "sql_level_agg_task_metrics.csv" is producde with this change while 24.12.3 release version does produce. I'll file an internal issue.

@sayedbilalbari
Copy link
Collaborator Author

@wjxiz1992 Thanks for testing this out. This is happening because of the fact that the events logs had no successful attempt. Hence some of the files that dealt with the stage_aggregated results were not generated.
However post discussion with @amahussein , will change this behavior to incorporate a combined output as this can lead to confusion. Will update the PR

@sayedbilalbari sayedbilalbari changed the title Adds filter for failed and non completed stages Adds aggregation across metrics for failed/succeeded and non completed stages Mar 5, 2025
@leewyang
Copy link
Collaborator

leewyang commented Mar 5, 2025

After discussion with @sayedbilalbari and @amahussein, I think this PR is fine for qualx purposes. 👍

Note: there may be changes to the models after this PR is merged, since the stage metrics will be different than before (but theoretically more correct).

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sayedbilalbari !
QQ: since this PR fixes the bug caused by iterating on all stages without considerations to the attempt.
is there any other calls to stageManager.getAllStages that might exhibit the same bug? If so, can you enumerate them so we target them in followups?

@sayedbilalbari
Copy link
Collaborator Author

sayedbilalbari commented Mar 6, 2025

Thanks @amahussein for reviewing. I see three more usages of getAllStages -

  • One in the generateTimeline - this I am assuming will be taken care in the issue for dead code removal
  • getStagesWithMlFunctions - this gets all stages and checks for Ml operations, Now this will basically have duplicate entries of the same stage coming in with the use of this function. No breaking change related to aggregation but might be a good idea to look into the function to check for duplicate handling
  • getFailedStages - this needs no change as this filters out the attempts of a stage that might have failed. So should be good

Signed-off-by: Sayed Bilal Bari <[email protected]>
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sayedbilalbari. LGTME.

@sayedbilalbari sayedbilalbari merged commit 634bd96 into NVIDIA:dev Mar 10, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core_tools Scope the core module (scala)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Aggregate metric per stage is missing filter for stage attempts
7 participants