-
Notifications
You must be signed in to change notification settings - Fork 44
Adds aggregation across metrics for failed/succeeded and non completed stages #1558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Sayed Bilal Bari <[email protected]>
Signed-off-by: Sayed Bilal Bari <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you see any change in the output as an impact of that change? If you a diff between the outputs of before/after is there any change in the metrics?
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/rapids/tool/store/StageModel.scala
Outdated
Show resolved
Hide resolved
@leewyang can you take a look at this wrt qualx? |
Signed-off-by: Sayed Bilal Bari <[email protected]>
Signed-off-by: Sayed Bilal Bari <[email protected]>
Signed-off-by: Sayed Bilal Bari <[email protected]>
Signed-off-by: Sayed Bilal Bari <[email protected]>
Signed-off-by: Sayed Bilal Bari <[email protected]>
…ids-tools into issue1552-bilal
core/src/main/scala/org/apache/spark/sql/rapids/tool/store/StageModel.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/rapids/tool/store/StageModel.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/rapids/tool/store/StageModelManager.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sayedbilalbari
It is interesting that this change did not trigger any change in the expected results of the unit tests. If that's the case, it will be great if we can add a test that shows this version is actually generating something different compared to the previous version.
Otherwise, it might be possible that the code had no impact at all.
For qualx purposes, would there be a way to get the metrics associated with the failed stages/tasks, or will these be dropped entirely? For customers who care about the total time/resources used, it would be useful to get the metrics associated with successful stages/tasks and the failed stages/tasks, so we could tell that a job spent 90% of it's time/resources on failed stages/tasks. Otherwise, the job would appear to only take 10% of the actual time/resources used in reality. |
Signed-off-by: Sayed Bilal Bari <[email protected]>
Signed-off-by: Sayed Bilal Bari <[email protected]>
Thanks @amahussein for reviewing.
As for no failure scenarios in the test, this change tackles two major scenarios -
We can discuss this offline about what would be the best way to generate event logs like this for the test case. |
Signed-off-by: Sayed Bilal Bari <[email protected]>
FYI, we have existing code that is trying to look at metrics for failed stages. |
// TODO: this has stage attempts. we should handle different attempts | ||
app.stageManager.getAllStages.map { sm => | ||
// TODO: Should we only consider successful tasks? | ||
app.stageManager.getAllSuccessfulStageAttempts.map { sm => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sm.duration
, can we output both all stages
and successFulStageAttempts
? It's quite common to have failed stages for an in-product runs. For duration, we want to see the cost there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a better way of adding this would be another view which incorporates both failed and successful tasks/stages. Doing that segregation on a metric level does not seem like a nice idea. That would require a another view creation. We can take care of this in another issue/PR.
Found a case that no "sql_level_agg_task_metrics.csv" is producde with this change while 24.12.3 release version does produce. I'll file an internal issue. |
@wjxiz1992 Thanks for testing this out. This is happening because of the fact that the events logs had no successful attempt. Hence some of the files that dealt with the stage_aggregated results were not generated. |
Signed-off-by: Sayed Bilal Bari <[email protected]>
After discussion with @sayedbilalbari and @amahussein, I think this PR is fine for qualx purposes. 👍 Note: there may be changes to the models after this PR is merged, since the stage metrics will be different than before (but theoretically more correct). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sayedbilalbari !
QQ: since this PR fixes the bug caused by iterating on all stages without considerations to the attempt.
is there any other calls to stageManager.getAllStages
that might exhibit the same bug? If so, can you enumerate them so we target them in followups?
core/src/main/scala/org/apache/spark/sql/rapids/tool/store/StageModelManager.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala
Show resolved
Hide resolved
Thanks @amahussein for reviewing. I see three more usages of getAllStages -
|
Signed-off-by: Sayed Bilal Bari <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sayedbilalbari. LGTME.
Fixes #1552
Currently we store the stageInfo using the stageModelManager class where we map incoming stage information during the following events -
spark-rapids-tools/core/src/main/scala/org/apache/spark/sql/rapids/tool/EventProcessorBase.scala
Line 475 in 1f037fa
spark-rapids-tools/core/src/main/scala/org/apache/spark/sql/rapids/tool/EventProcessorBase.scala
Line 464 in 1f037fa
So a stage information is updated once when a stage is submitted and once during completion.
A stageCompleted event comes for all attempts of a stage ( eg - there will be two stage Submitted and StageCompleted events for stage that fails on first attempt and succeeds on attempt 2)
This PR changes that behavior to aggregate all attempts for a stage ( failed + succeeded )
Changes -
This pull request includes several changes to improve the handling of stage attempts and task metrics in the Spark RAPIDS tool. The most important changes include adding logic to handle multiple stage attempts, modifying methods to aggregate metrics for these attempts, and updating the
AccumManager
to simplify task accumulation.Handling multiple stage attempts:
core/src/main/scala/com/nvidia/spark/rapids/tool/analysis/AppSparkMetricsAnalyzer.scala
: Added logic to handle multiple stage attempts by aggregating metrics for each attempt.core/src/main/scala/com/nvidia/spark/rapids/tool/profiling/ProfileClassWarehouse.scala
: IntroducedaggregateStageProfileMetric
method to combine metrics for multiple attempts of the same stage.Simplifying task accumulation:
core/src/main/scala/org/apache/spark/sql/rapids/tool/EventProcessorBase.scala
: ModifiedaddAccToTask
method to remove thetaskId
parameter and simplify task accumulation.core/src/main/scala/org/apache/spark/sql/rapids/tool/store/AccumInfo.scala
: UpdatedaddAccumToTask
method to remove thetaskId
parameter.core/src/main/scala/org/apache/spark/sql/rapids/tool/store/AccumManager.scala
: SimplifiedaddAccToTask
method by removing thetaskId
parameter.Testing
This change has been tested against internal event logs and integration tests have been updated to ensure this behavior is tested for the future