Skip to content

[BUG] Use a stub to store Spark StageInfo #1524

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amahussein opened this issue Feb 3, 2025 · 0 comments · Fixed by #1525
Closed

[BUG] Use a stub to store Spark StageInfo #1524

amahussein opened this issue Feb 3, 2025 · 0 comments · Fixed by #1525
Assignees
Labels
core_tools Scope the core module (scala) performance performance and scalability of tools

Comments

@amahussein
Copy link
Collaborator

Describe the bug

the StageModel references a StageInfo field to get the details of the stage.
The problem with that design that this causes a deep-levelpointer to data that is not needed by the core tools for now.

@DeveloperApi
class StageInfo(
    val stageId: Int,
    private val attemptId: Int,
    val name: String,
    val numTasks: Int,
    val rddInfos: Seq[RDDInfo],
    val parentIds: Seq[Int],
    val details: String,
    val taskMetrics: TaskMetrics = null,
    private[spark] val taskLocalityPreferences: Seq[Seq[TaskLocation]] = Seq.empty,
    private[spark] val shuffleDepId: Option[Int] = None,
    val resourceProfileId: Int,
    private[spark] var isPushBasedShuffleEnabled: Boolean = false,
    private[spark] var shuffleMergerCount: Int = 0) {
  /** When this stage was submitted from the DAGScheduler to a TaskScheduler. */
  var submissionTime: Option[Long] = None
  /** Time when the stage completed or when the stage was cancelled. */
  var completionTime: Option[Long] = None
  /** If the stage failed, the reason why. */
  var failureReason: Option[String] = None

  /**
   * Terminal values of accumulables updated during this stage, including all the user-defined
   * accumulators.
   */
  val accumulables = HashMap[Long, AccumulableInfo]()

Ideally, we need to have stub class that only copies what we need.
We did that before in #1206 but we had to roll it back for compatibility with various Spark implementations in #1260

@amahussein amahussein added ? - Needs Triage bug Something isn't working core_tools Scope the core module (scala) labels Feb 3, 2025
@amahussein amahussein self-assigned this Feb 3, 2025
amahussein added a commit to amahussein/spark-rapids-tools that referenced this issue Feb 4, 2025
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Fixes NVIDIA#1524

This commit uses a smaller class `StageInfoStub` to store Spark's
StageInfo. This class is common between all the spark implementations
but it has more fields to the constructor across different versions.

Currently we only use a subset of the class fields. The remaining fields
represent an overhead or redundant storage; especially when it comes to
store the accumulables and taskMetrics for each stage.

To evaluate the memory optimization, a new `Checkpoint` mechanism was
added to allow gathering information at separate stages of the
execution.
The `checkpoint` design and implementation can be further improved and
extended to build a performance profile to compare different tradeoffs.
@amahussein amahussein added performance performance and scalability of tools and removed bug Something isn't working labels Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core_tools Scope the core module (scala) performance performance and scalability of tools
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant