Skip to content

[FEA] Split qualification output to per-app basis #1643

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 3 tasks
amahussein opened this issue Apr 17, 2025 · 0 comments
Open
1 of 3 tasks

[FEA] Split qualification output to per-app basis #1643

amahussein opened this issue Apr 17, 2025 · 0 comments
Assignees
Labels
feature request New feature or request performance performance and scalability of tools

Comments

@amahussein
Copy link
Collaborator

amahussein commented Apr 17, 2025

Is your feature request related to a problem? Please describe.

Currently, the qualification cmd in core-tools dump all the estimate files on the root directory in a combined fashion.
This implementation means that all the results are kept in memory until the end eventhough there is not much needed for a combined report.

Below is a list of all problems that are caused by that design choice:

  1. limit the scalability of core-tools (cannot run huge number of eventlogs in a single run): As described in [FEA] Scale: Support huge number of eventlogs in single qualification process run #1377, keeping all the summaries in memory will cause the JVM to OOM eventually.
  2. the generated CSV files are huge; especially for the ones listing operators, and expressions. This definitely cause an overhead on the downstream that has to load those huge files, then do filters per App-ID.
  3. no incremental runs. A crash in runtime or a halt imply that no output is generated. Resuming the run means starting from scratch.
  4. bottleneck in generate the reports because it becomes a single threaded at the end. this means that we are slowing down the exection and not taking advantage of parallelism in generating the reports.

Describe the solution you'd like

Change directory structure of the core-tools output. Break the reports to be per-app and each app has its own directory.

  • A summary csv file remains in the top directory.
  • a directory apps will contain the second level.
  • for each attempt there will be a subdirectory apps/<appId-attemptID> that contains the relevant output.

we want to have appId-attemptID as a unique key in order to handle multiple-attempts per app.
This way, directories won't be overridden by multiple attempts.

Describe alternatives you've considered

In #1377 , a solution is suggested to store QualificationSummary objects in RocksDB file. While this could be a good way to offload the objects to disk, there will be considerable overhead from writing/reading to/from the disk. In addition, there will be significant amount of work to wrap all the objects to be serialized into rocksDB.

Additional context

There is a need to make changes to downstream in

  • user_tools
  • unit-tests
  • qualX
  • CI/CD tests
  • Aether

Sub Tasks

@amahussein amahussein added ? - Needs Triage feature request New feature or request performance performance and scalability of tools labels Apr 17, 2025
@amahussein amahussein self-assigned this Apr 17, 2025
amahussein added a commit to amahussein/spark-rapids-tools that referenced this issue May 8, 2025
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Contributes to NVIDIA#1643

Splits the output of qualification core module to be per-app.
This is a subtask toward the final goal of working with qualification on
per-app basis end-to-end.
amahussein added a commit to amahussein/spark-rapids-tools that referenced this issue May 9, 2025
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Contributes to NVIDIA#1643

Splits the output of qualification core module to be per-app.
This is a subtask toward the final goal of working with qualification on
per-app basis end-to-end.
amahussein added a commit that referenced this issue May 9, 2025
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Contributes to #1643

Splits the output of qualification core module to be per-app. This is a
subtask toward the final goal of working with qualification on per-app
basis end-to-end.

---------

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request performance performance and scalability of tools
Projects
None yet
Development

No branches or pull requests

1 participant