[FEA] Split qualification output to per-app basis #1643

amahussein · 2025-04-17T19:36:57Z

Is your feature request related to a problem? Please describe.

Currently, the qualification cmd in core-tools dump all the estimate files on the root directory in a combined fashion.
This implementation means that all the results are kept in memory until the end eventhough there is not much needed for a combined report.

Below is a list of all problems that are caused by that design choice:

limit the scalability of core-tools (cannot run huge number of eventlogs in a single run): As described in [FEA] Scale: Support huge number of eventlogs in single qualification process run #1377, keeping all the summaries in memory will cause the JVM to OOM eventually.
the generated CSV files are huge; especially for the ones listing operators, and expressions. This definitely cause an overhead on the downstream that has to load those huge files, then do filters per App-ID.
no incremental runs. A crash in runtime or a halt imply that no output is generated. Resuming the run means starting from scratch.
bottleneck in generate the reports because it becomes a single threaded at the end. this means that we are slowing down the exection and not taking advantage of parallelism in generating the reports.

Describe the solution you'd like

Change directory structure of the core-tools output. Break the reports to be per-app and each app has its own directory.

A summary csv file remains in the top directory.
a directory apps will contain the second level.
for each attempt there will be a subdirectory apps/<appId-attemptID> that contains the relevant output.

we want to have appId-attemptID as a unique key in order to handle multiple-attempts per app.
This way, directories won't be overridden by multiple attempts.

Describe alternatives you've considered

In #1377 , a solution is suggested to store QualificationSummary objects in RocksDB file. While this could be a good way to offload the objects to disk, there will be considerable overhead from writing/reading to/from the disk. In addition, there will be significant amount of work to wrap all the objects to be serialized into rocksDB.

Additional context

There is a need to make changes to downstream in

user_tools
unit-tests
qualX
CI/CD tests
Aether

Sub Tasks

Split qualification output to per-app basis in core-tools #1666
Benchmark the core-tools performance before and after splitting the output
Update the user-tools module to handle the new output structure.

The text was updated successfully, but these errors were encountered:

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]> Contributes to NVIDIA#1643 Splits the output of qualification core module to be per-app. This is a subtask toward the final goal of working with qualification on per-app basis end-to-end.

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]> Contributes to #1643 Splits the output of qualification core module to be per-app. This is a subtask toward the final goal of working with qualification on per-app basis end-to-end. --------- Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

amahussein added ? - Needs Triage feature request New feature or request performance performance and scalability of tools labels Apr 17, 2025

amahussein self-assigned this Apr 17, 2025

amahussein removed the ? - Needs Triage label Apr 17, 2025

This was referenced Apr 18, 2025

[BUG] Incorrect Action value in qual Execs table #1646

Open

[BUG] Qualification stages report is not accurate #1648

Open

amahussein mentioned this issue May 8, 2025

Split qualification output to per-app basis in core-tools #1666

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Split qualification output to per-app basis #1643

[FEA] Split qualification output to per-app basis #1643

amahussein commented Apr 17, 2025 •

edited

Loading

[FEA] Split qualification output to per-app basis #1643

[FEA] Split qualification output to per-app basis #1643

Comments

amahussein commented Apr 17, 2025 • edited Loading

amahussein commented Apr 17, 2025 •

edited

Loading