Skip to content

Disable Per-SQL summary text output #1530

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 10, 2025

Conversation

amahussein
Copy link
Collaborator

Signed-off-by: Ahmed Hussein (amahussein) [email protected]

Fixes #1527

Disable the text format output generated per-sql. The target is to reduce the noise of the stdout and improve the performance of the core-tools

Impact on the output:

  • rapids_4_spark_qualification_output_persql.log is not generated anymore by the qualTool
  • remove column AppName from rapids_4_spark_qualification_output_persql.csv
  • the rapids_4_spark_qualification_output_persql.log can still be generated by the RunningQualificationApp
  • The order of the SQls in the CSV file has changed. Sorted Desc based on (GPU opportunity, and DF duration) whithin each app. Perviously, the SQLs were sorted globally which might cause considerable overhead for a large number of eventogs.

Impact on Performance and usability:

  • Improve readability of the stdout/log generated by the tools.
  • Reduce the size of lines consumed by the python wrapper.
  • Sorting Sqls per-app implies less memory requirements since it is only required to maintain the list of SQL for the current iteration.
  • Improve the string construction by avoiding filling Buffer<String, Int>
  • Improve the performance by skipping generating and writing the log file to the disk.

Details

This pull request to core/src/main/scala/com/nvidia/spark/rapids/tool/qualification/QualOutputWriter.scala includes changes to improve the performance and structure of the SQL CSV report generation and related functionality. The most important changes include the removal of the application name from the per-SQL report, the optimization of string concatenation and object allocation, and updates to the test expectations to reflect these changes.

Improvements to report generation:

Changes to related classes:

Updates to test expectations

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Fixes NVIDIA#1527

Disable the text format output generated per-sql. The target is to
reduce the noise of the stdout and improve the performance of the
core-tools

Impact on the output:

- `rapids_4_spark_qualification_output_persql.log` is not generated
  anymore by the qualTool
- remove column `AppName` from `rapids_4_spark_qualification_output_persql.csv`
- the `rapids_4_spark_qualification_output_persql.log` can still be
  generated by the RunningQualificationApp
- The order of the SQls in the CSV file has changed. Sorted Desc based
  on (GPU opportunity, and DF duration) whithin each app. Perviously,
the SQLs were sorted globally which might cause considerable overhead
for a large number of eventogs.
@amahussein amahussein self-assigned this Feb 5, 2025
@amahussein amahussein added core_tools Scope the core module (scala) API change A change affecting the output (add/remove/rename files, add/remove/rename columns) performance performance and scalability of tools labels Feb 5, 2025
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein. Tested this. LGTME.

Q: Why are we removing the App Name from the CSV?

@amahussein
Copy link
Collaborator Author

Thanks @amahussein. Tested this. LGTME.

Q: Why are we removing the App Name from the CSV?

It was not clear why the the app-name was included in the SQL CSV file in first place.
I guess it was their to make the stdout easier to understand.
Since the stdout is removed, I don't see a reason to keep that column around.

@amahussein amahussein merged commit 38aad12 into NVIDIA:dev Feb 10, 2025
13 checks passed
@amahussein amahussein deleted the rapids-tools-1527 branch February 10, 2025 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API change A change affecting the output (add/remove/rename files, add/remove/rename columns) core_tools Scope the core module (scala) performance performance and scalability of tools
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Disable Per-SQL summary text output
2 participants