Skip to content

[FEA] Qualification : Uniquely identify apps when app names are the same #1590

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kuhushukla opened this issue Mar 13, 2025 · 10 comments
Open
Assignees
Labels
core_tools Scope the core module (scala) feature request New feature or request

Comments

@kuhushukla
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
There are scenarios where different application code can have same app names (with distinct app IDs) . In such a scenario it is hard to know which apps are good candidates to test and how we can proceed with moving that query to the test environment.
Since app name is typically the key we group by, we need a different key that will help identify job runs uniquely.
Describe the solution you'd like
Having a hash associated with the SQL plan(physical is available easily in eventlogs) along with more identifying metadata such as tables being read, the columns of the table being read and their duration will help augment the existing grouping key.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.

@kuhushukla kuhushukla changed the title [FEA] Uniquely identify apps when app names are the same [FEA] Qualification : Uniquely identify apps when app names are the same Mar 13, 2025
@amahussein
Copy link
Collaborator

  • physical plans can change due to AQE
  • logical plan is not saved to the eventlogs. It is possible to do it though with custom builds and configurations.
  • This would treat each app-attempt as a separate entry. I don't think we want that if we plan to use multiple attempts as an input to diagnose app behaviors (i.e., analyze why specific attempts fail ..etc)

@viadea
Copy link
Collaborator

viadea commented Mar 13, 2025

Logical plan is not directly in the event log, correct. Can we parse out a virtual layer on top of physical plan which looks like a logical plan?
For example:

Eventlog1:
GpuScan parquet tableA->GpuCustomShuffleReader->GpuShuffleCoalesce
GpuScan parquet tableB->GpuCoalesceBatches->GpuColumnarExchange->...->GpuBroadcastExchange
GpuBroadcashHashJoin of tableA+tableB

Eventlog2:
GpuScan parquet tableA->GpuCustomShuffleReader->GpuShuffleCoalesce
GpuScan parquet tableB->GpuCoalesceBatches->GpuColumnarExchange
GpuShuffledSymmetricHashJoin of tableA+tableB

The abstract layer of above physical plan should be:

Scan tableA, Scan tableB -> Join of tableA+tableB
and it should be the same for above 2 event logs.

As long as the abstract layer is the same, we treat them the same job IMO.

@amahussein
Copy link
Collaborator

The abstract layer of above physical plan should be

Thanks @viadea
I am not quite sure I understand the part related to "abstract layer/Virtual layer". How do are those defined?

@amahussein amahussein added the core_tools Scope the core module (scala) label Mar 14, 2025
@viadea
Copy link
Collaborator

viadea commented Mar 14, 2025

@leewyang is working on this and his current approach is hashing on AdaptiveSparkPlan isFinalPlan=false physical plan which is more logical than final physical plan. That is what we have so far in the event log.
At least this hashing works for my generated GPU event logs for the the same SQL code doing: BHJ vs SHJ, and the hash value is the same.

@amahussein
Copy link
Collaborator

@leewyang is working on this and his current approach is hashing on AdaptiveSparkPlan isFinalPlan=false physical plan which is more logical than final physical plan. That is what we have so far in the event log. At least this hashing works for my generated GPU event logs for the the same SQL code doing: BHJ vs SHJ, and the hash value is the same.

I see. I will discuss more details with @leewyang offline.
FWIW, a planInfo has multiple versions due to the AQE. So picking AdaptiveSparkPlan isFinalPlan=false would turn true for multiple Plans with the same SQLId.
Those plans are dependent on the AQE and the Spark versions which may take us back to the same original problem.
Perhaps this algorithm assumes same spark versions in order to work.
Which one are we picking for the comparison? plan version-0?

@viadea
Copy link
Collaborator

viadea commented Mar 14, 2025

That is the reason why i still prefers logical plan over any type of physical plan.
Maybe this AdaptiveSparkPlan isFinalPlan=false is a baby step. But I am not sure if this approach will work for all users/customers or not. Need to test it with some real event logs to double confirm.

@mattahrens
Copy link
Collaborator

mattahrens commented Mar 24, 2025

Proposal scope for the effort to allow for SQL query matching:

  • Generate new output file sql_plan_details.csv out of the profiling tool (to also be generated in qualification output for qualx).
  • New output file should contain the following fields:
    • appID
    • sqlID
    • sqlPlan: full text of query plan from first org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart event for a given SQL query

@leewyang
Copy link
Collaborator

@mattahrens the sqlPlan would just provide the raw plan information, so downstream users would still need to align similar sqlID/queries using whatever means. Also, for uniquely identifying appIds, that would require another level of matching (of multiple sqlIDs) across different appIds.

@mattahrens
Copy link
Collaborator

@mattahrens the sqlPlan would just provide the raw plan information, so downstream users would still need to align similar sqlID/queries using whatever means. Also, for uniquely identifying appIds, that would require another level of matching (of multiple sqlIDs) across different appIds.

Yes, that is the intent. Then there can be follow-up work in qualx (or a common utility) to do the actual alignment with hashing logic, etc. Let me know if this sounds reasonable.

@sayedbilalbari
Copy link
Collaborator

The expectation for this issue has been changed from Uniquely identifying apps based on different SQLs between CPU and GPU to just generating a downstream JSON based SQLPlanInfo files made for consumption by QualX to then parse and create a normalised version of the Plan to compare between the two. Creating a different issue for this.
The progress of the second task wherein the normalisation code is either added to qualX or core tools can be tracked here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core_tools Scope the core module (scala) feature request New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants