[FEA] Qualification : Uniquely identify apps when app names are the same #1590

kuhushukla · 2025-03-13T20:09:44Z

Is your feature request related to a problem? Please describe.
There are scenarios where different application code can have same app names (with distinct app IDs) . In such a scenario it is hard to know which apps are good candidates to test and how we can proceed with moving that query to the test environment.
Since app name is typically the key we group by, we need a different key that will help identify job runs uniquely.
Describe the solution you'd like
Having a hash associated with the SQL plan(physical is available easily in eventlogs) along with more identifying metadata such as tables being read, the columns of the table being read and their duration will help augment the existing grouping key.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.

amahussein · 2025-03-13T21:58:29Z

physical plans can change due to AQE
logical plan is not saved to the eventlogs. It is possible to do it though with custom builds and configurations.
This would treat each app-attempt as a separate entry. I don't think we want that if we plan to use multiple attempts as an input to diagnose app behaviors (i.e., analyze why specific attempts fail ..etc)

viadea · 2025-03-13T23:06:38Z

Logical plan is not directly in the event log, correct. Can we parse out a virtual layer on top of physical plan which looks like a logical plan?
For example:

Eventlog1:
GpuScan parquet tableA->GpuCustomShuffleReader->GpuShuffleCoalesce
GpuScan parquet tableB->GpuCoalesceBatches->GpuColumnarExchange->...->GpuBroadcastExchange
GpuBroadcashHashJoin of tableA+tableB

Eventlog2:
GpuScan parquet tableA->GpuCustomShuffleReader->GpuShuffleCoalesce
GpuScan parquet tableB->GpuCoalesceBatches->GpuColumnarExchange
GpuShuffledSymmetricHashJoin of tableA+tableB

The abstract layer of above physical plan should be:

Scan tableA, Scan tableB -> Join of tableA+tableB
and it should be the same for above 2 event logs.

As long as the abstract layer is the same, we treat them the same job IMO.

amahussein · 2025-03-14T14:25:43Z

The abstract layer of above physical plan should be

Thanks @viadea
I am not quite sure I understand the part related to "abstract layer/Virtual layer". How do are those defined?

viadea · 2025-03-14T16:03:42Z

@leewyang is working on this and his current approach is hashing on AdaptiveSparkPlan isFinalPlan=false physical plan which is more logical than final physical plan. That is what we have so far in the event log.
At least this hashing works for my generated GPU event logs for the the same SQL code doing: BHJ vs SHJ, and the hash value is the same.

amahussein · 2025-03-14T16:39:25Z

@leewyang is working on this and his current approach is hashing on AdaptiveSparkPlan isFinalPlan=false physical plan which is more logical than final physical plan. That is what we have so far in the event log. At least this hashing works for my generated GPU event logs for the the same SQL code doing: BHJ vs SHJ, and the hash value is the same.

I see. I will discuss more details with @leewyang offline.
FWIW, a planInfo has multiple versions due to the AQE. So picking AdaptiveSparkPlan isFinalPlan=false would turn true for multiple Plans with the same SQLId.
Those plans are dependent on the AQE and the Spark versions which may take us back to the same original problem.
Perhaps this algorithm assumes same spark versions in order to work.
Which one are we picking for the comparison? plan version-0?

viadea · 2025-03-14T22:40:47Z

That is the reason why i still prefers logical plan over any type of physical plan.
Maybe this AdaptiveSparkPlan isFinalPlan=false is a baby step. But I am not sure if this approach will work for all users/customers or not. Need to test it with some real event logs to double confirm.

mattahrens · 2025-03-24T21:02:18Z

Proposal scope for the effort to allow for SQL query matching:

Generate new output file sql_plan_details.csv out of the profiling tool (to also be generated in qualification output for qualx).
New output file should contain the following fields:
- appID
- sqlID
- sqlPlan: full text of query plan from first org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart event for a given SQL query

leewyang · 2025-03-24T21:21:32Z

@mattahrens the sqlPlan would just provide the raw plan information, so downstream users would still need to align similar sqlID/queries using whatever means. Also, for uniquely identifying appIds, that would require another level of matching (of multiple sqlIDs) across different appIds.

mattahrens · 2025-03-25T14:58:43Z

@mattahrens the sqlPlan would just provide the raw plan information, so downstream users would still need to align similar sqlID/queries using whatever means. Also, for uniquely identifying appIds, that would require another level of matching (of multiple sqlIDs) across different appIds.

Yes, that is the intent. Then there can be follow-up work in qualx (or a common utility) to do the actual alignment with hashing logic, etc. Let me know if this sounds reasonable.

sayedbilalbari · 2025-04-01T21:21:34Z

The expectation for this issue has been changed from Uniquely identifying apps based on different SQLs between CPU and GPU to just generating a downstream JSON based SQLPlanInfo files made for consumption by QualX to then parse and create a normalised version of the Plan to compare between the two. Creating a different issue for this.
The progress of the second task wherein the normalisation code is either added to qualX or core tools can be tracked here

kuhushukla added ? - Needs Triage feature request New feature or request labels Mar 13, 2025

kuhushukla changed the title ~~[FEA] Uniquely identify apps when app names are the same~~ [FEA] Qualification : Uniquely identify apps when app names are the same Mar 13, 2025

amahussein added the core_tools Scope the core module (scala) label Mar 14, 2025

mattahrens assigned sayedbilalbari Mar 25, 2025

mattahrens removed the ? - Needs Triage label Mar 25, 2025

sayedbilalbari mentioned this issue Apr 1, 2025

[FEA] Create a SQLPlanInfo JSON file for consumption by QualX to use as an input to plan normalisation utility #1611

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Qualification : Uniquely identify apps when app names are the same #1590

[FEA] Qualification : Uniquely identify apps when app names are the same #1590

kuhushukla commented Mar 13, 2025

amahussein commented Mar 13, 2025

viadea commented Mar 13, 2025

amahussein commented Mar 14, 2025

viadea commented Mar 14, 2025 •

edited

Loading

amahussein commented Mar 14, 2025

viadea commented Mar 14, 2025 •

edited

Loading

mattahrens commented Mar 24, 2025 •

edited

Loading

leewyang commented Mar 24, 2025

mattahrens commented Mar 25, 2025

sayedbilalbari commented Apr 1, 2025

[FEA] Qualification : Uniquely identify apps when app names are the same #1590

[FEA] Qualification : Uniquely identify apps when app names are the same #1590

Comments

kuhushukla commented Mar 13, 2025

amahussein commented Mar 13, 2025

viadea commented Mar 13, 2025

amahussein commented Mar 14, 2025

viadea commented Mar 14, 2025 • edited Loading

amahussein commented Mar 14, 2025

viadea commented Mar 14, 2025 • edited Loading

mattahrens commented Mar 24, 2025 • edited Loading

leewyang commented Mar 24, 2025

mattahrens commented Mar 25, 2025

sayedbilalbari commented Apr 1, 2025

viadea commented Mar 14, 2025 •

edited

Loading

viadea commented Mar 14, 2025 •

edited

Loading

mattahrens commented Mar 24, 2025 •

edited

Loading