Skip to content

Enforce JOIN plan to require condition #15334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 25, 2025
Merged

Conversation

goldmedal
Copy link
Contributor

@goldmedal goldmedal commented Mar 20, 2025

Which issue does this PR close?

Rationale for this change

When working on unparsing the plan optimized by ScalarSubqueryToJoin, I noticed it could generate a LEFT JOIN without the join condition.

Projection: customer.c_custkey [c_custkey:Int64]
  Filter: customer.c_custkey = __scalar_sq_1.max(orders.o_custkey) [c_custkey:Int64, c_name:Utf8, max(orders.o_custkey):Int64;N]
    Left Join:  [c_custkey:Int64, c_name:Utf8, max(orders.o_custkey):Int64;N]
      TableScan: customer [c_custkey:Int64, c_name:Utf8]
      SubqueryAlias: __scalar_sq_1 [max(orders.o_custkey):Int64;N]
        Projection: max(orders.o_custkey) [max(orders.o_custkey):Int64;N]
          Aggregate: groupBy=[[]], aggr=[[max(orders.o_custkey)]] [max(orders.o_custkey):Int64;N]
            TableScan: orders [o_orderkey:Int64, o_custkey:Int64, o_orderstatus:Utf8, o_totalprice:Float64;N]

The unparsing result will look like

SELECT 
  customer.c_custkey 
FROM 
  customer 
  LEFT JOIN (
    SELECT 
      max(orders.o_custkey) 
    FROM 
      orders
  ) AS __scalar_sq_1
WHERE 
  (
    customer.c_custkey = __scalar_sq_1."max(orders.o_custkey)"
  )

Most databases don't allow the join without the condition besides the cross join.

DuckDB

D select * from c1 left join c2;
Parser Error:
syntax error at or near ";"

Postgres

test=# select * from t1 left join t2;
ERROR:  syntax error at or near ";"
LINE 1: select * from t1 left join t2;

DataFusion

> select * from t1 left join t2;
+----+----+
| c1 | c1 |
+----+----+
| 1  | 1  |
+----+----+
1 row(s) fetched. 
Elapsed 0.005 seconds.

I prefer to align this behavior with others. It makes the logical plan created by DataFusion reliable.
#13486 also mentioned that join without condition isn't a valid SQL.

Because the cross-join has the same IR as the inner join without the condition, #13486 should be prevented from the SQL layer. Maybe we can consider to reopen apache/datafusion-sqlparser-rs#1552

What changes are included in this PR?

  • Enforce JOIN plan to require conditions when built
  • ScalarSubqueryToJoin will generate the LEFT JOIN plan with the condition True.

Are these changes tested?

yes

Are there any user-facing changes?

Besides INNER JOIN, the join syntax without condition is not allowed anymore.

@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Mar 20, 2025
@@ -861,167 +849,6 @@ mod test {
assert_optimized_plan_equal(outer_query, expected)
}

#[test]
fn limit_should_push_down_join_without_condition() -> Result<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case has never occurred now.

@goldmedal goldmedal marked this pull request as ready for review March 20, 2025 15:32
@goldmedal goldmedal marked this pull request as draft March 20, 2025 16:15
@goldmedal goldmedal marked this pull request as ready for review March 20, 2025 16:33
@github-actions github-actions bot added the sql SQL Planner label Mar 20, 2025
@goldmedal goldmedal marked this pull request as draft March 22, 2025 06:19
@goldmedal goldmedal marked this pull request as ready for review March 22, 2025 07:39
@@ -65,7 +65,7 @@ logical_plan
07)------------Projection: customer.c_phone, customer.c_acctbal
08)--------------LeftAnti Join: customer.c_custkey = __correlated_sq_1.o_custkey
09)----------------Filter: substr(customer.c_phone, Int64(1), Int64(2)) IN ([Utf8View("13"), Utf8View("31"), Utf8View("23"), Utf8View("29"), Utf8View("30"), Utf8View("18"), Utf8View("17")])
10)------------------TableScan: customer projection=[c_custkey, c_phone, c_acctbal], partial_filters=[substr(customer.c_phone, Int64(1), Int64(2)) IN ([Utf8View("13"), Utf8View("31"), Utf8View("23"), Utf8View("29"), Utf8View("30"), Utf8View("18"), Utf8View("17")])]
10)------------------TableScan: customer projection=[c_custkey, c_phone, c_acctbal], partial_filters=[substr(customer.c_phone, Int64(1), Int64(2)) IN ([Utf8View("13"), Utf8View("31"), Utf8View("23"), Utf8View("29"), Utf8View("30"), Utf8View("18"), Utf8View("17")]), Boolean(true)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I benchmarked to ensure the plan change won't impact the performance.

--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃    main ┃ fix_join-check ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 68.97ms │        70.14ms │    no change │
│ QQuery 2     │ 18.72ms │        18.55ms │    no change │
│ QQuery 3     │ 30.47ms │        29.46ms │    no change │
│ QQuery 4     │ 21.45ms │        21.77ms │    no change │
│ QQuery 5     │ 43.53ms │        41.56ms │    no change │
│ QQuery 6     │ 14.39ms │        14.88ms │    no change │
│ QQuery 7     │ 55.03ms │        55.30ms │    no change │
│ QQuery 8     │ 41.24ms │        39.45ms │    no change │
│ QQuery 9     │ 50.84ms │        53.19ms │    no change │
│ QQuery 10    │ 44.85ms │        44.83ms │    no change │
│ QQuery 11    │ 13.42ms │        13.22ms │    no change │
│ QQuery 12    │ 28.49ms │        28.20ms │    no change │
│ QQuery 13    │ 29.56ms │        29.38ms │    no change │
│ QQuery 14    │ 23.03ms │        24.40ms │ 1.06x slower │
│ QQuery 15    │ 33.59ms │        34.98ms │    no change │
│ QQuery 16    │ 13.07ms │        13.08ms │    no change │
│ QQuery 17    │ 58.00ms │        58.67ms │    no change │
│ QQuery 18    │ 76.00ms │        75.06ms │    no change │
│ QQuery 19    │ 39.99ms │        38.40ms │    no change │
│ QQuery 20    │ 29.04ms │        29.79ms │    no change │
│ QQuery 21    │ 62.14ms │        64.14ms │    no change │
│ QQuery 22    │ 13.82ms │        14.01ms │    no change │
└──────────────┴─────────┴────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Benchmark Summary             ┃          ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ Total Time (main)             │ 809.65ms │
│ Total Time (fix_join-check)   │ 812.45ms │
│ Average Time (main)           │  36.80ms │
│ Average Time (fix_join-check) │  36.93ms │
│ Queries Faster                │        0 │
│ Queries Slower                │        1 │
│ Queries with No Change        │       21 │
└───────────────────────────────┴──────────┘

@@ -723,14 +723,14 @@ statement ok
create table testSubQueryLimit (a int, b int) as values (1,2), (2,3), (3,4);

query IIII
select * from testSubQueryLimit as t1 join (select * from testSubQueryLimit limit 1) limit 10;
select * from testSubQueryLimit as t1, (select * from testSubQueryLimit limit 1) limit 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change is needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that It was an invalid SQL mentioned by #13486. But indeed, we can modify it after #13486 is fixed. I revert this change here.


# join condition is required
query error join condition should not be empty
SELECT * FROM t1 FULL JOIN t2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include also a cross join without filters like

SELECT * FROM t1 CROSS JOIN t2

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @goldmedal

@alamb alamb merged commit 3e56ed2 into apache:main Mar 25, 2025
27 checks passed
@goldmedal goldmedal deleted the fix/join-check branch March 26, 2025 00:52
@goldmedal
Copy link
Contributor Author

Thanks @comphead and @alamb 👍

qstommyshu pushed a commit to qstommyshu/datafusion that referenced this pull request Mar 27, 2025
* add check for missing join condition

* modify the tests

* handle the cross join case

* remove invald sql case

* fix sqllogictests

* fix for cross join case

* revert the change for limit.slt

* add cross join test
@linhr linhr mentioned this pull request Apr 8, 2025
39 tasks
@niebayes
Copy link
Contributor

niebayes commented Apr 21, 2025

@goldmedal I don't understand this change.
Previously, we have a sql:

select * from t
where (select sid from t) = (select a from t limit 1)
order by ts

which actually corresponds to the following:

select * from t where sid = some_value order by ts

The generated plan has a Left Join with a Filter: Boolean(true). I think this sql should not be impacted by this change.

@goldmedal
Copy link
Contributor Author

@goldmedal I don't understand this change. Previously, we have a sql:

select * from t
where (select sid from t) = (select a from t limit 1)
order by ts

The ScalarSubquery will be transformed by ScalarSubqueryToJoin to a left join, and then this PR forces the left join to have a condition. So, in this PR, I added the True expression for the left join in ScalarSubqueryToJoin.

I think the result is expected.

nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
* add check for missing join condition

* modify the tests

* handle the cross join case

* remove invald sql case

* fix sqllogictests

* fix for cross join case

* revert the change for limit.slt

* add cross join test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants