[SPARK-38075][SQL] Fix `hasNext` in `HiveScriptTransformationExec`'s process output iterator #35368

bersprockets · 2022-01-30T23:13:31Z

What changes were proposed in this pull request?

Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false.

Why are the changes needed?

When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true.

The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with order by and limit. For example:

create or replace temp view t as
select * from values
(1),
(2),
(3)
as t(a);

select transform(a)
USING 'cat' AS (a int)
FROM t order by a limit 10;

This returns:

NULL
NULL
NULL
1
2
3

Does this PR introduce any user-facing change?

No, other than removing the correctness issue.

How was this patch tested?

New unit test.

bersprockets · 2022-01-30T23:16:26Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationExec.scala

        try {
          if (scriptOutputWritable == null) {
            scriptOutputWritable = reusedWritableObject

            if (scriptOutputReader != null) {
              if (scriptOutputReader.next(scriptOutputWritable) <= 0) {
                checkFailureAndPropagate(writerThread, null, proc, stderrBuffer)
+                completed = true


Alternatively, I could just set scriptOutputWritable to null here, and I wouldn't need the if statement at the top. That seemed to work. However, it feels a little unhygienic to read from an inputStream that has already returned EOF, so I added the completed flag instead.

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala

dongjoon-hyun

Thank you, @bersprockets . Do you know when this bug started?

cc @viirya , too.

bersprockets · 2022-01-31T02:10:04Z

Thank you, @bersprockets . Do you know when this bug started?

Not sure how far back, but at I can reproduce in 2.4.8, 3.1.2, 3.2.0, 3.2.1, and master.

dongjoon-hyun

~~It seems that ORDER BY clause issue and irrelevant to LIMIT clause. Could you confirm the following?~~

scala> spark.version
res19: String = 3.2.1

scala> sql("SELECT transform(a) USING 'cat' AS (a int) FROM VALUES (1) t(a) ORDER BY a").show
+----+
|   a|
+----+
|null|
|   1|
+----+

dongjoon-hyun · 2022-01-31T03:38:48Z

Oh, never mind. I realized that the spark-shell add limit automatically. You're right. This is ORDER BY .. LIMIT issue.

spark-sql> SELECT version();
3.2.1 4f25b3f71238a00508a356591553f2dfa89f8290
Time taken: 0.208 seconds, Fetched 1 row(s)

spark-sql> SELECT transform(a) USING 'cat' AS (a int) FROM VALUES (1) t(a) ORDER BY a;
1
Time taken: 0.282 seconds, Fetched 1 row(s)

spark-sql> SELECT transform(a) USING 'cat' AS (a int) FROM VALUES (1) t(a) ORDER BY a LIMIT 3;
NULL
1
Time taken: 0.09 seconds, Fetched 2 row(s)

dongjoon-hyun · 2022-01-31T03:42:20Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationExec.scala

@@ -64,7 +64,7 @@ private[hive] case class HiveScriptTransformationExec(
      outputSoi: StructObjectInspector,
      hadoopConf: Configuration): Iterator[InternalRow] = {
    new Iterator[InternalRow] with HiveInspectors {
-      var curLine: String = null


Although this is irrelevant to this correctness issue, the clean-up looks okay.

dongjoon-hyun

+1, LGTM.

viirya · 2022-01-31T04:04:30Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationExec.scala

@@ -64,7 +64,7 @@ private[hive] case class HiveScriptTransformationExec(
      outputSoi: StructObjectInspector,
      hadoopConf: Configuration): Iterator[InternalRow] = {
    new Iterator[InternalRow] with HiveInspectors {
-      var curLine: String = null
+      var completed = false


nit: private

dongjoon-hyun · 2022-01-31T17:08:42Z

Thank you for updates, @bersprockets . After compilation, I'll merge this.

…process output iterator ### What changes were proposed in this pull request? Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false. ### Why are the changes needed? When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true. The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with `order by` and `limit`. For example: ``` create or replace temp view t as select * from values (1), (2), (3) as t(a); select transform(a) USING 'cat' AS (a int) FROM t order by a limit 10; ``` This returns: ``` NULL NULL NULL 1 2 3 ``` ### Does this PR introduce _any_ user-facing change? No, other than removing the correctness issue. ### How was this patch tested? New unit test. Closes #35368 from bersprockets/script_transformation_issue. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 46885be) Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2022-01-31T18:50:29Z

Merged to master/3.2. For 3.1, there is a conflict.

bersprockets · 2022-01-31T19:03:41Z

For 3.1, there is a conflict.

I will take a look at what needs to be done.

…process output iterator Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false. When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true. The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with `order by` and `limit`. For example: ``` create or replace temp view t as select * from values (1), (2), (3) as t(a); select transform(a) USING 'cat' AS (a int) FROM t order by a limit 10; ``` This returns: ``` NULL NULL NULL 1 2 3 ``` No, other than removing the correctness issue. New unit test. Closes apache#35368 from bersprockets/script_transformation_issue. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…c`'s process output iterator Backport #35368 to 3.1. ### What changes were proposed in this pull request? Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false. ### Why are the changes needed? When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true. The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with `order by` and `limit`. For example: ``` create or replace temp view t as select * from values (1), (2), (3) as t(a); select transform(a) USING 'cat' AS (a int) FROM t order by a limit 10; ``` This returns: ``` NULL NULL NULL 1 2 3 ``` ### Does this PR introduce _any_ user-facing change? No, other than removing the correctness issue. ### How was this patch tested? New unit test. Closes #35375 from bersprockets/SPARK-38075_3.1. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…c`'s process output iterator Backport apache#35368 to 3.1. ### What changes were proposed in this pull request? Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false. ### Why are the changes needed? When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true. The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with `order by` and `limit`. For example: ``` create or replace temp view t as select * from values (1), (2), (3) as t(a); select transform(a) USING 'cat' AS (a int) FROM t order by a limit 10; ``` This returns: ``` NULL NULL NULL 1 2 3 ``` ### Does this PR introduce _any_ user-facing change? No, other than removing the correctness issue. ### How was this patch tested? New unit test. Closes apache#35375 from bersprockets/SPARK-38075_3.1. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…process output iterator ### What changes were proposed in this pull request? Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false. ### Why are the changes needed? When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true. The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with `order by` and `limit`. For example: ``` create or replace temp view t as select * from values (1), (2), (3) as t(a); select transform(a) USING 'cat' AS (a int) FROM t order by a limit 10; ``` This returns: ``` NULL NULL NULL 1 2 3 ``` ### Does this PR introduce _any_ user-facing change? No, other than removing the correctness issue. ### How was this patch tested? New unit test. Closes apache#35368 from bersprockets/script_transformation_issue. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 46885be) Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit babde31)

bersprockets added 3 commits January 29, 2022 14:15

Test

39f6768

Proposed fix

e3586cb

Update test

db1b8b6

github-actions bot added the SQL label Jan 30, 2022

bersprockets commented Jan 30, 2022

View reviewed changes

dongjoon-hyun reviewed Jan 30, 2022

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Jan 30, 2022

View reviewed changes

Update test name

b2918c9

dongjoon-hyun reviewed Jan 31, 2022

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-38075][SQL] Fix hasNext in HiveScriptTransformationExec's process output iterator~~ [SPARK-38075][SQL] Fix hasNext in HiveScriptTransformationExec's process output iterator Jan 31, 2022

dongjoon-hyun reviewed Jan 31, 2022

View reviewed changes

dongjoon-hyun approved these changes Jan 31, 2022

View reviewed changes

viirya approved these changes Jan 31, 2022

View reviewed changes

viirya reviewed Jan 31, 2022

View reviewed changes

Review feedback

8f84823

dongjoon-hyun closed this in 46885be Jan 31, 2022

GulajavaMinistudio mentioned this pull request Jan 31, 2022

Create a new pull request by comparing changes GulajavaMinistudio/spark#1177

Merged

bersprockets mentioned this pull request Jan 31, 2022

[SPARK-38075][SQL][3.1] Fix hasNext in HiveScriptTransformationExec's process output iterator #35375

Closed

bersprockets deleted the script_transformation_issue branch August 10, 2022 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-38075][SQL] Fix `hasNext` in `HiveScriptTransformationExec`'s process output iterator #35368

[SPARK-38075][SQL] Fix `hasNext` in `HiveScriptTransformationExec`'s process output iterator #35368

Uh oh!

bersprockets commented Jan 30, 2022

Uh oh!

bersprockets Jan 30, 2022 •

edited

Loading

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Uh oh!

bersprockets commented Jan 31, 2022

Uh oh!

dongjoon-hyun left a comment •

edited

Loading

Uh oh!

dongjoon-hyun commented Jan 31, 2022 •

edited

Loading

Uh oh!

dongjoon-hyun Jan 31, 2022

Uh oh!

dongjoon-hyun left a comment

Uh oh!

viirya Jan 31, 2022

Uh oh!

dongjoon-hyun commented Jan 31, 2022

Uh oh!

dongjoon-hyun commented Jan 31, 2022

Uh oh!

bersprockets commented Jan 31, 2022

Uh oh!

Uh oh!

[SPARK-38075][SQL] Fix hasNext in HiveScriptTransformationExec's process output iterator #35368

[SPARK-38075][SQL] Fix hasNext in HiveScriptTransformationExec's process output iterator #35368

Uh oh!

Conversation

bersprockets commented Jan 30, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

bersprockets Jan 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

bersprockets commented Jan 31, 2022

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jan 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun Jan 31, 2022

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

viirya Jan 31, 2022

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jan 31, 2022

Uh oh!

dongjoon-hyun commented Jan 31, 2022

Uh oh!

bersprockets commented Jan 31, 2022

Uh oh!

Uh oh!

[SPARK-38075][SQL] Fix `hasNext` in `HiveScriptTransformationExec`'s process output iterator #35368

[SPARK-38075][SQL] Fix `hasNext` in `HiveScriptTransformationExec`'s process output iterator #35368

bersprockets Jan 30, 2022 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun commented Jan 31, 2022 •

edited

Loading