-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-38075][SQL] Fix hasNext
in HiveScriptTransformationExec
's process output iterator
#35368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-38075][SQL] Fix hasNext
in HiveScriptTransformationExec
's process output iterator
#35368
Conversation
try { | ||
if (scriptOutputWritable == null) { | ||
scriptOutputWritable = reusedWritableObject | ||
|
||
if (scriptOutputReader != null) { | ||
if (scriptOutputReader.next(scriptOutputWritable) <= 0) { | ||
checkFailureAndPropagate(writerThread, null, proc, stderrBuffer) | ||
completed = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, I could just set scriptOutputWritable
to null
here, and I wouldn't need the if statement at the top. That seemed to work. However, it feels a little unhygienic to read from an inputStream that has already returned EOF, so I added the completed flag instead.
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @bersprockets . Do you know when this bug started?
cc @viirya , too.
Not sure how far back, but at I can reproduce in 2.4.8, 3.1.2, 3.2.0, 3.2.1, and master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that ORDER BY
clause issue and irrelevant to LIMIT
clause. Could you confirm the following?
scala> spark.version
res19: String = 3.2.1
scala> sql("SELECT transform(a) USING 'cat' AS (a int) FROM VALUES (1) t(a) ORDER BY a").show
+----+
| a|
+----+
|null|
| 1|
+----+
Oh, never mind. I realized that the
|
hasNext
in HiveScriptTransformationExec
's process output iterator
@@ -64,7 +64,7 @@ private[hive] case class HiveScriptTransformationExec( | |||
outputSoi: StructObjectInspector, | |||
hadoopConf: Configuration): Iterator[InternalRow] = { | |||
new Iterator[InternalRow] with HiveInspectors { | |||
var curLine: String = null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this is irrelevant to this correctness issue, the clean-up looks okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
@@ -64,7 +64,7 @@ private[hive] case class HiveScriptTransformationExec( | |||
outputSoi: StructObjectInspector, | |||
hadoopConf: Configuration): Iterator[InternalRow] = { | |||
new Iterator[InternalRow] with HiveInspectors { | |||
var curLine: String = null | |||
var completed = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: private
Thank you for updates, @bersprockets . After compilation, I'll merge this. |
…process output iterator ### What changes were proposed in this pull request? Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false. ### Why are the changes needed? When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true. The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with `order by` and `limit`. For example: ``` create or replace temp view t as select * from values (1), (2), (3) as t(a); select transform(a) USING 'cat' AS (a int) FROM t order by a limit 10; ``` This returns: ``` NULL NULL NULL 1 2 3 ``` ### Does this PR introduce _any_ user-facing change? No, other than removing the correctness issue. ### How was this patch tested? New unit test. Closes #35368 from bersprockets/script_transformation_issue. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 46885be) Signed-off-by: Dongjoon Hyun <[email protected]>
Merged to master/3.2. For 3.1, there is a conflict. |
I will take a look at what needs to be done. |
…process output iterator Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false. When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true. The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with `order by` and `limit`. For example: ``` create or replace temp view t as select * from values (1), (2), (3) as t(a); select transform(a) USING 'cat' AS (a int) FROM t order by a limit 10; ``` This returns: ``` NULL NULL NULL 1 2 3 ``` No, other than removing the correctness issue. New unit test. Closes apache#35368 from bersprockets/script_transformation_issue. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…c`'s process output iterator Backport #35368 to 3.1. ### What changes were proposed in this pull request? Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false. ### Why are the changes needed? When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true. The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with `order by` and `limit`. For example: ``` create or replace temp view t as select * from values (1), (2), (3) as t(a); select transform(a) USING 'cat' AS (a int) FROM t order by a limit 10; ``` This returns: ``` NULL NULL NULL 1 2 3 ``` ### Does this PR introduce _any_ user-facing change? No, other than removing the correctness issue. ### How was this patch tested? New unit test. Closes #35375 from bersprockets/SPARK-38075_3.1. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…c`'s process output iterator Backport apache#35368 to 3.1. ### What changes were proposed in this pull request? Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false. ### Why are the changes needed? When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true. The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with `order by` and `limit`. For example: ``` create or replace temp view t as select * from values (1), (2), (3) as t(a); select transform(a) USING 'cat' AS (a int) FROM t order by a limit 10; ``` This returns: ``` NULL NULL NULL 1 2 3 ``` ### Does this PR introduce _any_ user-facing change? No, other than removing the correctness issue. ### How was this patch tested? New unit test. Closes apache#35375 from bersprockets/SPARK-38075_3.1. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…process output iterator ### What changes were proposed in this pull request? Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false. ### Why are the changes needed? When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true. The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with `order by` and `limit`. For example: ``` create or replace temp view t as select * from values (1), (2), (3) as t(a); select transform(a) USING 'cat' AS (a int) FROM t order by a limit 10; ``` This returns: ``` NULL NULL NULL 1 2 3 ``` ### Does this PR introduce _any_ user-facing change? No, other than removing the correctness issue. ### How was this patch tested? New unit test. Closes apache#35368 from bersprockets/script_transformation_issue. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 46885be) Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit babde31)
What changes were proposed in this pull request?
Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false.
Why are the changes needed?
When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true.
The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with
order by
andlimit
. For example:This returns:
Does this PR introduce any user-facing change?
No, other than removing the correctness issue.
How was this patch tested?
New unit test.