Skip to content

[SPARK-38075][SQL] Fix hasNext in HiveScriptTransformationExec's process output iterator #35368

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

bersprockets
Copy link
Contributor

What changes were proposed in this pull request?

Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false.

Why are the changes needed?

When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true.

The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with order by and limit. For example:

create or replace temp view t as
select * from values
(1),
(2),
(3)
as t(a);

select transform(a)
USING 'cat' AS (a int)
FROM t order by a limit 10;

This returns:

NULL
NULL
NULL
1
2
3

Does this PR introduce any user-facing change?

No, other than removing the correctness issue.

How was this patch tested?

New unit test.

@github-actions github-actions bot added the SQL label Jan 30, 2022
try {
if (scriptOutputWritable == null) {
scriptOutputWritable = reusedWritableObject

if (scriptOutputReader != null) {
if (scriptOutputReader.next(scriptOutputWritable) <= 0) {
checkFailureAndPropagate(writerThread, null, proc, stderrBuffer)
completed = true
Copy link
Contributor Author

@bersprockets bersprockets Jan 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, I could just set scriptOutputWritable to null here, and I wouldn't need the if statement at the top. That seemed to work. However, it feels a little unhygienic to read from an inputStream that has already returned EOF, so I added the completed flag instead.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @bersprockets . Do you know when this bug started?

cc @viirya , too.

@bersprockets
Copy link
Contributor Author

Thank you, @bersprockets . Do you know when this bug started?

Not sure how far back, but at I can reproduce in 2.4.8, 3.1.2, 3.2.0, 3.2.1, and master.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that ORDER BY clause issue and irrelevant to LIMIT clause. Could you confirm the following?

scala> spark.version
res19: String = 3.2.1

scala> sql("SELECT transform(a) USING 'cat' AS (a int) FROM VALUES (1) t(a) ORDER BY a").show
+----+
|   a|
+----+
|null|
|   1|
+----+

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 31, 2022

Oh, never mind. I realized that the spark-shell add limit automatically. You're right. This is ORDER BY .. LIMIT issue.

spark-sql> SELECT version();
3.2.1 4f25b3f71238a00508a356591553f2dfa89f8290
Time taken: 0.208 seconds, Fetched 1 row(s)

spark-sql> SELECT transform(a) USING 'cat' AS (a int) FROM VALUES (1) t(a) ORDER BY a;
1
Time taken: 0.282 seconds, Fetched 1 row(s)

spark-sql> SELECT transform(a) USING 'cat' AS (a int) FROM VALUES (1) t(a) ORDER BY a LIMIT 3;
NULL
1
Time taken: 0.09 seconds, Fetched 2 row(s)

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-38075][SQL] Fix hasNext in HiveScriptTransformationExec's process output iterator [SPARK-38075][SQL] Fix hasNext in HiveScriptTransformationExec's process output iterator Jan 31, 2022
@@ -64,7 +64,7 @@ private[hive] case class HiveScriptTransformationExec(
outputSoi: StructObjectInspector,
hadoopConf: Configuration): Iterator[InternalRow] = {
new Iterator[InternalRow] with HiveInspectors {
var curLine: String = null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this is irrelevant to this correctness issue, the clean-up looks okay.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@@ -64,7 +64,7 @@ private[hive] case class HiveScriptTransformationExec(
outputSoi: StructObjectInspector,
hadoopConf: Configuration): Iterator[InternalRow] = {
new Iterator[InternalRow] with HiveInspectors {
var curLine: String = null
var completed = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: private

@dongjoon-hyun
Copy link
Member

Thank you for updates, @bersprockets . After compilation, I'll merge this.

dongjoon-hyun pushed a commit that referenced this pull request Jan 31, 2022
…process output iterator

### What changes were proposed in this pull request?

Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false.

### Why are the changes needed?

When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true.

The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with `order by` and `limit`. For example:

```
create or replace temp view t as
select * from values
(1),
(2),
(3)
as t(a);

select transform(a)
USING 'cat' AS (a int)
FROM t order by a limit 10;
```
This returns:
```
NULL
NULL
NULL
1
2
3
```

### Does this PR introduce _any_ user-facing change?

No, other than removing the correctness issue.

### How was this patch tested?

New unit test.

Closes #35368 from bersprockets/script_transformation_issue.

Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 46885be)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

Merged to master/3.2. For 3.1, there is a conflict.

@bersprockets
Copy link
Contributor Author

For 3.1, there is a conflict.

I will take a look at what needs to be done.

bersprockets added a commit to bersprockets/spark that referenced this pull request Jan 31, 2022
…process output iterator

Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false.

When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true.

The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with `order by` and `limit`. For example:

```
create or replace temp view t as
select * from values
(1),
(2),
(3)
as t(a);

select transform(a)
USING 'cat' AS (a int)
FROM t order by a limit 10;
```
This returns:
```
NULL
NULL
NULL
1
2
3
```

No, other than removing the correctness issue.

New unit test.

Closes apache#35368 from bersprockets/script_transformation_issue.

Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Feb 1, 2022
…c`'s process output iterator

Backport #35368 to 3.1.

### What changes were proposed in this pull request?

Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false.

### Why are the changes needed?

When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true.

The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with `order by` and `limit`. For example:

```
create or replace temp view t as
select * from values
(1),
(2),
(3)
as t(a);

select transform(a)
USING 'cat' AS (a int)
FROM t order by a limit 10;
```
This returns:
```
NULL
NULL
NULL
1
2
3
```

### Does this PR introduce _any_ user-facing change?

No, other than removing the correctness issue.

### How was this patch tested?

New unit test.

Closes #35375 from bersprockets/SPARK-38075_3.1.

Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
fishcus pushed a commit to fishcus/spark that referenced this pull request Feb 15, 2022
…c`'s process output iterator

Backport apache#35368 to 3.1.

### What changes were proposed in this pull request?

Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false.

### Why are the changes needed?

When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true.

The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with `order by` and `limit`. For example:

```
create or replace temp view t as
select * from values
(1),
(2),
(3)
as t(a);

select transform(a)
USING 'cat' AS (a int)
FROM t order by a limit 10;
```
This returns:
```
NULL
NULL
NULL
1
2
3
```

### Does this PR introduce _any_ user-facing change?

No, other than removing the correctness issue.

### How was this patch tested?

New unit test.

Closes apache#35375 from bersprockets/SPARK-38075_3.1.

Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@bersprockets bersprockets deleted the script_transformation_issue branch August 10, 2022 18:28
kazuyukitanimura pushed a commit to kazuyukitanimura/spark that referenced this pull request Aug 10, 2022
…process output iterator

### What changes were proposed in this pull request?

Fix hasNext in HiveScriptTransformationExec's process output iterator to always return false if it had previously returned false.

### Why are the changes needed?

When hasNext on the process output iterator returns false, it leaves the iterator in a state (i.e., scriptOutputWritable is not null) such that the next call returns true.

The Guava Ordering used in TakeOrderedAndProjectExec will call hasNext on the process output iterator even after an earlier call had returned false. This results in fake rows when script transform is used with `order by` and `limit`. For example:

```
create or replace temp view t as
select * from values
(1),
(2),
(3)
as t(a);

select transform(a)
USING 'cat' AS (a int)
FROM t order by a limit 10;
```
This returns:
```
NULL
NULL
NULL
1
2
3
```

### Does this PR introduce _any_ user-facing change?

No, other than removing the correctness issue.

### How was this patch tested?

New unit test.

Closes apache#35368 from bersprockets/script_transformation_issue.

Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 46885be)
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit babde31)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants