fix(trino,pyspark): improve null handling in array filter #10448

stephen-bowser · 2024-11-06T17:51:06Z

Description of changes

This fixes an issue with the pyspark array filter function. The original implementation does not account for handling nulls correctly in the input array.

I'm not too faimiliar with SqlGlot, but by copying the implementation from duckdb, I was able to get all the test cases passing. Happy to take feedback if there's something I've missed though.

See this issue for further details

Issues closed

Resolves #10201

ibis/backends/tests/test_array.py

stephen-bowser · 2024-11-07T09:43:06Z

Looks like there is also an issue in the trino implementation for the same reason as there was in pyspark.
I had a go at fixing that one as well, but I'm not familiar enough with SqlGlot to work through it.
That being said, I think a better implementation for this backend could be something like:

use zip to combine array elements with their indices into a struct
apply the filter function
return back the values that passed the filter function, but without their associated indexes

gforsyth · 2024-11-09T22:13:33Z

Hey @stephen-bowser -- thanks for putting this together!
It's fine to only fix the pyspark backend, you can add a marker to xfail Trino with a TODO for us to handle redoing the implementation.

I can see that you copied the values check from the previous test -- that would work if we weren't dealing with Pandas NULL/NaN nonsense, so you're getting a test failure because Pandas makes things nan and also coerces columns to float because of that.

You might be better served by using to_pyarrow() to trigger execution and then comparing values that way, with something that has a proper notion of NULL

cpcloud · 2024-11-10T14:25:04Z

Went ahead and fixed the trino backend here.

cpcloud · 2024-11-10T14:43:57Z

Fix on the way for using pyarrow to test

gforsyth

Thanks for putting this in @stephen-bowser !

github-actions bot added tests Issues or PRs related to tests sql Backends that generate SQL labels Nov 6, 2024

stephen-bowser commented Nov 6, 2024

View reviewed changes

ibis/backends/tests/test_array.py Outdated Show resolved Hide resolved

cpcloud added this to the 10.0 milestone Nov 10, 2024

cpcloud added bug Incorrect behavior inside of ibis pyspark The Apache PySpark backend trino The Trino backend labels Nov 10, 2024

cpcloud force-pushed the fix/pyspark_array_filter branch 2 times, most recently from dff4ff7 to 4e851e5 Compare November 10, 2024 14:24

cpcloud force-pushed the fix/pyspark_array_filter branch from 4e851e5 to e202d66 Compare November 10, 2024 14:25

gforsyth changed the title ~~fix(pyspark): fix array filter function in pyspark~~ fix(trino,pyspark): improve null handling in array filter Nov 10, 2024

fix(pyspark): preserve NULLs in array filter

ac5b9a1

cpcloud force-pushed the fix/pyspark_array_filter branch from e202d66 to de92f72 Compare November 10, 2024 14:44

fix(trino): ensure that NULLs are preserved in array filter

d7e2551

cpcloud force-pushed the fix/pyspark_array_filter branch from de92f72 to bcc57fb Compare November 10, 2024 14:44

test(arrays): use to_pyarrow to make null handling sane

62ed89f

cpcloud force-pushed the fix/pyspark_array_filter branch from bcc57fb to 62ed89f Compare November 10, 2024 14:45

gforsyth approved these changes Nov 10, 2024

View reviewed changes

cpcloud enabled auto-merge (squash) November 10, 2024 15:23

cpcloud merged commit 860b9ca into ibis-project:main Nov 10, 2024
76 checks passed

stephen-bowser deleted the fix/pyspark_array_filter branch November 11, 2024 10:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(trino,pyspark): improve null handling in array filter #10448

fix(trino,pyspark): improve null handling in array filter #10448

stephen-bowser commented Nov 6, 2024 •

edited

Loading

stephen-bowser commented Nov 7, 2024

gforsyth commented Nov 9, 2024

cpcloud commented Nov 10, 2024

cpcloud commented Nov 10, 2024

gforsyth left a comment

fix(trino,pyspark): improve null handling in array filter #10448

fix(trino,pyspark): improve null handling in array filter #10448

Conversation

stephen-bowser commented Nov 6, 2024 • edited Loading

Description of changes

Issues closed

stephen-bowser commented Nov 7, 2024

gforsyth commented Nov 9, 2024

cpcloud commented Nov 10, 2024

cpcloud commented Nov 10, 2024

gforsyth left a comment

Choose a reason for hiding this comment

stephen-bowser commented Nov 6, 2024 •

edited

Loading