Skip to content

fix(trino,pyspark): improve null handling in array filter #10448

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 10, 2024

Conversation

stephen-bowser
Copy link
Contributor

@stephen-bowser stephen-bowser commented Nov 6, 2024

Description of changes

This fixes an issue with the pyspark array filter function. The original implementation does not account for handling nulls correctly in the input array.

I'm not too faimiliar with SqlGlot, but by copying the implementation from duckdb, I was able to get all the test cases passing. Happy to take feedback if there's something I've missed though.

See this issue for further details

Issues closed

Resolves #10201

@github-actions github-actions bot added tests Issues or PRs related to tests sql Backends that generate SQL labels Nov 6, 2024
@stephen-bowser
Copy link
Contributor Author

Looks like there is also an issue in the trino implementation for the same reason as there was in pyspark.
I had a go at fixing that one as well, but I'm not familiar enough with SqlGlot to work through it.
That being said, I think a better implementation for this backend could be something like:

  • use zip to combine array elements with their indices into a struct
  • apply the filter function
  • return back the values that passed the filter function, but without their associated indexes

@gforsyth
Copy link
Member

gforsyth commented Nov 9, 2024

Hey @stephen-bowser -- thanks for putting this together!
It's fine to only fix the pyspark backend, you can add a marker to xfail Trino with a TODO for us to handle redoing the implementation.

I can see that you copied the values check from the previous test -- that would work if we weren't dealing with Pandas NULL/NaN nonsense, so you're getting a test failure because Pandas makes things nan and also coerces columns to float because of that.

You might be better served by using to_pyarrow() to trigger execution and then comparing values that way, with something that has a proper notion of NULL

@cpcloud cpcloud added this to the 10.0 milestone Nov 10, 2024
@cpcloud cpcloud added bug Incorrect behavior inside of ibis pyspark The Apache PySpark backend trino The Trino backend labels Nov 10, 2024
@cpcloud cpcloud force-pushed the fix/pyspark_array_filter branch 2 times, most recently from dff4ff7 to 4e851e5 Compare November 10, 2024 14:24
@cpcloud
Copy link
Member

cpcloud commented Nov 10, 2024

Went ahead and fixed the trino backend here.

@cpcloud cpcloud force-pushed the fix/pyspark_array_filter branch from 4e851e5 to e202d66 Compare November 10, 2024 14:25
@gforsyth gforsyth changed the title fix(pyspark): fix array filter function in pyspark fix(trino,pyspark): improve null handling in array filter Nov 10, 2024
@cpcloud
Copy link
Member

cpcloud commented Nov 10, 2024

Fix on the way for using pyarrow to test

@cpcloud cpcloud force-pushed the fix/pyspark_array_filter branch from e202d66 to de92f72 Compare November 10, 2024 14:44
@cpcloud cpcloud force-pushed the fix/pyspark_array_filter branch from de92f72 to bcc57fb Compare November 10, 2024 14:44
@cpcloud cpcloud force-pushed the fix/pyspark_array_filter branch from bcc57fb to 62ed89f Compare November 10, 2024 14:45
Copy link
Member

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Thanks for putting this in @stephen-bowser !

@cpcloud cpcloud enabled auto-merge (squash) November 10, 2024 15:23
@cpcloud cpcloud merged commit 860b9ca into ibis-project:main Nov 10, 2024
76 checks passed
@stephen-bowser stephen-bowser deleted the fix/pyspark_array_filter branch November 11, 2024 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis pyspark The Apache PySpark backend sql Backends that generate SQL tests Issues or PRs related to tests trino The Trino backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: allow the index as an argument to list functions such as filter and map
3 participants