Skip to content

feat(datafusion): add ArrayDistinct operation #10336

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 20, 2024

Conversation

IndexSeek
Copy link
Member

@IndexSeek IndexSeek commented Oct 19, 2024

Description of changes

Adds support for the ArrayDistinct operation on the DataFusion backend.

https://datafusion.apache.org/user-guide/expressions.html#array-expressions

I was running into an issue where if a nan was present, the row count being returned was different. It was raising the following:

Exception: Internal error: UDF returned a different number of rows than expected. Expected: 6, Got: 5.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

I hope I marked this correctly in the test; this may require an upstream issue.

In [1]: from ibis.interactive import *

In [2]: t = ibis.memtable({"a": [[1, 3, 3], [], [42, 42], [], [None], None]})

In [3]: con = ibis.connect("datafusion://")

In [4]: expr = t.select("a", uniqued=_.a.unique())

In [5]: con.execute(expr.filter(~_.a.isnull()))
Out[5]: 
                 a     uniqued
0  [1.0, 3.0, 3.0]  [1.0, 3.0]
1               []          []
2     [42.0, 42.0]      [42.0]
3               []          []
4            [nan]       [nan]

In [6]: con.execute(expr)
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 con.execute(expr)

File ~/ibis/ibis/backends/datafusion/__init__.py:565, in Backend.execute(self, expr, **kwargs)
    562 def execute(self, expr: ir.Expr, **kwargs: Any):
    563     batch_reader = self.to_pyarrow_batches(expr, **kwargs)
    564     return expr.__pandas_result__(
--> 565         batch_reader.read_pandas(timestamp_as_object=True)
    566     )

File /nix/store/h6dzdmg2hy4mcgry6r2y45nzcdqn5z7h-python3-3.12.6-env/lib/python3.12/site-packages/pyarrow/ipc.pxi:617, in pyarrow.lib._ReadPandasMixin.read_pandas()

File /nix/store/h6dzdmg2hy4mcgry6r2y45nzcdqn5z7h-python3-3.12.6-env/lib/python3.12/site-packages/pyarrow/ipc.pxi:762, in pyarrow.lib.RecordBatchReader.read_all()

File /nix/store/h6dzdmg2hy4mcgry6r2y45nzcdqn5z7h-python3-3.12.6-env/lib/python3.12/site-packages/pyarrow/error.pxi:89, in pyarrow.lib.check_status()

File ~/ibis/ibis/backends/datafusion/__init__.py:542, in Backend.to_pyarrow_batches.<locals>.make_gen()
    541 def make_gen():
--> 542     yield from (
    543         # convert the renamed + casted columns into a record batch
    544         pa.RecordBatch.from_struct_array(
    545             # rename columns to match schema because datafusion lowercases things
    546             pa.RecordBatch.from_arrays(batch.to_pyarrow().columns, names=names)
    547             # cast the struct array to the desired types to work around
    548             # https://github.com/apache/arrow-datafusion-python/issues/534
    549             .to_struct_array()
    550             .cast(struct_schema, safe=False)
    551         )
    552         for batch in frame.execute_stream()
    553     )

File ~/ibis/ibis/backends/datafusion/__init__.py:552, in <genexpr>(.0)
    541 def make_gen():
    542     yield from (
    543         # convert the renamed + casted columns into a record batch
    544         pa.RecordBatch.from_struct_array(
    545             # rename columns to match schema because datafusion lowercases things
    546             pa.RecordBatch.from_arrays(batch.to_pyarrow().columns, names=names)
    547             # cast the struct array to the desired types to work around
    548             # https://github.com/apache/arrow-datafusion-python/issues/534
    549             .to_struct_array()
    550             .cast(struct_schema, safe=False)
    551         )
--> 552         for batch in frame.execute_stream()
    553     )

File /nix/store/h6dzdmg2hy4mcgry6r2y45nzcdqn5z7h-python3-3.12.6-env/lib/python3.12/site-packages/datafusion/record_batch.py:71, in RecordBatchStream.__next__(self)
     69 def __next__(self) -> RecordBatch:
     70     """Iterator function."""
---> 71     next_batch = next(self.rbs)
     72     return RecordBatch(next_batch)

Exception: Internal error: UDF returned a different number of rows than expected. Expected: 6, Got: 5.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

@github-actions github-actions bot added tests Issues or PRs related to tests sql Backends that generate SQL labels Oct 19, 2024
@IndexSeek IndexSeek force-pushed the datafusion-array-distinct branch 2 times, most recently from f31ce66 to 4632230 Compare October 19, 2024 23:28
@github-actions github-actions bot added the polars The polars backend label Oct 19, 2024
@IndexSeek IndexSeek force-pushed the datafusion-array-distinct branch from 4632230 to b6f900d Compare October 19, 2024 23:30
Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@cpcloud cpcloud force-pushed the datafusion-array-distinct branch from ee1fc46 to 1996ff1 Compare October 20, 2024 13:57
@cpcloud cpcloud added this to the 10.0 milestone Oct 20, 2024
@cpcloud cpcloud enabled auto-merge (squash) October 20, 2024 14:06
@cpcloud cpcloud merged commit 4491a89 into ibis-project:main Oct 20, 2024
77 checks passed
@IndexSeek IndexSeek deleted the datafusion-array-distinct branch October 20, 2024 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
polars The polars backend sql Backends that generate SQL tests Issues or PRs related to tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants