Description
Describe the bug
When running a query like
SELECT * FROM table WHERE RANDOM() < 0.1;
I get different results depending on the value of "datafusion.execution.parquet.pushdown_filters"
. When this setting is turned off, I get the results I expect, roughly 10% of the rows in the table. When it is turned on, I think I'm seeing 1% of the rows in the table.
I suspect I'm seeing these results because pushdown with TableProviderFilterPushDown::Inexact
is applying this filter at both the parquet level and a FilterExec: random() <= 0.1
. This results in the RANDOM()
filter being evaluated twice, which causes fewer rows to be sampled.
To Reproduce
This can be reproduced with datafusion-cli
version 42.2.0:
Without pushdown_filters
> create external table data stored as parquet location '/Users/adam.faulkner/Downloads/parquet_data/';
> select COUNT(*) from data WHERE RANDOM() < 0.1;
+----------+
| count(*) |
+----------+
| 605572 |
+----------+
1 row(s) fetched.
Elapsed 0.043 seconds.
With pushdown_filters
(note that you must re-create the table with the updated setting):
> set datafusion.execution.parquet.pushdown_filters=true;
0 row(s) fetched.
Elapsed 0.002 seconds.
> create external table data stored as parquet location '/Users/adam.faulkner/Downloads/parquet_data/';
0 row(s) fetched.
Elapsed 0.007 seconds.
> select COUNT(*) from data WHERE RANDOM() < 0.1;
+----------+
| count(*) |
+----------+
| 60152 |
+----------+
1 row(s) fetched.
Elapsed 0.045 seconds.
Expected behavior
I would expect that a filter on RANDOM()
would be applied only once, so that RANDOM() < 0.1
means that only 10% of all rows will be sampled.
It would be acceptable if RANDOM()
was no longer eligible for pushdown, though I suspect this leaves a negligible amount of performance on the table compared to the alternative.
It feels like the "right" solution is to somehow guarantee that RANDOM()
always returns the same value for a given row and query evaluation, perhaps by "caching" its values.
Additional context
In my custom TableProvider, I tried using ``TableProviderFilterPushDown::Exact` for these filters, and I get the results that I expect. However, it seems that this is only because my filter is really simple.