Skip to content

Filters on RANDOM() are applied incorrectly when pushdown_filters is enabled. #13268

Closed
@adamfaulkner-at

Description

@adamfaulkner-at

Describe the bug

When running a query like

SELECT * FROM table WHERE RANDOM() < 0.1;

I get different results depending on the value of "datafusion.execution.parquet.pushdown_filters". When this setting is turned off, I get the results I expect, roughly 10% of the rows in the table. When it is turned on, I think I'm seeing 1% of the rows in the table.

I suspect I'm seeing these results because pushdown with TableProviderFilterPushDown::Inexact is applying this filter at both the parquet level and a FilterExec: random() <= 0.1. This results in the RANDOM() filter being evaluated twice, which causes fewer rows to be sampled.

To Reproduce

This can be reproduced with datafusion-cli version 42.2.0:

Without pushdown_filters

> create external table data stored as parquet location '/Users/adam.faulkner/Downloads/parquet_data/';
> select COUNT(*) from data WHERE RANDOM() < 0.1;
+----------+
| count(*) |
+----------+
| 605572   |
+----------+
1 row(s) fetched.
Elapsed 0.043 seconds.

With pushdown_filters (note that you must re-create the table with the updated setting):

> set datafusion.execution.parquet.pushdown_filters=true;
0 row(s) fetched.
Elapsed 0.002 seconds.

> create external table data stored as parquet location '/Users/adam.faulkner/Downloads/parquet_data/';
0 row(s) fetched.
Elapsed 0.007 seconds.

> select COUNT(*) from data WHERE RANDOM() < 0.1;
+----------+
| count(*) |
+----------+
| 60152    |
+----------+
1 row(s) fetched.
Elapsed 0.045 seconds.

Expected behavior

I would expect that a filter on RANDOM() would be applied only once, so that RANDOM() < 0.1 means that only 10% of all rows will be sampled.

It would be acceptable if RANDOM() was no longer eligible for pushdown, though I suspect this leaves a negligible amount of performance on the table compared to the alternative.

It feels like the "right" solution is to somehow guarantee that RANDOM() always returns the same value for a given row and query evaluation, perhaps by "caching" its values.

Additional context

In my custom TableProvider, I tried using ``TableProviderFilterPushDown::Exact` for these filters, and I get the results that I expect. However, it seems that this is only because my filter is really simple.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions