Support zero copy hash repartitioning for Hash Aggregate #15383

Dandandan · 2025-03-24T10:22:00Z

Is your feature request related to a problem or challenge?

Currently RepartitionExec: partitioning=Hash will be added whenever for aggregates in FinalPartitioned and SinglePartitioned

The benefit is increased parallelism, but at the cost of copying the entire table (in a not-so efficient way).

We should consider lowering the cost of repartitioning by not having to copy the input.

Dependencies

Add selection vector repartitioning #15420

Describe the solution you'd like

Instead of repartitioning the input in RepartitionExec, support repartitioning the inputs based on a selection vector.

Instead of taking the RecordBatch, we can consider doing the following:

Add a (boolean) selection vector as output column for each output partition. I.e. true means the row is selected for the partition.
The rest of the RecordBatch remains unchanged (i.e. no copy).
CoalesceBatchesExec is no longer needed for the output (reducing another copy)
In the hash aggregate code handle the selection vector.

Describe alternatives you've considered

The partitioning could be done inside the hash aggregate (at the cost of more complexity inside it).

Additional context

No response

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

goldmedal · 2025-03-25T13:41:09Z

take

goldmedal · 2025-03-29T10:56:08Z

@Dandandan
I have a draft goldmedal#3 based on #15423 for HashAggregate. Could you check if it's heading in the right direction?

When the selection vector mode is enabled:

CoalesceBatchesExec is not added for FinalPartitioned.
The selection vector is used to filter the required rows before merging batches.

The plan looks like this:

> create table t(c int) as values (1), (1), (1), (1), (2), (2), (3), (3)
> explain select count(distinct c) from t;
+---------------+--------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                             |
+---------------+--------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: count(alias1) AS count(DISTINCT t.c)                                                 |
|               |   Aggregate: groupBy=[[]], aggr=[[count(alias1)]]                                                |
|               |     Aggregate: groupBy=[[t.c AS alias1]], aggr=[[]]                                              |
|               |       TableScan: t projection=[c]                                                                |
| physical_plan | ProjectionExec: expr=[count(alias1)@0 as count(DISTINCT t.c)]                                    |
|               |   AggregateExec: mode=Final, gby=[], aggr=[count(alias1)]                                        |
|               |     CoalescePartitionsExec                                                                       |
|               |       AggregateExec: mode=Partial, gby=[], aggr=[count(alias1)]                                  |
|               |         AggregateExec: mode=FinalPartitioned, gby=[alias1@0 as alias1], aggr=[]                  |
|               |           RepartitionExec: partitioning=HashSelectionVector([alias1@0], 12), input_partitions=12 |
|               |             RepartitionExec: partitioning=RoundRobinBatch(12), input_partitions=1                |
|               |               AggregateExec: mode=Partial, gby=[c@0 as alias1], aggr=[]                          |
|               |                 DataSourceExec: partitions=1, partition_sizes=[1]                                |
|               |                                                                                                  |
+---------------+--------------------------------------------------------------------------------------------------+

I'll review more aggregation patterns and add additional tests.
Thanks.

goldmedal · 2025-03-31T15:12:23Z

Based on goldmedal#3, I did the some benchmarks(clieckbench_1, h2o_medium) for it.
feat_zero-copy-hash-agg-false is the branch that disables the configuration.
feat_zero-copy-hash-agg is the branch enabling the configuration.

In conclusion, HashAggregate is slower in the selection vector mode.

Comparing feat_zero-copy-hash-agg-false and feat_zero-copy-hash-agg
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ feat_zero-copy-hash-agg-false ┃ feat_zero-copy-hash-agg ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │                        0.24ms │                  0.32ms │  1.33x slower │
│ QQuery 1     │                       26.98ms │                 24.56ms │ +1.10x faster │
│ QQuery 2     │                       55.89ms │                 52.55ms │ +1.06x faster │
│ QQuery 3     │                       48.20ms │                 45.62ms │ +1.06x faster │
│ QQuery 4     │                      313.79ms │                347.08ms │  1.11x slower │
│ QQuery 5     │                      490.80ms │                471.41ms │     no change │
│ QQuery 6     │                       25.06ms │                 25.46ms │     no change │
│ QQuery 7     │                       28.17ms │                 27.29ms │     no change │
│ QQuery 8     │                      353.53ms │                406.58ms │  1.15x slower │
│ QQuery 9     │                      514.71ms │                478.99ms │ +1.07x faster │
│ QQuery 10    │                      132.73ms │                130.81ms │     no change │
│ QQuery 11    │                      142.59ms │                143.29ms │     no change │
│ QQuery 12    │                      475.75ms │                493.83ms │     no change │
│ QQuery 13    │                      569.90ms │                630.60ms │  1.11x slower │
│ QQuery 14    │                      435.30ms │                444.02ms │     no change │
│ QQuery 15    │                      361.60ms │                406.62ms │  1.12x slower │
│ QQuery 16    │                      825.41ms │                856.13ms │     no change │
│ QQuery 17    │                      752.13ms │                766.95ms │     no change │
│ QQuery 18    │                     1813.04ms │               1934.07ms │  1.07x slower │
│ QQuery 19    │                       40.67ms │                 41.49ms │     no change │
│ QQuery 20    │                      621.14ms │                625.89ms │     no change │
│ QQuery 21    │                      769.98ms │                749.81ms │     no change │
│ QQuery 22    │                     1544.70ms │               1560.61ms │     no change │
│ QQuery 23    │                     4471.51ms │               4356.12ms │     no change │
│ QQuery 24    │                      257.77ms │                265.81ms │     no change │
│ QQuery 25    │                      268.53ms │                273.24ms │     no change │
│ QQuery 26    │                      294.19ms │                307.36ms │     no change │
│ QQuery 27    │                      983.41ms │                987.90ms │     no change │
│ QQuery 28    │                     7514.46ms │               7533.94ms │     no change │
│ QQuery 29    │                      346.70ms │                344.54ms │     no change │
│ QQuery 30    │                      387.65ms │                405.92ms │     no change │
│ QQuery 31    │                      390.81ms │                427.40ms │  1.09x slower │
│ QQuery 32    │                     1597.45ms │               1987.50ms │  1.24x slower │
│ QQuery 33    │                     1753.56ms │               1863.63ms │  1.06x slower │
│ QQuery 34    │                     1950.84ms │               1945.21ms │     no change │
│ QQuery 35    │                      510.78ms │                560.47ms │  1.10x slower │
│ QQuery 36    │                      105.22ms │                110.02ms │     no change │
│ QQuery 37    │                       56.69ms │                 53.63ms │ +1.06x faster │
│ QQuery 38    │                       74.69ms │                 77.84ms │     no change │
│ QQuery 39    │                      189.59ms │                193.83ms │     no change │
│ QQuery 40    │                       24.37ms │                 24.50ms │     no change │
│ QQuery 41    │                       23.01ms │                 23.45ms │     no change │
│ QQuery 42    │                       27.48ms │                 27.82ms │     no change │
└──────────────┴───────────────────────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                            ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (feat_zero-copy-hash-agg-false)   │ 31570.98ms │
│ Total Time (feat_zero-copy-hash-agg)         │ 32434.14ms │
│ Average Time (feat_zero-copy-hash-agg-false) │   734.21ms │
│ Average Time (feat_zero-copy-hash-agg)       │   754.28ms │
│ Queries Faster                               │          5 │
│ Queries Slower                               │         10 │
│ Queries with No Change                       │         28 │
└──────────────────────────────────────────────┴────────────┘
--------------------
Benchmark h2o.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃ feat_zero-copy-hash-agg-false ┃ feat_zero-copy-hash-agg ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │                     1053.42ms │               1044.36ms │    no change │
│ QQuery 2     │                     2155.01ms │               2317.51ms │ 1.08x slower │
│ QQuery 3     │                     2275.93ms │               2611.06ms │ 1.15x slower │
│ QQuery 4     │                     1236.64ms │               1256.03ms │    no change │
│ QQuery 5     │                     1608.77ms │               1892.13ms │ 1.18x slower │
│ QQuery 6     │                     1369.68ms │               1382.62ms │    no change │
│ QQuery 7     │                     2258.63ms │               2548.21ms │ 1.13x slower │
│ QQuery 8     │                     3876.11ms │               3985.41ms │    no change │
│ QQuery 9     │                     5989.38ms │               6721.88ms │ 1.12x slower │
│ QQuery 10    │                     3064.89ms │               3677.05ms │ 1.20x slower │
└──────────────┴───────────────────────────────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                            ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (feat_zero-copy-hash-agg-false)   │ 24888.45ms │
│ Total Time (feat_zero-copy-hash-agg)         │ 27436.27ms │
│ Average Time (feat_zero-copy-hash-agg-false) │  2488.85ms │
│ Average Time (feat_zero-copy-hash-agg)       │  2743.63ms │
│ Queries Faster                               │          0 │
│ Queries Slower                               │          6 │
│ Queries with No Change                       │          4 │

I tried to profile Clickbench QQuery 4:
When selection vector enable:

When selection vector disabled:

In the current implementation, the CPU time of filter_record_batch (3.45%) is significantly greater than take_arrays(0.35%).
Does arrow have a more efficient way to filter a record batch by a boolean array?

goldmedal · 2025-03-31T16:23:50Z

I'm considering another approach. Maybe I shouldn't use filter_record_batch 🤔. It filters the all column iteratly. I should filter the row when the accumulator merge_batch 🤔
I'll draft another PR for the approach.

zebsme · 2025-03-31T16:36:56Z

I'm considering another approach. Maybe I shouldn't use filter_record_batch 🤔. It filters the all column iteratly. I should filter the row when the accumulator merge_batch 🤔 I'll draft another PR for the approach.

Agree with you, we should try to avoid directly operate on the record batch.

Dandandan · 2025-03-31T19:54:32Z

I'm considering another approach. Maybe I shouldn't use filter_record_batch 🤔. It filters the all column iteratly. I should filter the row when the accumulator merge_batch 🤔

Yes, doing so will copy the entire batch (which is what we try to avoid) and will be slower than take (in the end it will do the same).
I think what we probably want is to get the indices via https://docs.rs/arrow/latest/arrow/buffer/struct.BooleanBuffer.html#method.set_indices so it only will aggregate the values for those indices.

Rachelint · 2025-04-02T00:52:21Z

I'm considering another approach. Maybe I shouldn't use filter_record_batch 🤔. It filters the all column iteratly. I should filter the row when the accumulator merge_batch 🤔

I think also need to filter rows in GroupValues::intern, too.

goldmedal · 2025-04-20T13:23:35Z

I have another implementation for this issue goldmedal#4
The concept is that getting the row according to indices in the selection vector instead of going through all the rows in the batch.

Because it may involve many changes, I want to check if the implementations make sense.
Currently, I only implement GroupValuesPrimitive::intern for the group-by values. For the aggregation, I only implement count and some aggregations that use GroupsAccumulator.

I also did some optimization for the sv-mode repartition https://github.com/apache/datafusion/pull/15423/files#r2051721176.

However, I found the performance won't be better for Clickbench queries 4 and 7.

Query 4: SELECT COUNT(DISTINCT "UserID") FROM hits;
Query 7: SELECT "AdvEngineID", COUNT(*) FROM hits WHERE "AdvEngineID" <> 0 GROUP BY "AdvEngineID" ORDER BY COUNT(*) DESC;

--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃ feat_hash-agg-sv-disable ┃ feat_hash-agg-sv ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 4     │                 311.74ms │         320.72ms │ no change │
│ QQuery 7     │                  30.28ms │          29.01ms │ no change │
└──────────────┴──────────────────────────┴──────────────────┴───────────┘

I'm not sure if I'm on the right way 🤔

@Dandandan @Rachelint Do you have any suggestions for it?

Rachelint · 2025-04-21T03:43:31Z

However, I found the performance won't be better for Clickbench queries 4 and 7.

I think it may be possible that the test queries can't reflect the improvement well.
I may try to make some cases this evening.

Dandandan added the enhancement New feature or request label Mar 24, 2025

Dandandan changed the title ~~Support zero copy hash repartitioning inside Hash Aggregate~~ Support zero copy hash repartitioning for Hash Aggregate Mar 24, 2025

Dandandan mentioned this issue Mar 24, 2025

Speed up hash partitioning #6822

Open

Dandandan added the performance Make DataFusion faster label Mar 24, 2025

github-actions bot assigned goldmedal Mar 25, 2025

This was referenced Mar 25, 2025

Support zero copy hash repartitioning for Hash Join #15382

Open

Introduce selection vector repartitioning #15423

Open

Rachelint mentioned this issue Apr 28, 2025

[DISCUSSION] DataFusion Road Map: Q3-Q4 2025 #15878

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support zero copy hash repartitioning for Hash Aggregate #15383

Support zero copy hash repartitioning for Hash Aggregate #15383

Dandandan commented Mar 24, 2025 •

edited

Loading

goldmedal commented Mar 25, 2025

Uh oh!

goldmedal commented Mar 29, 2025

Uh oh!

goldmedal commented Mar 31, 2025 •

edited

Loading

Uh oh!

goldmedal commented Mar 31, 2025 •

edited

Loading

Uh oh!

zebsme commented Mar 31, 2025

Uh oh!

Dandandan commented Mar 31, 2025

Uh oh!

Rachelint commented Apr 2, 2025

Uh oh!

goldmedal commented Apr 20, 2025

Uh oh!

Rachelint commented Apr 21, 2025

Uh oh!

Support zero copy hash repartitioning for Hash Aggregate #15383

Support zero copy hash repartitioning for Hash Aggregate #15383

Comments

Dandandan commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Is your feature request related to a problem or challenge?

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Describe the solution you'd like

Describe alternatives you've considered

Additional context

goldmedal commented Mar 25, 2025

Uh oh!

goldmedal commented Mar 29, 2025

Uh oh!

goldmedal commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

goldmedal commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zebsme commented Mar 31, 2025

Uh oh!

Dandandan commented Mar 31, 2025

Uh oh!

Rachelint commented Apr 2, 2025

Uh oh!

goldmedal commented Apr 20, 2025

Uh oh!

Rachelint commented Apr 21, 2025

Uh oh!

Dandandan commented Mar 24, 2025 •

edited

Loading

goldmedal commented Mar 31, 2025 •

edited

Loading

goldmedal commented Mar 31, 2025 •

edited

Loading