RFC: Exclude datetime columns from distribution keys #3255

fuyufjh · 2022-06-15T12:43:09Z

Background

Datetime columns are very useful in real-time applications, especially combined with the time window functions. Unlike other columns from user’s business scenarios, the datetime columns usually have these properties:

If you distribute by date in streaming, the traffic always goes to a single node at any certain time.
It’s very common to scan multiple days to see a trend or aggregate at higher level.

Sounds abstract? Let’s look at some examples.

Case 1: Grouped by datatime columns only. This MV answers: “how many orders per day?”

CREATE MATERIALIZED VIEW mv AS
  SELECT window_start AS order_date, count(*) AS count
  FROM TUMBLE(orders, order_date, interval '1 hour')
  GROUP BY window_start

As we know, the streaming HashAgg operator distributes data by group key, i.e. window_start here. As a result,

During 00:00:00 ~ 00:59:59, all traffic go to the 1st HashAggExecutor instance
During 01:00:00 ~ 01:59:59, all traffic go to the 2nd HashAggExecutor instance
During 02:00:00 ~ 02:59:59, all traffic go to the 3rd HashAggExecutor instance
etc.

Another problem is, users may scan multiple results of hours to see a trend or aggregate in higher level.

-- Get the history of one day
SELECT * FROM mv WHERE order_date >= '2022-06-22' AND order_date < '2022-06-23'

-- Get the aggregated count by day
SELECT order_date::date, sum(count) FROM mv GROUP BY order_date::date

If the results of each hour locate on different nodes by hash, the scan will be a distributed scan and it’s much more expensive than a point select.

Case 2: Grouped by datatime columns and other columns. This MV answers: “how many orders per day per user?”

CREATE MATERIALIZED VIEW mv AS
  SELECT window_start AS order_date, customer_id, count(*)
  FROM TUMBLE(orders, order_date, interval '1 hour')
  GROUP BY window_start, customer_id

In this case, the streaming HashAgg operator distributes data by window_start as well as customer_id. Luckily, with the help of customer_id, the traffic will be distributed to all nodes, but it also shows that window_start might be useless in distributing data.

Similarly, imagine this MV serves a dashboard for each customer, so the queries may be like

-- Get the history of one day for one customer
SELECT * FROM mv
WHERE order_date >= '2022-06-22' AND order_date < '2022-06-23'
  AND customer_id = 42;

-- Get the aggregated count by day for one customer
SELECT order_date::date, sum(count) FROM mv 
WHERE customer_id = 42
GROUP BY order_date::date;

Design

Simply exclude all datetime columns from distribution keys of HashAgg in the optimizer.

A subtle case is: users may also convert date columns to strings e.g.

CREATE MATERIALIZED VIEW mv AS
  SELECT TO_CHAR(order_time, 'dd-mm-yyyy') AS order_date, count(*) AS count
  FROM orders
  GROUP BY 1

These “implicit” datetime columns may be found with some derivations in the optimizer.

The text was updated successfully, but these errors were encountered:

st1page · 2022-06-16T02:07:36Z

Simply exclude all datetime columns from distribution keys of HashAgg in the optimizer.

so when the agg is Grouped by datatime columns only, what should we do? convert it to a SimpleAgg with Single distribution？

fuyufjh · 2022-06-16T03:00:34Z

Simply exclude all datetime columns from distribution keys of HashAgg in the optimizer.

so when the agg is Grouped by datatime columns only, what should we do? convert it to a SimpleAgg with Single distribution？

Yes. Convert it to HashAgg with Single distribution.

fuyufjh · 2022-07-01T04:09:05Z

cc. @TennyZhuang @BugenZhao

fuyufjh · 2022-09-08T10:22:56Z

What's the current state 👀 @st1page

fuyufjh added the type/feature Type: New feature. label Jun 15, 2022

BugenZhao mentioned this issue Jul 3, 2022

bug: no exchange for hopping window #3573

Closed

st1page mentioned this issue Sep 22, 2022

Discussion: a case for enabling non-simple local aggregation #5490

Closed

kwannoel mentioned this issue Jan 27, 2023

perf: nexmark q17 #7351

Closed

st1page mentioned this issue Mar 20, 2023

perf: nexmark q5 #7343

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Exclude datetime columns from distribution keys #3255

RFC: Exclude datetime columns from distribution keys #3255

fuyufjh commented Jun 15, 2022

st1page commented Jun 16, 2022

Uh oh!

fuyufjh commented Jun 16, 2022

Uh oh!

fuyufjh commented Jul 1, 2022

Uh oh!

fuyufjh commented Sep 8, 2022

Uh oh!

RFC: Exclude datetime columns from distribution keys #3255

RFC: Exclude datetime columns from distribution keys #3255

Comments

fuyufjh commented Jun 15, 2022

Background

Design

st1page commented Jun 16, 2022

Uh oh!

fuyufjh commented Jun 16, 2022

Uh oh!

fuyufjh commented Jul 1, 2022

Uh oh!

fuyufjh commented Sep 8, 2022

Uh oh!