You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Datetime columns are very useful in real-time applications, especially combined with the time window functions. Unlike other columns from user’s business scenarios, the datetime columns usually have these properties:
If you distribute by date in streaming, the traffic always goes to a single node at any certain time.
It’s very common to scan multiple days to see a trend or aggregate at higher level.
Sounds abstract? Let’s look at some examples.
Case 1: Grouped by datatime columns only. This MV answers: “how many orders per day?”
CREATE MATERIALIZED VIEW mv ASSELECT window_start AS order_date, count(*) AS count
FROM TUMBLE(orders, order_date, interval '1 hour')
GROUP BY window_start
As we know, the streaming HashAgg operator distributes data by group key, i.e. window_start here. As a result,
During 00:00:00 ~ 00:59:59, all traffic go to the 1st HashAggExecutor instance
During 01:00:00 ~ 01:59:59, all traffic go to the 2nd HashAggExecutor instance
During 02:00:00 ~ 02:59:59, all traffic go to the 3rd HashAggExecutor instance
etc.
Another problem is, users may scan multiple results of hours to see a trend or aggregate in higher level.
-- Get the history of one daySELECT*FROM mv WHERE order_date >='2022-06-22'AND order_date <'2022-06-23'-- Get the aggregated count by daySELECT order_date::date, sum(count) FROM mv GROUP BY order_date::date
If the results of each hour locate on different nodes by hash, the scan will be a distributed scan and it’s much more expensive than a point select.
Case 2: Grouped by datatime columns and other columns. This MV answers: “how many orders per day per user?”
CREATE MATERIALIZED VIEW mv ASSELECT window_start AS order_date, customer_id, count(*)
FROM TUMBLE(orders, order_date, interval '1 hour')
GROUP BY window_start, customer_id
In this case, the streaming HashAgg operator distributes data by window_start as well as customer_id. Luckily, with the help of customer_id, the traffic will be distributed to all nodes, but it also shows that window_start might be useless in distributing data.
Similarly, imagine this MV serves a dashboard for each customer, so the queries may be like
-- Get the history of one day for one customerSELECT*FROM mv
WHERE order_date >='2022-06-22'AND order_date <'2022-06-23'AND customer_id =42;
-- Get the aggregated count by day for one customerSELECT order_date::date, sum(count) FROM mv
WHERE customer_id =42GROUP BY order_date::date;
Design
Simply exclude all datetime columns from distribution keys of HashAgg in the optimizer.
A subtle case is: users may also convert date columns to strings e.g.
CREATE MATERIALIZED VIEW mv ASSELECT TO_CHAR(order_time, 'dd-mm-yyyy') AS order_date, count(*) AS count
FROM orders
GROUP BY1
These “implicit” datetime columns may be found with some derivations in the optimizer.
The text was updated successfully, but these errors were encountered:
Background
Datetime columns are very useful in real-time applications, especially combined with the time window functions. Unlike other columns from user’s business scenarios, the datetime columns usually have these properties:
Sounds abstract? Let’s look at some examples.
Case 1: Grouped by datatime columns only. This MV answers: “how many orders per day?”
As we know, the streaming HashAgg operator distributes data by group key, i.e.
window_start
here. As a result,00:00:00 ~ 00:59:59
, all traffic go to the 1st HashAggExecutor instance01:00:00 ~ 01:59:59
, all traffic go to the 2nd HashAggExecutor instance02:00:00 ~ 02:59:59
, all traffic go to the 3rd HashAggExecutor instanceAnother problem is, users may scan multiple results of hours to see a trend or aggregate in higher level.
If the results of each hour locate on different nodes by hash, the scan will be a distributed scan and it’s much more expensive than a point select.
Case 2: Grouped by datatime columns and other columns. This MV answers: “how many orders per day per user?”
In this case, the streaming HashAgg operator distributes data by
window_start
as well ascustomer_id
. Luckily, with the help ofcustomer_id
, the traffic will be distributed to all nodes, but it also shows thatwindow_start
might be useless in distributing data.Similarly, imagine this MV serves a dashboard for each customer, so the queries may be like
Design
Simply exclude all datetime columns from distribution keys of HashAgg in the optimizer.
A subtle case is: users may also convert date columns to strings e.g.
These “implicit” datetime columns may be found with some derivations in the optimizer.
The text was updated successfully, but these errors were encountered: