Skip to content

RFC: Exclude datetime columns from distribution keys #3255

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
fuyufjh opened this issue Jun 15, 2022 · 4 comments
Open

RFC: Exclude datetime columns from distribution keys #3255

fuyufjh opened this issue Jun 15, 2022 · 4 comments
Labels
type/feature Type: New feature.

Comments

@fuyufjh
Copy link
Collaborator

fuyufjh commented Jun 15, 2022

Background

Datetime columns are very useful in real-time applications, especially combined with the time window functions. Unlike other columns from user’s business scenarios, the datetime columns usually have these properties:

  • If you distribute by date in streaming, the traffic always goes to a single node at any certain time.
  • It’s very common to scan multiple days to see a trend or aggregate at higher level.

Sounds abstract? Let’s look at some examples.

Case 1: Grouped by datatime columns only. This MV answers: “how many orders per day?”

CREATE MATERIALIZED VIEW mv AS
  SELECT window_start AS order_date, count(*) AS count
  FROM TUMBLE(orders, order_date, interval '1 hour')
  GROUP BY window_start

As we know, the streaming HashAgg operator distributes data by group key, i.e. window_start here. As a result,

  • During 00:00:00 ~ 00:59:59, all traffic go to the 1st HashAggExecutor instance
  • During 01:00:00 ~ 01:59:59, all traffic go to the 2nd HashAggExecutor instance
  • During 02:00:00 ~ 02:59:59, all traffic go to the 3rd HashAggExecutor instance
  • etc.

Another problem is, users may scan multiple results of hours to see a trend or aggregate in higher level.

-- Get the history of one day
SELECT * FROM mv WHERE order_date >= '2022-06-22' AND order_date < '2022-06-23'

-- Get the aggregated count by day
SELECT order_date::date, sum(count) FROM mv GROUP BY order_date::date

If the results of each hour locate on different nodes by hash, the scan will be a distributed scan and it’s much more expensive than a point select.

Case 2: Grouped by datatime columns and other columns. This MV answers: “how many orders per day per user?”

CREATE MATERIALIZED VIEW mv AS
  SELECT window_start AS order_date, customer_id, count(*)
  FROM TUMBLE(orders, order_date, interval '1 hour')
  GROUP BY window_start, customer_id

In this case, the streaming HashAgg operator distributes data by window_start as well as customer_id. Luckily, with the help of customer_id, the traffic will be distributed to all nodes, but it also shows that window_start might be useless in distributing data.

Similarly, imagine this MV serves a dashboard for each customer, so the queries may be like

-- Get the history of one day for one customer
SELECT * FROM mv
WHERE order_date >= '2022-06-22' AND order_date < '2022-06-23'
  AND customer_id = 42;

-- Get the aggregated count by day for one customer
SELECT order_date::date, sum(count) FROM mv 
WHERE customer_id = 42
GROUP BY order_date::date;

Design

Simply exclude all datetime columns from distribution keys of HashAgg in the optimizer.

A subtle case is: users may also convert date columns to strings e.g.

CREATE MATERIALIZED VIEW mv AS
  SELECT TO_CHAR(order_time, 'dd-mm-yyyy') AS order_date, count(*) AS count
  FROM orders
  GROUP BY 1

These “implicit” datetime columns may be found with some derivations in the optimizer.

@fuyufjh fuyufjh added the type/feature Type: New feature. label Jun 15, 2022
@st1page
Copy link
Contributor

st1page commented Jun 16, 2022

Simply exclude all datetime columns from distribution keys of HashAgg in the optimizer.

so when the agg is Grouped by datatime columns only, what should we do? convert it to a SimpleAgg with Single distribution?

@fuyufjh
Copy link
Collaborator Author

fuyufjh commented Jun 16, 2022

Simply exclude all datetime columns from distribution keys of HashAgg in the optimizer.

so when the agg is Grouped by datatime columns only, what should we do? convert it to a SimpleAgg with Single distribution?

Yes. Convert it to HashAgg with Single distribution.

@fuyufjh
Copy link
Collaborator Author

fuyufjh commented Jul 1, 2022

cc. @TennyZhuang @BugenZhao

@fuyufjh
Copy link
Collaborator Author

fuyufjh commented Sep 8, 2022

What's the current state 👀 @st1page

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature Type: New feature.
Projects
None yet
Development

No branches or pull requests

2 participants