Skip to content

Commit 291d07c

Browse files
authored
add performance case in readme (#489)
* add performance case in readme * refactor case
1 parent e14948e commit 291d07c

13 files changed

+62
-65
lines changed

README.md

+19-21
Original file line numberDiff line numberDiff line change
@@ -55,36 +55,34 @@ GeaFlow supports two sets of programming interfaces: DSL and API. You can develo
5555
* DSL application development: [DSL Application Development](docs/docs-en/source/5.application-development/2.dsl/1.overview.md)
5656
* API application development: [API Application Development](docs/docs-en/source/5.application-development/1.api/1.overview.md)
5757

58-
## Real-time Capabilities
58+
## Performance
5959

60-
Compared with traditional stream processing engines such as Flink and Storm, which use tables as their data model for real-time processing, GeaFlow's graph-based data model has significant performance advantages when handling join relationship operations, especially complex multi-hops relationship operations like those involving 3 or more hops of join and complex loop searches.
60+
### Dynamic Graph Computation Acceleration
6161

62-
[![total_time](docs/static/img/vs_join_total_time_en.jpg)](docs/docs-en/source/reference/vs_join.md)
62+
GeaFlow supports incremental graph computation capabilities, allowing for continuous streaming incremental graph iterative computations or traversals on dynamic graphs (graphs that are constantly changing). When GeaFlow consumes messages from real-time middleware, the points associated with the real-time data in the current window are activated, triggering iterative graph computations. In each iteration, only the updated points need to notify their neighboring nodes, while unchanged points are not triggered for computation, significantly enhancing the timeliness of the calculations.
6363

64-
[Why using graphs for relational operations is more appealing than table joins?](docs/docs-en/source/reference/vs_join.md)
64+
In the early days of the industry, there were systems for distributed offline graph computation using Spark GraphX. To support similar engine capabilities, Spark relied on the Spark Streaming framework. However, although this integrated approach can handle streaming consumption of point-edge data, it still requires full graph computations every time a calculation is triggered. This makes it challenging to meet the performance expectations of the business (this approach is also referred to as snapshot-based graph computation).
6565

66-
Association Analysis Demo Based on GQL:
66+
Using the WCC (Weakly Connected Components) algorithm as an example, we compared the algorithmic execution time of GeaFlow and Spark solutions, with specific performance results as follows:
67+
![total_time](docs/static/img/vs_dynamic_graph_compute_perf_en.jpg)
6768

68-
```roomsql
69-
--GQL Style
70-
Match (s:student)-[sc:selectCource]->(c:cource)
71-
Return c.name
72-
;
73-
```
69+
Since GeaFlow only activates the vertex-edge relations involved in the current window for incremental computation, the computation time can be completed within seconds, and the computation time for each window remains fairly stable. As the data volume increases, Spark’s need to backtrack through historical data during computation also grows. While the machine capacity has not reached its limit, the computation delay shows a positive correlation with the data volume. In similar conditions, GeaFlow's computation time may slightly increase but can generally still be kept at the level of seconds.
7470

75-
Association Analysis Demo Based on SQL:
7671

77-
```roomsql
78-
--SQL Style
79-
SELECT c.name
80-
FROM course c JOIN selectCourse sc
81-
ON c.id = sc.targetId
82-
JOIN student s ON sc.srcId = s.id
83-
;
84-
```
72+
### Stream Computation Acceleration
73+
74+
Compared to traditional stream processing engines (such as Flink and Storm, which are based on table models), GeaFlow utilizes a graph as its data model (using a vertex-edge storage format), offering significant performance advantages in handling Join operations, especially for complex multi-hop relationships (like joins exceeding 3 hops and complex cycle searches).
75+
76+
To make a comparison, we analyzed the performance of Flink and GeaFlow using the K-Hop algorithm. K-Hop relationships refer to chains of relationships in which individuals can know each other through K intermediaries. For example, in social networks, K-Hop indicates user relationships connected through K intermediaries. In transaction analysis, K-Hop refers to the path of funds transferred consecutively K times.
77+
78+
In comparing the time consumption of the K-Hop algorithm in Flink and GeaFlow:
79+
![total_time](docs/static/img/vs_multi_hops_en.jpg)
80+
81+
As shown in the figure above, Flink performs slightly better than GeaFlow in one-hop and two-hop scenarios. This is because, in these cases, the data volume involved in the Join calculations is relatively small, and both the left and right tables are compact, resulting in shorter traversal times. Additionally, Flink's computation framework can cache the historical results of Join operations.
82+
8583

8684
## Contribution
87-
Thank you very much for contributing to GeaFlow, whether bug reporting, documentation improvement, or major feature development, we warmly welcome all contributions.
85+
Thank you very much for contributing to GeaFlow, whether bug reporting, documentation improvement, or major feature development, we warmly welcome all contributions.
8886

8987
For more information: [Contribution](docs/docs-en/source/9.contribution.md).
9088

README_cn.md

+20-21
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ GeaFlow设计论文参考:[GeaFlow: A Graph Extended and Accelerated Dataflow
4040
1. 准备Git、JDK8、Maven、Docker环境。
4141
2. 下载源码:`git clone https://github.com/TuGraph-family/tugraph-analytics geaflow`
4242
3. 项目构建:`./build.sh --module=geaflow --output=package`
43-
4. 测试任务:`./bin/gql_submit.sh --gql geaflow/geaflow-examples/gql/loop_detection_file_demo.sql`
43+
4. 测试任务:`./bin/gql_submit.sh --gql geaflow/geaflow-examples/gql/loop_detection_file_demo.sql`
4444

4545
第二步 启动控制台,体验白屏提交quick start任务
4646

@@ -55,33 +55,32 @@ GeaFlow支持DSL和API两套编程接口,您既可以通过GeaFlow提供的类
5555
* DSL应用开发:[DSL开发文档](docs/docs-cn/source/5.application-development/2.dsl/1.overview.md)
5656
* API应用开发:[API开发文档](docs/docs-cn/source/5.application-development/1.api/guid.md)
5757

58-
## 实时能力
58+
## 性能优势
5959

60-
相比传统的流式计算引擎比如Flink、Storm这些以表为模型的实时处理系统而言,GeaFlow以图为数据模型,在处理Join关系运算,尤其是复杂多跳的关系运算如3跳以上的Join、复杂环路查找上具备极大的性能优势。
60+
### 增量图计算
6161

62-
[![total_time](docs/static/img/vs_join_total_time_cn.jpg)](docs/docs-cn/source/reference/vs_join.md)
62+
GeaFlow支持增量图计算的能力,即在动态图(图是不断变化的)上,可以持续进行流式增量的图迭代计算或遍历。当GeaFlow消费实时中间件的消息时,当前窗口的实时数据所涉及的点会被激活,从而触发迭代图计算。每一轮迭代仅需将有更新的点通知其邻居节点,未更新的点则不会被触发计算,这显著提升了计算的时效性。
6363

64-
[为什么使用图进行关联运算比表Join更具吸引力?](docs/docs-cn/source/reference/vs_join.md)
64+
在业界早期,存在Spark GraphX分布式离线图计算的系统。为了支持类似的引擎能力,Spark需要依赖Spark Streaming框架。然而,尽管这种融合的方式可以支持流式消费点边数据,每次触发计算时仍需进行全量图计算,这使得计算的时效性很难满足业务预期(这种方式也被称为基于快照的图计算方案)。
6565

66-
基于GQL的关联分析Demo:
66+
以WCC算法为例,我们对GeaFlow与Spark方案的算法耗时进行了比较,具体性能表现如下:
67+
![total_time](docs/static/img/vs_dynamic_graph_compute_perf_cn.jpg)
6768

68-
```roomsql
69-
--GQL Style
70-
Match (s:student)-[sc:selectCource]->(c:cource)
71-
Return c.name
72-
;
73-
```
69+
由于GeaFlow仅激活当前窗口中涉及的点边进行增量计算,计算时间可以在秒级别内完成,每个窗口的计算时间基本稳定。随着数据量的增加,Spark在进行计算时需回溯的历史数据也随之增多。在机器容量未达到上限的情况下,其计算延迟与数据量呈正相关分布。在相同情况下,GeaFlow的计算时间可能会略有增加,但仍可基本保持在秒级别完成。
7470

75-
基于SQL的关联分析Demo:
71+
### 流计算加速
72+
73+
与传统的流式计算引擎(如Flink、Storm等基于表模型的实时处理系统)相比,GeaFlow采用图作为数据模型(点边的存储模式),在处理Join关系运算,特别是复杂多跳关系运算(如超过3跳的Join、复杂环路查找)时具备显著的性能优势。
74+
75+
为了进行比较,我们采用K-Hop算法分析了Flink与GeaFlow的性能。K-Hop关系是指可以通过K个中间人相互认识的关系链,例如在社交网络中,K-Hop指的是通过K个中介联系的用户关系。在交易分析中,K-Hop指的是资金的K次连续转移路径。
76+
77+
在Flink与GeaFlow的K-Hop算法耗时对比中:
78+
![total_time](docs/static/img/vs_multi_hops_cn.jpg)
79+
80+
如上图所示,在一跳和两跳的场景中,Flink的性能略优于GeaFlow。这是因为在这些场景中,参与Join计算的数据量较小,左表和右表都很小,使得遍历操作耗时短。此外,Flink的计算框架能够缓存Join操作的历史计算结果。
81+
82+
然而,当进入三跳和四跳场景时,计算复杂度的上升导致Join算子需要遍历的表迅速膨胀,从而使计算性能大幅下降,甚至在四跳场景中超过一天仍无法完成计算。相比之下,GeaFlow采用基于流图的增量算法,计算时间仅与增量路径相关,而与历史的关联关系计算结果无关,因此性能明显优于Flink。
7683

77-
```roomsql
78-
--SQL Style
79-
SELECT c.name
80-
FROM course c JOIN selectCourse sc
81-
ON c.id = sc.targetId
82-
JOIN student s ON sc.srcId = s.id
83-
;
84-
```
8584

8685
## 参与贡献
8786
非常感谢您参与到 GeaFlow 的贡献中来,无论是Bug反馈还是文档完善,或者是大的功能点贡献,我们都表示热烈的欢迎。

community/CONTRIBUTING.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# Contributing to TuGraph Analytics
1+
# Contributing to GeaFlow
22

3-
Thank you for considering contributing to TuGraph Analytics! We welcome contributions from the community and are grateful for your support.
3+
Thank you for considering contributing to GeaFlow! We welcome contributions from the community and are grateful for your support.
44

55
## How to Contribute
66

@@ -10,8 +10,8 @@ Fork the repository to your own GitHub account by clicking the "Fork" button at
1010
### 2. Clone the Repository
1111
Clone the forked repository to your local machine:
1212
```bash
13-
git clone https://github.com/TuGraph-family/tugraph-analytics.git
14-
cd tugraph-analytics
13+
git clone https://github.com/TuGraph-family/tugraph-analytics.git geaflow
14+
cd geaflow
1515
```
1616

1717
### 3. Create a Branch

docs/docs-cn/source/1.guide.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# 文档地图
2-
这里是文档地图,帮助用户快速学习和使用TuGraph Analytics
2+
这里是文档地图,帮助用户快速学习和使用GeaFlow
33

44
## 介绍
5-
**TuGraph Analytics** (别名:GeaFlow) 是蚂蚁集团开源的[**性能世界一流**](https://ldbcouncil.org/benchmarks/snb-bi/)的OLAP图数据库,支持万亿级图存储、图表混合处理、实时图计算、交互式图分析等核心能力,目前广泛应用于数仓加速、金融风控、知识图谱以及社交网络等场景。
5+
**GeaFlow** 是蚂蚁集团开源的[**性能世界一流**](https://ldbcouncil.org/benchmarks/snb-bi/)的OLAP图数据库,支持万亿级图存储、图表混合处理、实时图计算、交互式图分析等核心能力,目前广泛应用于数仓加速、金融风控、知识图谱以及社交网络等场景。
66

77
关于GeaFlow更多介绍请参考:[GeaFlow介绍文档](2.introduction.md)
88

docs/docs-cn/source/7.deploy/4.collaborate_with_g6vp.md

+5-5
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
bin/socket.sh 9003 GI
1313
```
1414

15-
输出下列内容时,即表示 Tugraph Analytics 准备好建立连接
15+
输出下列内容时,即表示 GeaFlow 准备好建立连接
1616

1717
<img width="610" alt="image" src="https://github.com/TuGraph-family/tugraph-analytics/assets/25787943/a25ed6ba-4fb9-4db1-9325-ee2f26a4337f">
1818

@@ -40,7 +40,7 @@ bin/socket.sh 9003 GI
4040

4141
<img width="328" alt="image" src="https://github.com/TuGraph-family/tugraph-analytics/assets/25787943/5246536b-ddb0-4c3c-91fb-e941101e272a">
4242

43-
Tugraph Analytics 端建立连接后同样会输出以下内容:
43+
GeaFlow 端建立连接后同样会输出以下内容:
4444

4545
<img width="616" alt="image" src="https://github.com/TuGraph-family/tugraph-analytics/assets/25787943/46be1e88-9c93-430e-92cc-db8024691095">
4646

@@ -51,19 +51,19 @@ Tugraph Analytics 端建立连接后同样会输出以下内容:
5151
* 方式一 在输入框中输入点边信息
5252
* 方式二 使用内置数据进行演示
5353

54-
> 两种方式本质都是调用 Tugraph Analytics 进行实时计算,不过方式二省略了手动输入过程。
54+
> 两种方式本质都是调用 GeaFlow 进行实时计算,不过方式二省略了手动输入过程。
5555
5656
这里我们使用内置数据进行快速演示,点击【选项】,选择`添加点`,画布中出现了 7 个点信息;接着选择`添加边`。我们可以在上方对话框中看到添加记录。
5757

5858
<img width="332" alt="image" src="https://github.com/TuGraph-family/tugraph-analytics/assets/25787943/7ca76607-41a1-4afe-9427-cf7599de6889">
5959

60-
同样的,Tugraph Analytics 终端也会实时输出操作信息,并自动启动计算任务。
60+
同样的,GeaFlow 终端也会实时输出操作信息,并自动启动计算任务。
6161

6262
<img width="611" alt="image" src="https://github.com/TuGraph-family/tugraph-analytics/assets/25787943/d8d0d73a-4c07-4ecd-bcac-4633a742933a">
6363

6464
### 5. 结果展示
6565

66-
Tugraph Analytics 完成环路检测计算任务后,会自动返回检测结果。
66+
GeaFlow 完成环路检测计算任务后,会自动返回检测结果。
6767

6868
<img width="324" alt="image" src="https://github.com/TuGraph-family/tugraph-analytics/assets/25787943/ba343acf-812a-4df5-8da4-ff70e0b2531d">
6969

docs/docs-cn/source/reference/vs_join.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -51,11 +51,11 @@ GeaFlow提供融合GQL和SQL样式的查询语言,这是一种图表一体的
5151

5252
GeaFlow DSL引擎层还将支持SQL中的Join自动转化为GQL执行,用户可以自由混用SQL和GQL样式查询,同时做图匹配、图算法和表查询。
5353

54-
## 流图计算引擎TuGraph-Analytics
54+
## 流图计算引擎GeaFlow
5555

56-
GeaFlow(品牌名TuGraph-Analytics)是蚂蚁集团开源的分布式流式图计算引擎。在蚂蚁内部,目前已广泛应用于数仓加速、金融风控、知识图谱以及社交网络等大量场景。
56+
GeaFlow是蚂蚁集团开源的分布式流式图计算引擎。在蚂蚁内部,目前已广泛应用于数仓加速、金融风控、知识图谱以及社交网络等大量场景。
5757

58-
TuGraph-Analytics已经于2023年6月正式对外开源,开放其以图为数据模型的流批一体计算核心能力。相比传统的流式计算引擎,如Flink、Storm这些以表为模型的实时处理系统,GeaFlow以自研图存储为底座,流批一体计算引擎为矛,融合GQL/SQL DSL语言为旗帜,在复杂多度的关系运算上具备极大的优势。
58+
GeaFlow已经于2023年6月正式对外开源,开放其以图为数据模型的流批一体计算核心能力。相比传统的流式计算引擎,如Flink、Storm这些以表为模型的实时处理系统,GeaFlow以自研图存储为底座,流批一体计算引擎为矛,融合GQL/SQL DSL语言为旗帜,在复杂多度的关系运算上具备极大的优势。
5959

6060

6161
![insert_throuput](../../../static/img/query_throuput_cn.jpg)

docs/docs-en/source/1.guide.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
Here is the documentation map to help users quickly learn and use geaflow.
33

44
## Introduction
5-
**TuGraph Analytics** (alias: GeaFlow) is the [**fastest**](https://ldbcouncil.org/benchmarks/snb-bi/) open-source OLAP graph database developed by Ant Group. It supports core capabilities such as trillion-level graph storage, hybrid graph and table processing, real-time graph computation, and interactive graph analysis. Currently, it is widely used in scenarios such as data warehousing acceleration, financial risk control, knowledge graph, and social networks.
5+
**GeaFlow** is the [**fastest**](https://ldbcouncil.org/benchmarks/snb-bi/) open-source OLAP graph database developed by Ant Group. It supports core capabilities such as trillion-level graph storage, hybrid graph and table processing, real-time graph computation, and interactive graph analysis. Currently, it is widely used in scenarios such as data warehousing acceleration, financial risk control, knowledge graph, and social networks.
66

77
For more information about GeaFlow: [GeaFlow Introduction](2.introduction.md)
88

docs/docs-en/source/7.deploy/4.collaborate_with_g6vp.md

+5-5
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Reference [Quick Start](https://github.com/TuGraph-family/tugraph-analytics/blob
1212
bin/socket.sh 9003 GI
1313
```
1414

15-
When the terminal outputs the following, Tugraph Analytics is ready to establish a connection.
15+
When the terminal outputs the following, GeaFlow is ready to establish a connection.
1616

1717
<img width="610" alt="image" src="https://github.com/TuGraph-family/tugraph-analytics/assets/25787943/a25ed6ba-4fb9-4db1-9325-ee2f26a4337f">
1818

@@ -40,7 +40,7 @@ By default, a connection is automatically established after the Loop Detection D
4040

4141
<img width="328" alt="image" src="https://github.com/TuGraph-family/tugraph-analytics/assets/25787943/5246536b-ddb0-4c3c-91fb-e941101e272a">
4242

43-
Tugraph Analytics will also output the following after the connection is established:
43+
GeaFlow will also output the following after the connection is established:
4444

4545
<img width="616" alt="image" src="https://github.com/TuGraph-family/tugraph-analytics/assets/25787943/46be1e88-9c93-430e-92cc-db8024691095">
4646

@@ -51,19 +51,19 @@ Loop detection Demo provides two ways to interact:
5151
* Method 1 Enter the dot information in the input box
5252
* Method 2 Demonstrate using built-in data
5353

54-
> Both methods essentially call Tugraph Analytics for real-time calculations, but Method 2 omits the manual input process.
54+
> Both methods essentially call GeaFlow for real-time calculations, but Method 2 omits the manual input process.
5555
5656
Here we use the built-in data for a quick demonstration, click [Options], select 'Add Points', 7 points of information appear in the canvas; Then select 'Add Edges'. We can see the add record in the above dialog.
5757

5858
<img width="332" alt="image" src="https://github.com/TuGraph-family/tugraph-analytics/assets/25787943/7ca76607-41a1-4afe-9427-cf7599de6889">
5959

60-
Similarly, the Tugraph Analytics terminal outputs operational information in real time and automatically starts computation tasks.
60+
Similarly, the GeaFlow terminal outputs operational information in real time and automatically starts computation tasks.
6161

6262
<img width="611" alt="image" src="https://github.com/TuGraph-family/tugraph-analytics/assets/25787943/d8d0d73a-4c07-4ecd-bcac-4633a742933a">
6363

6464
### 5. Result Presentation
6565

66-
After the loop detection calculation task is completed, Tugraph Analytics automatically returns the detection results.
66+
After the loop detection calculation task is completed, GeaFlow automatically returns the detection results.
6767

6868
<img width="324" alt="image" src="https://github.com/TuGraph-family/tugraph-analytics/assets/25787943/ba343acf-812a-4df5-8da4-ff70e0b2531d">
6969

0 commit comments

Comments
 (0)