Skip to content

[Java] Provide light-weight arrow APIs #21675

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
asfimport opened this issue Apr 23, 2019 · 5 comments
Closed

[Java] Provide light-weight arrow APIs #21675

asfimport opened this issue Apr 23, 2019 · 5 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Apr 23, 2019

We are trying to incorporate Apache Arrow to Apache Flink runtime. We find Arrow an amazing library, which greatly simplifies the support of columnar data format.

However, for many scenarios, we find the performance unacceptable. Our investigation shows the reason is that, there are too many redundant checks and computations in Arrow API.

For example, the following figures shows that in a single call to Float8Vector.get(int) method (this is one of the most frequently used APIs in Flink computation),  there are 20+ method invocations.

image-2019-04-23-15-19-34-187.png

 

There are many other APIs with similar problems. We believe that these checks will make sure of the integrity of the program. However, it also impacts performance severely. For our evaluation, the performance may degrade by two or three orders of magnitude slower, compared to access data on heap memory. 

We think at least for some scenarios, we can give the responsibility of integrity check to application owners. If they can be sure all the checks have been passed, we can provide some light-weight APIs and the inherent high performance, to them.

In the light-weight APIs, we only provide minimum checks, or avoid checks at all. The application owner can still develop and debug their code using the original heavy-weight APIs. Once all bugs have been fixed, they can switch to light-weight APIs in their products and enjoy the consequent high performance.

 

Reporter: Liya Fan / @liyafan82
Assignee: Liya Fan / @liyafan82

Related issues:

Original Issue Attachments:

PRs and other links:

Note: This issue was originally created as ARROW-5200. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
This is related to at least a few other JIRAs. It would be great to start to set up some microbenchmarks so that it is possible to quantify the improvements in various use cases, otherwise we're shooting in the dark. I think having additional "unsafe" methods that skip checks that prevent Java off-heap memory-related segfaults is a good idea

@asfimport
Copy link
Collaborator Author

Liya Fan / @liyafan82:
Sounds reasonable. Thanks a lot for your comments. 

We have opened a new Jira (ARROW-5209) to setup some performance benchmarks from our SQL engine, which is going to be made open source. The benchmarks are extracted by running an open SQL benchmark TPC-H. 

@asfimport
Copy link
Collaborator Author

Liya Fan / @liyafan82:
This is the (source/byte code/assembly) code generated by the original Arrow API for Float8Vector. safe_nocheck.jpg

And this is the code generated by the unsafe API for Float8Vector. unsafe.jpg

It can be observed that the amount of (source/byte code/assembly) code generated by unsafe API is smaller.

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
[~fan_li_ya] I think this can be closed as won't fix and so can the corresponding pull request?

@asfimport
Copy link
Collaborator Author

Liya Fan / @liyafan82:
[~[email protected]] Closed. Thanks for your kind reminder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants