Skip to content

Short-term plan for this binding #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 4 tasks
waynexia opened this issue Apr 8, 2024 · 11 comments
Open
1 of 4 tasks

Short-term plan for this binding #2

waynexia opened this issue Apr 8, 2024 · 11 comments

Comments

@waynexia
Copy link
Collaborator

waynexia commented Apr 8, 2024

Task list for the first milestone.

The first milestone should:

  • Be generally available for common SQL queries.
  • Have a fundamental docs API and can be released to npm.js automatically
  • Have the basic config API for essential configs.
  • Define a (unstable) data API (maybe on top of https://arrow.apache.org/docs/js/ ?)

Tasks:

  • Expose configuration APIs (SessionContext)
  • Define and expose RecordBatch 1368bc5
  • Support GCS and Azure Bolb
  • Support more kinds of queries (fix them in the "upstream" datafusion)
@waynexia waynexia pinned this issue Apr 8, 2024
@waynexia
Copy link
Collaborator Author

In addition to the to-do list above, I also want to gradually bring this project to a relatively stable release rhythm, such as regularly synchronizing upstream updates or releasing. This can ensure that the project has a basic vitality.

My personal goal for this project is to make it an official part of DataFusion (ref to apache/datafusion#13815), just like the Python binding. People can easily use DataFusion for computations in more scenarios. At the same time, this project already has several application scenarios (datafusion-wasm-playground, parquet-viewer, or the ongoing one apache/datafusion#13818), ensuring that the basic release rhythm can also facilitate existing projects to use it better.

@qstommyshu
Copy link
Collaborator

Hi @waynexia ,

This is Tommy. I’m a master student from Georgia Institute of Technology. My interests wide spread in DBMS, OS, compilers, etc direction, and DataFusion really caught my eyes. I’m would like to on Robust WASM Support for GSoc 2025, and I’ve expressed my interests under [EPIC] A collection of tickets for improved WASM support in DataFusion. Currently, I’m working resolving issues in DataFusion to get myself familiarized with the DataFusion codebase, and I plan to check more on datafusion-wasm-bindings (or other wasm related items) after my current PR sets are completed. I think this issue maybe one of the things I need to tackle for GSoc 2025.

I’ve put together a rough draft of my proposal. Just a heads-up: GitHub shows the state of the repo at the time of commenting, so it might not reflect my latest updates. I’d really appreciate any feedback or suggestions when you get the chance!

(It’s still an early version, so the goals and timeline might shift a bit as I get a better handle on the project.)

I also reached out to @alamb—he was super kind, but mentioned he’s not too familiar with this area and might not be able to give detailed feedback.

Thanks a lot for your time, and I’d love to hear any thoughts you have!

@alamb
Copy link
Contributor

alamb commented Apr 1, 2025

Maybe @XiangpengHao has some idea of what would be useful

@XiangpengHao
Copy link

Hi @qstommyshu you plan looks good to me! I'd suggest to make it more concrete and more focused, though.

Here are my here are my two cents:

  1. My biggest hope for datafusion-wasm is to have a datafusion playground directly on the front page of the datafusion website:
Image

Making this demo is already quite a lot of work and fun, making it work smoothly with different storages, file formats, browsers can be especially challenging.

  1. Study the web specific nuances like binary size, browser compatibility. Right now we have a working demo, but it's not well studied on what is supported on what browser. The deliverable is more like a blog post to teach people more about using DataFusion on web.

  2. Better documentation. We have a few working demo, but mostly from insiders. It would be great if we have a user-friendly onboarding guide for Javascript programmer to use datafusion without knowledges about how datafusion works.

@qstommyshu
Copy link
Collaborator

Thank you @XiangpengHao for your reply,

Making this demo is already quite a lot of work and fun, making it work smoothly with different storages, file formats, browsers can be especially challenging.

Got it. I was thinking since there's already a DuckDB version to reference, it probably wouldn't take too long to build. But I definitely underestimated how much work goes into making a WASM shell playground — especially with all the compatibility stuff across different browsers and storage options. I think I’ll need to take another look at how much time this part might take.

The reason I want to start with the live playground is because I feel like it can be built somewhat independently from the current WASM bindings. My idea is to get a basic shell working with what we have now, and once the bindings are updated, we can plug those changes into the shell pretty easily. Is my assumption about the wasm playground and the wasm binding correct?

Study the web specific nuances like binary size, browser compatibility. Right now we have a working demo, but it's not well studied on what is supported on what browser. The deliverable is more like a blog post to teach people more about using DataFusion on web.

Can you please point me to a working GitHub repo so that I can play around with it and see how I can tweak it? Having a something to look at would be very helpful!

Better documentation. We have a few working demo, but mostly from insiders. It would be great if we have a user-friendly onboarding guide for Javascript programmer to use datafusion without knowledges about how datafusion works.

Definitely! I also agree having good documentation is a key to let new comer to learn about the project. Documentation would be an important part in my GSoc proposal.

@XiangpengHao
Copy link

Is my assumption about the wasm playground and the wasm binding correct?

point me to a working GitHub repo so that I can play around with it and see how I can tweak it?

Yes I think the wasm playground is a good start: https://github.com/datafusion-contrib/datafusion-wasm-playground

@qstommyshu
Copy link
Collaborator

qstommyshu commented Apr 4, 2025

Hi @XiangpengHao, @alamb, and @waynexia,

I've been tinkering with @waynexia's code and datafusion. Now I have a pretty good grasp of how things work and how to get started on the live playground.

I got something to work on my local:
Image

I realize my previous proposal might've skipped a few details. I'll update it tomorrow and submit a first draft (GSoc site said I can submit infinite times before the deadline), and I'll keep refining it as I dive deeper into each topic.

@pranavJibhakate
Copy link
Contributor

Hi @waynexia and @alamb which of two should I focus more of my time on adding support for compiling to wasm-wasi or adding support for wasm UDFs?

@waynexia
Copy link
Collaborator Author

waynexia commented Apr 9, 2025

Hi @qstommyshu, sorry for the late reply.

As @XiangpengHao mentioned, the datafusion playground can be a great component to get people involved in datafusion project. Two major parts (the WASM binding and a frontend playground) covered in your proposal look good (thanks for your proposal ❤️). I suggest further subdividing them as follows:

  • Binding
    • Wrap and expose datafusion APIs. Datafusion as a flexible engine with various APIs at different layers. To my understanding, we'll mainly focus on the uppermost APIs like configuration, querying from SQL string or DataFrame, catalog operations etc (cc @alamb please check if I'm missing anything). The WASM binding can be used directly by end users as a library, similar to Python binding. For our tier-1 target, I recommend wasm32-unknown-unknown.
    • Develop experience and toolchains as @XiangpengHao mentioned earlier. Compiling and binding a large project like datafusion to WASM lacks mature SoP compared to Python. It'll take a huge time for newcomers (like me), raising the contribution bar significantly. I wish we could make it much easier to get started with this project.
    • Tests. These are more akin to behavior tests—ensuring they always compile and work for simple SQLs since logic and results are covered within datafusion itself. The OpenDAL project's approach is quite similar; we could learn much from them.
  • Playground
    • A live playground in https://datafusion.apache.org. That screenshot is what I'm (and many others, certainly) long for.
    • A JS/TS binding. This might closely resemble the WASM bindings since wasm-bindgen will generate a JavaScript project from the rust code that can be submitted to NPM (we do have one). This deserves its own point because (1) it's not essential for using datafusion in WASM, @XiangpengHao parquet-viewer for example is in pure Rust, (2) we could create friendlier API atop raw bindings, like a react component <DataFusion data="https://some.example/file.parquet" />.
    • UI for the playground (in https://datafusion.apache.org). This is derived from above two points. As a frontend project (I mean the playground) we definitely need to do some UI work.

I want to clarify that not all the tasks listed above need to be included in this GSoC project. I'm just posting my thoughts and we'll pick some to implement. Maybe @XiangpengHao @alamb or others also have some points.

@waynexia
Copy link
Collaborator Author

waynexia commented Apr 9, 2025

Hi @waynexia and @alamb which of two should I focus more of my time on adding support for compiling to wasm-wasi or adding support for wasm UDFs?

From my perspective, both have similar priorities. Running a WASM UDF requires a WASM runtime, and wasm-wasi is a major target of running WASM functions in such runtime. To make it easier to write WASM functions, we may need to expose some low-level util function or core structs to wasm-wasi so they can be referenced in function implementation.

@qstommyshu
Copy link
Collaborator

qstommyshu commented Apr 9, 2025

Hi @waynexia ,

Thanks for your thoughtful reply on my proposal ❤️, I've submitted an updated version to GSoc before the deadline yesterday. I think the updated proposal covered everything you mentioned.

I want to clarify that not all the tasks listed above need to be included in this GSoC project.

Of course it would be hard to provide full support for all these items in just a summer.

My core goal is to:

A live playground in datafusion.apache.org.

  1. Create a well-designed playground for newcomer get involved in datafusion before starting this part (also with unit tests along with my development). I think the key point is to communicate the playground design (what styles/features do we want for embedding the live playground to datafusion doc page, I think the terminal style for duckdb has some room to improve in terms of UX), I also told @alamb about I will create a issue/discussion in datafusion to gather some ideas. I think posting over there will get more attention. Will update this comment once the issue is created.
    Here is the discussion: Gathering Ideas for WASM web playground design apache/datafusion#15660

    @XiangpengHao parquet-viewer for example is in pure Rust

    If using pure Rust is preferred, I can definitely do it with dioxus or lepto.rs, I've been waiting for a good reason to try them.

    we could create friendlier API atop raw bindings, like a react component <DataFusion data="https://some.example/file.parquet" />

    I didn't think about this before, but this certainly would be a great idea! Maybe we can do it as a stretch goal.

I had some frontend development experiences through my previous internships, the coding part should not be hard for me.

Wrap and expose datafusion APIs.

  1. Wrap and expose datafusion APIs that should be prioritized (i.e. parquet, csv support), as it is probably impossible to wrap all datafusion APIs in short period of time.
    And providing documentations for future contributors to get involved in contribution.
    I also plan to do a series of blogs to talk about my contribution experiences in GSoc, that should also help as an introductory material to new contributors to go through.

Develop experience and toolchains

  1. Set up standardized WASM compilation SoP to lower the bar for developers to build datafusion into WASM.

Tests. These are more akin to behavior tests—ensuring they always compile and work for simple SQLs since logic and results are covered within datafusion itself. The OpenDAL project's approach is quite similar; we could learn much from them.

  1. Create GitHub actions and add some basic tests to make sure the code always compile and work for simple SQLs. Also set up a test contribution standard to ensure future test contributions follows a good pattern.
    And thanks for the opendal reference, I'll check out their test approaches, looks like they also support different bindings, we can also learn about their compilation SoP.

And there are some stretch goals in my proposal, I will worry about them after these goals are achieved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants