Skip to content

feat(databricks): add the databricks backend #10223

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 3, 2024

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Sep 25, 2024

Description of changes

Add support for the databricks backend.

Notes

  • The PySpark compiler is almost entirely reused. Naturally there are a couple
    cases where things differ, and the get overridden in the databricks compiler.
  • Databricks seems to be aggressive about turning SQL NULLs into NaNs,
    which defeats a number of array and map tests that expect None in the output from to_pandas/execute
  • Naturally, databricks pins pyarrow to <17 (one version behind the latest) and numpy to <2. It's not as bad as it was with the snowflake connector, but we shouldn't merge this until we can figure out a sane workaround to avoid the pin in CI. I worked around this by using databricks-sql-connector-core which only pins pyarrow if you depend on it with its pyarrow extra. Despite the version bounds, the latest version of pyarrow (18 at the time of writing) passes our backend test suite with the databricks backend.

Issues closed

Resolves #9248.

@cpcloud cpcloud added this to the 10.0 milestone Sep 25, 2024
@cpcloud cpcloud added feature Features or general enhancements new backend PRs or issues related to adding new backends labels Sep 25, 2024
@github-actions github-actions bot added tests Issues or PRs related to tests ci Continuous Integration issues or PRs dependencies Issues or PRs related to dependencies labels Sep 25, 2024
@cpcloud
Copy link
Member Author

cpcloud commented Sep 25, 2024

Tests won't run until merge due to cloudiness.

I'll post the results from a local run, and then fix the CI (if needed) once this is merged.

@github-actions github-actions bot added the sql Backends that generate SQL label Sep 27, 2024
@cpcloud cpcloud force-pushed the databricks branch 4 times, most recently from d229f75 to 279322d Compare September 28, 2024 11:09
@github-actions github-actions bot added the flink Issues or PRs related to Flink label Sep 28, 2024
@github-actions github-actions bot added clickhouse The ClickHouse backend mssql The Microsoft SQL Server backend labels Sep 28, 2024
@cpcloud cpcloud force-pushed the databricks branch 3 times, most recently from b14d84e to c833f9b Compare September 29, 2024 11:34
@cpcloud cpcloud force-pushed the databricks branch 2 times, most recently from ab6762b to 518e3b1 Compare October 7, 2024 14:07
@cpcloud
Copy link
Member Author

cpcloud commented Oct 7, 2024

This is passing locally:

…/ibis on  databricks is 📦 v9.5.0 via 🐍 v3.12.5 via ❄️  impure (ibis-3.12.5-env)
❯ pytest -m databricks -n auto --dist loadgroup --snapshot-update -q
bringing up nodes...
xx.xxx.xsssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssx...x.sssssssssssssssssssss.........x...................x.........................x [  9%]
......x.............................x.......................................x.......................x..x.................x...............................................x..............xxx... [ 19%]
xxxxxxxxxxxxxxxxxxx.x.x.xxxxx...............x.........................x..........x.....x...........xx.....x................x..........................xx............................x....x.... [ 29%]
..........x.................x................x..x......x...x.............................xx..x...x...................x.x.........x...............x.......x.....x.x.x......xx.....xx....x...... [ 39%]
.........xx..x..x....xx.......x.x...x........xx.x.......x....................x.......x...................x..x.........x.....x.........x..sx.x....x.xs..x..s.....x..x......sx.................. [ 49%]
..................s..xx.x.....................x........x...................s...........s......................................x.s..................xx........x..x.x........x..........x....x.. [ 59%]
xx.x....x..x......x.......x.xxx.........x...x.x.xx.......................x.......x....xx.....x.....x.x.X.x....x...xX..x..xxx.x.xx......xxxx..xxx.....x...xx.x.................x..x............ [ 69%]
................................................................................................x..x..............xx..............x..xx.x......xx..........x......x....................x...... [ 79%]
.....x.....x..x.x.......................x.......x...x...xx....x.x......x.......x...x....x....x.x..................x.x.x..x......x................x............................x............... [ 89%]
..................s...........x.................s......................................................................................................................................x....x. [ 99%]
..x........                                                                                                                                                                                    [100%]
1569 passed, 130 skipped, 210 xfailed, 2 xpassed in 273.60s (0:04:33)

@gforsyth
Copy link
Member

gforsyth commented Oct 7, 2024

We probably need a policy in place for what happens to backends with no local CI option if financial sponsorship for those CI runs goes away.

@cpcloud
Copy link
Member Author

cpcloud commented Oct 8, 2024

The only reasonable policy I can think of right now is that if funding dries up, the backend goes into backend purgatory, where the CI stops running and support becomes best effort.

Thoughts?

@gforsyth
Copy link
Member

gforsyth commented Oct 8, 2024

That seems reasonable to me. And I think, if that happens, we add badges to that effect in the readme and on the backend docs page, effectively severing the backend from any semantic versioning guarantees (while still attempting best efforts to maintain those guarantees)

@cpcloud
Copy link
Member Author

cpcloud commented Oct 17, 2024

@gforsyth Where should we put this policy?

@gforsyth
Copy link
Member

@gforsyth Where should we put this policy?

How about a page in the docs in the Backends dropdown alongside the operations support matrix?

@cpcloud cpcloud force-pushed the databricks branch 4 times, most recently from 2662993 to 47a7830 Compare October 20, 2024 13:22
@deepyaman deepyaman mentioned this pull request Oct 22, 2024
1 task
@cpcloud cpcloud force-pushed the databricks branch 4 times, most recently from 0976530 to 7968c3d Compare October 31, 2024 13:36
@github-actions github-actions bot added the docs Documentation related issues or PRs label Oct 31, 2024
@cpcloud
Copy link
Member Author

cpcloud commented Oct 31, 2024

@gforsyth Finally got around to adding that policy page.

@cpcloud cpcloud force-pushed the databricks branch 4 times, most recently from 71fe2ef to f9bf147 Compare October 31, 2024 15:15
@cpcloud
Copy link
Member Author

cpcloud commented Oct 31, 2024

Databricks backend passes the backend test suite:

❯ pytest -m databricks -n auto --dist loadgroup --snapshot-update -q
bringing up nodes...
x...x.x..x...sssssssssssssssssssss..........xx.x..sssssssssssssss.ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssxx...x............xx......x........x.x.......x..x [ 10%]
x........x...x.....xx.x.x.x.........x................xx...............x.........x..x..s...........s......s...........................................................................x................. [ 20%]
........................................................................................................x....x...................x.x.....x......x.x.............x........x...................x......... [ 30%]
..x....................xx............................x.....x...........................................x..x....................................x....xxxxxxxxxxxx.........xx.x..xxx..xxx.x..x.x......... [ 41%]
................x...........x......x.......x...................x................................x....x..............x.....................x.........x..............xx......x.....xx..............x..... [ 51%]
...x.......x....x........xxxxx.xxx.xxxxx..xxxxx...xx...x..xx....x.......x............x..x.........ss....................xx..x.....x..s.s......x....x........x........x....x...........x.........x...xxx [ 61%]
xx......s.....x....x.x.x.........x...............x.x......x...x..x...x.x...X.xx.....xx........xx.x..........x..xxx........x..xx........x.x..x.....xx........X...x..............x...xx...x...........x.. [ 72%]
.............x..........................x..x....................x....x........x......x.....x.....x......x........................................x.............................xx.x....x............... [ 82%]
.....................x.....x.........x..................x...x.....x...x........x.x.x.....x..........x..........x...x...................x.....x.................................x.................s..s.. [ 92%]
...........................x................................................................................x............x......xx.x....                                                                [100%]
1584 passed, 130 skipped, 211 xfailed, 2 xpassed in 209.01s (0:03:29)

@cpcloud cpcloud requested a review from gforsyth October 31, 2024 16:00
@cpcloud
Copy link
Member Author

cpcloud commented Oct 31, 2024

Snowflake and BigQuery both passing:

…/ibis on  databricks is 📦 v9.5.0 via 🐍 v3.12.7 via ❄️  impure (ibis-3.12-env)
❯ pytest -m snowflake -n 8 --dist loadgroup --snapshot-update -q && pytest -m bigquery -n auto --dist loadgroup --snapshot-update -q
bringing up nodes...
x...........s.......................s........s............................................x.x.x..xxxxxxxxx.xx.............xx............xx.x.x..........xx.......x...............x.............x..x.x.. [  9%]
.....x...x..x....xx...........x...........x..............x...........x...x.x..x...x.....x.......xx..xx.xxx..x...x.....x....x.......................s................................................... [ 19%]
...........xxxxxxxxxxx...x.............x..x............x.xxx..xxxx.xx....xxx.x......x....x..xxx.......xx..xx.x.xx.xxx.xx...xxxxxxxxxx.xx....x.x.xx..xxx..x.xx..x.xx...xx.x.....xx.x....x........x...... [ 29%]
..............x.............xx......................x..................x...........x........x..s.xx...x...x..x....x.......xxx..x......x...x............................x.......xx............x......... [ 39%]
.....................x............x....x........x.....x..............................xx....x....x....x....x...x...............xx..x........x........................................................... [ 49%]
.......................xx......................x...............................................x.x..................................x..s..x..x....x...................x...x......................x...x. [ 59%]
......................x.............................x...............x.......................x.....x...x.........................x....x.................x...........x....x......................xx...... [ 68%]
........x.x........x.............x..............x..................x....x...........x.....x.....x........................................................x....x...................x.................... [ 78%]
......................x.....................x...................................x...................x........x.........x.x..xx........x.........x..........x...x....x...x.s.s.x......s................. [ 88%]
s.......................................................................................................................................................................................x.............. [ 98%]
...............................                                                                                                                                                                         [100%]
1785 passed, 10 skipped, 226 xfailed in 232.21s (0:03:52)
bringing up nodes...
........................s....sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss.ssssssssssssssssssssssss....s......sssssssssssssssssssss.................s...x......x..x...... [  8%]
................................................s....x.....................x...x...........x....xxx..x..x...x..xxx.x.....x.xx......x.........x....................................................x.... [ 17%]
x......................................x.............................................x............................x.x...x...x.....x...........s..s.x...s.x...x..x..sx...............x.......xx......... [ 26%]
.....x..x.x.xx........x............Xxxx.xx........x..x.....x..................x.xxx...x.xxx.x.XxXx..x............x.x....x...x.xx....X.............xx.xx....xx..xxx..xxxx.xx..x.xx.....xx...x......x...x [ 35%]
x...x..x.....x..x.x.x..xx.....x........x.........xxxx......x....x...xxx.............x..x..x...x....x......x....x.x.x.......xx....x..x...x................xx.......x..............x.......xx............ [ 44%]
.....x.....x...x......x......x............x.x..........x...x.x..x.......x..............xx........x......xx..x..........x.............x.........x.xx.x.x..x.xxx.xxx.xxxxxxx.xxxxxxxxxxxxx.xxx..xxxxx..xx [ 53%]
.xxxxx.xx..xxx.xxxxxxxxxxxxxxxxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxxxxxxxxxxxxx.x..x...xxxxx....x..xxx.....x.......xx...xxx..x......x.....................x...x....x.......... [ 61%]
...............x....x............x.............x..x.x......x........................x............xx.......x..........x.....x..x.............x..x.x........x...x.....x.....x.x.......x........x......... [ 70%]
...........x..x...........................x..................x...........................x....x..x................x...........x...................x....................x............................... [ 79%]
....................................................................................................................x.................................................................................. [ 88%]
.................................................x.............................................................................................................x.s....s......xs........................ [ 97%]
....s.......................x.x.............................                                                                                                                                            [100%]
1757 passed, 132 skipped, 356 xfailed, 4 xpassed in 388.78s (0:06:28)

@gforsyth
Copy link
Member

I'll try to take a look over this tomorrow!

Copy link
Member

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this looks good to me, modulo the insanity of databricks calling something void -- just the conflict on the poetry.lock but then this can go in.

@@ -50,6 +50,7 @@
"trino": "unknown",
"postgres": "null",
"risingwave": "null",
"databricks": "void",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void!?

@cpcloud cpcloud added the databricks The Databricks backend label Nov 3, 2024
@cpcloud cpcloud merged commit 0733705 into ibis-project:main Nov 3, 2024
93 checks passed
@cpcloud cpcloud deleted the databricks branch November 3, 2024 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Continuous Integration issues or PRs clickhouse The ClickHouse backend databricks The Databricks backend dependencies Issues or PRs related to dependencies docs Documentation related issues or PRs feature Features or general enhancements flink Issues or PRs related to Flink mssql The Microsoft SQL Server backend new backend PRs or issues related to adding new backends sql Backends that generate SQL tests Issues or PRs related to tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: add databricks support
2 participants