Skip to content

Commit b408227

Browse files
committed
[DOCS] Improve the Databricks setup guide (#1582)
1 parent a66d4e7 commit b408227

File tree

4 files changed

+40
-15
lines changed

4 files changed

+40
-15
lines changed

docs/setup/databricks.md

Lines changed: 28 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Please pay attention to the Spark version postfix and Scala version postfix on our [Maven Coordinate page](../maven-coordinates). Databricks Spark and Apache Spark's compatibility can be found here: https://docs.databricks.com/en/release-notes/runtime/index.html
1+
Please pay attention to the Spark version postfix and Scala version postfix on our [Maven Coordinate page](maven-coordinates.md). Databricks Spark and Apache Spark's compatibility can be found [here](https://docs.databricks.com/en/release-notes/runtime/index.html).
22

33
## Community edition (free-tier)
44

@@ -8,18 +8,18 @@ You just need to install the Sedona jars and Sedona Python on Databricks using D
88

99
1) From the Libraries tab install from Maven Coordinates
1010

11-
```
12-
org.apache.sedona:sedona-spark-shaded-3.0_2.12:{{ sedona.current_version }}
13-
org.datasyslab:geotools-wrapper:{{ sedona.current_geotools }}
14-
```
11+
```
12+
org.apache.sedona:sedona-spark-shaded-3.4_2.12:{{ sedona.current_version }}
13+
org.datasyslab:geotools-wrapper:{{ sedona.current_geotools }}
14+
```
1515

1616
2) For enabling python support, from the Libraries tab install from PyPI
1717

18-
```
19-
apache-sedona
20-
keplergl==0.3.2
21-
pydeck==0.8.0
22-
```
18+
```
19+
apache-sedona=={{ sedona.current_version }}
20+
keplergl==0.3.2
21+
pydeck==0.8.0
22+
```
2323

2424
### Initialize
2525

@@ -66,10 +66,15 @@ curl -o /Workspace/Shared/sedona/{{ sedona.current_version }}/geotools-wrapper-{
6666
curl -o /Workspace/Shared/sedona/{{ sedona.current_version }}/sedona-spark-shaded-3.4_2.12-{{ sedona.current_version }}.jar "https://repo1.maven.org/maven2/org/apache/sedona/sedona-spark-shaded-3.4_2.12/{{ sedona.current_version }}/sedona-spark-shaded-3.4_2.12-{{ sedona.current_version }}.jar"
6767
```
6868

69+
Of course, you can also do the steps above manually.
70+
6971
### Create an init script
7072

7173
!!!warning
72-
Starting from December 2023, Databricks has disabled all DBFS based init script (/dbfs/XXX/<script-name>.sh). So you will have to store the init script from a workspace level (`/Users/<user-name>/<script-name>.sh`) or Unity Catalog volume (`/Volumes/<catalog>/<schema>/<volume>/<path-to-script>/<script-name>.sh`). Please see https://docs.databricks.com/en/init-scripts/cluster-scoped.html#configure-a-cluster-scoped-init-script-using-the-ui
74+
Starting from December 2023, Databricks has disabled all DBFS based init script (/dbfs/XXX/<script-name>.sh). So you will have to store the init script from a workspace level (`/Workspace/Users/<user-name>/<script-name>.sh`) or Unity Catalog volume (`/Volumes/<catalog>/<schema>/<volume>/<path-to-script>/<script-name>.sh`). Please see [Databricks init scripts](https://docs.databricks.com/en/init-scripts/cluster-scoped.html#configure-a-cluster-scoped-init-script-using-the-ui) for more information.
75+
76+
!!!note
77+
If you are creating a Shared cluster, you won't be able to use init scripts and jars stored under `Workspace`. Please instead store them in `Volumes`. The overall process should be the same.
7378

7479
Create an init script in `Workspace` that loads the Sedona jars into the cluster's default jar directory. You can create that from any notebook by running:
7580

@@ -86,13 +91,14 @@ cat > /Workspace/Shared/sedona/sedona-init.sh <<'EOF'
8691
# File: sedona-init.sh
8792
#
8893
# On cluster startup, this script will copy the Sedona jars to the cluster's default jar directory.
89-
# In order to activate Sedona functions, remember to add to your spark configuration the Sedona extensions: "spark.sql.extensions org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions"
9094
9195
cp /Workspace/Shared/sedona/{{ sedona.current_version }}/*.jar /databricks/jars
9296
9397
EOF
9498
```
9599

100+
Of course, you can also do the steps above manually.
101+
96102
### Set up cluster config
97103

98104
From your cluster configuration (`Cluster` -> `Edit` -> `Configuration` -> `Advanced options` -> `Spark`) activate the Sedona functions and the kryo serializer by adding to the Spark Config
@@ -120,3 +126,13 @@ pydeck==0.8.0
120126

121127
!!!tips
122128
You need to install the Sedona libraries via init script because the libraries installed via UI are installed after the cluster has already started, and therefore the classes specified by the config `spark.sql.extensions`, `spark.serializer`, and `spark.kryo.registrator` are not available at startup time.*
129+
130+
### Verify installation
131+
132+
After you have started the cluster, you can verify that Sedona is correctly installed by running the following code in a notebook:
133+
134+
```python
135+
spark.sql("SELECT ST_Point(1, 1)").show()
136+
```
137+
138+
Note that: you don't need to run the `SedonaRegistrator.registerAll(spark)` or `SedonaContext.create(spark)` in the advanced edition because `org.apache.sedona.sql.SedonaSqlExtensions` in the Cluster Config will take care of that.

docs/setup/emr.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,3 +52,13 @@ When you create an EMR cluster, in the software configuration, add the following
5252

5353
!!!note
5454
If you use Sedona 1.3.1-incubating, please use `sedona-python-adpater-3.0_2.12` jar in the content above, instead of `sedona-spark-shaded-3.0_2.12`.
55+
56+
## Verify installation
57+
58+
After the cluster is created, you can verify the installation by running the following code in a Jupyter notebook:
59+
60+
```python
61+
spark.sql("SELECT ST_Point(0, 0)").show()
62+
```
63+
64+
Note that: you don't need to run the `SedonaRegistrator.registerAll(spark)` or `SedonaContext.create(spark)` because `org.apache.sedona.sql.SedonaSqlExtensions` in the config will take care of that.

docs/tutorial/benchmark.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,4 @@
33
We welcome people to use Sedona for benchmark purpose. To achieve the best performance or enjoy all features of Sedona,
44

55
* Please always use the latest version or state the version used in your benchmark so that we can trace back to the issues.
6-
* Please consider using Sedona core instead of Sedona SQL. Due to the limitation of SparkSQL (for instance, not support clustered index), we are not able to expose all features to SparkSQL.
76
* Please open Sedona kryo serializer to reduce the memory footprint.

docs/tutorial/sql.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ Detailed SedonaSQL APIs are available here: [SedonaSQL API](../api/sql/Overview.
4343

4444
## Create Sedona config
4545

46-
Use the following code to create your Sedona config at the beginning. If you already have a SparkSession (usually named `spark`) created by Wherobots/AWS EMR/Databricks, please skip this step and can use `spark` directly.
46+
Use the following code to create your Sedona config at the beginning. If you already have a SparkSession (usually named `spark`) created by AWS EMR/Databricks/Microsoft Fabric, please ==skip this step==.
4747

4848
==Sedona >= 1.4.1==
4949

@@ -147,7 +147,7 @@ The following method has been deprecated since Sedona 1.4.1. Please use the meth
147147

148148
## Initiate SedonaContext
149149

150-
Add the following line after creating Sedona config. If you already have a SparkSession (usually named `spark`) created by Wherobots/AWS EMR/Databricks, please call `SedonaContext.create(spark)` instead.
150+
Add the following line after creating Sedona config. If you already have a SparkSession (usually named `spark`) created by AWS EMR/Databricks/Microsoft Fabric, please call `sedona = SedonaContext.create(spark)` instead. For ==Databricks==, the situation is more complicated, please refer to [Databricks setup guide](../setup/databricks.md), but generally you don't need to create SedonaContext.
151151

152152
==Sedona >= 1.4.1==
153153

0 commit comments

Comments
 (0)