Skip to content

Commit 649cafc

Browse files
authored
Merge branch 'apache:master' into master
2 parents 3e4d8e2 + fcea411 commit 649cafc

File tree

37 files changed

+1611
-41
lines changed

37 files changed

+1611
-41
lines changed
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
Licensed to the Apache Software Foundation (ASF) under one
2+
or more contributor license agreements. See the NOTICE file
3+
distributed with this work for additional information
4+
regarding copyright ownership. The ASF licenses this file
5+
to you under the Apache License, Version 2.0 (the
6+
"License"); you may not use this file except in compliance
7+
with the License. You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing,
12+
software distributed under the License is distributed on an
13+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
KIND, either express or implied. See the License for the
15+
specific language governing permissions and limitations
16+
under the License.

.pre-commit-config.yaml

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,15 +10,22 @@ repos:
1010
hooks:
1111
- id: identity
1212
- id: check-hooks-apply
13+
- repo: https://github.com/Lucas-C/pre-commit-hooks
14+
rev: v1.5.5
15+
hooks:
16+
- id: insert-license
17+
name: Add license for all TOML files
18+
files: \.toml$
19+
args:
20+
- --comment-style
21+
- "|#|"
22+
- --license-filepath
23+
- .github/workflows/license-templates/LICENSE.txt
24+
- --fuzzy-match-generates-todo
1325
- repo: https://github.com/psf/black-pre-commit-mirror
1426
rev: 24.10.0
1527
hooks:
1628
- id: black-jupyter
17-
# - repo: https://github.com/pycqa/isort
18-
# rev: 5.13.2
19-
# hooks:
20-
# - id: isort
21-
# name: isort (python)
2229
- repo: https://github.com/pre-commit/mirrors-clang-format
2330
rev: v19.1.1
2431
hooks:

R/vignettes/articles/apache-sedona.Rmd

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -296,7 +296,7 @@ file in a supported geospatial format (`sedona_read_*` functions), or by extract
296296
Spark SQL query.
297297

298298
For example, the following code will import data from
299-
[arealm-small.csv](https://github.com/apache/sedona/blob/master/binder/data/arealm-small.csv)
299+
[arealm-small.csv](https://github.com/apache/sedona/blob/master/docs/usecases/data/arealm-small.csv)
300300
into a `SpatialRDD`:
301301

302302
```{r}
@@ -311,7 +311,7 @@ pt_rdd <- sedona_read_dsv_to_typed_rdd(
311311
```
312312

313313
Records from the example
314-
[arealm-small.csv](https://github.com/apache/sedona/blob/master/binder/data/arealm-small.csv)
314+
[arealm-small.csv](https://github.com/apache/sedona/blob/master/docs/usecases/data/arealm-small.csv)
315315
file look like the following:
316316

317317
testattribute0,-88.331492,32.324142,testattribute1,testattribute2

README.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -34,25 +34,26 @@ Join the Sedona monthly community office hour: [Google Calendar](https://calenda
3434

3535
## What is Apache Sedona?
3636

37-
Apache Sedona™ is a spatial computing engine that enables developers to easily process spatial data at any scale within modern cluster computing systems such as Apache Spark and Apache Flink. Sedona developers can express their spatial data processing tasks in Spatial SQL, Spatial Python or Spatial R. Internally, Sedona provides spatial data loading, indexing, partitioning, and query processing/optimization functionality that enable users to efficiently analyze spatial data at any scale.
37+
Apache Sedona™ is a spatial computing engine that enables developers to easily process spatial data at any scale within modern cluster computing systems such as [Apache Spark](https://spark.apache.org/) and [Apache Flink](https://flink.apache.org/).
38+
Sedona developers can express their spatial data processing tasks in [Spatial SQL](https://carto.com/spatial-sql), Spatial Python or Spatial R. Internally, Sedona provides spatial data loading, indexing, partitioning, and query processing/optimization functionality that enable users to efficiently analyze spatial data at any scale.
3839

3940
![Sedona Ecosystem](docs/image/sedona-ecosystem.png "Sedona Ecosystem")
4041

4142
### Features
4243

4344
Some of the key features of Apache Sedona include:
4445

45-
* Support for a wide range of geospatial data formats, including GeoJSON, WKT, and ESRI Shapefile.
46+
* Support for a wide range of geospatial data formats, including [GeoJSON](https://en.wikipedia.org/wiki/GeoJSON), [WKT](https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry), and [ESRI](https://www.esri.com) [Shapefile](https://en.wikipedia.org/wiki/Shapefile).
4647
* Scalable distributed processing of large vector and raster datasets.
4748
* Tools for spatial indexing, spatial querying, and spatial join operations.
48-
* Integration with popular geospatial python tools such as GeoPandas.
49-
* Integration with popular big data tools, such as Spark, Hadoop, Hive, and Flink for data storage and querying.
50-
* A user-friendly API for working with geospatial data in the SQL, Python, Scala and Java languages.
49+
* Integration with popular geospatial Python tools such as [GeoPandas](https://geopandas.org).
50+
* Integration with popular big data tools, such as Spark, [Hadoop](https://hadoop.apache.org/), [Hive](https://hive.apache.org/), and Flink for data storage and querying.
51+
* A user-friendly API for working with geospatial data in the [SQL](https://en.wikipedia.org/wiki/SQL), [Python](https://www.python.org/), [Scala](https://www.scala-lang.org/) and [Java](https://www.java.com) languages.
5152
* Flexible deployment options, including standalone, local, and cluster modes.
5253

5354
These are some of the key features of Apache Sedona, but it may offer additional capabilities depending on the specific version and configuration.
5455

55-
Click [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/apache/sedona/HEAD?filepath=docs/usecases) and play the interactive Sedona Python Jupyter Notebook immediately!
56+
Click [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/apache/sedona/HEAD?filepath=docs/usecases) and play the interactive Sedona Python [Jupyter](https://jupyter.org/) Notebook immediately!
5657

5758
## When to use Sedona?
5859

@@ -150,5 +151,6 @@ Please visit [Apache Sedona website](http://sedona.apache.org/) for detailed inf
150151
## Powered by
151152

152153
<a href="https://www.apache.org/">
153-
<img alt="The Apache Software Foundation" src="https://www.apache.org/foundation/press/kit/asf_logo_wide.png" width="500" class="center">
154+
<img alt="The Apache Software Foundation" class="center" src="https://www.apache.org/foundation/press/kit/asf_logo_wide.png"
155+
title="The Apache Software Foundation" width="500">
154156
</a>

docker/sedona-spark-jupyterlab/requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
attrs
22
descartes
3-
fiona==1.8.22
3+
fiona==1.10.1
44
geopandas==0.14.4
55
ipykernel
66
ipywidgets

docker/sedona-spark-jupyterlab/sedona-jupyterlab.dockerfile

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,6 @@ FROM ubuntu:22.04
1919

2020
ARG shared_workspace=/opt/workspace
2121
ARG spark_version=3.4.1
22-
ARG hadoop_version=3
2322
ARG hadoop_s3_version=3.3.4
2423
ARG aws_sdk_version=1.12.402
2524
ARG spark_xml_version=0.16.0
@@ -29,8 +28,7 @@ ARG spark_extension_version=2.11.0
2928

3029
# Set up envs
3130
ENV SHARED_WORKSPACE=${shared_workspace}
32-
ENV SPARK_HOME /opt/spark
33-
RUN mkdir ${SPARK_HOME}
31+
ENV SPARK_HOME /usr/local/lib/python3.10/dist-packages/pyspark
3432
ENV SEDONA_HOME /opt/sedona
3533
RUN mkdir ${SEDONA_HOME}
3634

@@ -44,7 +42,7 @@ COPY ./ ${SEDONA_HOME}/
4442

4543
RUN chmod +x ${SEDONA_HOME}/docker/spark.sh
4644
RUN chmod +x ${SEDONA_HOME}/docker/sedona.sh
47-
RUN ${SEDONA_HOME}/docker/spark.sh ${spark_version} ${hadoop_version} ${hadoop_s3_version} ${aws_sdk_version} ${spark_xml_version}
45+
RUN ${SEDONA_HOME}/docker/spark.sh ${spark_version} ${hadoop_s3_version} ${aws_sdk_version} ${spark_xml_version}
4846

4947
# Install Python dependencies
5048
COPY docker/sedona-spark-jupyterlab/requirements.txt /opt/requirements.txt

docker/spark.sh

Lines changed: 3 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -19,20 +19,16 @@ set -e
1919

2020
# Define variables
2121
spark_version=$1
22-
hadoop_version=$2
23-
hadoop_s3_version=$3
24-
aws_sdk_version=$4
25-
spark_xml_version=$5
22+
hadoop_s3_version=$2
23+
aws_sdk_version=$3
24+
spark_xml_version=$4
2625

2726
# Set up OS libraries
2827
apt-get update
2928
apt-get install -y openjdk-19-jdk-headless curl python3-pip maven
3029
pip3 install --upgrade pip && pip3 install pipenv
3130

3231
# Download Spark jar and set up PySpark
33-
curl https://archive.apache.org/dist/spark/spark-"${spark_version}"/spark-"${spark_version}"-bin-hadoop"${hadoop_version}".tgz -o spark.tgz
34-
tar -xf spark.tgz && mv spark-"${spark_version}"-bin-hadoop"${hadoop_version}"/* "${SPARK_HOME}"/
35-
rm spark.tgz && rm -rf spark-"${spark_version}"-bin-hadoop"${hadoop_version}"
3632
pip3 install pyspark=="${spark_version}"
3733

3834
# Add S3 jars
@@ -42,9 +38,6 @@ curl https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/"${aws_sdk
4238
# Add spark-xml jar
4339
curl https://repo1.maven.org/maven2/com/databricks/spark-xml_2.12/"${spark_xml_version}"/spark-xml_2.12-"${spark_xml_version}".jar -o "${SPARK_HOME}"/jars/spark-xml_2.12-"${spark_xml_version}".jar
4440

45-
# Set up master IP address and executor memory
46-
cp "${SPARK_HOME}"/conf/spark-defaults.conf.template "${SPARK_HOME}"/conf/spark-defaults.conf
47-
4841
# Install required libraries for GeoPandas on Apple chip mac
4942
apt-get install -y gdal-bin libgdal-dev
5043

docs/api/stats/sql.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,3 +49,70 @@ names in parentheses are python variable names
4949
- geometry - name of the geometry column
5050
- handleTies (handle_ties) - whether to handle ties in the k-distance calculation. Default is false
5151
- useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false
52+
53+
The output is the input DataFrame with the lof added to each row.
54+
55+
## Using Getis-Ord Gi(*)
56+
57+
The G Local function is provided at `org.apache.sedona.stats.hotspotDetection.GetisOrd.gLocal` in scala/java and `sedona.stats.hotspot_detection.getis_ord.g_local` in python.
58+
59+
Performs the Gi or Gi* statistic on the x column of the dataframe.
60+
61+
Weights should be the neighbors of this row. The members of the weights should be comprised
62+
of structs containing a value column and a neighbor column. The neighbor column should be the
63+
contents of the neighbors with the same types as the parent row (minus neighbors). Reference the _Using the Distance
64+
Weighting Function_ header for instructions on generating this column. To calculate the Gi*
65+
statistic, ensure the focal observation is in the neighbors array (i.e. the row is in the
66+
weights column) and `star=true`. Significance is calculated with a z score.
67+
68+
### Parameters
69+
70+
- dataframe - the dataframe to perform the G statistic on
71+
- x - The column name we want to perform hotspot analysis on
72+
- weights - The column name containing the neighbors array. The neighbor column should be the contents of the neighbors with the same types as the parent row (minus neighbors). You can use `Weighting` class functions to achieve this.
73+
- star - Whether the focal observation is in the neighbors array. If true this calculates Gi*, otherwise Gi
74+
75+
The output is the input DataFrame with the following columns added: G, E[G], V[G], Z, P.
76+
77+
## Using the Distance Weighting Function
78+
79+
The Weighting functions are provided at `org.apache.sedona.stats.Weighting` in scala/java and `sedona.stats.weighting` in python.
80+
81+
The function generates a column containing an array of structs containing a value column and a neighbor column.
82+
83+
The generic `addDistanceBandColumn` (`add_distance_band_column` in python) function annotates a dataframe with a weights column containing the other records within the threshold and their weight.
84+
85+
The dataframe should contain at least one `GeometryType` column. Rows must be unique. If one
86+
geometry column is present it will be used automatically. If two are present, the one named
87+
'geometry' will be used. If more than one are present and neither is named 'geometry', the
88+
column name must be provided. The new column will be named 'cluster'.
89+
90+
### Parameters
91+
92+
#### addDistanceBandColumn
93+
94+
names in parentheses are python variable names
95+
96+
- dataframe - DataFrame with geometry column
97+
- threshold - Distance threshold for considering neighbors
98+
- binary - whether to use binary weights or inverse distance weights for neighbors (dist^alpha)
99+
- alpha - alpha to use for inverse distance weights ignored when binary is true
100+
- includeZeroDistanceNeighbors (include_zero_distance_neighbors) - whether to include neighbors that are 0 distance. If 0 distance neighbors are included and binary is false, values are infinity as per the floating point spec (divide by 0)
101+
- includeSelf (include_self) - whether to include self in the list of neighbors
102+
- selfWeight (self_weight) - the value to use for the self weight
103+
- geometry - name of the geometry column
104+
- useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false
105+
106+
#### addBinaryDistanceBandColumn
107+
108+
names in parentheses are python variable names
109+
110+
- dataframe - DataFrame with geometry column
111+
- threshold - Distance threshold for considering neighbors
112+
- includeZeroDistanceNeighbors (include_zero_distance_neighbors) - whether to include neighbors that are 0 distance. If 0 distance neighbors are included and binary is false, values are infinity as per the floating point spec (divide by 0)
113+
- includeSelf (include_self) - whether to include self in the list of neighbors
114+
- selfWeight (self_weight) - the value to use for the self weight
115+
- geometry - name of the geometry column
116+
- useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false
117+
118+
In both cases the output is the input DataFrame with the weights column added to each row.

docs/community/contributor.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ The PMC regularly adds new committers from the active contributors, based on the
4040

4141
* Sustained contributions to Sedona: Committers should have a history of major contributions to Sedona.
4242
* Quality of contributions: Committers more than any other community member should submit simple, well-tested, and well-designed patches. In addition, they should show sufficient expertise to be able to review patches.
43-
* Community involvement: Committers should have a constructive and friendly attitude in all community interactions. They should also be active on the dev mailing list & Gitter, and help mentor newer contributors and users.
43+
* Community involvement: Committers should have a constructive and friendly attitude in all community interactions. They should also be active on the dev mailing list & Discord, and help mentor newer contributors and users.
4444

4545
The PMC also adds new PMC members. PMC members are expected to carry out PMC responsibilities as described in Apache Guidance, including helping vote on releases, enforce Apache project trademarks, take responsibility for legal and license issues, and ensure the project follows Apache project mechanics. The PMC periodically adds committers to the PMC who have shown they understand and can help with these activities.
4646

0 commit comments

Comments
 (0)