This repo documents a process for converting the static SpatioTemporal Asset Catalog (STAC) describing Satellogic EarthView dataset into a GeoParquet file. This dataset contains ~7.1 million high-resolution (~1m) satellite images released under CC-BY 4.0 license and made available via AWS's Open Data Registry.
Static SpatioTemporal Asset Catalogs (STAC) are made up of a large number of JSON files containing GeoJSON Features annotated with metadata (STAC Items) that describe the images. Due to this storage structure, static STACs are difficult to query and, therefore, explore. For example, searching for all of the images that intersect an area of interest would require reading every single STAC Item--potentially gigabytes of data. Aggregating metadata (e.g., cloud cover, temporal coverage) would be similarly laborious. These use cases can also be onerous with a dynamic STAC API because they would require paging over results which can be slow and susceptible to request failure, rate-limiting, etc.
Enter STAC GeoParquet! Packaging STAC Items in GeoParquet format makes these tasks trivial by providing a cloud-friendly, queryable representation of a STAC Collection.
- Parquet is a columnar storage file format optimized for efficient storage and retrieval (querying) of large datasets
- GeoParquet adds interoperable geospatial types
- STAC GeoParquet specifies how to map STAC Items into GeoParquet format
- STAC GeoParquet enables efficient bulk-access to large STAC Collections for analytic workflows
The EarthView STAC is comprised of 7.1 million JSON files in AWS S3 object storage (catalog root). I converted these files to STAC GeoParquet in two steps:
- Scrape all STAC Items in the Catalog to
ndjson
file(s) (newline delimited JSON) - Use the
stac-geoparquet
Python library to create a STAC GeoParquet file containing all of the Items - Apply spatial sorting to enable efficient spatial queries and compression to reduce overall file size
I performed the conversion in AWS EC2 on a t3.large
instance in the us-west-2
region where the Satellogic data are stored to minimize network latency. I found that applying the spatial sorting and compression required significantly more memory so I had to use a larger instance for this step (r7i.8xlarge
).
-
Clone this repo:
$ git clone [email protected]:bmcandr/satellogic-earthview-stac-parquet.git $ cd satellogic-earthview-stac-parquet
-
Create a Python virtual environment and install dependencies:
$ pyenv local 3.11 $ python -m venv venv $ source venv/bin/activate $ pip install -r requirements.txt
-
Install
duckdb
per instructions (only required to apply spatial sort and compression)-
Install required extensions:
$ duckdb D install spatial; D install lindel;
-
The EarthView STAC is organized into nested Collections for year, month, and day:
.
└── 2022/
├── 2022-07/
│ └── 2022-07-01/
│ ├── 20220701_085711_SN18_36N_359407_5809328
│ └── ...
├── 2022-08
├── 2022-09
├── 2022-10
├── 2022-11
└── 2022-12
cli.py
contains some Python utilities to traverse and scrape Catalogs, audit the scraped results, and convert the results to Parquet (see python cli.py --help
for more info). Using these tools, I scraped the Catalogs to produce an ndjson
for each terminal Collection by running:
$ python cli.py list-catalog-children-uris https://satellogic-earthview.s3.us-west-2.amazonaws.com/stac/2022/catalog.json \
| xargs -n1 python cli.py list-catalog-children-uris \
| xargs -n1 -P5 python cli.py scrape-catalog
Note: the scrape-catalog
command uses aiohttp
to make asynchronous requests to read the Items and aiofiles
to write the response to file. Through trial and error I found that running 5 concurrent processes via xargs -n1 -P5
avoided intermittent request failures (rate-limiting? 🤷).
After about an hour I had a pile of ndjson
files:
.
└── data/
├── 2022-07-01.ndjson
├── ...
└── 2022-12-31.ndjson
These files contained ~7.1 million lines and weighed in at 22GB.
At this point, combining the ndjson
s into a GeoParquet file is very straightforward:
$ python cli.py parse-stac-ndjson-to-parquet data/*.json all-items.parquet
Use duckdb
to check that the resulting file is readable and contains all the expected data:
$ duckdb
D load spatial;
D SELECT COUNT(*) items FROM all-items.parquet;
┌─────────┐
│ items │
│ int64 │
├─────────┤
│ 7095985 │
└─────────┘
D DESCRIBE SELECT * FROM all-items.parquet;
┌─────────────────────┬────────────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────────────────┼────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ assets │ STRUCT(analytic STRUCT("eo:bands… │ YES │ NULL │ NULL │ NULL │
│ bbox │ STRUCT(xmin DOUBLE, ymin DOUBLE,… │ YES │ NULL │ NULL │ NULL │
│ geometry │ GEOMETRY │ YES │ NULL │ NULL │ NULL │
│ id │ VARCHAR │ YES │ NULL │ NULL │ NULL │
...
If the column_type
of the geometry
column says blob
instead of GEOMETRY
make sure the spatial
extension is installed and loaded.
This file occupied about 1.5GB in disk space.
Using duckdb
run sort-and-compress.sql
:
$ duckdb
D .read sort-and-compress.sql
Remember, this may require a large amount of memory and temp directory space on disk.
This took about 10-20 minutes and produced a file named sorted.level22.parquet
. The size?
Just 275MB.
(s/o to @marklit for the sort-and-compress.sql
script!)
The spatially sorted, compressed STAC GeoParquet file containing the entirety of the Satellogic EarthView STAC resulting from this process is available on S3 at:
s3://satellogic-earthview-stac-geoparquet/satellogic-earthview-stac-items.parquet
Check out the included notebook that demonstrates how to use duckdb
, h3
, geopandas
, lonboard
, stacrs
, and ipyleaflet
to explore this metadata and display the images it describes without downloading a single file!
Here's a sneak peek at images rendered on a slippy map using a tile server that reads the data directly from S3:
visual | analytic: true color | analytic: ndvi |
---|---|---|
![]() |
![]() |
![]() |
-
EarthView: A Large Scale Remote Sensing Dataset for Self-Supervision
@inproceedings{earthview2025, author={Velázquez, Diego and Rodríguez, Pau and Alonso, Sergio and Gonfaus, Josep M. and González, Jordi and, Richarte, Gerardo and Marín, Javier and Bengio, Yoshua and Lacoste, Alexandre}, booktitle={2025 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)}, title={EarthView: A Large Scale Remote Sensing Dataset for Self-Supervision}, year={2025}, url={https://arxiv.org/abs/2501.08111} }
-
@marklit's blog post Satellogic's Open Satellite Feed
-
The STAC, GeoParquet, and Arrow communities
Disclaimer: I am not affiliated with Satellogic and all opinions are my own.