SedonaDB Dataset #3161

isaaccorley · 2025-12-01T22:30:25Z

Summary

See reasoning here #3160

This PR implements the following:

abstracts the spatial filtering ops in VectorDataset into a method VectorDataset.filter_index which can be overridden for new backends.
implements torchgeo.datasets.SedonaDBDataset

Benchmarking

I added a benchmarking script that simply iteratively queries the filter_index method of both methods and these are the results. The data I used is the Washington state buildings here from the Microsoft Open Buildings dataset (converted to parquet):

VectorDataset:
Initialization time: 2.27 seconds
Filter time (50 slices): 105.54 seconds
Time per slice: 2.11 seconds
Total geometries found: 21,235
Geometries per second: 201.20

SedonaDBDataset Performance:
Initialization time: 2.06 seconds
Filter time (50 slices): 12.89 seconds
Time per slice: 0.26 seconds
Total geometries found: 21,235
Geometries per second: 1,647.26

Speedup: 8.19x

SedonaDBDataset is about 8.19x faster than VectorDataset for filtering operations, processing ~1,647 geometries/second vs ~201 geometries/second. Both found the same 21,235 geometries, confirming correctness.

TODO

Possibly need to implement separate SedonaDBVectorDataset and SedonaDBRasterDataset that can use IntersectionDataset and UnionDataset (could be in a follow-up PR though)

cc: @jiayuasu @rbavery @paleolimbot

Copilot

Pull request overview

This PR introduces SedonaDB support for vector datasets to improve performance on large geospatial datasets by replacing GeoPandas operations with SedonaDB equivalents.

Key changes:

Abstracts spatial filtering operations in VectorDataset into a new filter_index method that can be overridden by subclasses
Implements SedonaDBDataset class that uses SedonaDB for geospatial query operations
Adds apache-sedona[db] as an optional dependency

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
torchgeo/datasets/sedonadb.py	New SedonaDB dataset implementation with custom spatial filtering
torchgeo/datasets/geo.py	Refactored to extract filtering logic into `filter_index` method
torchgeo/datasets/init.py	Added SedonaDBDataset to exports
tests/datasets/test_sedonadb.py	Comprehensive test suite for SedonaDBDataset functionality
requirements/datasets.txt	Added apache-sedona[db] pinned to version 1.8.0
pyproject.toml	Added apache-sedona[db]>=1.8.0 to optional dependencies

Comments suppressed due to low confidence (1)

torchgeo/datasets/sedonadb.py:1

Lines 126-130 build an options dict but never use it. Line 131 passes layer=self.layer directly to read_file instead of using the options dict. Either remove the unused options dict (lines 126-130) or use it in the read_file call.

# Copyright (c) TorchGeo Contributors. All rights reserved.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

torchgeo/datasets/sedonadb.py

paleolimbot

Cool!

I'm new to TorchGeo so take this all with a grain of salt! The high-level suggestion is that to get the most out of SedonaDB (or any spatial db-based dataset), you may want to do something like:

self._sd.read_pyogrio(files).to_view("index_raw", overwrite=True)
self._sd.sql("""SELECT "{t_field}" as t, "{label_field}" as label, wkb_geometry as geometry FROM index_raw""").to_view("index", overwrite=True)
self._sd.sql("""SELECT row_number() as i, t, label, geometry FROM (SELECT * FROM index ORDER BY t)""").to_view("index", overwrite=True)

...or in other words, self._sd.view("index) is never materialized until it is requested. Of course, this is a bit of a shift in thinking for most GeoPandas users and so feel free to not do this (I can help with this if you get a basic implementation with tests that I can use to make the expectations of the implementation clear!).

paleolimbot · 2025-12-01T22:50:28Z

torchgeo/datasets/sedonadb.py

+            SELECT * FROM index_df
+            WHERE datetime_start <= {t_stop_ts}
+            AND datetime_end >= {t_start_ts}
+            AND ST_Intersects(geometry, ST_SetSRID(ST_GeomFromWKT('{query_wkt}'), {epsg}))


You can use `ST_SetCrs(..., '{self.crs.to_json()}'), which will work even of there's no EPSG code.

paleolimbot · 2025-12-01T23:00:56Z

torchgeo/datasets/sedonadb.py

+        sedona_db = lazy_import('sedona.db')
+        super().__init__(paths, crs, res, transforms, label_name, task, layer)
+
+        self._sd = sedona_db.connect()


To really get the benefit of SedonaDB here, I think you may want to use sd.read_parquet() (if paths points to 1 or more Parquet files) or sd.read_pyogrio() (otherwise). Perhaps index can be a property? (in other words, to really get a benefit here you probably need to defer the materialization of the index as long as possible).

I refactored to use this. Actually uncovered a bug in TorchGeo where we used gpd.read_file everywhere which doesn't let us read parquet files or directories (!!!) cc @adamjstewart

paleolimbot · 2025-12-01T23:07:54Z

torchgeo/datasets/sedonadb.py

+        if t.step != 1:
+            filtered_index_gdf = filtered_index_gdf.iloc[:: t.step]
+            if len(filtered_index_gdf) == 0:


Perhaps the initial creation of the index could be like:

self._sd.read_pyogrio(files).to_view("index_raw", overwrite=True); self._sd.sql("""SELECT row_number() as i, t, geometry FROM (SELECT * FROM index_raw ORDER BY t)"""

...which would possibly let you do something like self._sd.sql("""SELECT * FROM filtered_index WHERE i % {t.step} = 0""").to_view("filtered_index", overwrite=True) here.

torchgeo/datasets/sedonadb.py

pyproject.toml

isaaccorley · 2025-12-02T04:21:38Z

@paleolimbot I appreciate the review for creating the index. I did some initial benchmarks and sedonadb ABSOLUTELY SLAPS at spatial intersection filtering (see PR description above) cc: @calebrob6 @adamjstewart

isaaccorley · 2025-12-02T04:25:33Z

This PR is still a WIP at the moment but so far looking verrrry good. Most of the remaining work needs to be modifying the constructor to use sedonadb to create the index.

Basically the following happens:

glob all the files and create an index which contains the filepath, datetime, bounds (still needs to be sedonadb-ified)
in getitem, query the index using the slice to find the files of interest, then read and filter the geometries within the files using the query (mostly done but probably could use some optimization)

calebrob6 · 2025-12-03T16:46:16Z

Would a "driver" flag in VectorDataset (that could be "geopandas" or "sedonadb") be possible? E.g. as a user if I want to use an existing vectordataset but want the sick gains of sedonadb, it'd be cool if I could just switch somehow.

Also, I like this benchmarking script and am curious how the old backend performs on it.

adamjstewart · 2025-12-08T11:39:55Z

Would like to continue the discussion in #3160 once Isaac returns from paternity leave, but some minor comments on the proposed implementation:

I'm not opposed to supporting multiple backends, but note:

This logic is not specific to VectorDataset, the same logic could be applied to all GeoDatasets
As mentioned in Vector Dataset Backends #3160, this is only a minor fraction of the places we use geopandas, see GeoDataset: rtree -> geopandas #2747 for a full list of locations required to truly make TorchGeo backend-agnostic
The benchmarking script here is cool, but note that we already have a benchmarking script for RasterDataset. We should either port this to IOBench or replace IOBench with this. It would be nice to integrate this with our torchgeo script so that it works post-installation
The benchmarking done here doesn't tell me much about SedonaDB's performance for insertion, intersection, union, caching, pickling, etc. It also doesn't tell me how SedonaDB scales compared to R-tree, shapely, geopandas, or STAC. I also don't know what features Sedona DB supports. We should consider adding SedonaDB to our literature review for TorchGeo 1.0 (currently private git repo, but I can give people access).

More generally, the reason we didn't consider database backends like PostGIS/DuckDB/SedonaDB is my fear that this would require users to know about these technologies, install non-Python deps, and manually set up their own databases. Also, no one suggested them when I asked around during our 6 month backend search. For reference, I tried many times to get the GeoTorchAI unit tests running on my laptop, but this didn't work as I didn't have SedonaDB installed. If this has changed and the setup process is now much easier, we can revisit these, as I expect them to be quite performant depending entirely on I/O speeds. However, note that not all users may have write access on the systems they run TorchGeo on, so we may not be able to switch to a file-based DB as the default.

jiayuasu · 2025-12-08T20:39:04Z

@adamjstewart Hi Adam,

Thank you again for all your contributions to TorchGeo. Since Isaac is currently on leave, I wanted to clarify a few things here.

SedonaDB and Apache Sedona

SedonaDB is a new subproject under Apache Sedona, but it is not the same as SedonaSpark, SedonaFlink, or other distributed Sedona engines. SedonaDB is a single machine data processing tool that requires zero installation beyond a simple pip install apache-sedona[db]. It is designed for embedded and self contained environments. The wheel files for SedonaDB are available on PyPI at
https://pypi.org/project/sedonadb/

SedonaDB and PostGIS

SedonaDB is designed specifically for embedded and self contained use cases. It requires zero database setup and no data ingestion process like PostGIS. It works directly on your data files, identical to the GeoPandas user experience.

PostGIS is more of a transactional database. For analytical workloads such as filtering, joins, unions, and aggregations, SedonaDB is orders of magnitude faster. Because of how large the performance gap is, we did not even include PostGIS in our SpatialBench comparison since it would be an extremely unfair comparison.

SedonaDB and GeoTorchAI

GeoTorchAI was a research prototype that relied on Apache Sedona through SedonaSpark rather than SedonaDB. I agree with you that SedonaSpark can be complicated to operate in certain environments.

SedonaDB functionalities

SedonaDB was released in September 2025 and is positioned as a GeoPandas alternative. You can find the full list of supported functions here
https://sedona.apache.org/sedonadb/latest/reference/sql/

We also conducted a comprehensive benchmark comparing SedonaDB, GeoPandas, and DuckDB using SpatialBench
https://sedona.apache.org/spatialbench/single-node-benchmarks/

SpatialBench is designed to evaluate geospatial analytical performance across different systems. It examines performance from multiple angles including individual spatial functions such as

filtering, intersection, and union
complex and heavy spatial joins.
automatic query optimization across combined operations

For inserting new rows into a GeoPandas DataFrame, I do not believe GeoPandas currently supports this, and neither do SedonaDB or DuckDB.

Hope this helps clarify things and addresses your concerns.

adamjstewart · 2025-12-09T12:20:46Z

Thanks, this actually helps a lot. So we didn't consider SedonaDB during our search because it didn't exist at the time we did our literature review. Glad to know that it's easier to install than Apache Sedona and doesn't require any setup.

SpatialBench is actually of great interest to me. We have our own benchmarking comparing R-tree, Shapely, Geopandas, and STAC. Would be interesting to add some of those to the comparison, although they all obviously have very different features.

I still need to think about the best way to do this. I really don't want a new SedonaDBDataset, as we would have to duplicate all 30+ existing GeoDataset subclasses to actually take advantage of this. Maybe a backend='sedonadb' parameter to GeoDataset and friends would help. Again, there are hundreds of places we use the geopandas representation, not just filtering. We don't have to replace all locations, but if you really want speedups, that may be necessary.

I'll add this to the agenda for our monthly Technical Steering Committee meeting. Not sure if we'll get to it in January or February but I'm more open to this idea now that I understand it better.

isaaccorley · 2025-12-09T14:36:31Z

Part of TorchGeo's API design is to use inheritance by providing base classes. For this reason I don't think it's necessary in this PR to completely redo the backend and add it to every single dataset that inherits from GeoDataset. I was hoping to scope this to just creating an experimental SedonaDBDataset as an optional dependency install that users can start to experiment with. As mentioned, the gains in spatial intersection are quite large and I don't think we should delay these kinds of speedups making it into the library.

It also helps our Wherobots developers to begin to consider optimizations that would help particularly for improving geospatial sampling for ML training workflows.

adamjstewart · 2025-12-09T15:08:27Z

Well this won't make it into a release for quite some time regardless of if we merge today or in a few months. I want to speed up the release cycle after 1.0, but at the moment we're busy breaking GeoDataset and GeoSampler to add time series support. These features need time to test and mature before making it into a release, especially because they are backwards-incompatible.

isaaccorley · 2025-12-09T16:36:08Z

Well this won't make it into a release for quite some time regardless of if we merge today or in a few months. I want to speed up the release cycle after 1.0,

We should discuss this at our monthly meeting since it seems you are prescribing a release schedule without input from the rest of the maintainers. My opinion is that we should really increase our release schedule frequency particularly for certain types of features. I imagine others would agree with this as well. Adding a new UNet model weights for example should not take 6 months to become available from PyPi. 1.0 or not isn't really relevant to this PR.

but at the moment we're busy breaking GeoDataset and GeoSampler to add time series support. These features need time to test and mature before making it into a release, especially because they are backwards-incompatible.

These types of large breaking changes over multiple PRs should consider being moved to a dev or time-series branch that can get fully merged at some point in the future and not be pushed directly to main. These features don't really have anything to do with the SedonaDBDataset in this PR considering what's currently on main and shouldn't hold up any other features before they are merged

My overall 2c is that this features like this should get merged sooner than later instead of being stalled by reasons that aren't relevant to what's proposed in the PR.

isaaccorley added this to the 1.0.0 milestone Dec 1, 2025

isaaccorley requested review from adamjstewart and Copilot December 1, 2025 22:30

isaaccorley self-assigned this Dec 1, 2025

github-actions bot added datasets Geospatial or benchmark datasets testing Continuous integration testing dependencies Packaging and dependencies labels Dec 1, 2025

Copilot AI reviewed Dec 1, 2025

View reviewed changes

torchgeo/datasets/sedonadb.py Outdated Show resolved Hide resolved

torchgeo/datasets/sedonadb.py Outdated Show resolved Hide resolved

torchgeo/datasets/sedonadb.py Outdated Show resolved Hide resolved

torchgeo/datasets/sedonadb.py Outdated Show resolved Hide resolved

paleolimbot reviewed Dec 1, 2025

View reviewed changes

github-actions bot added the scripts Training and evaluation scripts label Dec 2, 2025

isaaccorley marked this pull request as draft December 2, 2025 04:23

isaaccorley added 4 commits December 1, 2025 22:34

add sedonadb dataset

4c02c9c

update sedonadb dataset

5376ff3

clean up constructor

a7c3c82

fix versions

ade2da0

isaaccorley force-pushed the datasets/sedonadb branch from 574dc82 to ade2da0 Compare December 2, 2025 04:34

isaaccorley added 2 commits December 1, 2025 22:36

remove duplicate dataset listing

5ea84df

make mypy and docs happy again

2839cce

SedonaDB Dataset #3161

Are you sure you want to change the base?

SedonaDB Dataset #3161

Uh oh!

Conversation

isaaccorley commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmarking

TODO

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

paleolimbot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

paleolimbot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

isaaccorley Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

paleolimbot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

isaaccorley commented Dec 2, 2025

Uh oh!

isaaccorley commented Dec 2, 2025

Uh oh!

calebrob6 commented Dec 3, 2025

Uh oh!

adamjstewart commented Dec 8, 2025

Uh oh!

jiayuasu commented Dec 8, 2025

SedonaDB and Apache Sedona

SedonaDB and PostGIS

SedonaDB and GeoTorchAI

SedonaDB functionalities

Uh oh!

adamjstewart commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isaaccorley commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adamjstewart commented Dec 9, 2025

Uh oh!

isaaccorley commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

isaaccorley commented Dec 1, 2025 •

edited

Loading

adamjstewart commented Dec 9, 2025 •

edited

Loading

isaaccorley commented Dec 9, 2025 •

edited

Loading

isaaccorley commented Dec 9, 2025 •

edited

Loading