Skip to content

Conversation

@isaaccorley
Copy link
Collaborator

@isaaccorley isaaccorley commented Dec 1, 2025

Summary

See reasoning here #3160

This PR implements the following:

  • abstracts the spatial filtering ops in VectorDataset into a method VectorDataset.filter_index which can be overridden for new backends.
  • implements torchgeo.datasets.SedonaDBDataset

Benchmarking

I added a benchmarking script that simply iteratively queries the filter_index method of both methods and these are the results. The data I used is the Washington state buildings here from the Microsoft Open Buildings dataset (converted to parquet):

VectorDataset:
Initialization time: 2.27 seconds
Filter time (50 slices): 105.54 seconds
Time per slice: 2.11 seconds
Total geometries found: 21,235
Geometries per second: 201.20

SedonaDBDataset Performance:
Initialization time: 2.06 seconds
Filter time (50 slices): 12.89 seconds
Time per slice: 0.26 seconds
Total geometries found: 21,235
Geometries per second: 1,647.26

Speedup: 8.19x

SedonaDBDataset is about 8.19x faster than VectorDataset for filtering operations, processing ~1,647 geometries/second vs ~201 geometries/second. Both found the same 21,235 geometries, confirming correctness.

TODO

Possibly need to implement separate SedonaDBVectorDataset and SedonaDBRasterDataset that can use IntersectionDataset and UnionDataset (could be in a follow-up PR though)

cc: @jiayuasu @rbavery @paleolimbot

@isaaccorley isaaccorley added this to the 1.0.0 milestone Dec 1, 2025
@isaaccorley isaaccorley self-assigned this Dec 1, 2025
@github-actions github-actions bot added datasets Geospatial or benchmark datasets testing Continuous integration testing dependencies Packaging and dependencies labels Dec 1, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces SedonaDB support for vector datasets to improve performance on large geospatial datasets by replacing GeoPandas operations with SedonaDB equivalents.

Key changes:

  • Abstracts spatial filtering operations in VectorDataset into a new filter_index method that can be overridden by subclasses
  • Implements SedonaDBDataset class that uses SedonaDB for geospatial query operations
  • Adds apache-sedona[db] as an optional dependency

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
torchgeo/datasets/sedonadb.py New SedonaDB dataset implementation with custom spatial filtering
torchgeo/datasets/geo.py Refactored to extract filtering logic into filter_index method
torchgeo/datasets/init.py Added SedonaDBDataset to exports
tests/datasets/test_sedonadb.py Comprehensive test suite for SedonaDBDataset functionality
requirements/datasets.txt Added apache-sedona[db] pinned to version 1.8.0
pyproject.toml Added apache-sedona[db]>=1.8.0 to optional dependencies
Comments suppressed due to low confidence (1)

torchgeo/datasets/sedonadb.py:1

  • Lines 126-130 build an options dict but never use it. Line 131 passes layer=self.layer directly to read_file instead of using the options dict. Either remove the unused options dict (lines 126-130) or use it in the read_file call.
# Copyright (c) TorchGeo Contributors. All rights reserved.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

I'm new to TorchGeo so take this all with a grain of salt! The high-level suggestion is that to get the most out of SedonaDB (or any spatial db-based dataset), you may want to do something like:

self._sd.read_pyogrio(files).to_view("index_raw", overwrite=True)
self._sd.sql("""SELECT "{t_field}" as t, "{label_field}" as label, wkb_geometry as geometry FROM index_raw""").to_view("index", overwrite=True)
self._sd.sql("""SELECT row_number() as i, t, label, geometry FROM (SELECT * FROM index ORDER BY t)""").to_view("index", overwrite=True)

...or in other words, self._sd.view("index) is never materialized until it is requested. Of course, this is a bit of a shift in thinking for most GeoPandas users and so feel free to not do this (I can help with this if you get a basic implementation with tests that I can use to make the expectations of the implementation clear!).

SELECT * FROM index_df
WHERE datetime_start <= {t_stop_ts}
AND datetime_end >= {t_start_ts}
AND ST_Intersects(geometry, ST_SetSRID(ST_GeomFromWKT('{query_wkt}'), {epsg}))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use `ST_SetCrs(..., '{self.crs.to_json()}'), which will work even of there's no EPSG code.

Comment on lines 67 to 70
sedona_db = lazy_import('sedona.db')
super().__init__(paths, crs, res, transforms, label_name, task, layer)

self._sd = sedona_db.connect()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To really get the benefit of SedonaDB here, I think you may want to use sd.read_parquet() (if paths points to 1 or more Parquet files) or sd.read_pyogrio() (otherwise). Perhaps index can be a property? (in other words, to really get a benefit here you probably need to defer the materialization of the index as long as possible).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored to use this. Actually uncovered a bug in TorchGeo where we used gpd.read_file everywhere which doesn't let us read parquet files or directories (!!!) cc @adamjstewart

Comment on lines 117 to 119
if t.step != 1:
filtered_index_gdf = filtered_index_gdf.iloc[:: t.step]
if len(filtered_index_gdf) == 0:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the initial creation of the index could be like:

self._sd.read_pyogrio(files).to_view("index_raw", overwrite=True);
self._sd.sql("""SELECT row_number() as i, t, geometry FROM (SELECT * FROM index_raw ORDER BY t)"""

...which would possibly let you do something like self._sd.sql("""SELECT * FROM filtered_index WHERE i % {t.step} = 0""").to_view("filtered_index", overwrite=True) here.

@github-actions github-actions bot added the scripts Training and evaluation scripts label Dec 2, 2025
@isaaccorley
Copy link
Collaborator Author

@paleolimbot I appreciate the review for creating the index. I did some initial benchmarks and sedonadb ABSOLUTELY SLAPS at spatial intersection filtering (see PR description above) cc: @calebrob6 @adamjstewart

@isaaccorley isaaccorley marked this pull request as draft December 2, 2025 04:23
@isaaccorley
Copy link
Collaborator Author

This PR is still a WIP at the moment but so far looking verrrry good. Most of the remaining work needs to be modifying the constructor to use sedonadb to create the index.

Basically the following happens:

  • glob all the files and create an index which contains the filepath, datetime, bounds (still needs to be sedonadb-ified)
  • in getitem, query the index using the slice to find the files of interest, then read and filter the geometries within the files using the query (mostly done but probably could use some optimization)

@calebrob6
Copy link
Collaborator

Would a "driver" flag in VectorDataset (that could be "geopandas" or "sedonadb") be possible? E.g. as a user if I want to use an existing vectordataset but want the sick gains of sedonadb, it'd be cool if I could just switch somehow.

Also, I like this benchmarking script and am curious how the old backend performs on it.

@adamjstewart
Copy link
Member

Would like to continue the discussion in #3160 once Isaac returns from paternity leave, but some minor comments on the proposed implementation:

I'm not opposed to supporting multiple backends, but note:

  • This logic is not specific to VectorDataset, the same logic could be applied to all GeoDatasets
  • As mentioned in Vector Dataset Backends #3160, this is only a minor fraction of the places we use geopandas, see GeoDataset: rtree -> geopandas #2747 for a full list of locations required to truly make TorchGeo backend-agnostic
  • The benchmarking script here is cool, but note that we already have a benchmarking script for RasterDataset. We should either port this to IOBench or replace IOBench with this. It would be nice to integrate this with our torchgeo script so that it works post-installation
  • The benchmarking done here doesn't tell me much about SedonaDB's performance for insertion, intersection, union, caching, pickling, etc. It also doesn't tell me how SedonaDB scales compared to R-tree, shapely, geopandas, or STAC. I also don't know what features Sedona DB supports. We should consider adding SedonaDB to our literature review for TorchGeo 1.0 (currently private git repo, but I can give people access).

More generally, the reason we didn't consider database backends like PostGIS/DuckDB/SedonaDB is my fear that this would require users to know about these technologies, install non-Python deps, and manually set up their own databases. Also, no one suggested them when I asked around during our 6 month backend search. For reference, I tried many times to get the GeoTorchAI unit tests running on my laptop, but this didn't work as I didn't have SedonaDB installed. If this has changed and the setup process is now much easier, we can revisit these, as I expect them to be quite performant depending entirely on I/O speeds. However, note that not all users may have write access on the systems they run TorchGeo on, so we may not be able to switch to a file-based DB as the default.

@jiayuasu
Copy link

jiayuasu commented Dec 8, 2025

@adamjstewart Hi Adam,

Thank you again for all your contributions to TorchGeo. Since Isaac is currently on leave, I wanted to clarify a few things here.

SedonaDB and Apache Sedona

SedonaDB is a new subproject under Apache Sedona, but it is not the same as SedonaSpark, SedonaFlink, or other distributed Sedona engines. SedonaDB is a single machine data processing tool that requires zero installation beyond a simple pip install apache-sedona[db]. It is designed for embedded and self contained environments. The wheel files for SedonaDB are available on PyPI at
https://pypi.org/project/sedonadb/

SedonaDB and PostGIS

SedonaDB is designed specifically for embedded and self contained use cases. It requires zero database setup and no data ingestion process like PostGIS. It works directly on your data files, identical to the GeoPandas user experience.

PostGIS is more of a transactional database. For analytical workloads such as filtering, joins, unions, and aggregations, SedonaDB is orders of magnitude faster. Because of how large the performance gap is, we did not even include PostGIS in our SpatialBench comparison since it would be an extremely unfair comparison.

SedonaDB and GeoTorchAI

GeoTorchAI was a research prototype that relied on Apache Sedona through SedonaSpark rather than SedonaDB. I agree with you that SedonaSpark can be complicated to operate in certain environments.

SedonaDB functionalities

SedonaDB was released in September 2025 and is positioned as a GeoPandas alternative. You can find the full list of supported functions here
https://sedona.apache.org/sedonadb/latest/reference/sql/

We also conducted a comprehensive benchmark comparing SedonaDB, GeoPandas, and DuckDB using SpatialBench
https://sedona.apache.org/spatialbench/single-node-benchmarks/

SpatialBench is designed to evaluate geospatial analytical performance across different systems. It examines performance from multiple angles including individual spatial functions such as

  • filtering, intersection, and union
  • complex and heavy spatial joins.
  • automatic query optimization across combined operations

For inserting new rows into a GeoPandas DataFrame, I do not believe GeoPandas currently supports this, and neither do SedonaDB or DuckDB.

Hope this helps clarify things and addresses your concerns.

@adamjstewart
Copy link
Member

adamjstewart commented Dec 9, 2025

Thanks, this actually helps a lot. So we didn't consider SedonaDB during our search because it didn't exist at the time we did our literature review. Glad to know that it's easier to install than Apache Sedona and doesn't require any setup.

SpatialBench is actually of great interest to me. We have our own benchmarking comparing R-tree, Shapely, Geopandas, and STAC. Would be interesting to add some of those to the comparison, although they all obviously have very different features.

I still need to think about the best way to do this. I really don't want a new SedonaDBDataset, as we would have to duplicate all 30+ existing GeoDataset subclasses to actually take advantage of this. Maybe a backend='sedonadb' parameter to GeoDataset and friends would help. Again, there are hundreds of places we use the geopandas representation, not just filtering. We don't have to replace all locations, but if you really want speedups, that may be necessary.

I'll add this to the agenda for our monthly Technical Steering Committee meeting. Not sure if we'll get to it in January or February but I'm more open to this idea now that I understand it better.

@isaaccorley
Copy link
Collaborator Author

isaaccorley commented Dec 9, 2025

Part of TorchGeo's API design is to use inheritance by providing base classes. For this reason I don't think it's necessary in this PR to completely redo the backend and add it to every single dataset that inherits from GeoDataset. I was hoping to scope this to just creating an experimental SedonaDBDataset as an optional dependency install that users can start to experiment with. As mentioned, the gains in spatial intersection are quite large and I don't think we should delay these kinds of speedups making it into the library.

It also helps our Wherobots developers to begin to consider optimizations that would help particularly for improving geospatial sampling for ML training workflows.

@adamjstewart
Copy link
Member

Well this won't make it into a release for quite some time regardless of if we merge today or in a few months. I want to speed up the release cycle after 1.0, but at the moment we're busy breaking GeoDataset and GeoSampler to add time series support. These features need time to test and mature before making it into a release, especially because they are backwards-incompatible.

@isaaccorley
Copy link
Collaborator Author

isaaccorley commented Dec 9, 2025

Well this won't make it into a release for quite some time regardless of if we merge today or in a few months. I want to speed up the release cycle after 1.0,

We should discuss this at our monthly meeting since it seems you are prescribing a release schedule without input from the rest of the maintainers. My opinion is that we should really increase our release schedule frequency particularly for certain types of features. I imagine others would agree with this as well. Adding a new UNet model weights for example should not take 6 months to become available from PyPi. 1.0 or not isn't really relevant to this PR.

but at the moment we're busy breaking GeoDataset and GeoSampler to add time series support. These features need time to test and mature before making it into a release, especially because they are backwards-incompatible.

These types of large breaking changes over multiple PRs should consider being moved to a dev or time-series branch that can get fully merged at some point in the future and not be pushed directly to main. These features don't really have anything to do with the SedonaDBDataset in this PR considering what's currently on main and shouldn't hold up any other features before they are merged

My overall 2c is that this features like this should get merged sooner than later instead of being stalled by reasons that aren't relevant to what's proposed in the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasets Geospatial or benchmark datasets dependencies Packaging and dependencies scripts Training and evaluation scripts testing Continuous integration testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants