-
Notifications
You must be signed in to change notification settings - Fork 500
SedonaDB Dataset #3161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
SedonaDB Dataset #3161
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces SedonaDB support for vector datasets to improve performance on large geospatial datasets by replacing GeoPandas operations with SedonaDB equivalents.
Key changes:
- Abstracts spatial filtering operations in
VectorDatasetinto a newfilter_indexmethod that can be overridden by subclasses - Implements
SedonaDBDatasetclass that uses SedonaDB for geospatial query operations - Adds apache-sedona[db] as an optional dependency
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| torchgeo/datasets/sedonadb.py | New SedonaDB dataset implementation with custom spatial filtering |
| torchgeo/datasets/geo.py | Refactored to extract filtering logic into filter_index method |
| torchgeo/datasets/init.py | Added SedonaDBDataset to exports |
| tests/datasets/test_sedonadb.py | Comprehensive test suite for SedonaDBDataset functionality |
| requirements/datasets.txt | Added apache-sedona[db] pinned to version 1.8.0 |
| pyproject.toml | Added apache-sedona[db]>=1.8.0 to optional dependencies |
Comments suppressed due to low confidence (1)
torchgeo/datasets/sedonadb.py:1
- Lines 126-130 build an options dict but never use it. Line 131 passes
layer=self.layerdirectly toread_fileinstead of using theoptionsdict. Either remove the unusedoptionsdict (lines 126-130) or use it in theread_filecall.
# Copyright (c) TorchGeo Contributors. All rights reserved.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
I'm new to TorchGeo so take this all with a grain of salt! The high-level suggestion is that to get the most out of SedonaDB (or any spatial db-based dataset), you may want to do something like:
self._sd.read_pyogrio(files).to_view("index_raw", overwrite=True)
self._sd.sql("""SELECT "{t_field}" as t, "{label_field}" as label, wkb_geometry as geometry FROM index_raw""").to_view("index", overwrite=True)
self._sd.sql("""SELECT row_number() as i, t, label, geometry FROM (SELECT * FROM index ORDER BY t)""").to_view("index", overwrite=True)...or in other words, self._sd.view("index) is never materialized until it is requested. Of course, this is a bit of a shift in thinking for most GeoPandas users and so feel free to not do this (I can help with this if you get a basic implementation with tests that I can use to make the expectations of the implementation clear!).
torchgeo/datasets/sedonadb.py
Outdated
| SELECT * FROM index_df | ||
| WHERE datetime_start <= {t_stop_ts} | ||
| AND datetime_end >= {t_start_ts} | ||
| AND ST_Intersects(geometry, ST_SetSRID(ST_GeomFromWKT('{query_wkt}'), {epsg})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use `ST_SetCrs(..., '{self.crs.to_json()}'), which will work even of there's no EPSG code.
torchgeo/datasets/sedonadb.py
Outdated
| sedona_db = lazy_import('sedona.db') | ||
| super().__init__(paths, crs, res, transforms, label_name, task, layer) | ||
|
|
||
| self._sd = sedona_db.connect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To really get the benefit of SedonaDB here, I think you may want to use sd.read_parquet() (if paths points to 1 or more Parquet files) or sd.read_pyogrio() (otherwise). Perhaps index can be a property? (in other words, to really get a benefit here you probably need to defer the materialization of the index as long as possible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored to use this. Actually uncovered a bug in TorchGeo where we used gpd.read_file everywhere which doesn't let us read parquet files or directories (!!!) cc @adamjstewart
torchgeo/datasets/sedonadb.py
Outdated
| if t.step != 1: | ||
| filtered_index_gdf = filtered_index_gdf.iloc[:: t.step] | ||
| if len(filtered_index_gdf) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the initial creation of the index could be like:
self._sd.read_pyogrio(files).to_view("index_raw", overwrite=True);
self._sd.sql("""SELECT row_number() as i, t, geometry FROM (SELECT * FROM index_raw ORDER BY t)"""...which would possibly let you do something like self._sd.sql("""SELECT * FROM filtered_index WHERE i % {t.step} = 0""").to_view("filtered_index", overwrite=True) here.
|
@paleolimbot I appreciate the review for creating the index. I did some initial benchmarks and sedonadb ABSOLUTELY SLAPS at spatial intersection filtering (see PR description above) cc: @calebrob6 @adamjstewart |
|
This PR is still a WIP at the moment but so far looking verrrry good. Most of the remaining work needs to be modifying the constructor to use sedonadb to create the index. Basically the following happens:
|
574dc82 to
ade2da0
Compare
|
Would a "driver" flag in VectorDataset (that could be "geopandas" or "sedonadb") be possible? E.g. as a user if I want to use an existing vectordataset but want the sick gains of sedonadb, it'd be cool if I could just switch somehow. Also, I like this benchmarking script and am curious how the old backend performs on it. |
|
Would like to continue the discussion in #3160 once Isaac returns from paternity leave, but some minor comments on the proposed implementation: I'm not opposed to supporting multiple backends, but note:
More generally, the reason we didn't consider database backends like PostGIS/DuckDB/SedonaDB is my fear that this would require users to know about these technologies, install non-Python deps, and manually set up their own databases. Also, no one suggested them when I asked around during our 6 month backend search. For reference, I tried many times to get the GeoTorchAI unit tests running on my laptop, but this didn't work as I didn't have SedonaDB installed. If this has changed and the setup process is now much easier, we can revisit these, as I expect them to be quite performant depending entirely on I/O speeds. However, note that not all users may have write access on the systems they run TorchGeo on, so we may not be able to switch to a file-based DB as the default. |
|
@adamjstewart Hi Adam, Thank you again for all your contributions to TorchGeo. Since Isaac is currently on leave, I wanted to clarify a few things here. SedonaDB and Apache SedonaSedonaDB is a new subproject under Apache Sedona, but it is not the same as SedonaSpark, SedonaFlink, or other distributed Sedona engines. SedonaDB is a single machine data processing tool that requires zero installation beyond a simple SedonaDB and PostGISSedonaDB is designed specifically for embedded and self contained use cases. It requires zero database setup and no data ingestion process like PostGIS. It works directly on your data files, identical to the GeoPandas user experience. PostGIS is more of a transactional database. For analytical workloads such as filtering, joins, unions, and aggregations, SedonaDB is orders of magnitude faster. Because of how large the performance gap is, we did not even include PostGIS in our SpatialBench comparison since it would be an extremely unfair comparison. SedonaDB and GeoTorchAIGeoTorchAI was a research prototype that relied on Apache Sedona through SedonaSpark rather than SedonaDB. I agree with you that SedonaSpark can be complicated to operate in certain environments. SedonaDB functionalitiesSedonaDB was released in September 2025 and is positioned as a GeoPandas alternative. You can find the full list of supported functions here We also conducted a comprehensive benchmark comparing SedonaDB, GeoPandas, and DuckDB using SpatialBench SpatialBench is designed to evaluate geospatial analytical performance across different systems. It examines performance from multiple angles including individual spatial functions such as
For inserting new rows into a GeoPandas DataFrame, I do not believe GeoPandas currently supports this, and neither do SedonaDB or DuckDB. Hope this helps clarify things and addresses your concerns. |
|
Thanks, this actually helps a lot. So we didn't consider SedonaDB during our search because it didn't exist at the time we did our literature review. Glad to know that it's easier to install than Apache Sedona and doesn't require any setup. SpatialBench is actually of great interest to me. We have our own benchmarking comparing R-tree, Shapely, Geopandas, and STAC. Would be interesting to add some of those to the comparison, although they all obviously have very different features. I still need to think about the best way to do this. I really don't want a new SedonaDBDataset, as we would have to duplicate all 30+ existing GeoDataset subclasses to actually take advantage of this. Maybe a I'll add this to the agenda for our monthly Technical Steering Committee meeting. Not sure if we'll get to it in January or February but I'm more open to this idea now that I understand it better. |
|
Part of TorchGeo's API design is to use inheritance by providing base classes. For this reason I don't think it's necessary in this PR to completely redo the backend and add it to every single dataset that inherits from GeoDataset. I was hoping to scope this to just creating an experimental SedonaDBDataset as an optional dependency install that users can start to experiment with. As mentioned, the gains in spatial intersection are quite large and I don't think we should delay these kinds of speedups making it into the library. It also helps our Wherobots developers to begin to consider optimizations that would help particularly for improving geospatial sampling for ML training workflows. |
|
Well this won't make it into a release for quite some time regardless of if we merge today or in a few months. I want to speed up the release cycle after 1.0, but at the moment we're busy breaking GeoDataset and GeoSampler to add time series support. These features need time to test and mature before making it into a release, especially because they are backwards-incompatible. |
We should discuss this at our monthly meeting since it seems you are prescribing a release schedule without input from the rest of the maintainers. My opinion is that we should really increase our release schedule frequency particularly for certain types of features. I imagine others would agree with this as well. Adding a new UNet model weights for example should not take 6 months to become available from PyPi. 1.0 or not isn't really relevant to this PR.
These types of large breaking changes over multiple PRs should consider being moved to a My overall 2c is that this features like this should get merged sooner than later instead of being stalled by reasons that aren't relevant to what's proposed in the PR. |
Summary
See reasoning here #3160
This PR implements the following:
VectorDatasetinto a methodVectorDataset.filter_indexwhich can be overridden for new backends.torchgeo.datasets.SedonaDBDatasetBenchmarking
I added a benchmarking script that simply iteratively queries the
filter_indexmethod of both methods and these are the results. The data I used is the Washington state buildings here from the Microsoft Open Buildings dataset (converted to parquet):VectorDataset:
Initialization time: 2.27 seconds
Filter time (50 slices): 105.54 seconds
Time per slice: 2.11 seconds
Total geometries found: 21,235
Geometries per second: 201.20
SedonaDBDataset Performance:
Initialization time: 2.06 seconds
Filter time (50 slices): 12.89 seconds
Time per slice: 0.26 seconds
Total geometries found: 21,235
Geometries per second: 1,647.26
Speedup: 8.19x
SedonaDBDataset is about 8.19x faster than VectorDataset for filtering operations, processing ~1,647 geometries/second vs ~201 geometries/second. Both found the same 21,235 geometries, confirming correctness.
TODO
Possibly need to implement separate SedonaDBVectorDataset and SedonaDBRasterDataset that can use IntersectionDataset and UnionDataset (could be in a follow-up PR though)
cc: @jiayuasu @rbavery @paleolimbot