High-performance OpenStreetMap data extraction tool powered by DuckDB
Extract and filter OpenStreetMap data from PBF files to GeoParquet or DuckDB format. Built with Rust and DuckDB for maximum performance and minimal dependencies.
- Fast: Optimized SQL queries with MATERIALIZED CTEs
- Smart: Auto-detects CPU cores and memory - zero configuration needed
- Simple: Single 38 MB binary with zero runtime dependencies
- Powerful: Full OSM data model support (nodes, ways, relations, multipolygons)
- Flexible filtering: Tag filters, geometry filters, and custom SQL
- GIS-ready output: GeoParquet v1.1.0 metadata infrastructure (WKB encoding)
- Data quality validation: Optional validation to detect missing references and data issues
- DuckDB-powered: Leverages DuckDB's spatial extension for processing
Download the latest release for your platform from GitHub Releases:
# Linux x86_64
curl -LO https://github.com/tobilg/osmextract/releases/latest/download/osmextract-linux-amd64
chmod +x osmextract-linux-amd64
sudo mv osmextract-linux-amd64 /usr/local/bin/osmextract
# Linux ARM64
curl -LO https://github.com/tobilg/osmextract/releases/latest/download/osmextract-linux-arm64
chmod +x osmextract-linux-arm64
sudo mv osmextract-linux-arm64 /usr/local/bin/osmextract
# macOS Apple Silicon
curl -LO https://github.com/tobilg/osmextract/releases/latest/download/osmextract-macos-arm64
chmod +x osmextract-macos-arm64
sudo mv osmextract-macos-arm64 /usr/local/bin/osmextract
# Windows x86_64 (PowerShell)
# Download from: https://github.com/tobilg/osmextract/releases/latest/download/osmextract-windows-amd64.exeVerify installation:
osmextract --versionAvailable platforms:
- Linux x86_64 (amd64)
- Linux ARM64 (aarch64)
- macOS Apple Silicon (ARM64)
- Windows x86_64 (amd64)
# Prerequisites: Rust 1.75+ (https://rustup.rs)
# Clone repository
git clone https://github.com/tobilg/osmextract
cd osmextract
# Build release binary
cargo build --release
# Binary location
./target/release/osmextract# Extract all features from a PBF file
osmextract input.osm.pbf -o output.parquet
# Extract from URL
osmextract https://download.geofabrik.de/europe/monaco-latest.osm.pbf -o monaco.parquet
# Extract buildings only
osmextract city.pbf --tags-filter '{"building": true}' -o buildings.parquet
# Extract within bounding box
osmextract city.pbf --geom-filter-bbox "7.41,43.73,7.44,43.75" -o area.parquet
# Combine filters
osmextract city.pbf \
--tags-filter '{"amenity": ["restaurant", "cafe"]}' \
--geom-filter-bbox "7.41,43.73,7.44,43.75" \
-o cafes.parquet| Option | Description | Example |
|---|---|---|
<INPUT> |
Input PBF file or URL | city.pbf or https://... |
-o, --output <PATH> |
Output file (.parquet or .duckdb) | -o output.parquet |
-v, --verbose |
Show progress information | -v |
Filter features by OSM tags using JSON format.
| Option | Description |
|---|---|
--tags-filter <JSON> |
Filter tags (inline JSON) |
--tags-filter-file <PATH> |
Filter tags from file |
Tag Filter Formats:
# Key presence: any value
--tags-filter '{"building": true}'
# Exact value
--tags-filter '{"amenity": "restaurant"}'
# Multiple values (OR)
--tags-filter '{"highway": ["primary", "secondary"]}'
# Multiple keys (OR)
--tags-filter '{"building": true, "amenity": true}'Examples:
# All buildings
osmextract city.pbf --tags-filter '{"building": true}' -o buildings.parquet
# Restaurants and cafes
osmextract city.pbf \
--tags-filter '{"amenity": ["restaurant", "cafe"]}' \
-o food.parquet
# Major roads
osmextract region.pbf \
--tags-filter '{"highway": ["motorway", "trunk", "primary"]}' \
-o roads.parquet
# From file
cat > filters.json << EOF
{
"building": true,
"amenity": ["school", "hospital"],
"shop": true
}
EOF
osmextract city.pbf --tags-filter-file filters.json -o pois.parquetFilter features by spatial location.
| Option | Description |
|---|---|
--geom-filter-bbox <BBOX> |
Bounding box: minx,miny,maxx,maxy |
--geom-filter-wkt <WKT> |
WKT geometry (POINT, POLYGON, etc.) |
--geom-filter-geojson <JSON> |
GeoJSON geometry |
Examples:
# Bounding box (most common)
osmextract city.pbf \
--geom-filter-bbox "7.41,43.73,7.44,43.75" \
-o area.parquet
# WKT polygon
osmextract city.pbf \
--geom-filter-wkt "POLYGON((7.4 43.7, 7.5 43.7, 7.5 43.8, 7.4 43.8, 7.4 43.7))" \
-o polygon_area.parquet
# GeoJSON
osmextract city.pbf \
--geom-filter-geojson '{"type":"Point","coordinates":[7.42,43.74]}' \
-o point_area.parquetApply advanced filters using DuckDB SQL expressions.
| Option | Description |
|---|---|
--custom-sql-filter <SQL> |
SQL WHERE clause condition |
Available variables:
tags- Tag map (usemap_keys(),map_extract())geometry- Geometry object (useST_*functions)feature_id- OSM ID
Examples:
# Features with names
osmextract city.pbf \
--custom-sql-filter "list_contains(map_keys(tags), 'name')" \
-o named.parquet
# Buildings with >5 tags (complex features)
osmextract city.pbf \
--tags-filter '{"building": true}' \
--custom-sql-filter "cardinality(tags) > 5" \
-o complex_buildings.parquet
# Features with address
osmextract city.pbf \
--custom-sql-filter "list_contains(map_keys(tags), 'addr:street')" \
-o with_address.parquet
# Large polygons (area > 10000 sq meters)
osmextract city.pbf \
--tags-filter '{"building": true}' \
--custom-sql-filter "ST_Area(geometry) > 10000" \
-o large_buildings.parquetosmextract automatically detects and optimizes for your system, but you can override settings:
| Option | Description | Default |
|---|---|---|
--threads <N> |
Number of processing threads | Auto-detected CPU cores |
--memory-limit <SIZE> |
Memory limit (e.g., "8GB") | Auto-detected (50% of RAM) |
--max-temp-directory-size <SIZE> |
Max temp disk usage | Auto-detected (10x RAM, max 100GB) |
--temp-directory <PATH> |
Temp directory location | System default |
--checkpoint-threshold <SIZE> |
DuckDB checkpoint size | 256MB |
--compression <TYPE> |
Parquet compression | zstd |
--row-group-size <N> |
Parquet row group size | 100000 |
Compression types: zstd (default), snappy, gzip, brotli, uncompressed
Examples:
# Auto-detected settings (recommended for most cases)
osmextract large.pbf -o output.parquet --verbose
# Output shows: "Threads: 8 (auto-detected)", "Memory limit: 8GB (auto-detected)"
# Override specific settings
osmextract large.pbf -o output.parquet \
--threads 16 \
--memory-limit "16GB"
# Faster compression (larger files)
osmextract large.pbf -o output.parquet --compression snappy
# Better compression (slower)
osmextract large.pbf -o output.parquet --compression brotli
# Constrained memory
osmextract large.pbf -o output.parquet --memory-limit "4GB"
# Large file processing with custom temp directory
osmextract huge.pbf -o output.parquet \
--temp-directory "/nvme/tmp" \
--max-temp-directory-size "200GB"| Option | Description | Default |
|---|---|---|
-o <PATH> |
Output file path | Required |
--table-name <NAME> |
DuckDB table name | osm_features |
--add-index |
Create R-tree spatial index (DuckDB only) | false |
Output formats:
.parquetor.geoparquet- GeoParquet file with WKB geometry (v1.1.0 metadata infrastructure ready).duckdbor.db- DuckDB database file
# GeoParquet output (most common)
osmextract city.pbf -o city.parquet
# DuckDB database
osmextract city.pbf -o city.duckdb --table-name features
# DuckDB with R-tree spatial index (10-1000x faster spatial queries)
osmextract city.pbf -o city.duckdb --add-index
# Custom table name with index
osmextract city.pbf -o data.duckdb --table-name buildings \
--tags-filter '{"building": true}' \
--add-indexR-tree Spatial Index Benefits:
- 10-1000x faster spatial queries (
ST_Intersects,ST_Contains, etc.) - Bounding box queries: Dramatically faster spatial filtering
- Trade-off: Adds ~5-15% to file size, small increase in processing time
- Note: Index is created on
geomcolumn (GEOMETRY type), whilegeometrycolumn (WKB) is kept for compatibility
Use --validate to check for data quality issues before processing:
# Check for missing node/way references
osmextract region.pbf -o output.parquet --validate --verbose
# Example output:
# === Data Quality Report ===
# Total features processed: 10446
#
# Issues found:
# ⚠ 86 relations reference ways not present in the dataset
#
# Note: These issues are common with filtered/clipped extracts.
# Consider using a larger bbox or processing the full region.What validation checks:
- Missing node references: Ways that reference nodes not in the dataset
- Incomplete ways: Ways that resolve to fewer than 2 nodes (invalid geometry)
- Missing way references: Relations that reference ways not in the dataset
Common causes:
- Clipped/filtered extracts from Geofabrik or other sources
- Bbox filters that cut through features
- Tag filters that exclude needed nodes/ways
Validation runs before processing and doesn't prevent output - it just warns about potential issues.
osmextract city.pbf \
--tags-filter '{"building": true}' \
--geom-filter-bbox "7.415,43.73,7.435,43.75" \
-o downtown_buildings.parquet \
--verbose# Roads
osmextract region.pbf \
--tags-filter '{"highway": true}' \
-o roads.parquet
# Railways
osmextract region.pbf \
--tags-filter '{"railway": true}' \
-o railways.parquet
# Public transit stops
osmextract region.pbf \
--tags-filter '{"highway": "bus_stop", "railway": ["station", "halt"]}' \
-o transit_stops.parquet# Restaurants and cafes with names
osmextract city.pbf \
--tags-filter '{"amenity": ["restaurant", "cafe", "bar"]}' \
--custom-sql-filter "list_contains(map_keys(tags), 'name')" \
-o food_named.parquet
# Healthcare facilities
osmextract city.pbf \
--tags-filter '{"amenity": ["hospital", "clinic", "pharmacy"]}' \
-o healthcare.parquet
# Educational institutions
osmextract city.pbf \
--tags-filter '{"amenity": ["school", "university", "college"]}' \
-o education.parquet#!/bin/bash
# Process multiple countries
regions=(
"monaco"
"andorra"
"liechtenstein"
)
for region in "${regions[@]}"; do
echo "Processing $region..."
osmextract \
"https://download.geofabrik.de/europe/${region}-latest.osm.pbf" \
-o "${region}.parquet" \
--verbose
done#!/bin/bash
PBF="city.pbf"
# Buildings
osmextract $PBF --tags-filter '{"building": true}' -o buildings.parquet &
# Roads
osmextract $PBF --tags-filter '{"highway": true}' -o roads.parquet &
# Water
osmextract $PBF --tags-filter '{"natural": "water", "waterway": true}' -o water.parquet &
# Wait for all
wait
echo "All extractions complete!"# Install spatial extension and query
duckdb -c "
INSTALL spatial;
LOAD spatial;
-- Count features by type
SELECT
ST_GeometryType(geometry) as type,
COUNT(*) as count
FROM read_parquet('output.parquet')
GROUP BY type;
"# Get feature statistics
duckdb -c "
INSTALL spatial;
LOAD spatial;
SELECT
COUNT(*) as total_features,
COUNT(DISTINCT feature_id) as unique_ids,
MIN(ST_Area(geometry)) as min_area,
MAX(ST_Area(geometry)) as max_area
FROM read_parquet('buildings.parquet');
"# Extract specific tags
duckdb -c "
INSTALL spatial;
LOAD spatial;
SELECT
feature_id,
map_extract(tags, 'name') as name,
map_extract(tags, 'building') as building_type,
ST_Area(geometry) as area_sqm
FROM read_parquet('buildings.parquet')
WHERE list_contains(map_keys(tags), 'name')
LIMIT 10;
"duckdb -c "
INSTALL spatial;
LOAD spatial;
COPY (
SELECT
feature_id,
tags,
ST_AsGeoJSON(geometry) as geometry
FROM read_parquet('output.parquet')
) TO 'output.geojson' (FORMAT JSON);
"- Open QGIS
- Layer → Add Layer → Add Vector Layer
- Select your
.parquetfile - QGIS automatically recognizes GeoParquet format
- Layer loads with all attributes and geometry
import duckdb
import geopandas as gpd
# Read with DuckDB
conn = duckdb.connect()
conn.execute("INSTALL spatial; LOAD spatial;")
df = conn.execute("""
SELECT * FROM read_parquet('output.parquet')
""").fetchdf()
# Or use GeoPandas directly
gdf = gpd.read_parquet('output.parquet')
print(gdf.head())- DuckDB-centric: Leverage DuckDB's spatial extension for all processing
- Minimal dependencies: Only 6 runtime crates, everything bundled
- Zero-copy: Direct PBF → DuckDB → Parquet pipeline
- Type-safe: Rust's type system prevents runtime errors
- Auto-optimizing: Detects system resources and tunes performance automatically
OSM PBF → DuckDB ST_READOSM → Filtering → Geometry Processing → GeoParquet
↓ ↓ ↓ ↓
Input Tag/Spatial Ways→Polygons Output
Filters Relations→MPs
| Component | Technology | Version |
|---|---|---|
| Core | Rust | 2021 edition |
| Database | DuckDB | 1.4.2 |
| CLI | clap | 4.5.52 |
| Serialization | serde/serde_json | 1.0.228/1.0.145 |
| Error handling | thiserror | 2.0.17 |
| System detection | num_cpus | 1.16.0 |
| Feature | Status | Notes |
|---|---|---|
| Nodes → Points | ✅ | Full support |
| Ways → LineStrings | ✅ | Non-closed ways |
| Ways → Polygons | ✅ | OSM polygon detection |
| Relations → MultiPolygons | ✅ | With hole cutting |
| Tags | ✅ | All tags preserved in map |
| Metadata | ✅ | OSM IDs preserved |
| Filter Type | Status | Performance |
|---|---|---|
| Tag presence | ✅ | Fast |
| Tag exact value | ✅ | Fast |
| Tag multiple values | ✅ | Fast |
| Bounding box | ✅ | Very fast |
| WKT geometry | ✅ | Medium |
| GeoJSON geometry | ✅ | Medium |
| Custom SQL | ✅ | Varies |
| Format | Status | Features |
|---|---|---|
| GeoParquet | ✅ | v1.1.0 metadata, WKB encoding |
| DuckDB | ✅ | Direct database creation |
| Compression | ✅ | zstd, snappy, gzip, brotli |
# Prerequisites
# - Rust 1.75+ (https://rustup.rs)
# - ~500 MB disk space for dependencies
# Clone and build
git clone https://github.com/tobilg/osmextract
cd osmextract
cargo build --release
# Binary location
./target/release/osmextract
# Run tests
cargo test --release# All tests (53 tests)
cargo test
# With output
cargo test -- --nocapture
# Specific test
cargo test test_tag_filter
# Release mode (faster)
cargo test --releaseProblem: Long first build time (2-4 minutes)
- Cause: DuckDB C++ compilation from source
- Solution: Normal, subsequent builds are ~2 seconds
Problem: Out of memory during build
- Solution: Close other applications, or build without
--releasefirst
Problem: "Cannot open file" error
- Check: File path is correct and file exists
- Check: URL is accessible (try with curl/wget first)
Problem: "Out of memory" error
- Note: osmextract auto-detects memory and sets conservative limits (50% of RAM)
- Solution: Manually constrain with
--memory-limit "4GB"if needed - Solution: Process smaller regions or use bbox filter
Problem: Empty output file
- Check: Input file contains data in the filtered area/tags
- Check: Filters are correct (test without filters first)
- Solution: Use
--verboseto see what's happening
Problem: Slow processing
- Note: osmextract auto-detects CPU cores and sets optimal thread count
- Solution: Override with
--threads Nif needed (e.g., to limit resource usage) - Solution: Apply tag filters before geometry filters
- Solution: Use bbox instead of WKT/GeoJSON filters when possible
Problem: Slow spatial queries on DuckDB output
- Solution: Use
--add-indexflag to create R-tree spatial index - Note: Index provides 10-1000x speedup for queries using
ST_Intersects,ST_Contains, etc. - Example:
osmextract city.pbf -o city.duckdb --add-index
Q: How is this different from osmium or ogr2ogr? A: osmextract uses DuckDB's spatial extension for processing, offering simpler filtering syntax and direct GeoParquet output. It has zero runtime dependencies.
Q: What about the Python QuackOSM library? A: QuackOSM is excellent for Python workflows, and integration in existing data processing libraries.
Q: Does it support OSM XML files? A: Currently only PBF format. PBF is smaller, faster, and the standard distribution format.
Q: Can I contribute? A: Contributions welcome! See GitHub issues for ideas.
Q: What license? A: Apache 2.0 (permissive, commercial-friendly)
Apache License 2.0 - see LICENSE file for details.