Skip to content

tobilg/osmextract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

osmextract

High-performance OpenStreetMap data extraction tool powered by DuckDB

License Rust

Extract and filter OpenStreetMap data from PBF files to GeoParquet or DuckDB format. Built with Rust and DuckDB for maximum performance and minimal dependencies.

Features

  • Fast: Optimized SQL queries with MATERIALIZED CTEs
  • Smart: Auto-detects CPU cores and memory - zero configuration needed
  • Simple: Single 38 MB binary with zero runtime dependencies
  • Powerful: Full OSM data model support (nodes, ways, relations, multipolygons)
  • Flexible filtering: Tag filters, geometry filters, and custom SQL
  • GIS-ready output: GeoParquet v1.1.0 metadata infrastructure (WKB encoding)
  • Data quality validation: Optional validation to detect missing references and data issues
  • DuckDB-powered: Leverages DuckDB's spatial extension for processing

Quick Start

Installation

Option 1: Download Pre-built Binary (Recommended)

Download the latest release for your platform from GitHub Releases:

# Linux x86_64
curl -LO https://github.com/tobilg/osmextract/releases/latest/download/osmextract-linux-amd64
chmod +x osmextract-linux-amd64
sudo mv osmextract-linux-amd64 /usr/local/bin/osmextract

# Linux ARM64
curl -LO https://github.com/tobilg/osmextract/releases/latest/download/osmextract-linux-arm64
chmod +x osmextract-linux-arm64
sudo mv osmextract-linux-arm64 /usr/local/bin/osmextract

# macOS Apple Silicon
curl -LO https://github.com/tobilg/osmextract/releases/latest/download/osmextract-macos-arm64
chmod +x osmextract-macos-arm64
sudo mv osmextract-macos-arm64 /usr/local/bin/osmextract

# Windows x86_64 (PowerShell)
# Download from: https://github.com/tobilg/osmextract/releases/latest/download/osmextract-windows-amd64.exe

Verify installation:

osmextract --version

Available platforms:

  • Linux x86_64 (amd64)
  • Linux ARM64 (aarch64)
  • macOS Apple Silicon (ARM64)
  • Windows x86_64 (amd64)

Option 2: Build from Source

# Prerequisites: Rust 1.75+ (https://rustup.rs)

# Clone repository
git clone https://github.com/tobilg/osmextract
cd osmextract

# Build release binary
cargo build --release

# Binary location
./target/release/osmextract

Basic Usage

# Extract all features from a PBF file
osmextract input.osm.pbf -o output.parquet

# Extract from URL
osmextract https://download.geofabrik.de/europe/monaco-latest.osm.pbf -o monaco.parquet

# Extract buildings only
osmextract city.pbf --tags-filter '{"building": true}' -o buildings.parquet

# Extract within bounding box
osmextract city.pbf --geom-filter-bbox "7.41,43.73,7.44,43.75" -o area.parquet

# Combine filters
osmextract city.pbf \
  --tags-filter '{"amenity": ["restaurant", "cafe"]}' \
  --geom-filter-bbox "7.41,43.73,7.44,43.75" \
  -o cafes.parquet

Command Reference

Basic Options

Option Description Example
<INPUT> Input PBF file or URL city.pbf or https://...
-o, --output <PATH> Output file (.parquet or .duckdb) -o output.parquet
-v, --verbose Show progress information -v

Tag Filtering

Filter features by OSM tags using JSON format.

Option Description
--tags-filter <JSON> Filter tags (inline JSON)
--tags-filter-file <PATH> Filter tags from file

Tag Filter Formats:

# Key presence: any value
--tags-filter '{"building": true}'

# Exact value
--tags-filter '{"amenity": "restaurant"}'

# Multiple values (OR)
--tags-filter '{"highway": ["primary", "secondary"]}'

# Multiple keys (OR)
--tags-filter '{"building": true, "amenity": true}'

Examples:

# All buildings
osmextract city.pbf --tags-filter '{"building": true}' -o buildings.parquet

# Restaurants and cafes
osmextract city.pbf \
  --tags-filter '{"amenity": ["restaurant", "cafe"]}' \
  -o food.parquet

# Major roads
osmextract region.pbf \
  --tags-filter '{"highway": ["motorway", "trunk", "primary"]}' \
  -o roads.parquet

# From file
cat > filters.json << EOF
{
  "building": true,
  "amenity": ["school", "hospital"],
  "shop": true
}
EOF
osmextract city.pbf --tags-filter-file filters.json -o pois.parquet

Geometry Filtering

Filter features by spatial location.

Option Description
--geom-filter-bbox <BBOX> Bounding box: minx,miny,maxx,maxy
--geom-filter-wkt <WKT> WKT geometry (POINT, POLYGON, etc.)
--geom-filter-geojson <JSON> GeoJSON geometry

Examples:

# Bounding box (most common)
osmextract city.pbf \
  --geom-filter-bbox "7.41,43.73,7.44,43.75" \
  -o area.parquet

# WKT polygon
osmextract city.pbf \
  --geom-filter-wkt "POLYGON((7.4 43.7, 7.5 43.7, 7.5 43.8, 7.4 43.8, 7.4 43.7))" \
  -o polygon_area.parquet

# GeoJSON
osmextract city.pbf \
  --geom-filter-geojson '{"type":"Point","coordinates":[7.42,43.74]}' \
  -o point_area.parquet

Custom SQL Filtering

Apply advanced filters using DuckDB SQL expressions.

Option Description
--custom-sql-filter <SQL> SQL WHERE clause condition

Available variables:

  • tags - Tag map (use map_keys(), map_extract())
  • geometry - Geometry object (use ST_* functions)
  • feature_id - OSM ID

Examples:

# Features with names
osmextract city.pbf \
  --custom-sql-filter "list_contains(map_keys(tags), 'name')" \
  -o named.parquet

# Buildings with >5 tags (complex features)
osmextract city.pbf \
  --tags-filter '{"building": true}' \
  --custom-sql-filter "cardinality(tags) > 5" \
  -o complex_buildings.parquet

# Features with address
osmextract city.pbf \
  --custom-sql-filter "list_contains(map_keys(tags), 'addr:street')" \
  -o with_address.parquet

# Large polygons (area > 10000 sq meters)
osmextract city.pbf \
  --tags-filter '{"building": true}' \
  --custom-sql-filter "ST_Area(geometry) > 10000" \
  -o large_buildings.parquet

Performance Tuning

osmextract automatically detects and optimizes for your system, but you can override settings:

Option Description Default
--threads <N> Number of processing threads Auto-detected CPU cores
--memory-limit <SIZE> Memory limit (e.g., "8GB") Auto-detected (50% of RAM)
--max-temp-directory-size <SIZE> Max temp disk usage Auto-detected (10x RAM, max 100GB)
--temp-directory <PATH> Temp directory location System default
--checkpoint-threshold <SIZE> DuckDB checkpoint size 256MB
--compression <TYPE> Parquet compression zstd
--row-group-size <N> Parquet row group size 100000

Compression types: zstd (default), snappy, gzip, brotli, uncompressed

Examples:

# Auto-detected settings (recommended for most cases)
osmextract large.pbf -o output.parquet --verbose
# Output shows: "Threads: 8 (auto-detected)", "Memory limit: 8GB (auto-detected)"

# Override specific settings
osmextract large.pbf -o output.parquet \
  --threads 16 \
  --memory-limit "16GB"

# Faster compression (larger files)
osmextract large.pbf -o output.parquet --compression snappy

# Better compression (slower)
osmextract large.pbf -o output.parquet --compression brotli

# Constrained memory
osmextract large.pbf -o output.parquet --memory-limit "4GB"

# Large file processing with custom temp directory
osmextract huge.pbf -o output.parquet \
  --temp-directory "/nvme/tmp" \
  --max-temp-directory-size "200GB"

Output Options

Option Description Default
-o <PATH> Output file path Required
--table-name <NAME> DuckDB table name osm_features
--add-index Create R-tree spatial index (DuckDB only) false

Output formats:

  • .parquet or .geoparquet - GeoParquet file with WKB geometry (v1.1.0 metadata infrastructure ready)
  • .duckdb or .db - DuckDB database file
# GeoParquet output (most common)
osmextract city.pbf -o city.parquet

# DuckDB database
osmextract city.pbf -o city.duckdb --table-name features

# DuckDB with R-tree spatial index (10-1000x faster spatial queries)
osmextract city.pbf -o city.duckdb --add-index

# Custom table name with index
osmextract city.pbf -o data.duckdb --table-name buildings \
  --tags-filter '{"building": true}' \
  --add-index

R-tree Spatial Index Benefits:

  • 10-1000x faster spatial queries (ST_Intersects, ST_Contains, etc.)
  • Bounding box queries: Dramatically faster spatial filtering
  • Trade-off: Adds ~5-15% to file size, small increase in processing time
  • Note: Index is created on geom column (GEOMETRY type), while geometry column (WKB) is kept for compatibility

Data Quality Validation

Use --validate to check for data quality issues before processing:

# Check for missing node/way references
osmextract region.pbf -o output.parquet --validate --verbose

# Example output:
# === Data Quality Report ===
# Total features processed: 10446
#
# Issues found:
#   ⚠ 86 relations reference ways not present in the dataset
#
# Note: These issues are common with filtered/clipped extracts.
#       Consider using a larger bbox or processing the full region.

What validation checks:

  • Missing node references: Ways that reference nodes not in the dataset
  • Incomplete ways: Ways that resolve to fewer than 2 nodes (invalid geometry)
  • Missing way references: Relations that reference ways not in the dataset

Common causes:

  • Clipped/filtered extracts from Geofabrik or other sources
  • Bbox filters that cut through features
  • Tag filters that exclude needed nodes/ways

Validation runs before processing and doesn't prevent output - it just warns about potential issues.

Real-World Examples

Extract Buildings in City Center

osmextract city.pbf \
  --tags-filter '{"building": true}' \
  --geom-filter-bbox "7.415,43.73,7.435,43.75" \
  -o downtown_buildings.parquet \
  --verbose

Extract Complete Transportation Network

# Roads
osmextract region.pbf \
  --tags-filter '{"highway": true}' \
  -o roads.parquet

# Railways
osmextract region.pbf \
  --tags-filter '{"railway": true}' \
  -o railways.parquet

# Public transit stops
osmextract region.pbf \
  --tags-filter '{"highway": "bus_stop", "railway": ["station", "halt"]}' \
  -o transit_stops.parquet

Extract Points of Interest

# Restaurants and cafes with names
osmextract city.pbf \
  --tags-filter '{"amenity": ["restaurant", "cafe", "bar"]}' \
  --custom-sql-filter "list_contains(map_keys(tags), 'name')" \
  -o food_named.parquet

# Healthcare facilities
osmextract city.pbf \
  --tags-filter '{"amenity": ["hospital", "clinic", "pharmacy"]}' \
  -o healthcare.parquet

# Educational institutions
osmextract city.pbf \
  --tags-filter '{"amenity": ["school", "university", "college"]}' \
  -o education.parquet

Batch Processing Multiple Regions

#!/bin/bash
# Process multiple countries

regions=(
  "monaco"
  "andorra"
  "liechtenstein"
)

for region in "${regions[@]}"; do
  echo "Processing $region..."
  osmextract \
    "https://download.geofabrik.de/europe/${region}-latest.osm.pbf" \
    -o "${region}.parquet" \
    --verbose
done

Extract Different Features from One Source

#!/bin/bash
PBF="city.pbf"

# Buildings
osmextract $PBF --tags-filter '{"building": true}' -o buildings.parquet &

# Roads
osmextract $PBF --tags-filter '{"highway": true}' -o roads.parquet &

# Water
osmextract $PBF --tags-filter '{"natural": "water", "waterway": true}' -o water.parquet &

# Wait for all
wait
echo "All extractions complete!"

Working with Output

Query with DuckDB CLI

# Install spatial extension and query
duckdb -c "
  INSTALL spatial;
  LOAD spatial;

  -- Count features by type
  SELECT
    ST_GeometryType(geometry) as type,
    COUNT(*) as count
  FROM read_parquet('output.parquet')
  GROUP BY type;
"
# Get feature statistics
duckdb -c "
  INSTALL spatial;
  LOAD spatial;

  SELECT
    COUNT(*) as total_features,
    COUNT(DISTINCT feature_id) as unique_ids,
    MIN(ST_Area(geometry)) as min_area,
    MAX(ST_Area(geometry)) as max_area
  FROM read_parquet('buildings.parquet');
"
# Extract specific tags
duckdb -c "
  INSTALL spatial;
  LOAD spatial;

  SELECT
    feature_id,
    map_extract(tags, 'name') as name,
    map_extract(tags, 'building') as building_type,
    ST_Area(geometry) as area_sqm
  FROM read_parquet('buildings.parquet')
  WHERE list_contains(map_keys(tags), 'name')
  LIMIT 10;
"

Convert to GeoJSON

duckdb -c "
  INSTALL spatial;
  LOAD spatial;

  COPY (
    SELECT
      feature_id,
      tags,
      ST_AsGeoJSON(geometry) as geometry
    FROM read_parquet('output.parquet')
  ) TO 'output.geojson' (FORMAT JSON);
"

Load in QGIS

  1. Open QGIS
  2. LayerAdd LayerAdd Vector Layer
  3. Select your .parquet file
  4. QGIS automatically recognizes GeoParquet format
  5. Layer loads with all attributes and geometry

Load in Python

import duckdb
import geopandas as gpd

# Read with DuckDB
conn = duckdb.connect()
conn.execute("INSTALL spatial; LOAD spatial;")

df = conn.execute("""
    SELECT * FROM read_parquet('output.parquet')
""").fetchdf()

# Or use GeoPandas directly
gdf = gpd.read_parquet('output.parquet')
print(gdf.head())

Architecture

Design Philosophy

  • DuckDB-centric: Leverage DuckDB's spatial extension for all processing
  • Minimal dependencies: Only 6 runtime crates, everything bundled
  • Zero-copy: Direct PBF → DuckDB → Parquet pipeline
  • Type-safe: Rust's type system prevents runtime errors
  • Auto-optimizing: Detects system resources and tunes performance automatically

Data Flow

OSM PBF → DuckDB ST_READOSM → Filtering → Geometry Processing → GeoParquet
   ↓                              ↓              ↓                  ↓
 Input                      Tag/Spatial     Ways→Polygons      Output
                            Filters         Relations→MPs      

Technology Stack

Component Technology Version
Core Rust 2021 edition
Database DuckDB 1.4.2
CLI clap 4.5.52
Serialization serde/serde_json 1.0.228/1.0.145
Error handling thiserror 2.0.17
System detection num_cpus 1.16.0

Features Support

OpenStreetMap Data Model

Feature Status Notes
Nodes → Points Full support
Ways → LineStrings Non-closed ways
Ways → Polygons OSM polygon detection
Relations → MultiPolygons With hole cutting
Tags All tags preserved in map
Metadata OSM IDs preserved

Filtering

Filter Type Status Performance
Tag presence Fast
Tag exact value Fast
Tag multiple values Fast
Bounding box Very fast
WKT geometry Medium
GeoJSON geometry Medium
Custom SQL Varies

Output Formats

Format Status Features
GeoParquet v1.1.0 metadata, WKB encoding
DuckDB Direct database creation
Compression zstd, snappy, gzip, brotli

Development

Building from Source

# Prerequisites
# - Rust 1.75+ (https://rustup.rs)
# - ~500 MB disk space for dependencies

# Clone and build
git clone https://github.com/tobilg/osmextract
cd osmextract
cargo build --release

# Binary location
./target/release/osmextract

# Run tests
cargo test --release

Running Tests

# All tests (53 tests)
cargo test

# With output
cargo test -- --nocapture

# Specific test
cargo test test_tag_filter

# Release mode (faster)
cargo test --release

Troubleshooting

Build Issues

Problem: Long first build time (2-4 minutes)

  • Cause: DuckDB C++ compilation from source
  • Solution: Normal, subsequent builds are ~2 seconds

Problem: Out of memory during build

  • Solution: Close other applications, or build without --release first

Runtime Issues

Problem: "Cannot open file" error

  • Check: File path is correct and file exists
  • Check: URL is accessible (try with curl/wget first)

Problem: "Out of memory" error

  • Note: osmextract auto-detects memory and sets conservative limits (50% of RAM)
  • Solution: Manually constrain with --memory-limit "4GB" if needed
  • Solution: Process smaller regions or use bbox filter

Problem: Empty output file

  • Check: Input file contains data in the filtered area/tags
  • Check: Filters are correct (test without filters first)
  • Solution: Use --verbose to see what's happening

Performance Issues

Problem: Slow processing

  • Note: osmextract auto-detects CPU cores and sets optimal thread count
  • Solution: Override with --threads N if needed (e.g., to limit resource usage)
  • Solution: Apply tag filters before geometry filters
  • Solution: Use bbox instead of WKT/GeoJSON filters when possible

Problem: Slow spatial queries on DuckDB output

  • Solution: Use --add-index flag to create R-tree spatial index
  • Note: Index provides 10-1000x speedup for queries using ST_Intersects, ST_Contains, etc.
  • Example: osmextract city.pbf -o city.duckdb --add-index

FAQ

Q: How is this different from osmium or ogr2ogr? A: osmextract uses DuckDB's spatial extension for processing, offering simpler filtering syntax and direct GeoParquet output. It has zero runtime dependencies.

Q: What about the Python QuackOSM library? A: QuackOSM is excellent for Python workflows, and integration in existing data processing libraries.

Q: Does it support OSM XML files? A: Currently only PBF format. PBF is smaller, faster, and the standard distribution format.

Q: Can I contribute? A: Contributions welcome! See GitHub issues for ideas.

Q: What license? A: Apache 2.0 (permissive, commercial-friendly)

License

Apache License 2.0 - see LICENSE file for details.

Credits

  • Built with DuckDB spatial extension
  • Inspired by QuackOSM Python library

About

OpenStreetMap data extraction tool powered by DuckDB

Resources

License

Stars

Watchers

Forks

Contributors

Languages