osmextract

High-performance OpenStreetMap data extraction tool powered by DuckDB

Extract and filter OpenStreetMap data from PBF files to GeoParquet or DuckDB format. Built with Rust and DuckDB for maximum performance and minimal dependencies.

Features

Fast: Optimized SQL queries with MATERIALIZED CTEs
Smart: Auto-detects CPU cores and memory - zero configuration needed
Simple: Single 38 MB binary with zero runtime dependencies
Powerful: Full OSM data model support (nodes, ways, relations, multipolygons)
Flexible filtering: Tag filters, geometry filters, and custom SQL
GIS-ready output: GeoParquet v1.1.0 metadata infrastructure (WKB encoding)
Data quality validation: Optional validation to detect missing references and data issues
DuckDB-powered: Leverages DuckDB's spatial extension for processing

Quick Start

Installation

Option 1: Download Pre-built Binary (Recommended)

Download the latest release for your platform from GitHub Releases:

# Linux x86_64
curl -LO https://github.com/tobilg/osmextract/releases/latest/download/osmextract-linux-amd64
chmod +x osmextract-linux-amd64
sudo mv osmextract-linux-amd64 /usr/local/bin/osmextract

# Linux ARM64
curl -LO https://github.com/tobilg/osmextract/releases/latest/download/osmextract-linux-arm64
chmod +x osmextract-linux-arm64
sudo mv osmextract-linux-arm64 /usr/local/bin/osmextract

# macOS Apple Silicon
curl -LO https://github.com/tobilg/osmextract/releases/latest/download/osmextract-macos-arm64
chmod +x osmextract-macos-arm64
sudo mv osmextract-macos-arm64 /usr/local/bin/osmextract

# Windows x86_64 (PowerShell)
# Download from: https://github.com/tobilg/osmextract/releases/latest/download/osmextract-windows-amd64.exe

Verify installation:

osmextract --version

Available platforms:

Linux x86_64 (amd64)
Linux ARM64 (aarch64)
macOS Apple Silicon (ARM64)
Windows x86_64 (amd64)

Option 2: Build from Source

# Prerequisites: Rust 1.75+ (https://rustup.rs)

# Clone repository
git clone https://github.com/tobilg/osmextract
cd osmextract

# Build release binary
cargo build --release

# Binary location
./target/release/osmextract

Basic Usage

# Extract all features from a PBF file
osmextract input.osm.pbf -o output.parquet

# Extract from URL
osmextract https://download.geofabrik.de/europe/monaco-latest.osm.pbf -o monaco.parquet

# Extract buildings only
osmextract city.pbf --tags-filter '{"building": true}' -o buildings.parquet

# Extract within bounding box
osmextract city.pbf --geom-filter-bbox "7.41,43.73,7.44,43.75" -o area.parquet

# Combine filters
osmextract city.pbf \
  --tags-filter '{"amenity": ["restaurant", "cafe"]}' \
  --geom-filter-bbox "7.41,43.73,7.44,43.75" \
  -o cafes.parquet

Command Reference

Basic Options

Option	Description	Example
`<INPUT>`	Input PBF file or URL	`city.pbf` or `https://...`
`-o, --output <PATH>`	Output file (.parquet or .duckdb)	`-o output.parquet`
`-v, --verbose`	Show progress information	`-v`

Tag Filtering

Filter features by OSM tags using JSON format.

Option	Description
`--tags-filter <JSON>`	Filter tags (inline JSON)
`--tags-filter-file <PATH>`	Filter tags from file

Tag Filter Formats:

# Key presence: any value
--tags-filter '{"building": true}'

# Exact value
--tags-filter '{"amenity": "restaurant"}'

# Multiple values (OR)
--tags-filter '{"highway": ["primary", "secondary"]}'

# Multiple keys (OR)
--tags-filter '{"building": true, "amenity": true}'

Examples:

# All buildings
osmextract city.pbf --tags-filter '{"building": true}' -o buildings.parquet

# Restaurants and cafes
osmextract city.pbf \
  --tags-filter '{"amenity": ["restaurant", "cafe"]}' \
  -o food.parquet

# Major roads
osmextract region.pbf \
  --tags-filter '{"highway": ["motorway", "trunk", "primary"]}' \
  -o roads.parquet

# From file
cat > filters.json << EOF
{
  "building": true,
  "amenity": ["school", "hospital"],
  "shop": true
}
EOF
osmextract city.pbf --tags-filter-file filters.json -o pois.parquet

Geometry Filtering

Filter features by spatial location.

Option	Description
`--geom-filter-bbox <BBOX>`	Bounding box: `minx,miny,maxx,maxy`
`--geom-filter-wkt <WKT>`	WKT geometry (POINT, POLYGON, etc.)
`--geom-filter-geojson <JSON>`	GeoJSON geometry

Examples:

# Bounding box (most common)
osmextract city.pbf \
  --geom-filter-bbox "7.41,43.73,7.44,43.75" \
  -o area.parquet

# WKT polygon
osmextract city.pbf \
  --geom-filter-wkt "POLYGON((7.4 43.7, 7.5 43.7, 7.5 43.8, 7.4 43.8, 7.4 43.7))" \
  -o polygon_area.parquet

# GeoJSON
osmextract city.pbf \
  --geom-filter-geojson '{"type":"Point","coordinates":[7.42,43.74]}' \
  -o point_area.parquet

Custom SQL Filtering

Apply advanced filters using DuckDB SQL expressions.

Option	Description
`--custom-sql-filter <SQL>`	SQL WHERE clause condition

Available variables:

tags - Tag map (use map_keys(), map_extract())
geometry - Geometry object (use ST_* functions)
feature_id - OSM ID

Examples:

# Features with names
osmextract city.pbf \
  --custom-sql-filter "list_contains(map_keys(tags), 'name')" \
  -o named.parquet

# Buildings with >5 tags (complex features)
osmextract city.pbf \
  --tags-filter '{"building": true}' \
  --custom-sql-filter "cardinality(tags) > 5" \
  -o complex_buildings.parquet

# Features with address
osmextract city.pbf \
  --custom-sql-filter "list_contains(map_keys(tags), 'addr:street')" \
  -o with_address.parquet

# Large polygons (area > 10000 sq meters)
osmextract city.pbf \
  --tags-filter '{"building": true}' \
  --custom-sql-filter "ST_Area(geometry) > 10000" \
  -o large_buildings.parquet

Performance Tuning

osmextract automatically detects and optimizes for your system, but you can override settings:

Option	Description	Default
`--threads <N>`	Number of processing threads	Auto-detected CPU cores
`--memory-limit <SIZE>`	Memory limit (e.g., "8GB")	Auto-detected (50% of RAM)
`--max-temp-directory-size <SIZE>`	Max temp disk usage	Auto-detected (10x RAM, max 100GB)
`--temp-directory <PATH>`	Temp directory location	System default
`--checkpoint-threshold <SIZE>`	DuckDB checkpoint size	`256MB`
`--compression <TYPE>`	Parquet compression	`zstd`
`--row-group-size <N>`	Parquet row group size	`100000`

Compression types: zstd (default), snappy, gzip, brotli, uncompressed

Examples:

# Auto-detected settings (recommended for most cases)
osmextract large.pbf -o output.parquet --verbose
# Output shows: "Threads: 8 (auto-detected)", "Memory limit: 8GB (auto-detected)"

# Override specific settings
osmextract large.pbf -o output.parquet \
  --threads 16 \
  --memory-limit "16GB"

# Faster compression (larger files)
osmextract large.pbf -o output.parquet --compression snappy

# Better compression (slower)
osmextract large.pbf -o output.parquet --compression brotli

# Constrained memory
osmextract large.pbf -o output.parquet --memory-limit "4GB"

# Large file processing with custom temp directory
osmextract huge.pbf -o output.parquet \
  --temp-directory "/nvme/tmp" \
  --max-temp-directory-size "200GB"

Output Options

Option	Description	Default
`-o <PATH>`	Output file path	Required
`--table-name <NAME>`	DuckDB table name	`osm_features`
`--add-index`	Create R-tree spatial index (DuckDB only)	`false`

Output formats:

.parquet or .geoparquet - GeoParquet file with WKB geometry (v1.1.0 metadata infrastructure ready)
.duckdb or .db - DuckDB database file

# GeoParquet output (most common)
osmextract city.pbf -o city.parquet

# DuckDB database
osmextract city.pbf -o city.duckdb --table-name features

# DuckDB with R-tree spatial index (10-1000x faster spatial queries)
osmextract city.pbf -o city.duckdb --add-index

# Custom table name with index
osmextract city.pbf -o data.duckdb --table-name buildings \
  --tags-filter '{"building": true}' \
  --add-index

R-tree Spatial Index Benefits:

10-1000x faster spatial queries (ST_Intersects, ST_Contains, etc.)
Bounding box queries: Dramatically faster spatial filtering
Trade-off: Adds ~5-15% to file size, small increase in processing time
Note: Index is created on geom column (GEOMETRY type), while geometry column (WKB) is kept for compatibility

Data Quality Validation

Use --validate to check for data quality issues before processing:

# Check for missing node/way references
osmextract region.pbf -o output.parquet --validate --verbose

# Example output:
# === Data Quality Report ===
# Total features processed: 10446
#
# Issues found:
#   ⚠ 86 relations reference ways not present in the dataset
#
# Note: These issues are common with filtered/clipped extracts.
#       Consider using a larger bbox or processing the full region.

What validation checks:

Missing node references: Ways that reference nodes not in the dataset
Incomplete ways: Ways that resolve to fewer than 2 nodes (invalid geometry)
Missing way references: Relations that reference ways not in the dataset

Common causes:

Clipped/filtered extracts from Geofabrik or other sources
Bbox filters that cut through features
Tag filters that exclude needed nodes/ways

Validation runs before processing and doesn't prevent output - it just warns about potential issues.

Real-World Examples

Extract Buildings in City Center

osmextract city.pbf \
  --tags-filter '{"building": true}' \
  --geom-filter-bbox "7.415,43.73,7.435,43.75" \
  -o downtown_buildings.parquet \
  --verbose

Extract Complete Transportation Network

# Roads
osmextract region.pbf \
  --tags-filter '{"highway": true}' \
  -o roads.parquet

# Railways
osmextract region.pbf \
  --tags-filter '{"railway": true}' \
  -o railways.parquet

# Public transit stops
osmextract region.pbf \
  --tags-filter '{"highway": "bus_stop", "railway": ["station", "halt"]}' \
  -o transit_stops.parquet

Extract Points of Interest

# Restaurants and cafes with names
osmextract city.pbf \
  --tags-filter '{"amenity": ["restaurant", "cafe", "bar"]}' \
  --custom-sql-filter "list_contains(map_keys(tags), 'name')" \
  -o food_named.parquet

# Healthcare facilities
osmextract city.pbf \
  --tags-filter '{"amenity": ["hospital", "clinic", "pharmacy"]}' \
  -o healthcare.parquet

# Educational institutions
osmextract city.pbf \
  --tags-filter '{"amenity": ["school", "university", "college"]}' \
  -o education.parquet

Batch Processing Multiple Regions

#!/bin/bash
# Process multiple countries

regions=(
  "monaco"
  "andorra"
  "liechtenstein"
)

for region in "${regions[@]}"; do
  echo "Processing $region..."
  osmextract \
    "https://download.geofabrik.de/europe/${region}-latest.osm.pbf" \
    -o "${region}.parquet" \
    --verbose
done

Extract Different Features from One Source

#!/bin/bash
PBF="city.pbf"

# Buildings
osmextract $PBF --tags-filter '{"building": true}' -o buildings.parquet &

# Roads
osmextract $PBF --tags-filter '{"highway": true}' -o roads.parquet &

# Water
osmextract $PBF --tags-filter '{"natural": "water", "waterway": true}' -o water.parquet &

# Wait for all
wait
echo "All extractions complete!"

Working with Output

Query with DuckDB CLI

# Install spatial extension and query
duckdb -c "
  INSTALL spatial;
  LOAD spatial;

  -- Count features by type
  SELECT
    ST_GeometryType(geometry) as type,
    COUNT(*) as count
  FROM read_parquet('output.parquet')
  GROUP BY type;
"

# Get feature statistics
duckdb -c "
  INSTALL spatial;
  LOAD spatial;

  SELECT
    COUNT(*) as total_features,
    COUNT(DISTINCT feature_id) as unique_ids,
    MIN(ST_Area(geometry)) as min_area,
    MAX(ST_Area(geometry)) as max_area
  FROM read_parquet('buildings.parquet');
"

# Extract specific tags
duckdb -c "
  INSTALL spatial;
  LOAD spatial;

  SELECT
    feature_id,
    map_extract(tags, 'name') as name,
    map_extract(tags, 'building') as building_type,
    ST_Area(geometry) as area_sqm
  FROM read_parquet('buildings.parquet')
  WHERE list_contains(map_keys(tags), 'name')
  LIMIT 10;
"

Convert to GeoJSON

duckdb -c "
  INSTALL spatial;
  LOAD spatial;

  COPY (
    SELECT
      feature_id,
      tags,
      ST_AsGeoJSON(geometry) as geometry
    FROM read_parquet('output.parquet')
  ) TO 'output.geojson' (FORMAT JSON);
"

Load in QGIS

Open QGIS
Layer → Add Layer → Add Vector Layer
Select your .parquet file
QGIS automatically recognizes GeoParquet format
Layer loads with all attributes and geometry

Load in Python

import duckdb
import geopandas as gpd

# Read with DuckDB
conn = duckdb.connect()
conn.execute("INSTALL spatial; LOAD spatial;")

df = conn.execute("""
    SELECT * FROM read_parquet('output.parquet')
""").fetchdf()

# Or use GeoPandas directly
gdf = gpd.read_parquet('output.parquet')
print(gdf.head())

Architecture

Design Philosophy

DuckDB-centric: Leverage DuckDB's spatial extension for all processing
Minimal dependencies: Only 6 runtime crates, everything bundled
Zero-copy: Direct PBF → DuckDB → Parquet pipeline
Type-safe: Rust's type system prevents runtime errors
Auto-optimizing: Detects system resources and tunes performance automatically

Data Flow

OSM PBF → DuckDB ST_READOSM → Filtering → Geometry Processing → GeoParquet
   ↓                              ↓              ↓                  ↓
 Input                      Tag/Spatial     Ways→Polygons      Output
                            Filters         Relations→MPs

Technology Stack

Component	Technology	Version
Core	Rust	2021 edition
Database	DuckDB	1.4.2
CLI	clap	4.5.52
Serialization	serde/serde_json	1.0.228/1.0.145
Error handling	thiserror	2.0.17
System detection	num_cpus	1.16.0

Features Support

OpenStreetMap Data Model

Feature	Status	Notes
Nodes → Points	✅	Full support
Ways → LineStrings	✅	Non-closed ways
Ways → Polygons	✅	OSM polygon detection
Relations → MultiPolygons	✅	With hole cutting
Tags	✅	All tags preserved in map
Metadata	✅	OSM IDs preserved

Filtering

Filter Type	Status	Performance
Tag presence	✅	Fast
Tag exact value	✅	Fast
Tag multiple values	✅	Fast
Bounding box	✅	Very fast
WKT geometry	✅	Medium
GeoJSON geometry	✅	Medium
Custom SQL	✅	Varies

Output Formats

Format	Status	Features
GeoParquet	✅	v1.1.0 metadata, WKB encoding
DuckDB	✅	Direct database creation
Compression	✅	zstd, snappy, gzip, brotli

Development

Building from Source

# Prerequisites
# - Rust 1.75+ (https://rustup.rs)
# - ~500 MB disk space for dependencies

# Clone and build
git clone https://github.com/tobilg/osmextract
cd osmextract
cargo build --release

# Binary location
./target/release/osmextract

# Run tests
cargo test --release

Running Tests

# All tests (53 tests)
cargo test

# With output
cargo test -- --nocapture

# Specific test
cargo test test_tag_filter

# Release mode (faster)
cargo test --release

Troubleshooting

Build Issues

Problem: Long first build time (2-4 minutes)

Cause: DuckDB C++ compilation from source
Solution: Normal, subsequent builds are ~2 seconds

Problem: Out of memory during build

Solution: Close other applications, or build without --release first

Runtime Issues

Problem: "Cannot open file" error

Check: File path is correct and file exists
Check: URL is accessible (try with curl/wget first)

Problem: "Out of memory" error

Note: osmextract auto-detects memory and sets conservative limits (50% of RAM)
Solution: Manually constrain with --memory-limit "4GB" if needed
Solution: Process smaller regions or use bbox filter

Problem: Empty output file

Check: Input file contains data in the filtered area/tags
Check: Filters are correct (test without filters first)
Solution: Use --verbose to see what's happening

Performance Issues

Problem: Slow processing

Note: osmextract auto-detects CPU cores and sets optimal thread count
Solution: Override with --threads N if needed (e.g., to limit resource usage)
Solution: Apply tag filters before geometry filters
Solution: Use bbox instead of WKT/GeoJSON filters when possible

Problem: Slow spatial queries on DuckDB output

Solution: Use --add-index flag to create R-tree spatial index
Note: Index provides 10-1000x speedup for queries using ST_Intersects, ST_Contains, etc.
Example: osmextract city.pbf -o city.duckdb --add-index

FAQ

Q: How is this different from osmium or ogr2ogr? A: osmextract uses DuckDB's spatial extension for processing, offering simpler filtering syntax and direct GeoParquet output. It has zero runtime dependencies.

Q: What about the Python QuackOSM library? A: QuackOSM is excellent for Python workflows, and integration in existing data processing libraries.

Q: Does it support OSM XML files? A: Currently only PBF format. PBF is smaller, faster, and the standard distribution format.

Q: Can I contribute? A: Contributions welcome! See GitHub issues for ideas.

Q: What license? A: Apache 2.0 (permissive, commercial-friendly)

License

Apache License 2.0 - see LICENSE file for details.

Credits

Built with DuckDB spatial extension
Inspired by QuackOSM Python library

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.cargo		.cargo
.github/workflows		.github/workflows
includes		includes
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

osmextract

Features

Quick Start

Installation

Option 1: Download Pre-built Binary (Recommended)

Option 2: Build from Source

Basic Usage

Command Reference

Basic Options

Tag Filtering

Geometry Filtering

Custom SQL Filtering

Performance Tuning

Output Options

Data Quality Validation

Real-World Examples

Extract Buildings in City Center

Extract Complete Transportation Network

Extract Points of Interest

Batch Processing Multiple Regions

Extract Different Features from One Source

Working with Output

Query with DuckDB CLI

Convert to GeoJSON

Load in QGIS

Load in Python

Architecture

Design Philosophy

Data Flow

Technology Stack

Features Support

OpenStreetMap Data Model

Filtering

Output Formats

Development

Building from Source

Running Tests

Troubleshooting

Build Issues

Runtime Issues

Performance Issues

FAQ

License

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Contributors

Uh oh!

Languages