Skip to content

paddymul/near-bike

Repository files navigation

closest-bike

Find the closest bike share station to you, anywhere in the world.

Scrapes GBFS (General Bikeshare Feed Specification) feeds from 1,200+ systems across 50+ countries.

Setup

pnpm install

Scripts

1. Fetch the systems catalog

Downloads the master list of all GBFS systems from MobilityData/gbfs/systems.csv and writes it to data/systems.json.

pnpm fetch-systems

Output includes a summary:

Fetching GBFS systems catalog…
  Found 1245 systems
  Top 10 countries:
    DE: 217
    US: 172
    FR: 139
    ...

2. Fetch station data for a system

Reads data/systems.json, resolves a system's auto-discovery endpoint, and fetches its station/vehicle data. This includes static info (locations, capacity) and a snapshot of current availability.

# Search for systems
npx tsx src/scripts/fetch-station.ts --list "paris"
npx tsx src/scripts/fetch-station.ts --list "us"

# Fetch by index (from --list output)
npx tsx src/scripts/fetch-station.ts 753

# Fetch by system_id
npx tsx src/scripts/fetch-station.ts dublin

# Fetch by name (substring match)
npx tsx src/scripts/fetch-station.ts "Citi Bike"

Output is written to data/stations/<system_id>.json and includes:

  • station_information — station locations (lat/lon), names, capacity
  • station_status — real-time availability per station
  • free_bike_status — dockless vehicle locations (lat/lon)

3. Refresh availability for a system

Re-fetches only the real-time availability feeds (station_status + free_bike_status/vehicle_status) using the cached discovery URLs from a previous fetch-station run. Much faster than a full fetch since it skips discovery and station_information.

# List systems that have been fetched
npx tsx src/scripts/fetch-availability.ts --list
npx tsx src/scripts/fetch-availability.ts --list "dub"

# Refresh by system_id
npx tsx src/scripts/fetch-availability.ts dublin

# Refresh by name
npx tsx src/scripts/fetch-availability.ts "Citi Bike"

# Refresh by index
npx tsx src/scripts/fetch-availability.ts 0

Output is written to data/availability/<system_id>.json.

Typical workflow

pnpm fetch-systems                                       # once: get all 1,200+ systems
npx tsx src/scripts/fetch-station.ts "Citi Bike"         # once: get stations + first snapshot
npx tsx src/scripts/fetch-availability.ts "Citi Bike"    # repeat: refresh availability

4. Build geo-index tiles

Reads all fetched station data from data/stations/*.json, extracts every geo point (stations + free-floating vehicles), and builds a two-tier spatial index of binary kdbush tiles.

pnpm build-tiles

# Custom target size (default 5000 points per box)
pnpm build-tiles -- --target-size 3000

Output:

  • data/tiles/box-index.bin — routing index (~5 KB, KDBush of ~120 box centers)
  • data/tiles/box-NNN.bin — one data tile per box (~340 KB each, ~5000 points)

The tiling groups points into bounding boxes:

  • Points are grouped by GBFS system first, keeping systems together in boxes
  • Large systems (>6500 points) are recursively split along the median latitude or longitude
  • Small nearby systems are greedily merged into the same box (within 500 km)
  • Bounding boxes are expanded by 10% for overlap, with closest-center tiebreaking
  • Result: ~120 boxes covering 598K+ stations and vehicles worldwide

Binary tile format

Both the box-index and data tiles use the same self-contained binary format:

┌──────────────────────────────────────────────────────────┐
│ HEADER (12 bytes)                                        │
│   magic: 0x4742 ("GB")                    [2 bytes]      │
│   version: 1                              [2 bytes]      │
│   point_count: N                          [4 bytes]      │
│   metadata_offset: M                      [4 bytes]      │
├──────────────────────────────────────────────────────────┤
│ KDBUSH INDEX (variable)                                  │
│   Raw kdbush ArrayBuffer — zero-copy restore             │
│   with KDBush.from()                                     │
├──────────────────────────────────────────────────────────┤
│ METADATA (starts at byte M)                              │
│   JSON-encoded metadata (UTF-8)                          │
│                                                          │
│   Box-index tiles:                                       │
│     BoxIndexMeta[] — { box, bbox, n }                    │
│                                                          │
│   Data tiles (compact string-table format):              │
│     { systems: ["velib", ...],                           │
│       types: ["station", "vehicle"],                     │
│       points: [{ i, s, t, name, cap? }, ...] }           │
│     s/t are indices into the systems/types tables         │
│     i is a sequential integer (point identity)           │
└──────────────────────────────────────────────────────────┘

5. Query nearest bikes

Given a latitude and longitude, finds the closest bike stations and vehicles using the two-tier KDBush index and geokdbush (haversine-aware kNN).

# Find 5 nearest (default)
pnpm query-nearest -- 48.8566 2.3522

# Find 10 nearest
pnpm query-nearest -- 48.8566 2.3522 --k 10

Output:

Querying nearest to (48.8566, 2.3522) with k=5…
Loaded box-index: 120 boxes

    #  Distance    System                Name                            Type      Cap
  ──────────────────────────────────────────────────────────────────────────────────────────
    1  80 m        velib-paris           Rue de Rivoli - Châtelet        station   34
    2  120 m       dott-paris            (dockless: #47)                 vehicle   —
    3  150 m       velib-paris           Place du Châtelet               station   28
    4  190 m       lime-paris            (dockless: #102)                vehicle   —
    5  230 m       velib-paris           Quai de la Mégisserie           station   42

The query loads exactly one data tile per request:

  1. Loads the box-index (~5 KB, cached) and finds the 5 nearest box centers
  2. Picks the box whose bounding box contains the query point (closest-center tiebreaking)
  3. Loads that single box tile and runs kNN within it

Full workflow

pnpm fetch-systems                                       # once: get all 1,200+ systems
npx tsx src/scripts/fetch-station.ts 0                   # fetch systems one by one (or batch)
pnpm build-tiles                                         # build geo-index from all fetched data
pnpm query-nearest -- 48.8566 2.3522                     # find nearest bikes to the Eiffel Tower

Ingest pipeline

The ingest pipeline automates fetching, scheduling, and compacting GBFS data across all 1,200+ systems. It replaces manual per-system fetching with a continuous batch process that tracks changes and stores time-series snapshots.

Data flow:

pnpm ingest (batch fetcher)
  ├─ data/stations/{id}.json                              latest cached station data
  ├─ data/snapshots/YYYY-MM-DD/HH-MM/{id}.{type}.json    time-series snapshots
  └─ data/ingest.db                                       fetch logs + scheduling state

pnpm compact
  └─ data/parquet/availability/{date}.parquet              compressed history

pnpm rebuild-index
  └─ data/tiles/box-index.bin + box-NNN.bin               spatial index

Each system is polled on a schedule with three interval tiers — base (300s), rush (120s), quiet (900s). Feeds are hashed with SHA-256 and only written when content changes. Failing systems use exponential backoff.

See operations.md for how to run the ingest pipeline locally and monitor it.

Batch scripts

  • fetch-all-stations.bash — fetches station data for all ~1,700 systems in parallel (10 concurrent jobs via xargs)
  • fetch-all-availability.bash — sequentially refreshes availability for all fetched systems

Analysis scripts

  • scripts/analyze-station-density.py — reports station/vehicle counts per system, bounding boxes, and geohash cell distribution
  • scripts/identify-hotspots.py — maps the top 40 densest geohash-4 cells to their contributing systems

Cloudflare Worker

The entire system — ingest, tile building, and serving — runs on Cloudflare Workers. No external VPS needed.

  • operations.md — running locally, monitoring, troubleshooting
  • deploy.md — deployment instructions, KV layout, architecture

Routes

Route Description
GET / Landing page with geolocation — redirects to /nearby
GET /nearby?lat=&lon= Server-rendered HTML page showing nearest bikes
GET /nearest?lat=&lon=&k= JSON API: nearest bikes (default k=5)
GET /systems Return cached systems catalog from KV
GET /systems/refresh Re-fetch systems.csv from GitHub, store in KV
GET /systems/status?format=html Systems directory — every system with stats
GET /station/:system_id Fetch full station data (cached 24h in KV)
GET /availability/:system_id Refresh just availability using cached discovery
GET /ingest/init Seed scheduling state for all systems in KV
GET /ingest/status?format=html&n=30 Ingest dashboard — JSON (default) or HTML, n=rows
GET /ingest/run?sync=true&limit=N Fetch N due systems inline (local dev)
GET /planner/run?sync=true Run planner + build tiles inline (local dev)
GET /api List available API routes

Tests

pnpm test            # single run
pnpm test:watch      # watch mode

Tests use fixture files in src/test/fixtures/ with mock fetch — no network calls.

Project structure

src/
  types/gbfs.ts                          # TypeScript types for GBFS feeds + geo-index
  lib/
    gbfs-fetch.ts                        # Shared fetch helpers (fetchJson, findFeedUrl)
    fetch-systems-catalog.ts             # Pure fn: systems.csv → SystemsCatalog
    fetch-station-data.ts                # Pure fn: system → station/vehicle data
    fetch-availability.ts                # Pure fn: cached discovery → fresh availability
    box-assign.ts                        # Pure fn: assign points to bounding boxes
    box-assign-meta.ts                   # Pure fn: metadata-based box assignment (for Workers)
    geo-tile.ts                          # Pure fn: tile building + binary serialization
    geo-query.ts                         # Pure fn: two-tier kNN query with box routing
    content-hash.ts                      # Pure fn: content hashing for change detection
    kv-scheduling.ts                     # KV-based scheduling state (for Workers)
    ingest-db.ts                         # SQLite metadata DB for CLI ingest pipeline
    ingest-scheduler.ts                  # Batch ingest scheduling logic
    compact-parquet.ts                   # Parquet compaction for historical data
  scripts/
    fetch-systems.ts                     # CLI: fetches catalog → data/systems.json
    fetch-station.ts                     # CLI: fetches station data → data/stations/
    fetch-availability.ts                # CLI: refreshes availability → data/availability/
    build-tiles.ts                       # CLI: builds geo-index tiles → data/tiles/
    query-nearest.ts                     # CLI: queries nearest bikes from tiles
    ingest.ts                            # CLI: batch ingest with scheduling
    ingest-init.ts                       # CLI: initialize ingest database
    ingest-status.ts                     # CLI: show ingest status
    compact.ts                           # CLI: compact snapshots to parquet
    rebuild-index.ts                     # CLI: rebuild index with change detection
  test/
    fixtures/                            # Small example JSON/CSV for tests
    fetch-systems-catalog.test.ts
    fetch-station-data.test.ts
    fetch-availability.test.ts
    box-assign.test.ts
    box-assign-meta.test.ts
    geo-tile.test.ts
    geo-query.test.ts
    content-hash.test.ts
    kv-scheduling.test.ts
    ingest-db.test.ts
    ingest-scheduler.test.ts
    compact-parquet.test.ts
worker/
  index.ts                               # Cloudflare Worker (routes + cron + queues)
  ingest.ts                              # Queue consumer: per-system GBFS fetches
  planner.ts                             # Box assignment planner (metadata-based)
  tile-builder.ts                        # Queue consumer: per-box tile building
  pages.ts                               # Server-rendered HTML pages
  index.test.ts                          # Worker tests with mock KV
  tsconfig.json                          # Worker-specific TS config
wrangler.toml                            # Wrangler configuration (KV + Queues + crons)
deploy.md                                # Full deployment guide
data/                                    # Fetched data (gitignored)
  stations/                              #   Per-system station data JSON files
  availability/                          #   Per-system availability JSON files
  tiles/                                 #   Geo-index tiles (box-index.bin + box-NNN.bin)

Architecture

The library functions in src/lib/ are pure — they take an optional fetch parameter and do no file I/O. This makes them callable from:

  • Local CLI scripts (src/scripts/) — use Node's global fetch, write to disk
  • Cloudflare Workers — pass the Worker's fetch binding, store in KV/R2/D1
  • Unit tests — pass a mock fetch, assert on returned objects

How GBFS discovery works

systems.csv ──→ gbfs.json ──→ station_information.json   (static, fetch once)
(master list)   (per system)   station_status.json        (real-time, refresh)
                               free_bike_status.json      (real-time, refresh)

The scraper handles both GBFS v2.x (feeds nested under language keys like data.en.feeds) and v3.0 (flat data.feeds array), and the v3 rename of free_bike_statusvehicle_status.

Two-phase fetch design

  1. fetch-station-data — full fetch: discovery → resolve all feed URLs → fetch station_information + station_status + free_bike_status. Caches the discovery response in the output file.
  2. fetch-availability — lightweight refresh: reads cached discovery → fetches only station_status + free_bike_status/vehicle_status. Skips the discovery round-trip entirely.

This separation means you can poll availability every few minutes without redundantly re-fetching static station info or re-doing discovery.

Geo-index architecture

The geo-index is built from scraped station data and uses two key libraries:

  • kdbush — static KD-tree for 2D points. 2× less memory than flatbush for point data. Serializes to/from ArrayBuffer with zero-copy KDBush.from() restore.
  • geokdbush — haversine-aware kNN queries on kdbush indexes. around(index, lon, lat, k) returns indices sorted by distance. ~0.025ms per query on 500K points.

Two-tier KDBush architecture

The spatial index has two tiers, both using the same binary tile format and KDBush:

┌─────────────────────────────────────────────────────────────────────┐
│  Tier 1: box-index.bin (~5 KB)                                      │
│    KDBush of ~120 box center-points                                 │
│    Metadata: [{ box: "box-001", bbox: {...}, n: 4832 }, ...]        │
│                                                                     │
│  Tier 2: box-NNN.bin (~340 KB each, ~120 tiles)                     │
│    KDBush of ~5000 station/vehicle points                           │
│    Metadata: { systems: [...], types: [...], points: [...] }        │
│    Compact format — system_id and type stored as string-table refs  │
└─────────────────────────────────────────────────────────────────────┘

Box assignment strategy

Points are grouped into boxes by the assignPointsToBoxes() algorithm:

  1. System-first grouping — all points from a GBFS system start in one group
  2. Recursive median-split — groups with >6500 points are split along the median lat or lon
  3. Greedy merge — small groups are merged with their nearest neighbor (within 500 km)
  4. Bbox expansion — bounding boxes are expanded by 10% to create overlap zones

This keeps systems together (no fragmentation across arbitrary grid cells), handles dense cities naturally via overlap + closest-center tiebreaking, and eliminates the need for hardcoded city zones or geohash cells.

Query flow

lat,lon → around(boxIndex, lon, lat, 5) → check bbox containment → closest center wins
       → load single box-NNN.bin → around(tileIndex, lon, lat, k) → results

A query always loads exactly 1 data tile. The box-index (~5 KB) is loaded once and cached. Total: 2 KV reads on first request, 1 KV read on subsequent requests. CPU budget is well under 5ms — within Cloudflare Workers free tier limits.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors