Skip to content

paddymul/near-bike

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

60 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

closest-bike

Find the closest bike share station to you, anywhere in the world.

Scrapes GBFS (General Bikeshare Feed Specification) feeds from 1,200+ systems across 50+ countries.

Setup

pnpm install

Scripts

1. Fetch the systems catalog

Downloads the master list of all GBFS systems from MobilityData/gbfs/systems.csv and writes it to data/systems.json.

pnpm fetch-systems

Output includes a summary:

Fetching GBFS systems catalog…
  Found 1245 systems
  Top 10 countries:
    DE: 217
    US: 172
    FR: 139
    ...

2. Fetch station data for a system

Reads data/systems.json, resolves a system's auto-discovery endpoint, and fetches its station/vehicle data. This includes static info (locations, capacity) and a snapshot of current availability.

# Search for systems
npx tsx src/scripts/fetch-station.ts --list "paris"
npx tsx src/scripts/fetch-station.ts --list "us"

# Fetch by index (from --list output)
npx tsx src/scripts/fetch-station.ts 753

# Fetch by system_id
npx tsx src/scripts/fetch-station.ts dublin

# Fetch by name (substring match)
npx tsx src/scripts/fetch-station.ts "Citi Bike"

Output is written to data/stations/<system_id>.json and includes:

  • station_information β€” station locations (lat/lon), names, capacity
  • station_status β€” real-time availability per station
  • free_bike_status β€” dockless vehicle locations (lat/lon)

3. Refresh availability for a system

Re-fetches only the real-time availability feeds (station_status + free_bike_status/vehicle_status) using the cached discovery URLs from a previous fetch-station run. Much faster than a full fetch since it skips discovery and station_information.

# List systems that have been fetched
npx tsx src/scripts/fetch-availability.ts --list
npx tsx src/scripts/fetch-availability.ts --list "dub"

# Refresh by system_id
npx tsx src/scripts/fetch-availability.ts dublin

# Refresh by name
npx tsx src/scripts/fetch-availability.ts "Citi Bike"

# Refresh by index
npx tsx src/scripts/fetch-availability.ts 0

Output is written to data/availability/<system_id>.json.

Typical workflow

pnpm fetch-systems                                       # once: get all 1,200+ systems
npx tsx src/scripts/fetch-station.ts "Citi Bike"         # once: get stations + first snapshot
npx tsx src/scripts/fetch-availability.ts "Citi Bike"    # repeat: refresh availability

4. Build geo-index tiles

Reads all fetched station data from data/stations/*.json, extracts every geo point (stations + free-floating vehicles), and builds a two-tier spatial index of binary kdbush tiles.

pnpm build-tiles

# Custom target size (default 5000 points per box)
pnpm build-tiles -- --target-size 3000

Output:

  • data/tiles/box-index.bin β€” routing index (~5 KB, KDBush of ~120 box centers)
  • data/tiles/box-NNN.bin β€” one data tile per box (~340 KB each, ~5000 points)

The tiling groups points into bounding boxes:

  • Points are grouped by GBFS system first, keeping systems together in boxes
  • Large systems (>6500 points) are recursively split along the median latitude or longitude
  • Small nearby systems are greedily merged into the same box (within 500 km)
  • Bounding boxes are expanded by 10% for overlap, with closest-center tiebreaking
  • Result: ~120 boxes covering 598K+ stations and vehicles worldwide

Binary tile format

Both the box-index and data tiles use the same self-contained binary format:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ HEADER (12 bytes)                                        β”‚
β”‚   magic: 0x4742 ("GB")                    [2 bytes]      β”‚
β”‚   version: 1                              [2 bytes]      β”‚
β”‚   point_count: N                          [4 bytes]      β”‚
β”‚   metadata_offset: M                      [4 bytes]      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ KDBUSH INDEX (variable)                                  β”‚
β”‚   Raw kdbush ArrayBuffer β€” zero-copy restore             β”‚
β”‚   with KDBush.from()                                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ METADATA (starts at byte M)                              β”‚
β”‚   JSON-encoded metadata (UTF-8)                          β”‚
β”‚                                                          β”‚
β”‚   Box-index tiles:                                       β”‚
β”‚     BoxIndexMeta[] β€” { box, bbox, n }                    β”‚
β”‚                                                          β”‚
β”‚   Data tiles (compact string-table format):              β”‚
β”‚     { systems: ["velib", ...],                           β”‚
β”‚       types: ["station", "vehicle"],                     β”‚
β”‚       points: [{ i, s, t, name, cap? }, ...] }           β”‚
β”‚     s/t are indices into the systems/types tables         β”‚
β”‚     i is a sequential integer (point identity)           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

5. Query nearest bikes

Given a latitude and longitude, finds the closest bike stations and vehicles using the two-tier KDBush index and geokdbush (haversine-aware kNN).

# Find 5 nearest (default)
pnpm query-nearest -- 48.8566 2.3522

# Find 10 nearest
pnpm query-nearest -- 48.8566 2.3522 --k 10

Output:

Querying nearest to (48.8566, 2.3522) with k=5…
Loaded box-index: 120 boxes

    #  Distance    System                Name                            Type      Cap
  ──────────────────────────────────────────────────────────────────────────────────────────
    1  80 m        velib-paris           Rue de Rivoli - ChΓ’telet        station   34
    2  120 m       dott-paris            (dockless: #47)                 vehicle   β€”
    3  150 m       velib-paris           Place du ChΓ’telet               station   28
    4  190 m       lime-paris            (dockless: #102)                vehicle   β€”
    5  230 m       velib-paris           Quai de la MΓ©gisserie           station   42

The query loads exactly one data tile per request:

  1. Loads the box-index (~5 KB, cached) and finds the 5 nearest box centers
  2. Picks the box whose bounding box contains the query point (closest-center tiebreaking)
  3. Loads that single box tile and runs kNN within it

Full workflow

pnpm fetch-systems                                       # once: get all 1,200+ systems
npx tsx src/scripts/fetch-station.ts 0                   # fetch systems one by one (or batch)
pnpm build-tiles                                         # build geo-index from all fetched data
pnpm query-nearest -- 48.8566 2.3522                     # find nearest bikes to the Eiffel Tower

Ingest pipeline

The ingest pipeline automates fetching, scheduling, and compacting GBFS data across all 1,200+ systems. It replaces manual per-system fetching with a continuous batch process that tracks changes and stores time-series snapshots.

Data flow:

pnpm ingest (batch fetcher)
  β”œβ”€ data/stations/{id}.json                              latest cached station data
  β”œβ”€ data/snapshots/YYYY-MM-DD/HH-MM/{id}.{type}.json    time-series snapshots
  └─ data/ingest.db                                       fetch logs + scheduling state

pnpm compact
  └─ data/parquet/availability/{date}.parquet              compressed history

pnpm rebuild-index
  └─ data/tiles/box-index.bin + box-NNN.bin               spatial index

Each system is polled on a schedule with three interval tiers β€” base (300s), rush (120s), quiet (900s). Feeds are hashed with SHA-256 and only written when content changes. Failing systems use exponential backoff.

See operations.md for how to run the ingest pipeline locally and monitor it.

Batch scripts

  • fetch-all-stations.bash β€” fetches station data for all ~1,700 systems in parallel (10 concurrent jobs via xargs)
  • fetch-all-availability.bash β€” sequentially refreshes availability for all fetched systems

Analysis scripts

  • scripts/analyze-station-density.py β€” reports station/vehicle counts per system, bounding boxes, and geohash cell distribution
  • scripts/identify-hotspots.py β€” maps the top 40 densest geohash-4 cells to their contributing systems

Cloudflare Worker

The entire system β€” ingest, tile building, and serving β€” runs on Cloudflare Workers. No external VPS needed.

  • operations.md β€” running locally, monitoring, troubleshooting
  • deploy.md β€” deployment instructions, KV layout, architecture

Routes

Route Description
GET / Landing page with geolocation β€” redirects to /nearby
GET /nearby?lat=&lon= Server-rendered HTML page showing nearest bikes
GET /nearest?lat=&lon=&k= JSON API: nearest bikes (default k=5)
GET /systems Return cached systems catalog from KV
GET /systems/refresh Re-fetch systems.csv from GitHub, store in KV
GET /systems/status?format=html Systems directory β€” every system with stats
GET /station/:system_id Fetch full station data (cached 24h in KV)
GET /availability/:system_id Refresh just availability using cached discovery
GET /ingest/init Seed scheduling state for all systems in KV
GET /ingest/status?format=html&n=30 Ingest dashboard β€” JSON (default) or HTML, n=rows
GET /ingest/run?sync=true&limit=N Fetch N due systems inline (local dev)
GET /planner/run?sync=true Run planner + build tiles inline (local dev)
GET /api List available API routes

Tests

pnpm test            # single run
pnpm test:watch      # watch mode

Tests use fixture files in src/test/fixtures/ with mock fetch β€” no network calls.

Project structure

src/
  types/gbfs.ts                          # TypeScript types for GBFS feeds + geo-index
  lib/
    gbfs-fetch.ts                        # Shared fetch helpers (fetchJson, findFeedUrl)
    fetch-systems-catalog.ts             # Pure fn: systems.csv β†’ SystemsCatalog
    fetch-station-data.ts                # Pure fn: system β†’ station/vehicle data
    fetch-availability.ts                # Pure fn: cached discovery β†’ fresh availability
    box-assign.ts                        # Pure fn: assign points to bounding boxes
    box-assign-meta.ts                   # Pure fn: metadata-based box assignment (for Workers)
    geo-tile.ts                          # Pure fn: tile building + binary serialization
    geo-query.ts                         # Pure fn: two-tier kNN query with box routing
    content-hash.ts                      # Pure fn: content hashing for change detection
    kv-scheduling.ts                     # KV-based scheduling state (for Workers)
    ingest-db.ts                         # SQLite metadata DB for CLI ingest pipeline
    ingest-scheduler.ts                  # Batch ingest scheduling logic
    compact-parquet.ts                   # Parquet compaction for historical data
  scripts/
    fetch-systems.ts                     # CLI: fetches catalog β†’ data/systems.json
    fetch-station.ts                     # CLI: fetches station data β†’ data/stations/
    fetch-availability.ts                # CLI: refreshes availability β†’ data/availability/
    build-tiles.ts                       # CLI: builds geo-index tiles β†’ data/tiles/
    query-nearest.ts                     # CLI: queries nearest bikes from tiles
    ingest.ts                            # CLI: batch ingest with scheduling
    ingest-init.ts                       # CLI: initialize ingest database
    ingest-status.ts                     # CLI: show ingest status
    compact.ts                           # CLI: compact snapshots to parquet
    rebuild-index.ts                     # CLI: rebuild index with change detection
  test/
    fixtures/                            # Small example JSON/CSV for tests
    fetch-systems-catalog.test.ts
    fetch-station-data.test.ts
    fetch-availability.test.ts
    box-assign.test.ts
    box-assign-meta.test.ts
    geo-tile.test.ts
    geo-query.test.ts
    content-hash.test.ts
    kv-scheduling.test.ts
    ingest-db.test.ts
    ingest-scheduler.test.ts
    compact-parquet.test.ts
worker/
  index.ts                               # Cloudflare Worker (routes + cron + queues)
  ingest.ts                              # Queue consumer: per-system GBFS fetches
  planner.ts                             # Box assignment planner (metadata-based)
  tile-builder.ts                        # Queue consumer: per-box tile building
  pages.ts                               # Server-rendered HTML pages
  index.test.ts                          # Worker tests with mock KV
  tsconfig.json                          # Worker-specific TS config
wrangler.toml                            # Wrangler configuration (KV + Queues + crons)
deploy.md                                # Full deployment guide
data/                                    # Fetched data (gitignored)
  stations/                              #   Per-system station data JSON files
  availability/                          #   Per-system availability JSON files
  tiles/                                 #   Geo-index tiles (box-index.bin + box-NNN.bin)

Architecture

The library functions in src/lib/ are pure β€” they take an optional fetch parameter and do no file I/O. This makes them callable from:

  • Local CLI scripts (src/scripts/) β€” use Node's global fetch, write to disk
  • Cloudflare Workers β€” pass the Worker's fetch binding, store in KV/R2/D1
  • Unit tests β€” pass a mock fetch, assert on returned objects

How GBFS discovery works

systems.csv ──→ gbfs.json ──→ station_information.json   (static, fetch once)
(master list)   (per system)   station_status.json        (real-time, refresh)
                               free_bike_status.json      (real-time, refresh)

The scraper handles both GBFS v2.x (feeds nested under language keys like data.en.feeds) and v3.0 (flat data.feeds array), and the v3 rename of free_bike_status β†’ vehicle_status.

Two-phase fetch design

  1. fetch-station-data β€” full fetch: discovery β†’ resolve all feed URLs β†’ fetch station_information + station_status + free_bike_status. Caches the discovery response in the output file.
  2. fetch-availability β€” lightweight refresh: reads cached discovery β†’ fetches only station_status + free_bike_status/vehicle_status. Skips the discovery round-trip entirely.

This separation means you can poll availability every few minutes without redundantly re-fetching static station info or re-doing discovery.

Geo-index architecture

The geo-index is built from scraped station data and uses two key libraries:

  • kdbush β€” static KD-tree for 2D points. 2Γ— less memory than flatbush for point data. Serializes to/from ArrayBuffer with zero-copy KDBush.from() restore.
  • geokdbush β€” haversine-aware kNN queries on kdbush indexes. around(index, lon, lat, k) returns indices sorted by distance. ~0.025ms per query on 500K points.

Two-tier KDBush architecture

The spatial index has two tiers, both using the same binary tile format and KDBush:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Tier 1: box-index.bin (~5 KB)                                      β”‚
β”‚    KDBush of ~120 box center-points                                 β”‚
β”‚    Metadata: [{ box: "box-001", bbox: {...}, n: 4832 }, ...]        β”‚
β”‚                                                                     β”‚
β”‚  Tier 2: box-NNN.bin (~340 KB each, ~120 tiles)                     β”‚
β”‚    KDBush of ~5000 station/vehicle points                           β”‚
β”‚    Metadata: { systems: [...], types: [...], points: [...] }        β”‚
β”‚    Compact format β€” system_id and type stored as string-table refs  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Box assignment strategy

Points are grouped into boxes by the assignPointsToBoxes() algorithm:

  1. System-first grouping β€” all points from a GBFS system start in one group
  2. Recursive median-split β€” groups with >6500 points are split along the median lat or lon
  3. Greedy merge β€” small groups are merged with their nearest neighbor (within 500 km)
  4. Bbox expansion β€” bounding boxes are expanded by 10% to create overlap zones

This keeps systems together (no fragmentation across arbitrary grid cells), handles dense cities naturally via overlap + closest-center tiebreaking, and eliminates the need for hardcoded city zones or geohash cells.

Query flow

lat,lon β†’ around(boxIndex, lon, lat, 5) β†’ check bbox containment β†’ closest center wins
       β†’ load single box-NNN.bin β†’ around(tileIndex, lon, lat, k) β†’ results

A query always loads exactly 1 data tile. The box-index (~5 KB) is loaded once and cached. Total: 2 KV reads on first request, 1 KV read on subsequent requests. CPU budget is well under 5ms β€” within Cloudflare Workers free tier limits.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors