A comprehensive data pipeline for fetching, processing, and analyzing Polymarket trading data. This system collects market information, order-filled events, and processes them into structured trade data.
First-time users: Download the latest data snapshot and extract it in the main repository directory before your first run. This will save you over 2 days of initial data collection time.
This pipeline performs three main operations:
- Market Data Collection - Fetches all Polymarket markets with metadata
- Order Event Scraping - Collects order-filled events from Goldsky subgraph
- Trade Processing - Transforms raw order events into structured trade data
This project uses UV for fast, reliable package management.
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
# Or with pip
pip install uv# Install all dependencies
uv sync
# Install with development dependencies (Jupyter, etc.)
uv sync --extra dev# Run with UV (recommended)
uv run python update_all.py
# Or activate the virtual environment first
source .venv/bin/activate # On Windows: .venv\Scripts\activate
python update_all.pyThis will sequentially run all three pipeline stages:
- Update markets from Polymarket API
- Update order-filled events from Goldsky
- Process new orders into trades
poly_data/
├── update_all.py # Main orchestrator script
├── update_utils/ # Data collection modules
│ ├── update_markets.py # Fetch markets from Polymarket API
│ ├── update_goldsky.py # Scrape order events from Goldsky
│ └── process_live.py # Process orders into trades
├── poly_utils/ # Utility functions
│ └── utils.py # Market loading and missing token handling
├── markets.csv # Main markets dataset
├── missing_markets.csv # Markets discovered from trades (auto-generated)
├── goldsky/ # Order-filled events (auto-generated)
│ └── orderFilled.csv
└── processed/ # Processed trade data (auto-generated)
└── trades.csv
Market metadata including:
- Market question, outcomes, and tokens
- Creation/close times and slugs
- Trading volume and condition IDs
- Negative risk indicators
Fields: createdAt, id, question, answer1, answer2, neg_risk, market_slug, token1, token2, condition_id, volume, ticker, closedTime
Raw order-filled events with:
- Maker/taker addresses and asset IDs
- Fill amounts and transaction hashes
- Unix timestamps
Fields: timestamp, maker, makerAssetId, makerAmountFilled, taker, takerAssetId, takerAmountFilled, transactionHash
Structured trade data including:
- Market ID mapping and trade direction
- Price, USD amount, and token amount
- Maker/taker roles and transaction details
Fields: timestamp, market_id, maker, taker, nonusdc_side, maker_direction, taker_direction, price, usd_amount, token_amount, transactionHash
Fetches all markets from Polymarket API in chronological order.
Features:
- Automatic resume from last offset (idempotent)
- Rate limiting and error handling
- Batch fetching (500 markets per request)
Usage:
uv run python -c "from update_utils.update_markets import update_markets; update_markets()"Scrapes order-filled events from Goldsky subgraph API.
Features:
- Resumes from last timestamp automatically
- Handles GraphQL queries with pagination
- Deduplicates events
Usage:
uv run python -c "from update_utils.update_goldsky import update_goldsky; update_goldsky()"Processes raw order events into structured trades.
Features:
- Maps asset IDs to markets using token lookup
- Calculates prices and trade directions
- Identifies BUY/SELL sides
- Handles missing markets by discovering them from trades
- Incremental processing from last checkpoint
Usage:
uv run python -c "from update_utils.process_live import process_live; process_live()"Processing Logic:
- Identifies non-USDC asset in each trade
- Maps to market and outcome token (token1/token2)
- Determines maker/taker directions (BUY/SELL)
- Calculates price as USDC amount per outcome token
- Converts amounts from raw units (divides by 10^6)
Dependencies are managed via pyproject.toml and installed automatically with uv sync.
Key Libraries:
polars- Fast DataFrame operationspandas- Data manipulationgql- GraphQL client for Goldskyrequests- HTTP requests to Polymarket APIflatten-json- JSON flattening for nested responses
Development Dependencies (optional, installed with --extra dev):
jupyter- Interactive notebooksnotebook- Jupyter notebook interfaceipykernel- Python kernel for Jupyter
All stages automatically resume from where they left off:
- Markets: Counts existing CSV rows to set offset
- Goldsky: Reads last timestamp from orderFilled.csv
- Processing: Finds last processed transaction hash
- Automatic retries on network failures
- Rate limit detection and backoff
- Server error (500) handling
- Graceful fallbacks for missing data
The processing stage automatically discovers markets that weren't in the initial markets.csv (e.g., markets created after last update) and fetches them via the Polymarket API, saving to missing_markets.csv.
- Taker Direction: BUY if paying USDC, SELL if receiving USDC
- Maker Direction: Opposite of taker direction
- Price: Always expressed as USDC per outcome token
makerAssetId/takerAssetIdof "0" represents USDC- Non-zero IDs are outcome token IDs (token1/token2 from markets)
- Each trade involves USDC and one outcome token
- All amounts are normalized to standard decimal format (divided by 10^6)
- Timestamps are converted from Unix epoch to datetime
- Platform wallets (
0xc5d563a36ae78145c45a50134d48a1215220f80a,0x4bfb41d5b3570defd03c39a9a4d8de6bd8b8982e) are tracked inpoly_utils/utils.py - Negative risk markets are flagged in the market data
Issue: Markets not found during processing
Solution: Run update_markets() first, or let process_live() auto-discover them
Issue: Duplicate trades Solution: Deduplication is automatic - re-run processing from scratch if needed
Issue: Rate limiting Solution: The pipeline handles this automatically with exponential backoff
import pandas as pd
import polars as pl
from poly_utils import get_markets, PLATFORM_WALLETS
# Load markets
markets_df = get_markets()
# Load trades
df = pl.scan_csv("processed/trades.csv").collect(streaming=True)
df = df.with_columns(
pl.col("timestamp").str.to_datetime().alias("timestamp")
)Important: When filtering for a specific user's trades, filter by the maker column. Even though it appears you're only getting trades where the user is the maker, this is how Polymarket generates events at the contract level. The maker column shows trades from that user's perspective including price.
USERS = {
'domah': '0x9d84ce0306f8551e02efef1680475fc0f1dc1344',
'50pence': '0x3cf3e8d5427aed066a7a5926980600f6c3cf87b3',
'fhantom': '0x6356fb47642a028bc09df92023c35a21a0b41885',
'car': '0x7c3db723f1d4d8cb9c550095203b686cb11e5c6b',
'theo4': '0x56687bf447db6ffa42ffe2204a05edaa20f55839'
}
# Get all trades for a specific user
trader_df = df.filter((pl.col("maker") == USERS['domah']))Go wild with it