You need test data. Realistic, multi-table, relational test data. But:
- Copying production data is a compliance nightmare
- Mocking by hand is tedious and produces flat, fake-looking datasets
- Existing generators are schema-limited or produce obviously synthetic output
DataPop generates statistically realistic synthetic datasets — multi-table, relational, with real distributions — from a simple Python API or a single YAML config file.
from datapop import Dataset
# 3-line setup
schema = Dataset.from_yaml("schema.yml")
schema.generate(rows=50_000)
schema.to_sqlite("synthetic.db")Or use it programmatically:
from datapop import Schema, Column
from datapop.generators import Normal, UniformChoice, Sequence, ForeignKey
schema = Schema(name="ecommerce")
schema.add_column("user_id", Sequence(start=1))
schema.add_column("email", UniformChoice(patterns=["user{}@example.com", "test{}@mail.com"]))
schema.add_column("age", Normal(mean=32, std=8))
schema.add_column("country", UniformChoice(countries))
schema.add_column("signup_date", Sequence(start="2024-01-01", freq="1d"))
schema.add_column("plan", UniformChoice(["free", "pro", "enterprise"]))
schema.add_table("orders", rows=200_000)
schema.tables["orders"].add_column("order_id", Sequence(start=1))
schema.tables["orders"].add_column("user_id", ForeignKey("users"))
schema.add_table("products", rows=5_000)
# ... configure products table
dataset = schema.generate()
dataset.to_parquet("./synthetic_data/")
dataset.to_sqlite("./ecommerce.db")- Distribution-aware: Normal, exponential, log-normal, uniform, zipfian distributions
- Relational integrity: Foreign keys, unique constraints, referential consistency
- Multi-format export: CSV, Parquet, SQLite, PostgreSQL, DuckDB
- Schema validation: Catch constraint violations before generation
- Seedable: Reproducible output with
random.seed() - CLI tool:
datapop generate schema.yml --rows 100k --format sqlite
pip install datapop
# or from source
pip install .# Generate from a schema file
datapop generate examples/shop.yml --rows 10000 --format csv
# Validate a schema
datapop validate schema.ymldatapop/
├── core/ # Schema, Dataset, Column definitions
├── generators/ # Distribution generators (Normal, Uniform, Zipf, etc.)
├── exporters/ # Output formatters (CSV, Parquet, SQLite, DuckDB)
└── validators/ # Schema and constraint validators
See docs/blog-post.md for a draft article on how DataPop works, including the statistical methods used to match real data distributions.