Skip to content

AmSach/datapop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataPop — Synthetic Dataset Generator

**Generate realistic synthetic tabular data with a single line of Python.**

Python License: MIT

The Problem

You need test data. Realistic, multi-table, relational test data. But:

  • Copying production data is a compliance nightmare
  • Mocking by hand is tedious and produces flat, fake-looking datasets
  • Existing generators are schema-limited or produce obviously synthetic output

What DataPop Does

DataPop generates statistically realistic synthetic datasets — multi-table, relational, with real distributions — from a simple Python API or a single YAML config file.

from datapop import Dataset

# 3-line setup
schema = Dataset.from_yaml("schema.yml")
schema.generate(rows=50_000)
schema.to_sqlite("synthetic.db")

Or use it programmatically:

from datapop import Schema, Column
from datapop.generators import Normal, UniformChoice, Sequence, ForeignKey

schema = Schema(name="ecommerce")
schema.add_column("user_id", Sequence(start=1))
schema.add_column("email", UniformChoice(patterns=["user{}@example.com", "test{}@mail.com"]))
schema.add_column("age", Normal(mean=32, std=8))
schema.add_column("country", UniformChoice(countries))
schema.add_column("signup_date", Sequence(start="2024-01-01", freq="1d"))
schema.add_column("plan", UniformChoice(["free", "pro", "enterprise"]))

schema.add_table("orders", rows=200_000)
schema.tables["orders"].add_column("order_id", Sequence(start=1))
schema.tables["orders"].add_column("user_id", ForeignKey("users"))
schema.add_table("products", rows=5_000)
# ... configure products table

dataset = schema.generate()
dataset.to_parquet("./synthetic_data/")
dataset.to_sqlite("./ecommerce.db")

Features

  • Distribution-aware: Normal, exponential, log-normal, uniform, zipfian distributions
  • Relational integrity: Foreign keys, unique constraints, referential consistency
  • Multi-format export: CSV, Parquet, SQLite, PostgreSQL, DuckDB
  • Schema validation: Catch constraint violations before generation
  • Seedable: Reproducible output with random.seed()
  • CLI tool: datapop generate schema.yml --rows 100k --format sqlite

Installation

pip install datapop
# or from source
pip install .

Quick Start

# Generate from a schema file
datapop generate examples/shop.yml --rows 10000 --format csv

# Validate a schema
datapop validate schema.yml

Architecture

datapop/
├── core/          # Schema, Dataset, Column definitions
├── generators/    # Distribution generators (Normal, Uniform, Zipf, etc.)
├── exporters/     # Output formatters (CSV, Parquet, SQLite, DuckDB)
└── validators/   # Schema and constraint validators

Blog Post

See docs/blog-post.md for a draft article on how DataPop works, including the statistical methods used to match real data distributions.

About

Generate realistic synthetic tabular data with a single line of Python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages