DPSynth is a library for differentially private synthetic tabular data generation. Given a sensitive dataset of records defined w.r.t. a single-table schema, our library can generate a synthetic version of the dataset, preserving the structure and statistical properties of the source data while satisfying differential privacy.
Warning
This library is under active development. APIs may change without notice, and you may encounter bugs, rough edges, or incomplete features. For standard tabular data settings (categorical and numerical attributes, single-table schemas), DPSynth should work well out-of-the-box — but more advanced use cases may hit limitations we haven't smoothed out yet.
We have a long roadmap of features we plan to add. In the meantime, we welcome early adopters to:
- Try it out on your own datasets and use cases.
- Report issues — bugs, confusing behavior, or sharp edges.
- Benchmark it against other DP synthetic data implementations.
- Suggest features that would be valuable for your workflows.
- Contribute — whether it's a GitHub issue, pull request, new mechanism, bug fix, or added functionality, contributions are welcome!
Your feedback directly shapes the library's direction. Thank you for your patience as we build toward a stable release!
DPSynth contains two independent implementations of differentially private synthetic data generation. While both produce synthetic data using the same underlying mathematical principles (marginal measurement + Private-PGM inference), they were developed independently and have different trade-offs:
Entry point: dpsynth.generate() (backed by
data_generation_v2.py)
Designed for datasets that fit in memory (e.g., Pandas DataFrames). We have tested this on datasets up to ~100M rows, though performance will depend on the number of attributes and domain sizes. This code path:
- Operates directly on NumPy / Jax arrays and Pandas DataFrames via
discrete_mechanisms/. - Accepts domains and data as Python objects — no
DatasetDescriptorrequired. - May be more feature-rich, including experimental mechanisms not yet ported to the pipeline mode.
- Has limited scalability compared to the pipeline mode.
CLI binary: bin/main.py
Entry point:
data_generation.generate()
Built for large-scale data that may not fit on a single machine. This code path:
- Runs on distributed frameworks (Apache Beam) via
pipeline_dp.PipelineBackend. - Uses
pipeline_transformations/for all DP operations. - Requires a
DatasetDescriptorto bridge format-specific data (CSV, TFRecord) with internal representations. - Also works in local settings (
pipeline_dp.LocalBackend), but follows a different code path than the in-memory mode above.
CLI binary:
bin/run_data_generation.py
Entry point:
postprocessing.generate_synthetic_data_from_marginals()
For situations where noisy marginals are already computed by an external system (e.g., a SQL pipeline or a custom DP mechanism), you can bypass the measurement step entirely and use DPSynth purely for post-processing. This code path takes pre-computed noisy marginals as Pandas DataFrames. It automatically infers the domain from the marginals, but requires categorical data.
We have made efforts to align the APIs and behavior between the in-memory and pipeline code paths, but because they were developed independently, there are some differences:
- API surface: The in-memory API accepts domains as a plain
dict[str, AttributeType], while the pipeline API uses theDatasetDescriptorabstraction. - Budget accounting: There may be small differences in how the privacy budget is split across sub-operations (derivation, compression, measurement).
- Feature availability: The in-memory mode may support more experimental mechanisms or features that have not yet been ported to the pipeline mode.
Note
If you observe significant differences in behavior or utility between the two code paths on the same dataset and parameters, please open an issue.
These modules are used by both the in-memory and pipeline code paths:
domain.py: Public API for defining attribute domains (CategoricalAttribute,NumericalAttribute,OpenSetCategoricalAttribute). Users construct these objects to describe their data schema.constraints.py: Definition and validation of cross-attribute constraints, provided by users to enforce structural properties.transformations.py: Internal logic for encoding, discretization, and mapping values between domains.
discrete_mechanisms/: Local, single-machine DP mechanisms (AIM, MST, etc.) and shared mathematical utilities like domain compression.data_generation_v2.py: The end-to-end in-memory generation pipeline. This is whatdpsynth.generate()calls.local_mode/: Locally-optimized DP primitives for quantiles and partition selection (NumPy/SciPy-based).pydantic_api.py: API for synthesizing collections of Pydantic models directly.
dataset_descriptors/: The central orchestration layer. Bridges format-specific data (CSV, TFRecord) with internal mathematical representations.pipeline_transformations/: Distributed Beam implementations of DP primitives, derivations, and final sample synthesis.data_generation.py: High-level API for generating synthetic data in data pipelines usingpipeline_dp.diagnostic_info.proto: Proto definition for tracking DP accounting and utility metrics during pipeline execution.
postprocessing.py: Utilities for post-processing pre-computed noisy marginals into synthetic data via Private-PGM, without running DPSynth's own DP measurement step.
bin/: Entry points for local prototyping (main.py) and distributed production jobs (run_data_generation.py).eval/: Tabular evaluation engine for comparing real and synthetic data distributions.
| Scenario | Recommended |
|---|---|
| Fits in memory, Pandas workflow | In-Memory (dpsynth.generate) |
| Discrete data, precomputed marginals | In-Memory (discrete_mechanisms) |
| Large-scale, distributed processing | Pipeline (data_generation) |
| Marginals from an external system | Post-Processing |
| Prototyping / experimental features | In-Memory (more flexible) |
Both code paths support the following DP mechanisms:
- AIM (Adaptive Iterative Mechanism): An MWEM-style algorithm that iteratively selects and measures low-dimensional marginals (arXiv:2201.12677).
- AIM-GDP: A variant of AIM using Gaussian Differential Privacy accounting for tighter budget composition.
- MST (Maximum Spanning Tree): Computes an approximate maximum spanning tree over pairwise attribute correlations using the exponential mechanism (arXiv:2108.04978).
- SWIFT (Scalable Workload-Informed Factor Tree): An unpublished mechanism that operates on discrete data and improves over AIM for higher-dimensional datasets by supporting denser sets of marginal measurements. Numerical attributes can be handled via the existing discretization wrappers.
- INDEPENDENT: A baseline mechanism that measures 1-way marginals and models all attributes independently.
Detailed guides are available in the documentation/
directory:
- In-Memory DataFrame API Guide (
documentation/in_memory_api.md): Detailed guide to using the Pandas-based API and local CLI. - Scalable Pipeline API Guide (
documentation/scalable_beam_api.md): Guide for distributed data generation. - Data Model & Terminology (
documentation/data_and_terminology.md): Attributes, schema specifications, anddomain.yamlformat. - Processing Lifecycle (
documentation/processing_lifecycle.md): The 5-stage mathematical lifecycle shared by both code paths. - Contributor Guide (
documentation/contributors_guide.md): Architecture, PipelineBackend programming rules, and evaluation framework.
This is not an officially supported Google product. This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.