synloc: An Algorithm to Create Synthetic Tabular Data

Overview

synloc is an open-source Python package implementing the Local Resampler (LR) algorithm for generating synthetic tabular data while safeguarding privacy. It provides a computationally efficient and flexible approach to synthetic data generation, enabling researchers to work with privacy-preserving datasets that maintain statistical utility.

Two Subsampling Strategies

Both approaches provide effective disclosure control. Choose based on your priorities:

Approach	Best for	Key advantage
k-Nearest Neighbors (k-NN)	Stronger disclosure control	Naturally underrepresents outliers, reducing privacy risks
Clustering-based	Efficiency & accuracy	Better data utility and computational performance

Key features:

Natural disclosure risk reduction by underrepresenting outliers (k-NN variant)
Accurate replication of complex distributions, including multimodal and non-convex-support data
Flexible trade-off between data utility and privacy protection
Built-in quality diagnostics, including Kolmogorov-Smirnov distances, Wasserstein distances, summary statistics, and correlation-difference metrics
Compatible with parametric and nonparametric distributions

This implementation aligns with statistical agencies' safe data regulations, including the k-anonymity criterion and the Five Safes framework adopted by organizations such as the Australian Bureau of Statistics. For the full methodology and theoretical foundations, see the paper referenced below.

Data Requirements

synloc expects a numeric pandas.DataFrame.

Categorical variables must be encoded before synthesis, for example with pandas.get_dummies.
Boolean dummy variables are accepted and converted to 0/1.
Missing numeric values are filled with column medians during fitting.
Columns with only missing values, duplicate column names, infinite values, and non-numeric columns raise clear errors.
Integer-like variables can be rounded after synthesis with round_integers.

Installation

synloc can be installed through PyPI:

pip install synloc

A Quick Example

Assume that we have a sample with three variables with the following distributions:

$$x \sim Beta(0.1,,0.1)$$

$$y \sim Beta(0.1,, 0.5)$$

$$z \sim 10 y + Normal(0,,1)$$

The distribution can be generated by tools module in synloc:

from synloc.tools import sample_trivariate_xyz
data = sample_trivariate_xyz() # Generates a sample with size 1000 by default.

Initializing the resampler:

from synloc import LocalCov
resampler = LocalCov(data = data, K = 30)

Subsample size is defined as K=30. Now, we locally estimate the multivariate normal distribution and from each estimated distributions we draw "synthetic values."

syn_data = resampler.fit()

syn_data is a pandas.DataFrame where all variables are synthesized. Comparing the original sample using a 3-D Scatter:

resampler.comparePlots(['x','y','z'])

You can also inspect utility diagnostics after fitting:

variable_metrics = resampler.compareStats()
quality = resampler.qualityReport()

print(variable_metrics[["ks_statistic", "wasserstein_distance"]])
print(quality["overall"])

How to cite?

If you use synloc in your research, please cite the following paper:

@article{https://doi.org/10.1111/anzs.70032,
    author = {Kalay, Ali Furkan},
    title = {Generating Synthetic Data With Locally Estimated Distributions for Disclosure Control},
    journal = {Australian \& New Zealand Journal of Statistics},
    volume = {68},
    number = {1},
    pages = {e70032},
    doi = {https://doi.org/10.1111/anzs.70032},
    url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/anzs.70032},
    eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/anzs.70032},
    year = {2026}
}

Replication

For replication materials of the paper, see the replication folder.

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
assets		assets
docs		docs
docs_source		docs_source
replication		replication
synloc		synloc
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
Untitled.ipynb		Untitled.ipynb
examples.py		examples.py
extract.dta		extract.dta
notes.md		notes.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

synloc: An Algorithm to Create Synthetic Tabular Data

Overview

Two Subsampling Strategies

Data Requirements

Installation

A Quick Example

How to cite?

Replication

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

synloc: An Algorithm to Create Synthetic Tabular Data

Overview

Two Subsampling Strategies

Data Requirements

Installation

A Quick Example

How to cite?

Replication

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages