Overview | Data Requirements | Installation | A Quick Example | Documentation | How to cite? | Replication
synloc is an open-source Python package implementing the Local Resampler (LR) algorithm for generating synthetic tabular data while safeguarding privacy. It provides a computationally efficient and flexible approach to synthetic data generation, enabling researchers to work with privacy-preserving datasets that maintain statistical utility.
Both approaches provide effective disclosure control. Choose based on your priorities:
| Approach | Best for | Key advantage |
|---|---|---|
| k-Nearest Neighbors (k-NN) | Stronger disclosure control | Naturally underrepresents outliers, reducing privacy risks |
| Clustering-based | Efficiency & accuracy | Better data utility and computational performance |
Key features:
- Natural disclosure risk reduction by underrepresenting outliers (k-NN variant)
- Accurate replication of complex distributions, including multimodal and non-convex-support data
- Flexible trade-off between data utility and privacy protection
- Built-in quality diagnostics, including Kolmogorov-Smirnov distances, Wasserstein distances, summary statistics, and correlation-difference metrics
- Compatible with parametric and nonparametric distributions
This implementation aligns with statistical agencies' safe data regulations, including the k-anonymity criterion and the Five Safes framework adopted by organizations such as the Australian Bureau of Statistics. For the full methodology and theoretical foundations, see the paper referenced below.
synloc expects a numeric pandas.DataFrame.
- Categorical variables must be encoded before synthesis, for example with
pandas.get_dummies. - Boolean dummy variables are accepted and converted to
0/1. - Missing numeric values are filled with column medians during fitting.
- Columns with only missing values, duplicate column names, infinite values, and non-numeric columns raise clear errors.
- Integer-like variables can be rounded after synthesis with
round_integers.
synloc can be installed through PyPI:
pip install synloc
Assume that we have a sample with three variables with the following distributions:
The distribution can be generated by tools module in synloc:
from synloc.tools import sample_trivariate_xyz
data = sample_trivariate_xyz() # Generates a sample with size 1000 by default. Initializing the resampler:
from synloc import LocalCov
resampler = LocalCov(data = data, K = 30)Subsample size is defined as K=30. Now, we locally estimate the multivariate normal distribution and from each estimated distributions we draw "synthetic values."
syn_data = resampler.fit() syn_data is a pandas.DataFrame where all variables are synthesized. Comparing the original sample using a 3-D Scatter:
resampler.comparePlots(['x','y','z'])You can also inspect utility diagnostics after fitting:
variable_metrics = resampler.compareStats()
quality = resampler.qualityReport()
print(variable_metrics[["ks_statistic", "wasserstein_distance"]])
print(quality["overall"])If you use synloc in your research, please cite the following paper:
@article{https://doi.org/10.1111/anzs.70032,
author = {Kalay, Ali Furkan},
title = {Generating Synthetic Data With Locally Estimated Distributions for Disclosure Control},
journal = {Australian \& New Zealand Journal of Statistics},
volume = {68},
number = {1},
pages = {e70032},
doi = {https://doi.org/10.1111/anzs.70032},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/anzs.70032},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/anzs.70032},
year = {2026}
}
For replication materials of the paper, see the replication folder.