Pair Correlations Preserving Model in Synthetic Data Generation


  • Vardan H. Topchyan Institute for Informatics and Automation Problems of NAS RA


Synthetic data, Confidentiality, Disclosure limitation


The risk of disclosure of confidential information increases by the statistical organizations, due to the large volume of data released to the public. The most common methods of limiting the risk of dicloure are synthetic data genaretion methods. Unfortunately, these methods have a heuristic nature, because they do not have a clear theoretical basis. In this work presented a formal model of synthetic data generation for pair correlation preservation


D. B. Rubin, “Discussion: statistical disclosure limitation”, Journal of Official Statistics, vol. 9, pp. 462–468, 1993.

T. E. Raghunathan, J. P. Reiter and D. B. Rubin, “Multiple imputation for statistical disclosure limitation”, Journal of Official Statistics, vol. 19, pp. 1–16, 2003.

J. Drechsler, Synthetic Datasets for Statistical Disclosure Control. Theory and Implementation, Springer, 2011.

J. Drechsler and J. P. Reiter, “An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets”, Computational Statistics & Data Analysis, vol. 55, no. 12, pp. 3232--3243, 2011.

T. E. Raghunathan, J. M. Lepkowski, J. van Hoewyk and P. Solenberger, ”A multivariate technique for multiply imputing missing values using a series of regression models”, Survey Methodology, vol. 27, pp. 85--96, 2001.

D. B. Rubin, Multiple Imputation for Nonresponse in Surveys, New York: John Wiley and Sons, 1987.

J. P. Reiter, “Using CART to generate partially synthetic, public use microdata”, Journal of Official Statistics, vol. 21, pp. 441-462, 2005.

G. Caiola and J. P. Reiter, “Random forests for generating partially synthetic, categorical data”, Transactions on Data Privacy, vol. 3, pp. 27-42, 2010.

J. Domingo-Ferrer, J. Magkos, (eds.), Privacy in Statistical Databases, New York: Springer, pp. 148--161, 2010.

В. Лидовский, Теория Информации, Москва, Спутник+, 2004

L. Aslanyan and V. Topchyan, “Hierarchical cluster analysis for partially synthetic data generation”, Transactions of IIAP NAS RA, Mathematical Problems of Computer Science, vol. 40, pp. 55--67, 2013.

M. Halkidi, Y. Baristakis and M. Vazirgiannis, “On clustering validation techniques”, Journal of Intelligent Information Systems, vol. 17, no. 2-3, pp. 107-145, 2001.

L. Breiman, J. H. Friedman, R. A. Olshen and C.J. Stone, Classification and Regression Trees, Belmont, CA: Wadsworh, Inc., 1984.

D. B. Rubin, “The Bayesian bootstrap”, The Annals of Statistics, vol. 9, pp. 130–134, 1981.




How to Cite

Topchyan, V. H. . (2021). Pair Correlations Preserving Model in Synthetic Data Generation. Mathematical Problems of Computer Science, 41, 81–92. Retrieved from