Hierarchical Cluster Analysis for Partially Synthetic Data Generation

Levon H. Aslanyan; Vardan H. Topchyan

Authors

Levon H. Aslanyan Institute for Informatics and Automation Problems of NAS RA
Vardan H. Topchyan Institute for Informatics and Automation Problems of NAS RA

Keywords:

Confidentiality, Multiple imputation, Synthetic data, Hierarchical clustering

Abstract

Limiting the risk of information disclosure is now common for statistical agencies. One of the widespread approaches is to release the synthetic, public use of microdata sets. To put it another way, thanks to the multiple imputations the sensitive variables of original data are replaced by new/synthetic values. This paper introduces the method for partially synthetic data generation based on hierarchical cluster analysis.

References

L. Willenborg and T. de Waal, Elements of Statistical Disclosure Control, New York: Springer-Verlag, 2001.

D.B. Rubin, “Discussion: statistical disclosure limitation”, Journal of Official Statistics, vol. 9, pp. 462–468, 1993.

T. E. Raghunathan, J. P. Reiter and D. B. Rubin, “Multiple imputation for statistical disclosure limitation”, Journal of Official Statistics, vol. 19, pp. 1–16, 2003.

J. P. Reiter, “Significance tests for multi-component estimands from multiply-imputed, synthetic microdata”, Journal of Statistical Planning and Inference, vol. 131, pp. 365 – 377, 2005.

D. B. Rubin, Multiple Imputation for Nonresponse in Surveys, New York: John Wiley and Sons, 1987.

R.J.A. Little, “Statistical analysis of masked data”, Journal of Official Statistics, vol. 9, pp. 407–426, 1993.

W. Alvey and B. Jamerson, (eds), Record Linkage Techniques, Washington, D.C.: National Academy Press., 1997.

J. M. Abowd and S. D. Woodcock, Disclosure Limitation in Longitudinal Linked Data. Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, Amsterdam: North-Holland, 2001

F. Liu and R. J. A. Little, “Selective multiple imputation of keys for statistical disclosure control in microdata”, ASA Proceedings of the Joint Statistical Meetings, pp. 2133–2138, 2002.

J. Drechsler, Synthetic Datasets for Statistical Disclosure Control. Theory and Implementation, Springer, 2011.

J.P. Reiter, “Using CART to generate partially synthetic, public use microdata”, Journal of Official Statistics, vol. 21, pp. 441-462, 2005.

L. Breiman, J. H. Friedman, R. A. Olshen and C. J. Stone, Classification and Regression Trees, Belmont, CA: Wadsworh, Inc. , 1984.

J.P. Reiter, “Inference for partially synthetic, public use microdata sets”, Survey Methodology, vol. 29, pp. 181–189, 2003.

C.D. Manning, P. Raghavan and H. Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008.

D.B. Rubin, “The Bayesian bootstrap”, The Annals of Statistics, vol. 9, pp. 130–134, 1981.

M. Halkidi, Y. Batistakis and M. Vazirgiannis, “Clustering validity checking methods: Part II”, ACM New York, NY, USA, vol. 31, pp. 19-27, 2002.

Hierarchical Cluster Analysis for Partially Synthetic Data Generation

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Make a Submission