Hierarchical Cluster Analysis for Partially Synthetic Data Generation
Keywords:
Confidentiality, Multiple imputation, Synthetic data, Hierarchical clusteringAbstract
Limiting the risk of information disclosure is now common for statistical agencies. One of the widespread approaches is to release the synthetic, public use of microdata sets. To put it another way, thanks to the multiple imputations the sensitive variables of original data are replaced by new/synthetic values. This paper introduces the method for partially synthetic data generation based on hierarchical cluster analysis.
References
L. Willenborg and T. de Waal, Elements of Statistical Disclosure Control, New York: Springer-Verlag, 2001.
D.B. Rubin, “Discussion: statistical disclosure limitation”, Journal of Official Statistics, vol. 9, pp. 462–468, 1993.
T. E. Raghunathan, J. P. Reiter and D. B. Rubin, “Multiple imputation for statistical disclosure limitation”, Journal of Official Statistics, vol. 19, pp. 1–16, 2003.
J. P. Reiter, “Significance tests for multi-component estimands from multiply-imputed, synthetic microdata”, Journal of Statistical Planning and Inference, vol. 131, pp. 365 – 377, 2005.
D. B. Rubin, Multiple Imputation for Nonresponse in Surveys, New York: John Wiley and Sons, 1987.
R.J.A. Little, “Statistical analysis of masked data”, Journal of Official Statistics, vol. 9, pp. 407–426, 1993.
W. Alvey and B. Jamerson, (eds), Record Linkage Techniques, Washington, D.C.: National Academy Press., 1997.
J. M. Abowd and S. D. Woodcock, Disclosure Limitation in Longitudinal Linked Data. Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, Amsterdam: North-Holland, 2001
F. Liu and R. J. A. Little, “Selective multiple imputation of keys for statistical disclosure control in microdata”, ASA Proceedings of the Joint Statistical Meetings, pp. 2133–2138, 2002.
J. Drechsler, Synthetic Datasets for Statistical Disclosure Control. Theory and Implementation, Springer, 2011.
J.P. Reiter, “Using CART to generate partially synthetic, public use microdata”, Journal of Official Statistics, vol. 21, pp. 441-462, 2005.
L. Breiman, J. H. Friedman, R. A. Olshen and C. J. Stone, Classification and Regression Trees, Belmont, CA: Wadsworh, Inc. , 1984.
J.P. Reiter, “Inference for partially synthetic, public use microdata sets”, Survey Methodology, vol. 29, pp. 181–189, 2003.
C.D. Manning, P. Raghavan and H. Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008.
D.B. Rubin, “The Bayesian bootstrap”, The Annals of Statistics, vol. 9, pp. 130–134, 1981.
M. Halkidi, Y. Batistakis and M. Vazirgiannis, “Clustering validity checking methods: Part II”, ACM New York, NY, USA, vol. 31, pp. 19-27, 2002.
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.