JEROME P. REITER

Mrs. Alexander Hehmeyer Associate Professor of Statistical Science
Department of Statistical Science
Duke University

Summary of research on statistical disclosure limitation


Below are descriptions of my research on various aspects of statistical disclosure limitation, including assessing risk and utility, synthetic data methods, remote access servers, and secure analyses of distributed data.  I also have included papers with links.

Assessing disclosure risk and data utility
When considering the release of data sets to the public, statistical agencies face competing objectives. They seek to provide users with sufficiently detailed data and also to guard the confidentiality of survey respondents. Commonly used methods for meeting these objectives include coarsening or recoding variables and swapping data for selected units. However, these methods can compromise estimation by distorting relationships among variables in the data set.  To determine reasonable strategies, agencies need to quantify the disclosure risks and data utility of proposed releases.  In this research, I examine ways of quantifying risk and utility.

Published Papers
1.  Reiter, J. P. (2005)   Estimating risks of identification disclosure for microdata. 
Journal of the American Statistical Association, 100, 1103 - 1113.
2.  Karr, A. F., Kohnen, C. N.,  Oganian, A., Reiter, J. P. and  Sanil, A. P. (2006), "A framework for evaluating the utility of data altered to protect confidentiality," The American Statistician, 60, 224 - 232.
3. 
Woo, M., Reiter, J. P., Oganian, A., Karr, A. F.  (2009) "Global measures of data utility for microdata masked for disclosure limitation," Journal of Privacy and Confidentiality, 1.1, 111 - 124.
4.  Reiter J.P., Oganian A., and Karr AF (2009), "Verification servers: enabling analysts to assess the quality of inferences from public use data." Computational Statistics and Data Analysis, 53, 1475 - 1482.

Synthetic data methods
Rubin (1993, JOS) proposed that agencies generate and release synthetic data with characteristics similar to those of the collected data.  That is, release data sets in which the values of some variables are not actual, collected values. This approach can protect confidentiality and, with estimation methods based on the concepts of multiple imputation for missing data, can allow data users to obtain valid and straightforward inferences for a variety of estimands.  In this research, I have been developing theoretical methods for analyzing synthetic datasets.  The next steps involve implementing the approach on genuine data.
 
Published Papers
1.  Reiter, J. P. (2002)  Satisfying disclosure restrictions with synthetic data sets.  Journal of Official Statistics, 18, 531-544.
2.  Raghunathan, T. E., Reiter, J. P.,  and Rubin, D. B. (2003)  Multiple imputation for statistical disclosure limitation.  Journal of Official Statistics, 19, 1-16.
3.  Reiter, J. P. (2003)  Inference for partially synthetic, public use microdata sets.
Survey Methodology, 29, 181-188.
4.  Reiter, J. P. (2004) New approaches to data dissemination: A glimpse into the future (?).  Chance, 17:3 (Summer 2004), 12 - 16.
5.  Reiter, J. P. (2004)  Simultaneous use of multiple imputation for missing data and disclosure limitation.  Survey Methodology 30, 235 - 242.
6.  Reiter, J. P. (2005)  Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study.  Journal of the Royal Statistical Society, Series A, 168, 185 - 205.
7.  Reiter, J. P. (2005)  Significance tests for multi-component estimands from multiply-imputed, synthetic microdata. 
Journal of Statistical Planning and Inference, 131, 365 - 377.
8.  Reiter, J. P. (2005)  Using CART to generate partially synthetic public use microdata.   Journal of Official Statistics, 21, 441 - 462.
9.  Mitra, R. and Reiter, J. P. (2006), "Adjusting survey weights when altering identifying design variables via synthetic data," in Privacy in Statistical Databases 2006, Lecture Notes in Computer Science, New York: Springer-Verlag, 177 - 188
10.  Reiter, J. P. and Raghunathan, T. E. (2007), "The multiple adaptations of multiple imputation," Journal of the American Statistical Association,
102, 1462 - 1471.
11.  Reiter, J. P. (2008), "Selecting the number of imputed datasets when using multiple imputation for missing data and disclosure limitation,"
Statistics and Probability Letters, 78, 15 - 20.
12.  Reiter, J. P.  (2008)  "Protecting data confidentiality in public release datasets: Approaches based on multiple imputation," The Imputation Bulletin, 8.2, 1 - 6.
13.  Drechsler, J. and Reiter, J. P. (2008), "Accounting for intruder uncertainty due to sampling when estimating identification disclosure risks in partially synthetic data," Privacy in Statistical Databases (Lecture Notes in Computer Science 5262), ed. J. Domingo-Ferrer and Y. Saygin, Springer, 227 - 238.
14.  Kohnen C. N. and Reiter J. P. (2009), "Multiple imputation for combining confidential data owned by two agencies," Journal of the Royal Statistical Society, Series A, 172, 511 - 528.
15.  Reiter, J. P. (2009), "Using multiple imputation to integrate and disseminate confidential microdata", International Statistical Review, 77, 179 - 195.
16.  Reiter, J. P. and Mitra, R. (2009), "Estimating risks of identification disclosure in partially synthetic data," Journal of Privacy and Confidentiality, 1.1, 99 - 110.
17. Drechsler, J. and Reiter, J. P. (2009) "Disclosure risk and data utility for partially synthetic data: An empirical study using the German IAB Establishment Survey," Journal of Official Statistics, 25, 589 - 603.
18.  Reiter, J. P. and Drechsler, J. (2010), "Two stage multiple imputation to protect confidentiality," Statistica Sinica, 20, 405 - 422.
19. Caiola, G. and Reiter, J. P. (2010), "Random forests for generating partially synthetic, categorical data," Transactions on Data Privacy, 3:1, 27 - 42.
20. Kinney, S. K. and Reiter, J. P. (forthcoming) "Tests of multivariate hypotheses when using multiple imputation for missing data and partial synthesis." Journal of Official Statistics.

Remote access servers
Federal agencies are increasingly concerned that public releases of micro-data may allow data users to learn sensitive information about survey respondents.  One approach to data protection is not to allow users ever to see the data.  Instead, they would submit analyses to a remote server that would report back output from the fitted model.  There still may be opportunities for disclosures with remote servers.  For example, releasing diagnostics like residuals might allow users to back into sensitive values.  Or, users may be able to learn sensitive values by submitting certain types of models.  In this research, I have been investigating the risks and utility of the remote server approach.

Published Papers
1.  Reiter, J. P. (2003)  Model diagnostics for remote-access regression servers.  Statistics and Computing, 13, 371-380.
2.  Reiter, J. P. (2004)  New approaches to data dissemination: A glimpse into the future (?).  Chance,17:3 (Summer 2004), 12 - 16.
3.  Gomatam, S., Karr, A. F., Reiter, J. P., Sanil, A. (2005)  Data dissemintation and disclosure limitation in a world without microdata: A risk-utility framework for remote access servers. Statistical Science, 20, 163 - 177.
4.  Reiter, J. P. and Kohnen, C. N. (2005)  Categorical data regression diagnostics for remote servers.  Journal of Statistical Computation and Simulation, 75, 889 - 903.

Secure analyses of distributed databases

Federal statistical agencies and other data collectors can obtain more accurate inferences by combining the data they collect.  For example, data collected on the same variables by different agencies can be included in one large model, thereby reducing standard errors when fitting models.  Or, different agencies may have different variables, and they seek to merge those variables into one dataset.  These examples of data integration and distribution are complicated by confidentiality concerns.  Some agencies may not want to share exact values of their data with other agencies, although they do want the output from models fit to these shared datasets.   In this research, I have been developing methods for integrating data and distributing analyses safely across multiple agencies.

Published Papers
1.  Karr, A. F.,  Lin, X., Sanil, A. P. and Reiter, J. P. (2004),  "Analysis of integrated data without data integration," Chance17:3 (Summer 2004), 27 - 30.
2.  Kohnen, C. N. and Reiter, J. P.  (2004),  "Sharing confidential data among multiple agencies using multiply imputed, synthetic data," 
ASA Proceedings of the Joint Statistical Meetings.
3.  Reiter, J. P., Kohnen, C. N., Karr, A. F., Lin, X., and Sanil, A. P.  (2004), "Secure regression for vertically partitioned, partially overlapping data,"  ASA Proceedings of the Joint Statistical Meetings.
4.  Sanil, A. P., Karr, A. F., Lin, X., and Reiter, J. P. (2004), "Privacy preserving regression modelling via distributed computation," Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 677-682 (peer reviewed).
5.  Karr, A. F., Lin, X., Sanil, A. P., and Reiter, J. P. (2005), "Secure regressions on distributed databases,"  Journal of Computational and Graphical Statistics, 14, 263 - 279.
6.  Karr, A. F., Feng, J., Lin, X., Sanil, A. P., Young, S. S., and Reiter, J. P. (2005), "Secure analysis of distributed chemical databases without data integration," 
Journal of Computer-Aided Molecular Design, 19, 739 - 747.
7.  Karr, A. F., Lin, X., Sanil, A. P., and Reiter, J. P. (2006),  "Secure statistical analysis of distributed databases," in Statistical Methods in Counterterrorism: Game Theory, Modeling, Syndromic Surveillance, and Biometric Authentication.  Edited by A. Wilson, G. Wilson, and D. Olwell.  New York: Springer, 237 - 262.
8. 
Karr, A. F., Lin, X., Reiter, J. P.  and Sanil, A. P. (2006),  "Methods of secure computation and data integration," in Monographs of Official Statistics: Work Session on Statistical Data Confidentiality, edited by P. Diaz Munoz and H. Brungger, Eurostat, 217 - 226.
9.  Ghosh, J., Reiter, J.P. and Karr, A. F. (2007), "Secure computation with horizontally partitioned data using adaptive regression splines,''   Computational Statistics and Data Analysis, 51, 5813 - 5820.
10.  Karr, A. F., Fulp, W. J., Vera, F., Young, S. S., Lin, X., and Reiter, J. P.  (2007) "Secure, privacy-preserving analysis of distributed databases,"  Technometrics, 49, 335 - 345.
11.  Karr, A. F., Lin, X.,  Reiter, J. P. and  Sanil, A. P. (2009),  "Privacy preserving analysis of vertically partitioned data using secure matrix protocols,"
Journal of Official Statistics, 25, 125 - 138.