JEROME
P. REITER
Mrs. Alexander Hehmeyer Associate Professor
of Statistical
Science
Department of
Statistical Science
Duke University
Summary
of research on statistical disclosure limitation
Below are descriptions of my research on various aspects of statistical
disclosure limitation, including assessing risk and utility, synthetic
data methods, remote access servers, and secure analyses of distributed
data.
I also have included papers with links.
Assessing
disclosure risk and data utility
When considering
the release of data
sets to the public, statistical agencies face competing objectives.
They
seek to provide users with sufficiently detailed data and also to guard
the confidentiality of survey respondents. Commonly used methods for
meeting these objectives include coarsening or recoding variables and
swapping data for selected units. However, these methods can compromise
estimation by distorting relationships among variables in the data set. To
determine reasonable strategies, agencies need
to quantify the disclosure risks and data utility of proposed
releases. In this research, I examine ways of quantifying risk
and utility.
Published Papers
1. Reiter,
J. P. (2005)
Estimating risks of
identification disclosure for
microdata. Journal
of the American Statistical Association, 100, 1103 - 1113.
2. Karr,
A. F., Kohnen, C. N., Oganian, A., Reiter, J. P. and Sanil,
A. P. (2006), "A framework for evaluating the utility of data
altered
to
protect confidentiality," The
American Statistician, 60,
224 - 232.
3.
Woo, M., Reiter, J. P., Oganian, A., Karr, A. F. (2009)
"Global measures
of data utility for microdata masked for disclosure limitation," Journal
of Privacy and Confidentiality, 1.1, 111 - 124.
4. Reiter J.P., Oganian A., and Karr AF (2009),
"Verification servers: enabling analysts to assess the quality of
inferences from public use data." Computational Statistics and Data
Analysis, 53, 1475 - 1482.
Synthetic
data methods
Rubin (1993, JOS) proposed
that agencies
generate and release synthetic data with characteristics similar to
those of the collected data. That is,
release
data sets in which the values of some variables are not actual,
collected values. This approach can protect
confidentiality and, with estimation methods based on the concepts of
multiple imputation for missing data,
can allow
data users to obtain valid and straightforward inferences for a variety
of estimands. In this research, I
have been developing theoretical methods for analyzing synthetic
datasets. The next steps involve implementing the approach on
genuine data.
Published
Papers
1. Reiter,
J.
P. (2002) Satisfying disclosure restrictions with synthetic
data
sets. Journal
of Official Statistics, 18, 531-544.
2. Raghunathan, T. E.,
Reiter, J. P., and Rubin, D. B. (2003) Multiple
imputation
for statistical disclosure limitation. Journal
of
Official Statistics, 19, 1-16.
3. Reiter, J.
P. (2003) Inference for partially synthetic, public use
microdata
sets. Survey
Methodology,
29, 181-188.
4. Reiter,
J. P.
(2004) New approaches to data dissemination: A glimpse into the
future
(?). Chance,
17:3 (Summer 2004), 12 - 16.
5. Reiter, J.
P.
(2004) Simultaneous use of multiple imputation
for
missing data and disclosure limitation. Survey Methodology 30, 235 -
242.
6. Reiter, J.
P. (2005) Releasing multiply-imputed, synthetic public use
microdata: An illustration and empirical study. Journal of
the
Royal Statistical Society, Series A, 168, 185 - 205.
7. Reiter,
J.
P. (2005) Significance tests for multi-component estimands
from multiply-imputed, synthetic microdata. Journal
of Statistical Planning and
Inference, 131, 365 - 377.
8. Reiter,
J.
P. (2005) Using CART to generate partially synthetic public
use microdata. Journal
of Official Statistics, 21, 441 - 462.
9. Mitra, R. and Reiter, J. P. (2006), "Adjusting survey
weights
when
altering identifying design variables via synthetic data," in Privacy
in Statistical Databases 2006, Lecture
Notes in Computer Science, New York: Springer-Verlag, 177 - 188
10. Reiter,
J. P. and Raghunathan, T. E. (2007), "The multiple
adaptations of
multiple imputation," Journal
of the American Statistical Association, 102, 1462 - 1471.
11.
Reiter, J. P. (2008), "Selecting the number of imputed
datasets
when
using
multiple imputation for missing data and disclosure limitation," Statistics
and
Probability Letters, 78, 15 - 20.
12.
Reiter,
J. P. (2008) "Protecting data confidentiality in public
release datasets: Approaches based on multiple imputation," The
Imputation Bulletin, 8.2, 1 - 6.
13.
Drechsler, J. and Reiter, J. P. (2008), "Accounting for intruder
uncertainty due to sampling when estimating identification disclosure
risks in partially synthetic data," Privacy in Statistical
Databases (Lecture Notes in Computer Science 5262), ed. J.
Domingo-Ferrer and Y. Saygin, Springer, 227 - 238.
14. Kohnen C. N. and Reiter J. P. (2009), "Multiple
imputation for combining confidential data owned by two agencies," Journal
of the Royal Statistical Society, Series A, 172, 511 - 528.
15. Reiter, J. P. (2009), "Using multiple imputation to
integrate and disseminate confidential microdata", International Statistical
Review, 77, 179 - 195.
16.
Reiter, J. P. and Mitra, R. (2009), "Estimating risks of
identification disclosure in partially synthetic data," Journal of
Privacy and Confidentiality, 1.1, 99 - 110.
17. Drechsler, J. and Reiter, J. P. (2009) "Disclosure risk and data utility for partially
synthetic data: An empirical study using the German IAB Establishment Survey,"
Journal of Official Statistics, 25, 589 - 603.
18. Reiter, J. P. and Drechsler, J. (2010), "Two stage
multiple imputation to protect confidentiality," Statistica Sinica, 20, 405 - 422.
19. Caiola, G. and Reiter, J. P. (2010), "Random forests for generating partially
synthetic, categorical data," Transactions on Data Privacy,
3:1, 27 - 42.
20. Kinney, S. K. and Reiter, J. P. (forthcoming) "Tests of multivariate hypotheses when
using multiple imputation for missing data and partial synthesis." Journal of Official
Statistics.
Remote access servers
Federal agencies
are increasingly
concerned that public releases of micro-data may allow data users to
learn sensitive information about survey respondents.
One
approach to data protection is not to allow users ever to see the
data. Instead, they would submit analyses to a
remote
server that would report back output from the fitted model. There
still may be opportunities for disclosures with
remote servers. For example, releasing diagnostics like residuals
might allow users to back into sensitive values. Or, users may be
able to learn sensitive values by submitting certain types of
models. In
this research, I have been investigating the risks and
utility of the remote server approach.
Published
Papers
1. Reiter,
J.
P. (2003) Model diagnostics for remote-access
regression
servers. Statistics and
Computing, 13,
371-380.
2. Reiter,
J. P.
(2004) New approaches to
data dissemination: A glimpse into the
future
(?). Chance,17:3
(Summer 2004), 12 - 16.
3. Gomatam, S.,
Karr, A. F., Reiter, J. P., Sanil, A. (2005) Data
dissemintation
and disclosure limitation in a world without microdata: A risk-utility
framework for remote access servers. Statistical
Science, 20, 163 - 177.
4.
Reiter,
J. P. and Kohnen, C. N. (2005) Categorical data
regression
diagnostics for remote servers. Journal
of Statistical Computation and
Simulation, 75, 889 - 903.
Secure analyses of distributed databases
Federal
statistical
agencies and other
data collectors can obtain more accurate inferences by combining the
data they collect. For example, data collected on the same
variables by different agencies can be included in one large model,
thereby reducing standard errors when fitting models. Or,
different agencies may have different variables, and they seek to merge
those variables into one dataset. These examples of data
integration and distribution are complicated by confidentiality
concerns. Some agencies may not want to share exact values of
their data with other agencies, although they do want the output from
models fit to these shared datasets. In this research, I
have been developing methods for integrating data and
distributing analyses safely across multiple agencies.
Published Papers
1. Karr,
A. F., Lin, X., Sanil, A.
P. and Reiter, J. P. (2004), "Analysis of integrated data
without
data
integration," Chance, 17:3 (Summer 2004),
27
- 30.
2. Kohnen,
C. N. and Reiter, J. P. (2004), "Sharing confidential data
among multiple agencies using multiply imputed, synthetic data," ASA
Proceedings of the Joint Statistical
Meetings.
3.
Reiter, J.
P., Kohnen, C. N., Karr, A. F., Lin, X., and Sanil, A.
P. (2004), "Secure regression for vertically partitioned,
partially
overlapping data," ASA Proceedings of
the Joint Statistical
Meetings.
4. Sanil,
A. P.,
Karr, A.
F., Lin, X., and Reiter, J. P. (2004), "Privacy
preserving regression modelling via distributed
computation," Proceedings
of the Tenth ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining,
pp. 677-682 (peer reviewed).
5. Karr,
A.
F., Lin, X., Sanil, A. P., and
Reiter, J. P. (2005), "Secure regressions on distributed
databases," Journal
of Computational and
Graphical Statistics, 14, 263 - 279.
6. Karr,
A. F., Feng, J., Lin, X., Sanil, A. P., Young, S. S., and Reiter, J. P.
(2005), "Secure analysis of distributed chemical databases without data
integration," Journal
of
Computer-Aided Molecular Design, 19, 739 - 747.
7. Karr,
A. F., Lin, X., Sanil, A. P., and Reiter, J. P.
(2006), "Secure statistical analysis of distributed
databases," in Statistical Methods in
Counterterrorism:
Game Theory, Modeling, Syndromic Surveillance, and Biometric
Authentication.
Edited by A. Wilson, G. Wilson, and D. Olwell. New
York: Springer, 237 - 262.
8.
Karr,
A. F., Lin, X., Reiter, J. P. and Sanil, A. P.
(2006), "Methods of secure computation and data integration," in Monographs
of Official Statistics: Work Session on Statistical Data Confidentiality,
edited by P. Diaz Munoz and H. Brungger, Eurostat, 217 - 226.
9. Ghosh, J., Reiter, J.P. and
Karr,
A. F.
(2007), "Secure
computation with
horizontally partitioned data using adaptive regression
splines,'' Computational
Statistics and Data Analysis, 51, 5813 - 5820.
10. Karr,
A. F., Fulp, W. J., Vera, F.,
Young, S. S., Lin, X., and Reiter,
J. P. (2007) "Secure, privacy-preserving analysis of
distributed
databases," Technometrics,
49, 335 -
345.
11. Karr,
A. F., Lin, X., Reiter, J. P. and Sanil, A. P.
(2009),
"Privacy preserving analysis of vertically partitioned
data using secure matrix protocols," Journal
of Official Statistics, 25, 125 - 138.