Hostname: page-component-89b8bd64d-sd5qd Total loading time: 0 Render date: 2026-05-07T16:04:39.234Z Has data issue: false hasContentIssue false

Statistically Valid Inferences from Privacy-Protected Data

Published online by Cambridge University Press:  08 February 2023

GEORGINA EVANS*
Affiliation:
Harvard University, United States
GARY KING*
Affiliation:
Harvard University, United States
MARGARET SCHWENZFEIER*
Affiliation:
Harvard University, United States
ABHRADEEP THAKURTA*
Affiliation:
Google Brain, United States
*
Georgina Evans, PhD Candidate, Department of Government, Harvard University, United States, georgieaevans@gmail.com.
Gary King, Albert J. Weatherhead III University Professor, Institute for Quantitative Social Science, Harvard University, United States, King@Harvard.edu.
Margaret Schwenzfeier, PhD Candidate, Department of Government, Harvard University, United States, schwenzfeier@g.harvard.edu.
Abhradeep Thakurta, Senior Research Scientist, Google Brain, athakurta@google.com.
Rights & Permissions [Opens in a new window]

Abstract

Unprecedented quantities of data that could help social scientists understand and ameliorate the challenges of human society are presently locked away inside companies, governments, and other organizations, in part because of privacy concerns. We address this problem with a general-purpose data access and analysis system with mathematical guarantees of privacy for research subjects, and statistical validity guarantees for researchers seeking social science insights. We build on the standard of “differential privacy,” correct for biases induced by the privacy-preserving procedures, provide a proper accounting of uncertainty, and impose minimal constraints on the choice of statistical methods and quantities estimated. We illustrate by replicating key analyses from two recent published articles and show how we can obtain approximately the same substantive results while simultaneously protecting privacy. Our approach is simple to use and computationally efficient; we also offer open-source software that implements all our methods.

Information

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press on behalf of the American Political Science Association
Figure 0

Figure 1. Differentially Private Algorithm (Before Bias Correction)

Figure 1

Figure 2. Underlying Distributions (Before Estimation). The censored distribution includes the orange area and spikes at $ -\Lambda $ and $ \Lambda $

Figure 2

Figure 3. Monte Carlo Simulations: Bias of the Uncorrected ($ {\widehat{\theta}}^{\mathrm{dp}} $) and Corrected ($ {\overset{\sim }{\theta}}^{\mathrm{dp}} $) Estimates, and (in the Bottom-Right Panel) the Standard Error of the True Uncorrected ($ {\mathrm{SE}}_{{\widehat{\theta}}^{\mathrm{dp}}} $), True Corrected ($ {\mathrm{SE}}_{{\overset{\sim }{\theta}}^{\mathrm{dp}}} $), and Estimated Corrected ($ {\hat{\mathrm{SE}}}_{{\overset{\sim }{\theta}}^{\mathrm{dp}}} $) Estimates (the Latter Two Having Almost Identical Values)Note: “True” SEs refer to the actual standard deviation of a point estimate.

Figure 3

Figure 4. Performance across P (Number of Data Partitions) for Fixed Privacy Budget (ϵ=1) and Sample Size (N=100k)

Figure 4

Figure 5. Original versus Privacy-Protected Data Analyses

Figure 5

Figure 6. Distribution of Estimate across 200 Runs of Our Algorithm, at Varying Levels of $ \Lambda $

Supplementary material: Link

Evans et al. Dataset

Link
Supplementary material: PDF

Evans et al. supplementary material

Online Appendix

Download Evans et al. supplementary material(PDF)
PDF 252.1 KB
Submit a response

Comments

No Comments have been published for this article.