Extracting Information from Big Data: Issues of Measurement, Inference and Linkage

doi:10.1017/CBO9781107590205.016

12 - Extracting Information from Big Data: Issues of Measurement, Inference and Linkage

Published online by Cambridge University Press: 05 July 2014

Frauke Kreuter and

Roger D. Peng

Edited by

Julia Lane ,

Victoria Stodden ,

Stefan Bender and

Helen Nissenbaum

Show author details

Frauke Kreuter: Affiliation:
University of Maryland
Roger D. Peng: Affiliation:
Johns Hopkins Bloomberg School of Public Health
Julia Lane: Affiliation:
American Institutes for Research, Washington DC
Victoria Stodden: Affiliation:
Columbia University, New York
Stefan Bender: Affiliation:
Institute for Employment Research of the German Federal Employment Agency
Helen Nissenbaum: Affiliation:
New York University

Book contents

Get access

Summary

Introduction

Big data pose several interesting and new challenges to statisticians and others who want to extract information from data. As Groves pointedly commented, the era is “appropriately called Big Data as opposed to Big Information,” because there is a lot of work for analysts before information can be gained from “auxiliary traces of some process that is going on in the society.” The analytic challenges most often discussed are those related to three of the Vs that are used to characterize big data. The volume of truly massive data requires expansion of processing techniques that match modern hardware infrastructure, cloud computing with appropriate optimization mechanisms, and re-engineering of storage systems. The velocity of the data calls for algorithms that allow learning and updating on a continuous basis, and of course the computing infrastructure to do so. Finally, the variety of the data structures requires statistical methods that more easily allow for the combination of different data types collected at different levels, sometimes with a temporal and geographic structure.

However, when it comes to privacy and confidentiality, the challenges of extracting (meaningful) information from big data are in our view similar to those associated with data of much smaller size, surveys being one example. For any statistician or quantitative working (social) scientist there are two main concerns when extracting information from data, which we summarize here as concerns about measurement and concerns about inference. Both of these aspects can be implicated by privacy and confidentiality concerns.

Information

Type: Chapter
Information: Privacy, Big Data, and the Public Good
Frameworks for Engagement
, pp. 257 - 275

DOI: https://doi.org/10.1017/CBO9781107590205.016 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

References

O’Neil, C. and Schutt, R., Doing Data Science (Sebastopol, CA: O’Reilly Media, 2014)Google Scholar

Groves, R. M. and Lyberg, L., “Total Survey Error,” Public Opinion Quarterly 74, no. 5 (2010): 849–879CrossRef Google Scholar

Valliant, R., Dever, J. A., and Kreuter, F., Practical Tools for Sampling and Weighting (New York: Springer, 2013)CrossRef Google Scholar

Rosenbaum, R. and Rubin, D. B., “The Central Role of the Propensity Score in Observational Studies for Causal Effects,” Biometrika 70, no. 1 (April 1983): 41–55CrossRef Google Scholar

Frangakis, C. and Rubin, D., “Principal Stratification in Causal Inference,” Biometrics 58 (2002): 21–29CrossRef Google Scholar PubMed

Singer, E., “Confidentiality, Risk Perception, and Survey Participation,” Chance 17, no. 3 (2004): 30–34CrossRef Google Scholar

Singer, E., Mathiowetz, N., and Couper, M. P., “The Role of Privacy and Confidentiality as Factors in Response to the 1990 Census,” Public Opinion Quarterly 57 (1993): 465–482CrossRef Google Scholar

Groves, R. and Peytcheva, E., “The Impact of Nonresponse Rates on Nonresponse Bias: A Meta-Analysis. Public Opinion Quarterly 72, no. 2 (2008): 167–189CrossRef Google Scholar

Groves, R. M., “Three Eras of Survey Research,” Public Opinion Quarterly 75, no. 5 (2011): 861–871CrossRef Google Scholar

Schmieder, J., von Wachter, T., and Bender, S., “The Effects of Extended Unemployment Insurance over the Business Cycle: Evidence from Regression Discontinuity Estimates over 20 Years,” Quarterly Journal of Economics 127, no. 2 (2012): 701–752CrossRef Google Scholar

Card, D., Heining, J., and Kline, P., “Workplace Heterogeneity and the Rise of West German Wage Inequality,” Quarterly Journal of Economics 128, no. 3 (2013): 967–1015CrossRef Google Scholar

Tourangeau, R. and Yan, T., “Sensitive Questions in Surveys,” Psychological Bulletin 133, no. 5 (2007): 859–883CrossRef Google Scholar PubMed

Brown, V. R. and Vaughn, E. D., “The Writing on the (Facebook) Wall: The Use of Social Networking Sites in Hiring Decisions,” Journal of Business and Psychology 26, no. 2 (2011): 219–225CrossRef Google Scholar

Karl, K., Peluchette, J., and Schlaegel, C., “Who’s Posting Facebook Faux Pas? A Cross-Cultural Examination of Personality Differences,” International Journal of Selection and Assessment 18, no. 2 (2010): 174–186CrossRef Google Scholar

Kreuter, F., Presser, S., and Tourangeau, R., “Social Desirability Bias in CATI, IVR, and Web Surveys: The Effects of Mode and Question Sensitivity,” Public Opinion Quarterly 72, no. 5 (2008): 847–865CrossRef Google Scholar

Couper, M. P., “Is the Sky Falling? New Technology, Changing Media, and the Future of Surveys,” Survey Research Methods 7, no. 3 (2013): 145–156Google Scholar

Prewitt, K., “The 2012 Morris Hansen Lecture: Thank you Morris, et al., for Westat, et al.,” Journal of Official Statistics 29, no. 2 (2013): 223–231CrossRef Google Scholar

Yan, T. and Olson, K., “Analyzing Paradata to Investigate Measurement Error,” in Improving Surveys with Paradata: Making Use of Process Information, ed. Kreuter, F. (Hoboken, NJ: Wiley, 2013)Google Scholar

Mayer-Schönberger, V. and Cukier, K., Big Data: A Revolution That Will Transform How We Live, Work and Think (London: John Murray, 2013)Google Scholar

Lessler, J. T. and Kalsbeek, W. D., Nonsampling Error in Surveys (Hoboken, NJ: Wiley, 1992)Google Scholar

Bosnjak, M., Haas, I., Galesic, M., Kaczmirek, L., Bandilla, W., and Couper, M. P., “Sample Composition Discrepancies in Different Stages of a Probability-Based Online Panel,” Field Methods 25, no. 4 (2013): 339–360CrossRef Google Scholar

Dever, J. A., Rafferty, A., and Valliant, R., “Internet Surveys: Can Statistical Adjustment Eliminate Coverage Bias?” Survey Research Methods 2, no. 2 (2008): 47–62Google Scholar

Couper, M. P., Kapteyn, A., Schonlau, M., and Winter, J., “Noncoverage and Nonresponse in an Internet Survey,” Social Science Research 36, no. 1 (2007): 131–148CrossRef Google Scholar

Schonlau, M., Van Soest, A., Kapteyn, A., and Couper, M., “Selection Bias in Web Surveys and the Use of Propensity Scores,” Sociological Methods and Research 37, no. 3 (2009): 291–318CrossRef Google Scholar

Singer, E., “Toward a Benefit-Cost Theory of Survey Participation: Evidence, Further Tests, and Implications,” Journal of Official Statistics 27, no. 2 (2011): 379–392Google Scholar

Zandbergen, P. A., “Accuracy of iPhone Locations: A Comparison of Assisted GPS, WiFi and Cellular Positioning,” Transactions in GIS 13, no. s1 (2009): 5–25CrossRef Google Scholar

Kosinski, M., Stillwell, D., and Graepel, T., “Private Traits and Attributes are Predictable from Digital Records of Human Behavior,” Proceedings of the National Academy of Sciences 110, no. 15 (2013): 5802–5805CrossRef Google Scholar PubMed

Valliant, R. and Dever, J., “Estimating Propensity Adjustments for Volunteer Web Surveys,” Sociological Methods and Research 40 (2011): 105–137CrossRef Google Scholar

Dever, J., Rafferty, A., and Valliant, R., “Internet Surveys: Can Statistical Adjustments Eliminate Coverage Bias?” Survey Research Methods 2 (2008): 47–60Google Scholar

Couper, , “Is the Sky Falling,” and AAPOR, “Report of the AAPOR Task Force on Non-Probability Sampling,” Journal of Survey Statistics and Methodology 1 (2013): 90–143Google Scholar

Massey, Douglas S. and Tourangeau, Roger, The Nonresponse Challenge to Surveys and Statistics, ANNALS of the American Academy of Political and Social Science Series 645 (Thousand Oaks, CA: Sage, 2013)Google Scholar

Stuart, E. A., Cole, S. R., Bradshaw, C. P., and Leaf, P. J., “The Use of Propensity Scores to Assess the Generalizability of Results from Randomized Trials,” Journal of the Royal Statistical Society, Series A 174, no. 2 (2011): 369–386CrossRef Google Scholar

Cole, S. R. and Stuart, E. A., “Generalizing Evidence from Randomized Clinical Trials to Target Populations: The ACTG-320 Trial,” American Journal of Epidemiology 172 (2010): 107–115CrossRef Google Scholar PubMed

Smith, T., “The Report of the International Workshop on Using Multi-Level Data from Sample Frames, Auxiliary Databases, Paradata and Related Sources to Detect and Adjust for Nonresponse Bias in Surveys,” International Journal of Public Opinion Research 23 (2011): 389–402CrossRef Google Scholar

Sakshaug, J. and Kreuter, F., “Assessing the Magnitude of Non-Consent Biases in Linked Survey and Administrative Data,” Survey Research Methods 6, no. 2 (2012): 113–122Google Scholar

Singer, E., Hippler, H. J., and Schwarz, N., “Confidentiality Assurances in Surveys: Reassurance or Threat?” International Journal of Public Opinion Research 4, no. 3 (1992): 256–268CrossRef Google Scholar

Bates, N., Dalhammer, J., and Singer, E., “Privacy Concerns, Too Busy, or Just Not Interested: Using Doorstep Concerns to Predict Survey Nonresponse,” Journal of Official Statistics 24, no. 4 (2008): 591–612Google Scholar

Couper, M. P., Singer, E., Conrad, F. G., and Groves, R. M., “Experimental Studies of Disclosure Risk, Disclosure Harm, Topic Sensitivity, and Survey Participation,” Journal of Offiical Statistics 26, no. 2 (2010): 287–300Google Scholar PubMed

Sakshaug, J., Tutz, V., and Kreuter, F., “Placement, Wording, and Interviewers: Identifying Correlates of Consent to Link Survey and Administrative Data,” Survey Research Methods 7, no. 2 (2013): 133–144Google Scholar

Schnell, R., “Combining Surveys with Non-Questionnaire Data: Overview and Introduction,” in Improving Surveys Methods: Lessons from Recent Research, ed. Engel, U., Jann, B., Lynn, P., Scherpenzeel, A., and Sturgis, P. (New York: Psychology Press, 2014)Google Scholar

Eckman, S. and English, N., “Creating Housing Unit Frames from Address Databases Geocoding Precision and Net Coverage Rates,” Field Methods 24, no. 4 (2012): 399–408CrossRef Google Scholar

Groves, R., “Designed Data” and “Organic Data,” Director’s Blog, (accessed January 20, 2014)

Kohavi, R., Longbotham, R., Sommerfield, D., and Henne, R. M., “Controlled Experiments on the Web: Survey and Practical Guide,” Data Mining and Knowledge Discovery 18 (2009): 140–181CrossRef Google Scholar

Accessibility standard: Unknown

Why this information is here

This section outlines the accessibility features of this content - including support for screen readers, full keyboard navigation and high-contrast display options. This may not be relevant for you.

Accessibility Information

Accessibility compliance for the PDF of this book is currently unknown and may be updated in the future.