Published online by Cambridge University Press: 05 July 2014
Big data pose several interesting and new challenges to statisticians and others who want to extract information from data. As Groves pointedly commented, the era is “appropriately called Big Data as opposed to Big Information,” because there is a lot of work for analysts before information can be gained from “auxiliary traces of some process that is going on in the society.” The analytic challenges most often discussed are those related to three of the Vs that are used to characterize big data. The volume of truly massive data requires expansion of processing techniques that match modern hardware infrastructure, cloud computing with appropriate optimization mechanisms, and re-engineering of storage systems. The velocity of the data calls for algorithms that allow learning and updating on a continuous basis, and of course the computing infrastructure to do so. Finally, the variety of the data structures requires statistical methods that more easily allow for the combination of different data types collected at different levels, sometimes with a temporal and geographic structure.
However, when it comes to privacy and confidentiality, the challenges of extracting (meaningful) information from big data are in our view similar to those associated with data of much smaller size, surveys being one example. For any statistician or quantitative working (social) scientist there are two main concerns when extracting information from data, which we summarize here as concerns about measurement and concerns about inference. Both of these aspects can be implicated by privacy and confidentiality concerns.