Validity: Biases and Pitfalls of Social Media Data

Carlos Castillo

doi:10.1017/CBO9781316476840.010

In 2008 Google launched Flu Trends, showing that the search volume of certain terms in a region was strongly correlated with levels of flu activity in that region. They also found that the increase in the usage of flu-related terms happened days before health care authorities were able to report an increase in cases of flu. The reasons are twofold: there are delays in the official data collection done from hospitals, and people search for symptoms before visiting a doctor. Despite the success of Flu Trends, it was not beyond criticism. Lazer et al. (2014) highlighted a series of issues with its predictions, including a systemic bias that produced an overestimate in 100 out of the 108 weeks analyzed during a two-year period. As a more general criticism, Lazer et al. denounced this as an example of big data hubris: the assumption that a large dataset can be a substitute, rather than a supplement, to a traditional analysis method. The popular press has lambasted “big data fundamentalism,” the idea that larger datasets imply more objective results.

Researchers performing social science research have embraced and criticized, sometimes at the same time, the usage of large-scale datasets from social media. Social media, as a reflection of social interactions at large scale and in digitally accessible formats, provides a larger quantity of data at a much lower cost than alternative datasets, such as surveys or direct observations. However, the infamous “streetlight effect” may be at play here: scientists inclined to search for evidence where it is easier, instead of where better evidence is likely to be found.

The representativeness of social media and other types of digital traces, and their lack of context, are often cited as key factors to distrust conclusions based solely on them. “Just because you see traces of data doesn't mean you always know the intention or cultural logic behind them. And just because you have a big N doesn't mean that it's representative or generalizable.”4 For instance, methods to use trends found on Twitter data as direct predictions of political election results, have been to a large extent debunked (Gayo-Avello, 2012).

This chapter warns against a naïve interpretation of results obtained from social media data from emergencies. The quality of social media data for this purpose is affected by at least two types of factors.

Book contents

9 - Validity: Biases and Pitfalls of Social Media Data

Summary

Information

Access options

Book purchase

Temporarily unavailable