Skip to main content

Big data: what it can and cannot achieve

  • Peter Schofield (a1) and Jayati Das-Munshi (a2)

This article looks at the use of large datasets of health records, typically linked with other data sources, in mental health research. The most comprehensive examples of this kind of ‘big data’ are typically found in Scandinavian countries, although there are also many useful sources in the UK. There are a number of promising methodological innovations from studies using big data in UK mental health research, including: hybrid study designs, data linkage and enhanced study recruitment. It is, however, important to be aware of the limitations of research using big data, particularly the various pitfalls in analysis. We therefore caution against abandoning traditional research designs, and argue that other data sources are equally valuable and, ideally, research should incorporate data from a range of sources.


  • Be aware of major big data resources relevant to mental health research
  • Be aware of key advantages and innovative study designs using these data sources
  • Understand the inherent limitations to studies reliant on big data alone



Corresponding author
Correspondence Peter Schofield, School of Population Health & Environmental Sciences, Faculty of Life Sciences and Medicine, King’s College London, 3rd Floor, Addison House, Guy's Campus, London SE1 1UL. Email:
Hide All
Agerbo, E, Sullivan, PF, Vilhjálmsson, BJ, et al. (2015) Polygenic risk score, parental socioeconomic status, family history of psychiatric disorders, and the risk for schizophrenia: a Danish population-based study and meta-analysis. JAMA Psychiatry, 72: 635–41.
Baker, M (2016) 1,500 scientists lift the lid on reproducibility. Nature, 533: 452–4.
Bansal, N, Bhopal, R, Netto, G, et al. (2014) Disparate patterns of hospitalisation reflect unmet needs and persistent ethnic inequalities in mental health care: the Scottish health and ethnicity linkage study. Ethnicity & Health, 19: 217–39.
Benchimol, EI, Smeeth, L, Guttmann, A, et al. (2015) The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) statement. PLoS Medicine, 12: e1001885.
Bhopal, R, Fischbacher, C, Povey, C, et al. (2011) Cohort profile: Scottish health and ethnicity linkage study of 4.65 million people exploring ethnic variations in disease in Scotland. International Journal of Epidemiology, 40: 1168–75.
Callard, F, Broadbent, M, Denis, M, et al. (2014) Developing a new model for patient recruitment in mental health services: a cohort study using Electronic Health Records. BMJ Open, 4: e005654.
Chang, C-K, Hayes, RD, Perera, G, et al. (2011) Life expectancy at birth for people with serious mental illness and other major disorders from a secondary mental health care case register in London. PLoS ONE, 6: e19590.
Crawford, MJ, Jayakumar, S, Lemmey, SJ, et al. (2014) Assessment and treatment of physical health problems among people with schizophrenia: national cross-sectional study. British Journal of Psychiatry, 205: 473–7.
Das-Munshi, J, Chang, C-K, Dutta, R, et al. (2017) Ethnicity and excess mortality in severe mental illness: a cohort study. Lancet Psychiatry, 4: 389–99.
Downs, J, Gilbert, R, Hayes, RD, et al. (2017) Linking health and education data to plan and evaluate services for children. Archives of Disease in Childhood, 102: 599602.
Erlangsen, A, Lind, BD, Stuart, EA, et al. (2015) Short-term and long-term effects of psychosocial therapy for people after deliberate self-harm: a register-based, nationwide multicentre study using propensity score matching. Lancet Psychiatry, 2: 4958.
Gelman, A, Loken, E (2014) The statistical crisis in science. American Scientist, 102: 460.
Gorrell, G, Oduola, S, Roberts, A, et al. (2016) Identifying first episodes of psychosis in psychiatric patient records using machine learning BT. In Proceedings of the 15th Workshop on Biomedical Natural Language Processing (ed Association for Computational Linguistics): 196–205. ACL.
Gulliford, MC, van Staa, TP, McDermott, L, et al. (2014) Cluster randomized trials utilizing primary care electronic health records: methodological issues in design, conduct, and analysis (eCRT Study). Trials, 15: 220.
Hennekens, C, Buring, J, Mayrent, S (eds) (1987) Epidemiology in Medicine. Lippincott Williams and Wilkins.
Howard, L, de Salis, I, Tomlin, Z, et al. (2009) Why is recruitment to trials difficult? An investigation into recruitment difficulties in an RCT of supported employment in patients with severe mental illness. Contemporary Clinical Trials, 30: 40–6.
Kendrick, T, Stuart, B, Newell, C, et al. (2015) Changes in rates of recorded depression in English primary care 2003–2013: time trend analyses of effects of the economic recession, and the GP contract quality outcomes framework (QOF). Journal of Affective Disorders, 180: 6878.
Knudsen, AK, Hotopf, M, Skogen, JC, et al. (2010) The health status of nonparticipants in a population-based health study. American Journal of Epidemiology, 172: 1306–14.
Lazer, D, Kennedy, R, King, G, et al. (2014) Big data. The parable of Google Flu: traps in big data analysis. Science, 343: 1203–5.
Mayer-Schönberger, V, Cukier, K (2013) Big Data: A Revolution That Will Transform How We Live, Work, and Think. Houghton Mifflin Harcourt.
McIntosh, AM, Stewart, R, John, A, et al. (2016) Data science for mental health: a UK perspective on a global challenge. Lancet Psychiatry, 3: 993–8.
Norredam, M, Kastrup, M, Helweg-Larsen, K (2011) Register-based studies on migration, ethnicity, and health. Scandinavian Journal of Public Health, 39: 201–5.
Oduola, S, Wykes, T, Robotham, D, et al. (2017) What is the impact of research champions on integrating research in mental health clinical practice? A quasiexperimental study in South London, UK. BMJ Open, 7: e016107.
OECD (2015) Health Data Governance: Privacy, Monitoring and Research. OECD Publishing.
Patel, R, Jayatilleke, N, Broadbent, M, et al. (2015) Negative symptoms in schizophrenia: a study in a large clinical sample of patients using a novel automated method. BMJ Open, 5: e007619.
Patel, R, Oduola, S, Callard, F, et al. (2017) What proportion of patients with psychosis is willing to take part in research? A mental health electronic case register analysis. BMJ Open, 7: e013113.
Pedersen, CB, Mortensen, PB (2001) Evidence of a dose-response relationship between urbanicity during upbringing and schizophrenia risk. Archives of General Psychiatry, 58: 1039–46.
Perera, G, Broadbent, M, Callard, F, et al. (2016) Cohort profile of the South London and Maudsley NHS Foundation Trust Biomedical Research Centre (SLaM BRC) Case Register: current status and recent enhancement of an Electronic Mental Health Record-derived data resource. BMJ Open, 6: e008721.
Prince, M, Stewart, R, Ford, T, et al. (eds) (2003) Practical Psychiatric Epidemiology. OUP.
Quan, H, Li, B, Duncan Saunders, L, et al. (2008) Assessing validity of ICD-9-CM and ICD-10 administrative data in recording clinical conditions in a unique dually coded database. Health Services Research, 43: 1424–41.
Rait, G, Walters, K, Griffin, M, et al. (2009) Recent trends in the incidence of recorded depression in primary care. British Journal of Psychiatry, 195: 520–4.
Reininghaus, U, Morgan, C (2014) Integrated models in psychiatry: the state of the art. Social Psychiatry and Psychiatric Epidemiology, 49: 12.
Roberts, E, Wessely, S, Chalder, T, et al. (2016) Mortality of people with chronic fatigue syndrome: a retrospective cohort study in England and Wales from the South London and Maudsley NHS Foundation Trust Biomedical Research Centre (SLaM BRC) Clinical Record Interactive Search (CRIS) Register. Lancet, 387: 1638–43.
Rosen, M (2002) National health data registers: a Nordic heritage to public health. Scandinavian Journal of Public Health, 30: 81–5.
Schofield, P, Das-Munshi, J, Becares, L, et al. (2017a) Neighbourhood ethnic density and incidence of psychosis – First and second generation migrants compared. European Psychiatry, 41: S249.
Schofield, P (2017b) Big data in mental health research – do the ns justify the means? Using large data-sets of electronic health records for mental health research. BJPsych Bulletin, 41: 129–32.
Schulz, KF, Altman, DG, Moher, D, et al. (2010) CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials. BMJ, 340: c332.
van Os, J, Kenis, G, Rutten, BP (2010) The environment and schizophrenia. Nature, 468: 203–12.
van Staa, T-P, Goldacre, B, Gulliford, M, et al. (2012) Pragmatic randomised trials using routine electronic health records: putting them to the test. BMJ, 344: e55.
von Elm, E, Altman, DG, Egger, M, et al. (2008) The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Journal of Clinical Epidemiology, 61: 344–9.
Wasserstein, RL, Lazar, NA (2016) The ASA's statement on p-values: context, process, and purpose. American Statistician, 70: 129–33.
Woodhead, C, Ashworth, M, Broadbent, M, et al. (2016) Cardiovascular disease treatment among patients with severe mental illness: a data linkage study between primary and secondary care. British Journal of General Practice, 66: e37481.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

BJPsych Advances
  • ISSN: 2056-4678
  • EISSN: 2056-4686
  • URL: /core/journals/bjpsych-advances
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed

Big data: what it can and cannot achieve

  • Peter Schofield (a1) and Jayati Das-Munshi (a2)
Submit a response


No eLetters have been published for this article.


Reply to: Submit a response

Your details

Conflicting interests

Do you have any conflicting interests? *