Hostname: page-component-89b8bd64d-72crv Total loading time: 0 Render date: 2026-05-10T13:27:26.440Z Has data issue: false hasContentIssue false

Shrinking a large dataset to identify variables associated with increased risk of Plasmodium falciparum infection in Western Kenya

Published online by Cambridge University Press:  16 April 2015

M. TREMBLAY*
Affiliation:
Departments of Medicine and Pathobiological Sciences, School of Veterinary Medicine, University of Wisconsin-Madison, Madison, WI, USA
J. S. DAHM
Affiliation:
Departments of Medicine and Pathobiological Sciences, School of Veterinary Medicine, University of Wisconsin-Madison, Madison, WI, USA
C. N. WAMAE
Affiliation:
Center for Microbiology Research, Kenya Medical Research Institute (KEMRI), Nairobi, Kenya School of Health Sciences, Mount Kenya University, Thika, Kenya
W. A. DE GLANVILLE
Affiliation:
Centre for Immunity, Infection and Evolution, Institute for Immunology and Infection Research, School of Biological Sciences, University of Edinburgh, Ashworth Laboratories, Edinburgh, UK International Livestock Research Institute, Nairobi, Kenya
E. M. FÈVRE
Affiliation:
International Livestock Research Institute, Nairobi, Kenya Institute of Infection and Global Health, University of Liverpool, Leahurst Campus, Neston, UK
D. DÖPFER
Affiliation:
Departments of Medicine and Pathobiological Sciences, School of Veterinary Medicine, University of Wisconsin-Madison, Madison, WI, USA
*
* Author for correspondence: Dr M. Tremblay, Department of Medical Sciences, School of Veterinary Medicine, 2015 Linden Drive, Madison, WI 53706, USA. (Email: mtremblay@wisc.edu)
Rights & Permissions [Opens in a new window]

Summary

Large datasets are often not amenable to analysis using traditional single-step approaches. Here, our general objective was to apply imputation techniques, principal component analysis (PCA), elastic net and generalized linear models to a large dataset in a systematic approach to extract the most meaningful predictors for a health outcome. We extracted predictors for Plasmodium falciparum infection, from a large covariate dataset while facing limited numbers of observations, using data from the People, Animals, and their Zoonoses (PAZ) project to demonstrate these techniques: data collected from 415 homesteads in western Kenya, contained over 1500 variables that describe the health, environment, and social factors of the humans, livestock, and the homesteads in which they reside. The wide, sparse dataset was simplified to 42 predictors of P. falciparum malaria infection and wealth rankings were produced for all homesteads. The 42 predictors make biological sense and are supported by previous studies. This systematic data-mining approach we used would make many large datasets more manageable and informative for decision-making processes and health policy prioritization.

Information

Type
Original Papers
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © Cambridge University Press 2015
Figure 0

Table 1. Number of variables per dataset at each step

Figure 1

Table 2. List of asset wealth variables by variable type

Figure 2

Table 3. List of livestock wealth variables by variable type

Figure 3

Table 4. Cross-validation, elastic net and GLM parameters

Figure 4

Table 5. Subset A: Generalized linear model results*

Figure 5

Table 6. Subset B: Generalized linear model results*