Hostname: page-component-89b8bd64d-r6c6k Total loading time: 0 Render date: 2026-05-05T10:20:38.310Z Has data issue: false hasContentIssue false

Comparing methods for handling missing values in food-frequency questionnaires and proposing k nearest neighbours imputation: effects on dietary intake in the Norwegian Women and Cancer study (NOWAC)

Published online by Cambridge University Press:  01 April 2008

Christine L Parr*
Affiliation:
Institute of Basic Medical Sciences, Department of Biostatistics, University of Oslo, PO Box 1122 Blindern, N-0317 Oslo, Norway
Anette Hjartåker
Affiliation:
Cancer Registry of Norway, Institute of Population-based Cancer Research, Norway
Ida Scheel
Affiliation:
Department of Mathematics, Statistics Division, University of Oslo, Norway
Eiliv Lund
Affiliation:
Institute of Community Medicine, University of Tromsø, Norway
Petter Laake
Affiliation:
Institute of Basic Medical Sciences, Department of Biostatistics, University of Oslo, PO Box 1122 Blindern, N-0317 Oslo, Norway
Marit B Veierød
Affiliation:
Institute of Basic Medical Sciences, Department of Biostatistics, University of Oslo, PO Box 1122 Blindern, N-0317 Oslo, Norway
*
Corresponding author: Email c.l.parr@medisin.uio.no
Rights & Permissions [Opens in a new window]

Abstract

Objective

To investigate item non-response in a postal food-frequency questionnaire (FFQ), and to assess the effect of substituting/imputing missing values on dietary intake levels in the Norwegian Women and Cancer study (NOWAC). We have adapted and probably for the first time applied k nearest neighbours (KNN) imputation to FFQ data.

Design

Data from a recent reproducibility study were used. The FFQ was mailed twice (test–retest) about 3 months apart to the same subjects. Missing responses in the test FFQ were imputed using the null value (frequencies = null, amount = smallest), the sample mode, the sample median, KNN, and retest values.

Setting

A methodological substudy of NOWAC, a national population-based cohort.

Subjects

A random sample of 2000 women aged 46–75 years was drawn from the cohort in 2002 (response 75%). The imputation methods were compared for 1430 women who completed at least 50% of the test FFQ.

Results

We imputed 16% missing values in the overall test data matrix. Compared to null value imputation, the largest differences in estimated dietary intake were seen for KNN, and for food items with a high proportion of missing. Imputation with retest values increased total energy intake, indicating that not all missing values are caused by respondents failing to specify no consumption, and that null value imputation may lead to underestimation and misclassification.

Conclusion

Missing values in FFQs present a methodological challenge. We encourage the application and evaluation of newer imputation methods, including KNN, which may reduce imputation errors and give more accurate intake estimates.

Information

Type
Research Paper
Copyright
Copyright © The Authors 2007
Figure 0

Fig. 1 RMSE for different values of the parameter k in KNN imputation of the test food-frequency questionnaire with non-missing retest data as the reference. The curve is compared to imputation with the null, mode and median values. The RMSE for test vs. retest for the subset of values which was non-missing in both data sets is also included (n = 1430) (KNN – k nearest neighbours; RMSE – root-mean-squared error)

Figure 1

Table 1 Median (quartiles) for the proportion of missing values (%) in the food frequency questionnaire (test and retest) by selected background characteristics of the respondents, n = 1496*

Figure 2

Table 2 The intake of food groups (g day−1) after imputing missing values in the test food-frequency questionnaire. Within-person differences (D) are presented as mean and median (quartiles) for imputation with retest values*, mode, median and KNN, relative to imputation with the null value, n = 1430

Figure 3

Table 3 The daily intake of energy and selected nutrients after imputing missing values in the test food-frequency questionnaire. Within-person differences (D) are presented as mean and median (quartiles) for imputation with retest values*, mode, median and KNN, relative to imputation with the null value, n = 1430