Comparing methods for handling missing values in food-frequency questionnaires and proposing k nearest neighbours imputation: effects on dietary intake in the Norwegian Women and Cancer study (NOWAC)

Christine L Parr; Anette Hjartåker; Ida Scheel; Eiliv Lund; Petter Laake; Marit B Veierød

doi:10.1017/S1368980007000365

Comparing methods for handling missing values in food-frequency questionnaires and proposing k nearest neighbours imputation: effects on dietary intake in the Norwegian Women and Cancer study (NOWAC)

Published online by Cambridge University Press: 01 April 2008

Christine L Parr ,

Anette Hjartåker ,

Ida Scheel ,

Eiliv Lund ,

Petter Laake and

Marit B Veierød

Show author details

Christine L Parr*: Affiliation:
Institute of Basic Medical Sciences, Department of Biostatistics, University of Oslo, PO Box 1122 Blindern, N-0317 Oslo, Norway
Anette Hjartåker: Affiliation:
Cancer Registry of Norway, Institute of Population-based Cancer Research, Norway
Ida Scheel: Affiliation:
Department of Mathematics, Statistics Division, University of Oslo, Norway
Eiliv Lund: Affiliation:
Institute of Community Medicine, University of Tromsø, Norway
Petter Laake: Affiliation:
Institute of Basic Medical Sciences, Department of Biostatistics, University of Oslo, PO Box 1122 Blindern, N-0317 Oslo, Norway
Marit B Veierød: Affiliation:
Institute of Basic Medical Sciences, Department of Biostatistics, University of Oslo, PO Box 1122 Blindern, N-0317 Oslo, Norway
*: Corresponding author: Email c.l.parr@medisin.uio.no

Article contents

Abstract
Objective
Design
Setting
Subjects
Results
Conclusion
Subjects and methods
Results
Discussion
References

Rights & Permissions

Abstract

Objective

To investigate item non-response in a postal food-frequency questionnaire (FFQ), and to assess the effect of substituting/imputing missing values on dietary intake levels in the Norwegian Women and Cancer study (NOWAC). We have adapted and probably for the first time applied k nearest neighbours (KNN) imputation to FFQ data.

Design

Data from a recent reproducibility study were used. The FFQ was mailed twice (test–retest) about 3 months apart to the same subjects. Missing responses in the test FFQ were imputed using the null value (frequencies = null, amount = smallest), the sample mode, the sample median, KNN, and retest values.

Setting

A methodological substudy of NOWAC, a national population-based cohort.

Subjects

A random sample of 2000 women aged 46–75 years was drawn from the cohort in 2002 (response 75%). The imputation methods were compared for 1430 women who completed at least 50% of the test FFQ.

Results

We imputed 16% missing values in the overall test data matrix. Compared to null value imputation, the largest differences in estimated dietary intake were seen for KNN, and for food items with a high proportion of missing. Imputation with retest values increased total energy intake, indicating that not all missing values are caused by respondents failing to specify no consumption, and that null value imputation may lead to underestimation and misclassification.

Conclusion

Missing values in FFQs present a methodological challenge. We encourage the application and evaluation of newer imputation methods, including KNN, which may reduce imputation errors and give more accurate intake estimates.

Keywords

Food-frequency questionnaire Missing values Non-response Imputation Data quality Bias

Information

Type: Research Paper
Information: Public Health Nutrition , Volume 11 , Issue 4 , April 2008 , pp. 361 - 370

DOI: https://doi.org/10.1017/S1368980007000365 [Opens in a new window]
Copyright: Copyright © The Authors 2007

Postal food-frequency questionnaires (FFQs) have long been the standard tool for assessing diet in large-scale nutritional epidemiological studies. The method has many advantages but respondents may return incomplete FFQs, in particular when there is no in-person contact. Missing answers to individual food items, referred to as item non-response, present both computational and conceptual problems in the estimation of dietary intake. One approach to handling missing values is imputation, or the practice of ‘filling in’ plausible values for the skipped items. Imputation is practical because it creates a complete data set at the outset, which can be used as input for dietary intake calculation programs. It prevents loss of statistical power caused by subject exclusion and potential selection bias when the exclusion is related to characteristics of the subjects. On the other hand, it may seem conceptually problematic. Item non-response usually occurs for reasons unknown to the researcher, and imputation may distort estimates, standard errors and P values of testsReference Little and Rubin¹. However, statistical procedures for handling missing data are a developing field and the methods are improvingReference Schafer and Graham², although there are few examples of more advanced methods being applied to FFQ data. Challenges may include a large number of variables, which must be aggregated to calculate the intake of food groups and nutrients, few respondents with complete data, dependent questions (e.g. frequency and portion size for the same food item) and an underlying missing-data mechanism that is not completely random. Most studies using FFQs do not describe how item non-response has been handled, as previously pointed outReference Hansson and Galanti³, but to assume no consumptionReference Holmberg, Ohlander, Byers, Zack, Wolk and Bruce⁴^–Reference Johansson, Hallmans, Wikman, Biessy, Riboli and Kaaks⁶, or to impute the median valueReference Gaard, Tretli and Loken⁷, or a combination of the twoReference Goldbohm, van den Brandt, Brants, van’t Veer, Al and Sturmans⁸ appear to be the most common practices after an initial exclusion of subjects. A few studies have evaluated different single imputation methods, including the null value compared to either the complete dataReference Holmberg, Ohlander, Byers, Zack, Wolk and Bruce⁴^, Reference Ahn, Paik and Ahn⁹ or the median valueReference Hansson and Galanti³. Multiple imputation (MI), a more modern procedure, was compared to single imputation and analysis of the complete data in the GISSI-Prevenzione studyReference Barzi, Woodward, Marfisi, Tognoni and Marchioli¹⁰. MI has also been applied in the Nurses’ Health StudyReference Michels, Rosner, Chumlea, Colditz and Willett¹¹. However, the published literature on the estimation of missing values in FFQs is still small, and there is currently no recommended practice.

The objective of our study was to investigate item non-response in a postal FFQ, and to assess how different methods for imputing missing values affect dietary intake levels using data from the Norwegian Women and Cancer study (NOWAC). We compare the common methods in nutritional epidemiology of imputing the null value, or the median value, to k nearest neighbours (KNN) imputation, a widely used procedure for missing entries in microarray dataReference Troyanskaya, Cantor, Sherlock, Brown, Hastie and Tibshirani¹². We have here adapted and applied KNN imputation to FFQ data, which is new to our knowledge. We also impute the mode value and use a repeated measurement of the FFQ on the same subjects to investigate the assumption that missing values imply no consumption.

The missing data problem in dietary intake calculations

The calculation of dietary intake from an FFQ is a chain of arithmetic operations and data aggregation. The first step is generally to estimate the food weights (grams per day) by converting the reported consumption frequencies to intakes per day and multiplying by the usual portion sizes (reported in the FFQ or determined by the investigator). The food weights are then added directly for each individual to create food group variables, e.g. ‘dairy products’. The food weights are also multiplied by the nutrient and energy values per 100 g of food (adjusted for inedible waste) from a food composition table or database. The contribution from each food is subsequently added for each individual to create nutrient and energy intake variables. non-response to food frequencies or portion sizes will generate missing food weights and missing values in aggregated variables. An illustrative example is energy intake, a key variable in many analyses. Because most foods provide energy, total energy intake from the FFQ will usually be missing for respondents skipping one or more food item(s). If the majority has skipped at least one item, it may seem more reasonable to add the contributions from the available items rather than to report missing for energy intake. But to just add the non-missing items, or to impute missing with the null value, may lead to underestimation and biased results, unless it is certain that the skipped food items were not consumed. Some calculation programs add non-missing items automatically and make it easy to overlook the missing data problem, while others may require a complete data set. Since dietary intake is often calculated with programs that are questionnaire- or study-specific and not commercially available, it is impossible to know how non-response has been handled, unless explicitly stated. The EPIC-Norfolk study sets a good example by describing their computer programReference Welch, Luben, Khaw and Bingham¹³.

Subjects and methods

Study design

NOWAC is a national population-based cohort study primarily designed to study risk factors for cancer, with 102 443 women enrolled at age 30–70 years from 1991 to 1997. The cohort has been described in detail elsewhereReference Lund, Kumle, Braaten, Hjartaker, Bakken and Eggen¹⁴. A part of NOWAC composes the Norwegian sub-cohort in the European Prospective Investigation into Cancer and Nutrition (EPIC). Updated information about NOWAC can be found on the website (http://uit.no/kk/nowac/). Exposure information is collected by a self-instructive health and lifestyle questionnaire (eight pages) developed specifically for the cohort. The questionnaire is administered by post and optically read. The FFQ covers four consecutive pages within the larger questionnaire. The present study uses data from the reproducibility study of the FFQReference Parr, Veierod, Laake, Lund and Hjartaker¹⁵. The questionnaire was mailed twice (test and retest) to the same subjects, about 3 months apart in February/March and May/June 2002. In the present study, we have imputed missing values in the test FFQ. The retest was used to study how missing responses in the test were reported 3 months later.

Subjects

A random sample of 2000 women was drawn from the cohort for the reproducibility study. The procedure has been previously describedReference Parr, Veierod, Laake, Lund and Hjartaker¹⁵. Five women had not given informed consent to further contact and were therefore excluded. The retest questionnaire was returned by 1496 (75%) of the 1995 women, but three had left the entire FFQ section blank. The imputation methods were compared for 1430 (96%) of the 1496 women who had completed at least 50% of the test FFQ. The exclusion was done to study the effects of imputation in a sample likely to be included in a regular epidemiological analysis of e.g. diet and cancer. All subject characteristics were based on self-reported information in the test and retest questionnaires, except for age, which was taken from the national population registry.

The FFQ

The FFQ structure and the dietary intake calculations have been described elsewhereReference Parr, Veierod, Laake, Lund and Hjartaker¹⁵, as well as the reproducibility and validity of the questionnaireReference Parr, Veierod, Laake, Lund and Hjartaker¹⁵^, Reference Hjartaker, Lund and Bjerve¹⁶. In short, the FFQ is designed to assess habitual diet over the past year, with emphasis on fish and other traditional food items in the study population. The FFQ is mostly structured as smaller blocks or grids with two to nine similar items grouped together under a question heading, but with some single questions about only one food item. Food quantity is estimated by assigning standard portions, or by separate portion size questions. The response options for consumption frequencies and portion sizes are predefined and listed in increasing order with check boxes to facilitate completion and optical reading. For consumption frequencies, the first alternative is always ‘never/rarely’.

The dietary intake was calculated from a total of 132 questions (consumption frequencies = 91, types of fat used on bread = 7, amounts = 28, time of year for the consumption of different species of fish = 6). The food groups were based on the classification system in the EPIC-SOFT program for conducting 24-hour dietary recalls in the EPIC studyReference Slimani, Ferrari, Ocke, Welch, Boeing and Liere¹⁷, but with some modificationsReference Parr, Veierod, Laake, Lund and Hjartaker¹⁵. The daily intake of food groups, energy, and nutrients was calculated using an analysis program developed at the Institute of Community Medicine, University of Tromsø, for SAS software.

Definition of missing values

All non-responses in an FFQ may not be considered missing values if e.g. respondents are directed to skip questions that are not relevant. In the present study, missing portion sizes were permitted if the consumption frequency was ‘never/rarely’. Missing frequencies were permitted for fish if the preceding question about the time of year for consumption was ‘never/rarely’. To identify users of cod liver oil supplements and alcohol, an introductory yes/no question was included, e.g. ‘Do you take cod liver oil (liquid)?’ If the answer was ‘no’, it was permitted to skip further questions about consumption.

The response option ‘Do not use fat on bread’ was listed before types (maximum 7) and the usual amount/layer (e.g. ‘thin’) on a slice of bread could be specified. ‘Do not use fat on bread’ and types of fat composed a group of eight separate 0/1 variables, each with one check box to confirm ‘yes’. This layout presents a challenge because the answer ‘no’ cannot be distinguished from a missing value when the box is left open. However, if both the use of fat and all types of fat had open boxes, this was interpreted as missing information and defined as one missing value because either ‘Do not use fat on bread’ or at least one type should have been marked.

Item non-response in the FFQ was evaluated for 136 questions (132 about frequencies, amounts, types of fat and seasonality, and four yes/no questions about user status). But since the eight check boxes for fat on bread could give only one missing value, the maximum number of missing values was 129.

Methods used for imputing missing values

No consumption and the smallest portion size

The original NOWAC program for calculating dietary intake imputes missing consumption frequencies with the null value (no consumption) and missing portion sizes with the smallest portion, for a conservative intake estimate. Thus, the food quantity will be null for all missing frequencies. When all information about fat on bread is missing, the null value is imputed. If only type is missing, the most common type (soft margarine) is imputed. A factor of 0.5 corresponding to half the year is assigned for missing information about seasons for the consumption of different species of fish. When the initial yes/no questions about the use of cod liver oil supplements and alcohol are missing or create inconsistent responses (e.g. do not take supplements, but the frequency is every day), the frequency response has priority. The method described here was used as the reference and compared to the other methods described below. In all methods, permitted missing values were treated as null intake.

Mode and median

Substitution by the sample mode or median may be described as cross-sectional imputation techniques, since the values are taken from the available data in the same data set. Missing values for user status were imputed by the most frequent answer, i.e. to use fat on bread and drink alcohol, but not take cod liver oil supplements. Missing values in frequencies and amounts were then imputed based on reported or imputed user status (null value for non-users, and the mode/median for users). Most users specified one type of fat on bread. Therefore, the most common type was imputed.

Retest values

Missing values in the first FFQ measurement (test) were imputed by non-missing values in the second measurement (retest) of the same individual. This may be regarded as longitudinal imputation, although the time to the retest was relatively short. The consumption frequency and amount questions for a given food item were imputed as a pair, i.e. if one value was missing in the test then both values were taken from the retest. In the case where the pair of retest values (frequency and amount) was incomplete, the missing test value was imputed if the retest value was available. There were retest answers for 50% of the values missing in the test. Residual missing values in frequencies were treated as null intake and residual missing in amounts as the smallest portion size.

k nearest neighbours imputation

When applying KNN imputation to FFQ data, missing values for each respondent were imputed using values from the k most similar respondents. The idea behind KNN imputationReference Troyanskaya, Cantor, Sherlock, Brown, Hastie and Tibshirani¹² is to take advantage of positive correlations between rows. It is assumed that information about the missing values in row p is best provided by the k rows most similar to row p (the k nearest neighbours). A missing value in row p (here respondent p) in column c (here question c) is therefore imputed by averaging the values that the k nearest neighbours have in column c. The k nearest neighbours must have non-missing entries in column c. The similarity between row p and row p′ is measured by the Euclidean distance between the two rows (omitting the columns for which row p and/or row p′ have missing values), divided by the number of columns where both row p and row p′ have non-missing entries. KNN imputation performs best when there are strong positive correlations between rows.

To adapt KNN imputation to the FFQ setting, we modified the function impute.knn in the package impute ¹⁸ for the statistical software R. First, when computing the distance between two rows, we scaled all columns so that the columns contribute to the distance on the same scale. Second, because the FFQ data are categorical, the imputed values were rounded to the nearest category. Third, the set of seven questions regarding type of fat used on bread were highly dependent (most respondents reported only one type) so that we had to tailor the imputation to handle the dependency. Last, we introduced conditional imputation to prevent permitted missing responses, as defined earlier, from being imputed. A necessary condition for using KNN imputation on categorical variables is that the categories are either ordered or binary, which was the case in this study.

When applying KNN imputation, the number of nearest neighbours (k) must be specified. Troyanskaya et al.Reference Troyanskaya, Cantor, Sherlock, Brown, Hastie and Tibshirani¹² have showed the optimal k to be between 10 and 20 for microarray data, and the default value in the software is 10. To determine a k value for the FFQ data, we calculated the root-mean-squared error (RMSE) for the KNN imputed data with k between 5 and 135, as well as for the data imputed with the null value and the mode and median (Fig. 1). The RMSE calculation was based on the subset of missing test data for which the retest data were non-missing. The non-missing retest data were used as the reference, since the true values for the responses missing in the test are unknown. Therefore, we also added the RMSE for test vs. retest for the subset of values which was non-missing in both data sets. Based on Fig. 1, three different values of the parameter k (10, 20 and 60) were selected to study the effects on the estimated energy and nutrient intake from the KNN imputed data (Appendix). The intake was stable, although a negligible decrease could be spotted for increasing values of k. In the main tables we used k = 20.

Fig. 1

RMSE for different values of the parameter k in KNN imputation of the test food-frequency questionnaire with non-missing retest data as the reference. The curve is compared to imputation with the null, mode and median values. The RMSE for test vs. retest for the subset of values which was non-missing in both data sets is also included (n = 1430) (KNN – k nearest neighbours; RMSE – root-mean-squared error)

Statistical analysis

The proportion of missing in the FFQ was calculated for each individual by dividing the number of missing values by the maximum number possible (129 minus the permitted missing values). The distribution was skewed, so the proportion of missing (%) is presented as the median value with lower and upper quartiles (Q1, Q3) by categories of selected background variables (Table 1). Estimated dietary intake is presented as the median value for the null imputed data, and as within-person differences for the other methods relative to the null imputation. The within-person differences did not fulfil the normality assumption and are therefore presented as both mean and median with quartiles (Q1, Q3). Distribution free confidence limits (95%) for the median differences were also calculated, but not included due to the large number of null values, for which the upper confidence limits were generally also null. We used SAS 9.1 for the data analysis and all imputations, except KNN, which was done with R software.

Table 1

Median (quartiles) for the proportion of missing values (%) in the food frequency questionnaire (test and retest) by selected background characteristics of the respondents, n = 1496Footnote *

* n may not total to 1496 for each characteristic due to missing values.

^† Q1 = lower quartile (25th percentile), Q3 = upper quartile (75th percentile).

^‡ Misprint in the questionnaire: the category <150 should have been ≤ 150. 1000 Norwegian kroner (NOK) ≈ 125 €.

^§ Only presented for the retest, due to an optical reading error in the test.

^¶ Could only be calculated for the retest.

Results

Rate of missing in the FFQ

The overall data matrix for the 1496 respondents and 136 FFQ variables had 18% missing values for the test and 16% for the retest FFQ, after controlling for permitted missing values. The FFQ was fully completed by an equal proportion of respondents in the test and the retest (6%). After excluding individuals with >50% missing in the test, the test data matrix for the remaining 1430 (96%) respondents had 16% missing values, which were imputed. Respondents had a median value (Q1, Q3) of 13% (5, 25) missing values.

Table 1 shows how the proportion of missing values (%) in the test and retest FFQ varied by categories of selected background characteristics of the respondents. The median proportion increased with age and decreased with household income up to 450 000 NOK. The proportion was ≥16% in the oldest age group (66–75 years), in the lowest household income group (<150 000 NOK), and among widows and teetotallers. Health status, trying to lose weight, smoking status, supplement use and days used to return the FFQ appeared to have little or no effect.

The percentage of imputed values for consumption frequencies ranged from < 1% for potatoes to 50% for instant coffee (Table 2). The median value (Q1, Q3) was 12% (6, 25). All items that were not part of question blocks had ≤ 4% missing, e.g. potatoes, yoghurt, breakfast cereal, shellfish and eggs. Unspecific questions about ‘other’ items included at the end of some blocks, e.g. for fruits, vegetables, meat dishes and fish, had a relatively high percentage of missing with 27–38%. Most items with ≥30% missing were part of question blocks listing several types of the same item, e.g. types of milk, cheese, bread and coffee. Items with a high percentage of missing values in the test also tended to have a high percentage of missing in the retest.

Table 2

The intake of food groups (g day⁻¹) after imputing missing values in the test food-frequency questionnaire. Within-person differences (D) are presented as mean and median (quartiles) for imputation with retest valuesFootnote *, mode, median and KNN, relative to imputation with the null value, n = 1430

KNN – k nearest neighbours imputation.

* Available for 50% of missing values. Residual missing treated as null intake.

^† Missing in food frequency questions. Not specified for fat on bread since the intake is calculated from the frequency of bread consumption.

Effects of imputation on dietary intake

Table 2 shows the daily intake of food groups after the missing values in the test FFQ were imputed. Imputation with retest values and the sample mode gave similar results compared to imputation with the null value. The median of the differences was null for all food groups, with some changes in the upper quartiles (Q3). The different imputation methods had little or no effect on the intake of food groups with a low percentage of missing, e.g. ‘potatoes’, ‘cream desserts, milk-based puddings’, ‘pasta and rice’, and ‘red meat and chicken’. The median of the differences was >0 for some food groups when imputed with the sample median and KNN, most of which had >30% missing in one or more food items. KNN tended to give a higher intake than the other imputation methods, with some considerable differences. The median of the differences for the daily intake of ‘milk, yoghurt and cheese’ was 10 g (Q3 = 19) when imputed with the sample median, but 54 g (Q3 = 132) with KNN. For ‘coffee’ the value was 0 g (Q3 = 0) with the sample median, but 60 g (Q3 = 180) with KNN.

The differences in the estimated intake of ‘milk, yoghurt and cheese’ were reflected in the intake of calcium (Table 3). The median of the differences compared to the null value imputation (Q1, Q3) was 84 mg (11, 145) for the sample median, and 176 mg (50, 295) for KNN. For energy intake, the median of the differences was about 200 kJ for imputation with retest values and the sample mode, 455 kJ with the sample median and 743 kJ with KNN. The imputation methods did not change the % energy from protein, fat, carbohydrate, sugar and alcohol.

Table 3

The daily intake of energy and selected nutrients after imputing missing values in the test food-frequency questionnaire. Within-person differences (D) are presented as mean and median (quartiles) for imputation with retest valuesFootnote *, mode, median and KNN, relative to imputation with the null value, n = 1430

KNN – k nearest neighbours imputation; RE – retinol equivalents.

* Available for 50% of missing values. Residual missing treated as null intake.

Discussion

The present study was undertaken to investigate item non-response in the FFQ developed for the NOWAC study and to assess effects of different imputation methods on dietary intake. Compared to null value imputation, the largest differences were seen for KNN, and for food items with a high proportion of missing values. Imputation with retest values increased total energy intake, indicating that not all missing values are caused by respondents failing to specify no consumption.

Item non-response

The proportion of missing within respondents was positively associated with age (data not shown). Likely explanations include impaired cognitive function or a diet with fewer food items. Other factors that seemed to increase non-response, including low income, being a widow or a teetotaller, were also associated with age. Somewhat surprisingly, health status did not seem to affect non-response or to be related to age.

In the present study, we observed a very low percentage of missing values in all items that were not part of a larger question block, or grid. FFQs with a non-grid format may be cognitively easier for respondents to complete, but it increases the page length and thus the costs for printing, scanning and mailing. One example is the 36-page dietary history questionnaire (DHQ) developed at the National Cancer Institute in USA. When compared to a shorter FFQ with a traditional format, the DHQ performed better for questions on portion sizes and dietary supplements, but not for consumption frequenciesReference Subar, Ziegler, Thompson, Johnson, Weissfeld and Reding¹⁹. The proportion of missing/uninterpretable responses in the DHQ was low, but similar to the FFQ. In our study, several blocks of items also had relatively low proportions of missing, e.g. cakes (six items with 4–10% missing) and meat (nine items with 4–14% missing, except ‘other meat dishes’ with 27%). It could be that smaller blocks are cognitively not too demanding, or that the consumption awareness or desirability of some foods encourages responses. A complete non-grid format is not practical in all studies. One alternative is to mix single questions with smaller blocks as in the NOWAC questionnaire, and to put key foods as single questions or as the first item in a block.

Effects of imputation on dietary intake

To assume that missing values are due to respondents failing to specify the option for no consumption greatly simplifies dietary intake calculations and may be reasonable for certain food items, e.g. types of milk and fat on breadReference Johansson, Hallmans, Wikman, Biessy, Riboli and Kaaks⁶^, Reference Kuskowska-Wolk, Holte, Ohlander, Bruce, Holmberg and Adami²⁰. Some studies have validated this common assumption by follow-up telephone interviewReference Hansson and Galanti³^, Reference Kuskowska-Wolk, Holte, Ohlander, Bruce, Holmberg and Adami²⁰^, Reference Caan, Hiatt and Owen²¹ or a resurveyReference Ahn, Paik and Ahn⁹. In one study, the ‘true’ proportion of null consumption was found to vary greatly between food items from 0% (potatoes) to 96% (roe and fish pâté)Reference Hansson and Galanti³. Using our retest data to estimate the proportion of null consumption (i.e. the proportion of ‘never/rarely’ answers in the retest among the missing test values, for which there was a non-missing retest value), the range was 5% (carrots) to 86% (whole milk) (data not shown). Although the probability of null consumption may be high for some food items, it may be low for others. Food items could be consumed, but skipped due to lack of motivation, fatigue, oversight, difficult or sensitive questions, unclear instructions or other reasons. The increase in absolute intake when missing values are imputed with values from our retest or other resurveysReference Ahn, Paik and Ahn⁹^, Reference Caan, Hiatt and Owen²¹ supports this.

Compared to imputation with the null value, the sample mode only increased the dietary intake slightly. In our study, 47% of the frequency variables had a mode value of null. The sample mode is the most frequent answer, but mode = 0 or ‘never/rarely’ may not reflect the probability of a food being consumed if the higher response categories add up to a larger proportion. Thus, the mode may be better for imputing variables on a nominal scale. For ordered categories, such as frequencies and portion sizes, the probability is better reflected in the median. Of the frequency variables, 29% had a median value of null. When using the sample median, we observed a higher intake, in particular for food groups with a high proportion of missing values. One study found marginal differences between imputation with the null and the sample medianReference Hansson and Galanti³, but this study also had a low proportion of missing. KNN gave the highest intake. The largest differences were seen for the food groups ‘milk, yoghurt, cheese’ and ‘coffee’. When inspecting the distribution of the consumption frequencies for each food item, the median of the KNN imputed values tended to be one category higher than the median values before imputation, but the highest category was never imputed with KNN.

It is interesting to note that the RMSE value (Fig. 1) was equal for imputation with KNN (k = 20) and the sample median, but with some clear differences in dietary intake between the methods. Oppositely, the energy and nutrient intake was stable for k = 10, 20 and 60, even though the RMSE values were different. A weakness of RMSE is that it does not necessarily measure the effect of imputation on the dietary intake or other outcomes. This has also been found for microarray dataReference Scheel, Aldrin, Glad, Sorum, Lyng and Frigessi²².

When comparing the effects of imputation on dietary intake (food groups and nutrients) we used the null imputed test data as the reference method for two main reasons: null imputation appears to be common practice and gives the most conservative intake estimate. Our reasons for not using e.g. the complete retest data as reference is that it would be difficult to interpret the effects of imputation separately from the underlying differences between the test and retest measurementsReference Parr, Veierod, Laake, Lund and Hjartaker¹⁵. Also, the reduction in sample size and statistical power would be too large for a meaningful analysis in this study since the retest data were complete for only 6% of the respondents. The percentage of missing test values in this small subsample (n = 91) was low and not representative of the study sample (n = 1430). Since our reference method can only be used for relative comparisons, we cannot conclude which imputation method is more accurate, only that the choice of method may affect dietary intake. We think the lack of an absolute reference method or a gold standard is a general problem in comparative studies of dietary intake. Therefore, our next step in the evaluation of the imputation methods would be to do a simulation study with a complete data set as the reference.

Imputation uncertainty

All imputation methods used in the present study fall into the category of single imputation methods, implying that each missing value is replaced by a single value. Single imputation methods are usually easy to implement, but ignore any uncertainty about the correct value to impute. This can be estimated by doing MI, or repeated simulations of the missing valuesReference Schafer²³. MI is a model-based approach, relying on specific modelling assumptions, and the method may be difficult to use without proficiency in advanced statistics. To our knowledge, MI has been applied to FFQ data in two recent studiesReference Barzi, Woodward, Marfisi, Tognoni and Marchioli¹⁰^, Reference Michels, Rosner, Chumlea, Colditz and Willett¹¹. KNN is more sophisticated than the other single imputation methods used here, since values are estimated for each individual, but without having to specify a strict model as with MI. In the present study we based the KNN imputation on FFQ variables only, but additional predictors could also have been used.

Both single and multiple imputation rely on a mechanism of missingness known as missing at random (MAR)Reference Little and Rubin¹, which requires that the probability that a value is missing is independent of the underlying value that is missing. However, if no consumption (or high consumption) is an important reason for missing values in FFQs, the predominant mechanism may not be MAR, but NMAR (not missing at random). Imputation methods using many predictors, such as MI and KNN, give better protection against departures from the MAR assumption than imputation with e.g. the sample median. The default limit in the function impute.knn ¹⁸ for doing imputation based on KNN is 50% missing in rows (here, respondents). For more than 50% missing a column mean is used. In the present study, we excluded individuals with >50% missing. However, the criteria used to exclude questionnaires vary between investigatorsReference Caan, Hiatt and Owen²¹. If only a specific food group is of interest (e.g. alcohol), exclusion of subjects with missing values for all or most food group items (e.g. beer, wine and spirits) may be better than imputation. However, it is important to check that the exclusion does not significantly reduce the statistical power, or change the distribution of other exposure variables in the analysis.

Implications

Imputation may affect absolute intake levels and the ranking of subjectsReference Ahn, Paik and Ahn⁹^, Reference Caan, Hiatt and Owen²¹, which has implications for risk estimation in epidemiological studies, as well as for those who are defined as under- and overreporters and excluded from the analysis. When we excluded those in the lower (1%) and upper (99%) percentiles of energy intake after imputing the data (28 subjects with each method), 60% of the subjects were excluded by all methods (details not shown). Our data also show that by using different imputation methods, the median daily energy intake can be increased from 6.4 MJ for the null value to 7.3 MJ for KNN, with a median increase (Q1, Q3) of 11% (3, 24). This can affect the interpretation of FFQ data in general and in validation studies.

In conclusion, the calculation of dietary intake from FFQs is affected by the proportion of missing data and the imputation method used. As an overall imputation strategy, the null value is likely to lead to underestimation of dietary intake and misclassification. However, missing values cannot be estimated without error. We encourage the application and evaluation of more refined imputation methods, which are described in statistics literature, and which may reduce imputation errors and give more accurate intake estimates. To determine if KNN performs better than the other methods used, our next step would be to do a simulation study.

Acknowledgements

Sources of funding: The work of C.L.P. was supported by the Norwegian Foundation for Health and Rehabilitation.

Conflict of interest declaration: The authors declare no conflict of interest.

Authorship contributions: A.H., P.L., E.L. and M.B.V. were responsible for the initiation, design and funding of the present study. E.L. is the Principal Investigator of the Norwegian Women and Cancer study, and A.H. developed the food-frequency questionnaire. The imputation of the data was done by I.S. (KNN imputation) and C.L.P. (other methods). C.L.P. calculated the dietary intake, performed the statistical analysis, and drafted the manuscript, except for the section on KNN and Fig. 1, which were prepared by I.S. M.B.V. and P.L. contributed to the statistical analysis. All authors were involved in the interpretation of results and manuscript revision, and all authors approved the final version.

Acknowledgements: The authors are very grateful to Professor Arnoldo Frigessi at the University of Oslo for ideas, including the use of KNN imputation, and for fruitful discussions.

Appendix The mean and median daily intake of energy and selected nutrients after imputing missing values in the test food-frequency questionnaire by k nearest neighbours imputation for different values for the parameter k. Within-person differences are presented as mean and median (quartiles) for k = 10 and k = 60 relative to k = 20, which is used in the main tables, n = 1430

RE – retinol equivalents.

References

1Little, RJA, Rubin, DB. Statistical Analysis with Missing Data, 2nd edn. New Jersey: Wiley, 2002.CrossRef Google Scholar

2Schafer, JL, Graham, JW. Missing data: our view of the state of the art. Psychological Methods 2002; 7: 147–177.CrossRef Google Scholar PubMed

3Hansson, LM, Galanti, MR. Diet-associated risks of disease and self-reported food consumption: how shall we treat partial nonresponse in a food frequency questionnaire? Nutrition and Cancer 2000; 36: 1–6.CrossRef Google Scholar

4Holmberg, L, Ohlander, EM, Byers, T, Zack, M, Wolk, A, Bruce, A, et al. A search for recall bias in a case–control study of diet and breast cancer. International Journal of Epidemiology 1996; 25: 235–244.CrossRef Google Scholar

5Cade, J, Thompson, R, Burley, V, Warm, D. Development, validation and utilisation of food-frequency questionnaires – a review. Public Health Nutrition 2002; 5: 567–587.CrossRef Google Scholar PubMed

6Johansson, I, Hallmans, G, Wikman, A, Biessy, C, Riboli, E, Kaaks, R. Validation and calibration of food-frequency questionnaire measurements in the Northern Sweden Health and Disease cohort. Public Health Nutrition 2002; 5: 487–496.CrossRef Google Scholar PubMed

7Gaard, M, Tretli, S, Loken, EB. Dietary fat and the risk of breast cancer: a prospective study of 25,892 Norwegian women. International Journal of Cancer 1995; 63: 13–17.CrossRef Google Scholar

8Goldbohm, RA, van den Brandt, PA, Brants, HA, van’t Veer, P, Al, M, Sturmans, F, et al. Validation of a dietary questionnaire used in a large-scale prospective cohort study on diet and cancer. European Journal of Clinical Nutrition 1994; 48: 253–265.Google Scholar

9Ahn, Y, Paik, HY, Ahn, YO. Item nonresponses in mailed food frequency questionnaires in a Korean male cancer cohort study. Asia Pacific Journal of Clinical Nutrition 2006; 15: 170–177.Google Scholar

10Barzi, F, Woodward, M, Marfisi, RM, Tognoni, G, Marchioli, R. Analysis of the benefits of a mediterranean diet in the GISSI-prevenzione study: a case study in imputation of missing values from repeated measurements. European Journal of Epidemiology 2006; 21: 15–24.CrossRef Google Scholar PubMed

11Michels, KB, Rosner, BA, Chumlea, WC, Colditz, GA, Willett, WC. Preschool diet and adult risk of breast cancer. International Journal of Cancer 2006; 118: 749–754.CrossRef Google Scholar PubMed

12Troyanskaya, O, Cantor, M, Sherlock, G, Brown, P, Hastie, T, Tibshirani, R, et al. Missing value estimation methods for DNA microarrays. Bioinformatics 2001; 17: 520–525.CrossRef Google Scholar PubMed

13Welch, AA, Luben, R, Khaw, KT, Bingham, SA. The CAFE computer program for nutritional analysis of the EPIC-Norfolk food frequency questionnaire and identification of extreme nutrient values. Journal of Human Nutrition and Dietetics 2005; 18: 99–116.CrossRef Google Scholar PubMed

14Lund, E, Kumle, M, Braaten, T, Hjartaker, A, Bakken, K, Eggen, E, et al. External validity in a population-based national prospective study – the Norwegian Women and Cancer Study (NOWAC). Cancer Causes & Control 2003; 14: 1001–1008.Google Scholar

15Parr, CL, Veierod, MB, Laake, P, Lund, E, Hjartaker, A. Test–retest reproducibility of a food frequency questionnaire (FFQ) and estimated effects on disease risk in the Norwegian Women and Cancer Study (NOWAC). Nutrition Journal 2006; 5: 4.CrossRef Google Scholar PubMed

16Hjartaker, A, Lund, E, Bjerve, KS. Serum phospholipid fatty acid composition and habitual intake of marine foods registered by a semi-quantitative food frequency questionnaire. European Journal of Clinical Nutrition 1997; 51: 736–742.CrossRef Google Scholar PubMed

17Slimani, N, Ferrari, P, Ocke, M, Welch, A, Boeing, H, Liere, M, et al. Standardization of the 24-hour diet recall calibration method used in the European Prospective Investigation into Cancer and Nutrition (EPIC): general concepts and preliminary results. European Journal of Clinical Nutrition 2000; 54: 900–917.CrossRef Google Scholar PubMed

18 Hastie T, Tibshirani R, Narasimhan B, Chu G. The Package Impute for Imputation of Microarray Data With R Software [online], 2006. Available at http://cran.r-project.org/. Accessed 28 March 2007.Google Scholar

19Subar, AF, Ziegler, RG, Thompson, FE, Johnson, CC, Weissfeld, JL, Reding, D, et al. Is shorter always better? Relative importance of questionnaire length and cognitive ease on response rates and data quality for two dietary questionnaires. American Journal of Epidemiology 2001; 153: 404–409.CrossRef Google Scholar PubMed

20Kuskowska-Wolk, A, Holte, S, Ohlander, EM, Bruce, A, Holmberg, L, Adami, HO, et al. Effects of different designs and extension of a food frequency questionnaire on response rate, completeness of data and food frequency responses. International Journal of Epidemiology 1992; 21: 1144–1150.CrossRef Google Scholar PubMed

21Caan, B, Hiatt, RA, Owen, AM. Mailed dietary surveys: response rates, error rates, and the effect of omitted food items on nutrient values. Epidemiology 1991; 2: 430–436.CrossRef Google Scholar PubMed

22Scheel, I, Aldrin, M, Glad, IK, Sorum, R, Lyng, H, Frigessi, A. The influence of missing value imputation on detection of differentially expressed genes from microarray data. Bioinformatics 2005; 21: 4272–4279.CrossRef Google Scholar PubMed

23Schafer, JL. Multiple imputation: a primer. Statistical Methods in Medical Research 1999; 8: 3–15.CrossRef Google Scholar PubMed

Fig. 1 RMSE for different values of the parameter k in KNN imputation of the test food-frequency questionnaire with non-missing retest data as the reference. The curve is compared to imputation with the null, mode and median values. The RMSE for test vs. retest for the subset of values which was non-missing in both data sets is also included (n = 1430) (KNN – k nearest neighbours; RMSE – root-mean-squared error)

Table 1 Median (quartiles) for the proportion of missing values (%) in the food frequency questionnaire (test and retest) by selected background characteristics of the respondents, n = 1496*

Table 2 The intake of food groups (g day−1) after imputing missing values in the test food-frequency questionnaire. Within-person differences (D) are presented as mean and median (quartiles) for imputation with retest values*, mode, median and KNN, relative to imputation with the null value, n = 1430

Table 3 The daily intake of energy and selected nutrients after imputing missing values in the test food-frequency questionnaire. Within-person differences (D) are presented as mean and median (quartiles) for imputation with retest values*, mode, median and KNN, relative to imputation with the null value, n = 1430

Article contents

Comparing methods for handling missing values in food-frequency questionnaires and proposing k nearest neighbours imputation: effects on dietary intake in the Norwegian Women and Cancer study (NOWAC)

Abstract

Keywords

Information

The missing data problem in dietary intake calculations

Subjects and methods

Study design

Subjects

The FFQ

Definition of missing values

Methods used for imputing missing values

No consumption and the smallest portion size

Mode and median

Retest values

k nearest neighbours imputation

Statistical analysis

Results

Rate of missing in the FFQ

Effects of imputation on dietary intake

Discussion

Item non-response

Effects of imputation on dietary intake

Imputation uncertainty

Implications

Acknowledgements

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests