Many countries including the USA and Canada have seen dramatic increases in rates of childhood obesity, type 2 diabetes and other diet-related health conditions in recent decades( Reference Patterson, Guariguata and Dahlquist 1 , Reference Ng, Fleming and Robinson 2 ). Researchers have argued that improvements to the wider food environment including the availability, accessibility or affordability of healthy food( Reference Caspi, Sorensen and Subramanian 3 ) could contribute to public health strategies aimed at reducing barriers to healthy eating( Reference Ver Ploeg, Breneman and Farrigan 4 – Reference Zenk, Thatcher and Reina 6 ). Recent studies and policy interventions have focused in particular on measuring and assessing the potential impact of the ‘community nutrition environment‘ surrounding schools( Reference Mair, Pierce and Teret 7 ), defined by Glanz et al. as ‘the number, type, and location and accessibility of food outlets’( Reference Glanz, Sallis and Saelens 8 ).
For example, Los Angeles recently banned fast-food outlets from opening in South Los Angeles, in part to reduce children’s access to and intake of minimally nutritious foods( Reference Sturm and Cohen 9 ).
In Canada, the only G8 country without a federal school lunch programme, students may be particularly likely to purchase minimally nutritious foods from food vendors near schools; Héroux et al.( Reference Héroux, Iannotti and Currie 10 ) report that Canadian children are more frequent school-day patrons of food retailers than are American children. However, large gaps remain in the evidence base regarding the ways Canadian children’s dietary choices are shaped by community nutrition environments surrounding schools (or homes), in part due to difficulties associated with the collection of data on community nutrition environments.
The majority of peer-reviewed studies on the community nutrition environment obtain food outlet data from: (i) ‘ground-truthing’, the systematic surveying of a region to identify and classify food retailers; (ii) commercial database providers; or (iii) government sources( Reference Moore and Diez-Roux 11 ). Ground-truthing is considered the gold standard( Reference Fleischhacker, Evenson and Sharkey 12 , Reference Lucan 13 ), but the approach is resource-intensive and infeasible for the assessment of past years. Commercial data sets often require less time and cost to obtain, and many are available for historical periods (e.g. DMTI Spatial, Inc. 2003( 14 ), 2006( 15 ) and 2009( 16 )), but such data sets are constructed for business purposes and may not achieve levels of quality necessary for research( Reference Moore and Diez-Roux 11 ). To date, many Canadian studies of the community nutrition environment surrounding schools have relied on Yellow Pages (commercial) food outlet directories( Reference Héroux, Iannotti and Currie 10 , Reference Seliske, Pickett and Boyce 17 – Reference Laxer and Janssen 19 ). A recent review, however, found that Yellow Pages directories perform poorly in measures of validity compared with more expensive commercial sources( Reference Fleischhacker, Evenson and Sharkey 12 ). Municipal data sets like health inspections listings or business registries are frequently free, and could have fewer missing data points because of the legal requirements associated with government data collection( Reference Hosler and Dharssi 20 , Reference Toft, Erbs-Maibing and Glümer 21 ), but government agencies vary in their efforts to maintain and update registries( Reference Fleischhacker, Evenson and Sharkey 12 ).
A 2013 systematic review identified nineteen studies that tested the validity of commonly used community nutrition environment data sources( Reference Fleischhacker, Evenson and Sharkey 12 ), generally comparing the data source of interest with ground-truthed data. Researchers then rely on validity measures including sensitivity, positive predictive value (PPV) and concordance (Table 1) to characterize levels of overcounting (including stores that have closed or do not exist) and undercounting (failing to include existing stores). Data validation studies also often test for systematic error in secondary data sets, evaluating associations between error rates and neighbourhood characteristics( Reference Fleischhacker, Evenson and Sharkey 12 ). Both random and systematic errors are of interest because random measurement error would add noise that obscures the associations of the community nutrition environment with outcomes of interest, while systematic error would contribute bias that could lead researchers to incorrect results. There is thus a need to understand both the magnitude and the nature of error in commonly used community nutrition environment data sets.
TP, true positive; FN, false negative; FP, false positive.
Systematic error is of particular concern because of its potential to produce misleading results. Most studies have not found evidence of systematic bias according to neighbourhood socio-economic status( Reference Paquet, Daniel and Kestens 22 – Reference Clary and Kestens 28 ) or neighbourhood racial demographics( Reference Bader, Ailshire and Morenoff 24 , Reference Rossen, Pollack and Curriero 26 , Reference Rummo, Gordon-Larsen and Albrecht 29 ), but several studies show evidence of systematic bias in relation to urbanicity or commercial density. Four studies in the USA identified statistically significant differences in validity levels in association with urbanicity or density( Reference Bader, Ailshire and Morenoff 24 , Reference Longacre, Primack and Owens 30 – Reference Powell, Han and Zenk 32 ) although no significant associations were identified in two UK studies( Reference Lake, Burgoine and Stamp 25 , Reference Burgoine and Harrison 27 ) and the direction of the association varies across studies. But the data sets examined in the aforementioned studies are often specific to the USA or Europe. In Canada, data validation research has focused on two targeted geographic areas (the city of Montreal( Reference Paquet, Daniel and Kestens 22 , Reference Clary and Kestens 28 ) and the province of Ontario( Reference Seliske, Pickett and Bates 33 )), limiting generalizability to other regions like Vancouver, where there has been recent interest in food environment research and policy( 34 ). Moreover, to our knowledge, no Canadian study has tested for systematic bias in validity scores according to commercial density. This is an important gap given the evidence from other countries of associations between validity and commercial density( Reference Bader, Ailshire and Morenoff 24 , Reference Longacre, Primack and Owens 30 – Reference Powell, Han and Zenk 32 ) as well as the possibility that error, if systematic, may bias research results.
The present study sought to fill gaps in the literature through an evaluation of food outlet data sources for the city of Vancouver, Canada. The study’s objectives were threefold: (i) to assess the validity of two commercial and two municipal secondary data sources in comparison with ground-truthed data; (ii) to test each data set for evidence of systematic bias in association with neighbourhood socio-economic deprivation or commercial density; and (iii) to compare community nutrition environment measures constructed from secondary commercial and municipal data sets with gold-standard ground-truthed data. The first objective provides results that can be compared with findings from previous data validation research in other countries and cities, while the second and third objectives offer novel methods to help researchers understand how over- or undercounting of outlet listings may be affecting community nutrition environment research.
The present study examined the community nutrition environments surrounding schools in Vancouver, Canada. Vancouver is a coastal city with one of the most densely populated metropolitan areas in North America( 35 ). Food outlet data were obtained from five sources: (i) ground-truthed primary data; (ii) (municipal) Business Licences( 36 ); (iii) (municipal) Vancouver Coastal Health inspections lists( 37 ); (iv) (commercial) Pitney Bowes Software’s Canada Business Points( 38 ); and (v) (commercial) DMTI Spatial, Inc.’s Enhanced Points of Interest( 39 ). An overview of these data sets is provided in Table 2.
GIS, geographic information system; NAICS, North American Industry Classification System; SIC, Standard Industrial Classification.
The ground-truthed data were obtained through systematic surveying between 29 June and 30 September 2015. A purposive sampling approach was used to select twenty-six schools across the Vancouver School Board’s six geographic sectors (detailed in previous papers( Reference Ahmadi, Black and Velazquez 40 , Reference Velazquez, Black and Billette 41 )) located in neighbourhoods with diverse levels of commercial density and socio-economic status.
Following a surveying protocol adapted from similar research( Reference Fleischhacker, Rodriguez and Evenson 42 ) (see online supplementary material, Supplementary File 1), two researchers visited all commercial streets located within an 800 m line-based buffer surrounding schools, a buffer size chosen because it is the distance most frequently examined in research on the community nutrition environment surrounding schools( Reference Williams, Scarborough and Matthews 43 ). The researchers identified, photographed and classified all food outlets; a single researcher also identified, photographed and classified any outlets along each residential street included in the sample. The surveyors collected outlet GPS coordinates with a Garmin eTrex 20x Worldwide Handheld GPS Navigator. One school buffer zone was visited twice by two separate surveying teams, and the results were compared using Cohen’s κ to assess inter-rater reliability in surveyors’ store classifications.
The two municipal data sets – Business Licences and Vancouver Coastal Health inspections lists – were obtained from the Vancouver Open Data Catalogue and from the inspections website for Vancouver Coastal Health, respectively, in October 2015. For the Business Licences, historical records allowed the present study to examine both 2015 and 2012 data to consider the potential impacts of temporality of data on validity measures. The inspections lists included records from health inspections of all restaurants and food facilities conducted by Vancouver Coastal Health, the health authority for the region within which this study was conducted. The organization’s inspections lists comprised food-service establishments, food stores and food processors in the city of Vancouver, classified by ‘service type’. The Business Licences data were similar, although they offered a more fine-grained ‘business sub-type’ classification system for identifying convenience stores, grocery stores and produce outlets.
The most recent commercial data sources to which we had access were Canada Business Points data from 2012 and Enhanced Points of Interest data for 2013. Both data sets included geographic locations, Standard Industrial Classification (SIC) codes and North American Industry Classification System (NAICS) codes – two federal coding systems that classify businesses according to industry. The NAICS codes are a more recent classification system that has replaced SIC codes for many government agencies in Canada, the USA and Mexico( 44 ).
The 2015 Business Licence data( 36 ) were also used to measure commercial density – defined as the total number of businesses of any type located within the 800 m buffer surrounding schools – based on their performance in the validation study (see ‘Results’). Relative socio-economic deprivation was assessed with the Vancouver Area Neighbourhood Deprivation Index (VANDIX), an area-based index of deprivation constructed from seven variables (proportion of the population with less than a high school education, proportion with a university degree, unemployment rate, proportion of lone-parent families, average income, proportion of home owners and labour force participation rate) obtained from the 2006 Census of Canada( Reference Bell, Schuurman and Oliver 45 , Reference Bell and Hayes 46 ). For the current study, the VANDIX was calculated for dissemination areas, 400- to 700-person regions comprising the smallest available census geography( 47 ). The twenty-six schools examined in the study, which were mapped with data from the Vancouver Open Data Catalogue( 48 ), were assigned a ‘high’, ‘medium’ or ‘low’ VANDIX tertile based on the VANDIX scores of the dissemination area directly surrounding the school. ‘High’ scores indicate the most socio-economically deprived and ‘low’ scores indicate the least deprived areas.
Cleaning and classification of food outlets
The secondary data sets were carefully examined and listings that were outdated, duplicated or lacking geographic information were deleted following standard procedures used in similar research( Reference Paquet, Daniel and Kestens 22 , Reference Clary and Kestens 28 , Reference Liese, Colabianchi and Lamichhane 31 , Reference Lucan, Maroko and Bumol 49 ). For the Vancouver Coastal Health inspections lists, which did not include geographic coordinates, an address locator( 50 ) geolocated outlets with 98 % accuracy; manual address matches were identified for the remaining 2 % of outlets. For each of the four secondary community nutrition environment data sets, outlets located within 800 m line-based buffers( Reference Oliver, Schuurman and Hall 51 ) surrounding each of the twenty-six schools of interest were extracted for comparison with ground-truthed outlets located within the same buffers. All geographic data were projected to the NAD83/UTM zone 10N coordinate system with ArcGIS( 52 ).
The present study compared three classes of outlets: (i) limited-service food outlets, restaurants or coffee shops where customers order at a counter and pay before consuming food or beverages; (ii) convenience stores, which included retail stores primarily offering snack foods or beverages, possibly attached to a pharmacy or gas station; and (iii) grocery stores or supermarkets, comprising retail food stores with the departments of a traditional grocer (dairy, bakery, butcher, deli and produce). These three store types were selected because they are the most commonly used store types in the literature on community nutrition environments surrounding schools( Reference Williams, Scarborough and Matthews 43 ), and definitions were adapted from previous research( Reference Fleischhacker, Rodriguez and Evenson 42 , Reference Lucan, Maroko and Bumol 49 , Reference Han, Powell and Zenk 53 ). Outlets were classified following a modification of the flowchart used by Clary and Kestens( Reference Clary and Kestens 28 ) (included in the online supplementary material, Supplementary File 1). For the 2012 and 2015 Business Licences, ‘Business Type’ and ‘Business Subtype’ were used to classify listings. The ‘Facility Type’ classification included in the Vancouver Coastal Health inspections lists was too coarse-grained to identify each of the three outlet classes and the SIC/NAICS codes provided in the commercial Canada Business Points and Enhanced Points of Interest were inadequate for classification (e.g. McDonald’s and other well-known fast-food outlets were listed as full-service restaurants, and the codes often failed to discriminate between convenience stores and small grocery outlets). The present study thus supplemented the ‘Facility Type’ and SIC/NAICS codes with the application of a name-based classification scheme (see online supplementary material, Supplementary File 2) following previous studies( Reference Burgoine and Harrison 27 , Reference Clary and Kestens 28 ).
Outlet matching approach
Two approaches were applied to match outlets in the commercial and municipal data sets with outlets in the ground-truthed data set. First, outlets were compared by address and two outlets were matched if the listings included identical street names and numbers. This approach left some stores unmatched due to small inconsistencies, so an algorithm was encoded in R version 3.2.4( 54 ) to match each store according to name and geographic location, following previous studies( Reference Auchincloss, Moore and Moore 55 , Reference Hoehner and Schootman 56 ). For each store in the ground-truthed data set, geographic coordinates were used to identify all stores in the secondary data set located within 100 m of the ground-truthed store. The Levenshtein similarity, a similarity function based on the Levenshtein distance (the minimum number of edits necessary for one store name to become identical to the other( Reference Winkler 57 )), was calculated for all potential matches within 100 m with the RecordLinkage package for R( Reference Sariyar and Borg 58 ); the ground-truthed store was then matched with the outlet with the highest Levenshtein similarity score. Results from the address- and the name-based matching approaches were compared and, for ground-truthed outlets with different results across the two approaches, the best match was determined manually. For the Canada Business Points, which did not include addresses, the algorithm was applied twice and each entry was reviewed and, if necessary, matched manually.
First, the validity of all secondary data sets was assessed with the ground-truthed data set serving as the gold standard. For each of the commercial and municipal secondary data sets, a matched store was considered a true positive (TP) if it was listed in both the secondary data set and the ground-truthed data with the same classification, a false positive (FP) if listed in the secondary data but not in the ground-truthed data, and a false negative (FN) if listed in the ground-truthed data but not in the secondary data set. Sensitivity, PPV and concordance (defined in Table 1) were then calculated as measures of the validity of each secondary data source. A listing was considered a true positive even if it had a different name in the secondary data set from that in the ground-truthed data, if the two listings included identical addresses and classifications. As a sensitivity analysis, ‘strict’ true positives were calculated omitting stores with highly dissimilar names.
Second, logistic regressions examined whether the odds of false positives or false negatives increased in association with neighbourhood socio-economic deprivation or commercial density to assess systematic biases. Regressions were fitted for all stores in the ground-truthed data set with the outcome equal to 1 if the store was a false negative and 0 if the outlet was a true positive; the PPV analyses were run for all stores in each secondary data set with the outcome equal to 1 if the store was a false positive and 0 if the store was a true positive. Each model was fitted with either VANDIX score tertile or commercial density (in units of 100 outlets) as independent variables. As a sensitivity analysis, models were also fitted with population density, measured as the average number of people per hectare located within the 800 m line-based buffers surrounding each school, calculated from dissemination area-level data from the 2006 Census.
Third, community nutrition environment measures (density and proximity of outlets near schools) constructed from the commercial and municipal data sets were compared with measures from the ground-truthed data set using Kendall’s τ, a non-parametric measure of correlation( Reference Newson 59 ). ArcGIS was used to calculate density (the total number of outlets located within each 800 m line-based school buffer) and proximity (the shortest street-based distance from each school to a food outlet). Confidence intervals were calculated with the DescTools package in R( Reference Signorell 60 ) and P<0·05 was used for determining statistical significance for all analyses.
Assessment of data set validity
Table 3 reports the counts of food outlets for each of the municipal and commercial secondary data sets and results from comparisons between ground-truthed and secondary data sources. Ground-truthing identified 267 limited-service food outlets, 124 convenience stores and sixty-four grocery stores or supermarkets located within 800 m of the sample of twenty-six schools. For outlets classified by two surveyors, percentage agreement was 91 % and Cohen’s κ was 0·88, indicating strong inter-rater reliability( Reference McHugh 61 ).
† Number of total unique food outlets listed in each data set located within 800 m of twenty-six schools.
The 2015 Business Licences had the highest overall scores for sensitivity, identifying 69 % of the ground-truthed stores. This data set’s sensitivity was highest for convenience stores (0·75) and limited-service outlets (0·72), and lower for grocery stores (0·42). Nevertheless, the Business Licences generated the highest sensitivity for grocery stores among the secondary data sources examined. The Vancouver Coastal Health inspections list, in contrast, had the highest PPV (0·60) for all outlets combined. The validity estimates for each of the municipal data sets in 2015 were higher than those obtained for either of the two commercial data sets in all cases except for the sensitivity estimates for grocery stores.
With strict name matching, the 2015 Business Licence data lost twenty-eight outlet matches, leading its sensitivity to drop to 0·62 while PPV decreased to 0·50. The 2012 Business Licence data lost thirty-four matches (sensitivity=0·51, PPV=0·42), the Vancouver Coastal Health data lost fifteen matches (sensitivity=0·50, PPV=0·57) and the Enhanced Points of Interest lost twenty-seven matches (sensitivity=0·33, PPV=0·32). Canada Business Points had the fewest matched outlets with different names, with just seven outlets failing the stricter name-based standard (sensitivity=0·40, PPV=0·42). Regardless of the approach to matching store names, the municipal data sets performed better in terms of overall sensitivity, PPV and concordance than did the commercial data sets.
Assessment of systematic bias
Tables 4 and 5 report findings from bivariate logistic regression analyses examining associations of commercial density and socio-economic status with false positive and false negative listings in each secondary data set. Neighbourhood socio-economic deprivation surrounding schools was not consistently associated with the odds of listings being false positives or false negatives. However, commercial density surrounding schools was significantly associated with the proportion of false negative (v. true positive) listings in all secondary data sets except the municipal Business Licences data. An increase of 100 stores within an 800 m buffer zone surrounding schools was associated with a 7 % increase in the odds that a store in the ground-truthed data would be missing from the Vancouver Coastal Health inspections lists (OR=1·07, 95 % CI 1·01, 1·14), 11 % higher odds in the Canada Business Points (OR=1·11, 95 % CI 1·04, 1·18) and 8 % higher odds in the Enhanced Points of Interest (OR=1·08, 95 % CI 1·01, 1·15). Commercial density was not significantly associated with the odds of false positive listings, and no significant associations were observed in models fitted with population density rather than commercial density.
VANDIX, Vancouver Area Neighbourhood Deprivation Index.
† Calculated in the 800 m region surrounding each school.
‡ Calculated in the dissemination area surrounding each school; ‘high’ indicates most deprived.
§ Number of outlets in each secondary data set; outlets in two buffer zones are counted twice.
VANDIX, Vancouver Area Neighbourhood Deprivation Index.
* P<0·05, **P<0·01.
† Calculated in the 800 m region surrounding each school.
‡ Calculated in the dissemination area surrounding each school; ‘high’ indicates most deprived.
§ Number of outlets in each secondary data set; outlets in two buffer zones are counted twice.
Comparison of community nutrition environment measures across data sets
Across all secondary data sources, density measures were highly correlated with measures from the ground-truthed data (Kendall’s τ b≥0·87 for all outlets). The strength of the correlations between proximity measures from secondary and ground-truthed data was slightly lower, with Kendall’s τ a falling between 0·61 for the 2012 Business Licences (95 % CI 0·37, 0·84) and 0·74 for the Canada Business Points (95 % CI 0·49, 0·99). This suggests that in ranking schools by proximity, measures constructed from the Canada Business Points were 74 % more likely to agree than to disagree with measures constructed from the ground-truthed data; rankings based on measures constructed from the 2012 Business Licences were only 61 % more likely to agree than to disagree with measures constructed from the ground-truthed data.
Table 6 further illustrates differences in the correlations of community nutrition environment measures between data sources depending on the store type of interest. Although both commercial data sets performed comparably to the municipal data sets in estimating the density of limited-service outlets and convenience stores, rank correlations were considerably lower for grocery store densities (0·56 and 0·51, respectively).
*P<0·05, **P<0·01, ***P<0·001.
† Evaluated with τ b due to ties.
‡ Evaluated with τ a.
The present study assessed the validity of two municipal and two commercial community nutrition environment data sources compared with a gold standard, ground-truthed data set in a large North American city. This research to our knowledge is the first to directly compare two commercial database providers – DMTI Spatial, Inc. and Pitney Bowes Software – which are among the most accessible proprietary sources of commercial food outlet data in Canada. The study adds to the literature by examining how error affects measures of community nutrition environment exposure surrounding schools, illuminating the nature and magnitude of error within secondary data sets, and offering insight from a large Canadian city.
The study found that all data sets were subject to high levels of error: data sets both (i) failed to include at least 20 % of outlets observed in the field and (ii) consisted at minimum of 25 % listings not found in the field. The 2015 Business Licence data and the Vancouver Coastal Health data had sensitivity and PPV values in the range of 0·54–0·69 (for all food outlets), similar to results for local health department listings’ sensitivity (0·66) and PPV (0·49) in North Carolina, USA( Reference Fleischhacker, Rodriguez and Evenson 42 ), and to a sensitivity estimate (0·66) for city council data in Newcastle, UK( Reference Lake, Burgoine and Greenhalgh 62 ). The municipal data sources’ PPV scores were lower, however, than those found in Newcastle city council data (PPV=0·92)( Reference Lake, Burgoine and Greenhalgh 62 ) and for South Carolina Department of Health and Environmental Control data (PPV=0·89)( Reference Liese, Colabianchi and Lamichhane 31 ). These differences suggest that researchers should evaluate the validity of government data on a case-by-case basis, if possible, before choosing to use municipal data sets for research purposes( Reference Fleischhacker, Evenson and Sharkey 12 ).
Overall, the sensitivity, PPV and concordance values for the commercial data sources were lower in Vancouver than reported in previous studies in other regions. For example, examining food outlets in the UK Points of Interest data for 2012, Burgoine and Harrison( Reference Burgoine and Harrison 27 ) obtained a sensitivity value of 0·60 and PPV of 0·75, significantly higher than the values observed for commercial data sources in the present study; Clary and Kestens( Reference Clary and Kestens 28 ) similarly obtained higher PPV and sensitivity estimates (0·64 and 0·55, respectively) for their examination of the 2010 Enhanced Points of Interest data in Montreal. Both sets of researchers, however, had a smaller temporal difference between the last update of the secondary data source and their collection of ground-truthed data in comparison with the present study, suggesting that the difference in results may be explained by the depreciation of data quality over time.
Nevertheless, the current study found that overall both municipal data sets outperformed commercial data sets in measures of validity, even when the 2012, rather than 2015 Business Licence data were used for comparison. Much of the existing literature on the community nutrition environment surrounding schools has relied on commercial data sources such as the two data sets examined here( Reference Williams, Scarborough and Matthews 43 ). Our study suggests that municipal data sets can provide adequate alternatives that may offer higher-quality data than many of the data sets on which the community nutrition environment literature currently relies.
The present study also evaluated associations between neighbourhood socio-economic deprivation and commercial density with the odds of incorrect listings. This examination was valuable because systematic error in data sets could bias research findings: if data sets consistently fail to identify existing food retailers in low-income neighbourhoods, for example, researchers might underestimate low-income communities’ access to food retailers. In the absence of such bias, random error could create ‘noise‘ that weakens the magnitude of observed associations (i.e. type 2 error when true associations are not detected). Thus, the results obtained here – of no consistent associations between neighbourhood socio-economic deprivation and the odds of false negative or false positive associations – are reassuring for researchers because they suggest that results regarding socio-economic disparities in food retail access are not subject to systematic bias. This finding is similar to the results of several previous studies that have reported no associations between measures of socio-economic deprivation and levels of commercial data set validity( Reference Paquet, Daniel and Kestens 22 , Reference Cummins and Macintyre 23 , Reference Rossen, Pollack and Curriero 26 – Reference Clary and Kestens 28 ).
The present study did, however, find positive associations between the odds of false positive listings and commercial density in three of four data sets. Similar results were reported in Chicago where more disagreement between secondary and ground-truthed data was found for stores closer to the city’s central business district( Reference Bader, Ailshire and Morenoff 24 ). Areas close to the central business district are among the city’s most commercially dense neighbourhoods, so these results suggest that researchers would obtain lower validity scores in more commercially dense areas. It is worth noting that we conducted a sensitivity analysis using population density as an alternative measure of urbanicity, which did not find evidence of significant associations between that measure and odds of false positives or false negatives in any data set. We did not have access to data regarding business turnover, but hypothesize that more commercially dense Vancouver neighbourhoods (but not necessarily those with higher population densities alone) may have more outlets opening annually and thus more stores that can be missed. Researchers using commercial data to compare areas with higher and lower commercial density should therefore bear in mind potential impacts of such systematic error.
Despite the evidence of low levels of validity, community nutrition environment measures constructed from the commercial and municipal data sets were highly correlated with measures from ground-truthed data. This observation is consistent with findings of two other known studies examining the effect of data set validity on community nutrition environment measures: Ma et al.( Reference Ma, Battersby and Bell 63 ) found that measures of food deserts, which are low-income areas where residents lack access to grocery stores or supermarkets, created from two commercial data sets (InfoUSA and Dun & Bradstreet) had 93·5 % concordance with comparable measures obtained from the US Department of Agriculture and the Centers for Disease Control and Prevention; and Lebel et al.( Reference Lebel, Daepp and Block 64 ) found that estimates of food stores per 1000 people constructed from a commercial data set (InfoUSA) had 86·9 % correlation with estimates calculated from a gold standard data set (Boston Inspectional Services Department). The high levels of undercounting and overcounting estimated with low sensitivity and PPV, respectively, may offset one another, resulting in data that remain representative of the true community nutrition environment. Thus low validity scores did not translate into low validity for measures of relative access to food outlets, leading researchers to underestimate the usefulness of secondary data sets for research on the community nutrition environment( Reference Lebel, Daepp and Block 64 ).
Several notable limitations of the present study should be considered. Foremost, because ground-truthed data were collected in 2015, depreciation of data quality over time may contribute to the lower validity scores the study obtained for commercial data sets (collected in 2012 and 2013) in comparison with the municipal data sets, which were collected immediately after the completion of ground-truthing in 2015. However, the inclusion of both current (2015) and historical (2012) Business Licence data suggests that depreciation explains only part of the difference in validity. The two commercial data sets still performed between 5 and 10 percentage points worse in PPV and nearly 20 percentage points worse in sensitivity scores compared with the municipal Business Licences for 2012. Additionally, findings may not be generalizable to other cities because of variance in municipal data set quality, and the findings may overestimate validity for studies that do not follow the data cleaning and classification protocols used in the current research( Reference Jones, Zenk and Tarlov 65 ). It should also be noted that the gold standard, ground-truthed data, is subject to error that could contribute to the low validity scores estimated for secondary data sets. Although inter-rater reliability in store classification was high, it remains possible that surveyors missed stores or that results were affected by turnover in Vancouver storefronts. Finally, our definition of the community nutrition environment was limited to publicly accessible food outlets; places with restricted access such as office cafeterias or school snack shops were not examined in the study because they are considered to comprise the ‘organizational’ nutrition environment rather than the community nutrition environment( Reference Glanz, Sallis and Saelens 8 ).
Further research is still needed to understand why measures of proximity and density from secondary and ground-truthed data remained highly correlated despite low levels of sensitivity and PPV; researchers also need to continue working on classification schemes that could reduce the over- and undercounting attributable to reliance on industrial classification codes. And finally, studies are needed that examine how error may affect outcomes ultimately of interest: the associations between diet-related health outcomes and community nutrition environment exposures.
Nevertheless, the present research remains relevant to researchers outside Vancouver in both its methods and its findings. The inclusion of multiple years of municipal data offers researchers insight into the effects of depreciation over time. The finding of an association between error and commercial density joins several studies suggesting that researchers should be concerned with the effects of commercial density on data quality. Furthermore, the method of calculating the correlation between community nutrition environment measures from secondary data sets and ground-truthed data could be replicated with data sets in other geographic and national contexts, an effort that would help bring researchers a step closer to understanding the impact of error on the results obtained in community nutrition environment studies.
All data sets examined in the present study scored relatively poorly across validity measures. Three of the four data sets also had evidence of systematic bias in association with commercial density, although no data sets were systematically more likely to over- or undercount outlets in relation to neighbourhood socio-economic status. Nevertheless, community nutrition environment measures constructed from both municipal and commercial data sources were highly correlated with ground-truthed measures, suggesting that data sets with low validity scores may still offer reliable measures of community nutrition environment exposure.
The City of Vancouver Business Licences outperformed other data sources in measures of sensitivity and in its lack of systematic error in association with neighbourhood characteristics. Furthermore, community nutrition environment measures constructed from the Business Licences and those constructed from ground-truthed data were highly correlated. The present study thus suggests that the Business Licences offer the best available data set for community nutrition environment research in Vancouver. For studies using commercial data providers, the study suggests that researchers should be wary of systematic error in association with commercial density. While such data sets perform reasonably well for studies quantifying relative community nutrition environment exposures, they may be less useful for policy makers or planners seeking to identify specific food outlets.
Acknowledgements: Koharu Chayama and Cayley Velazquez assisted with the ground-truthing of food outlets in Vancouver. The authors would also like to thank Carol McAusland and Nadine Schuurman for guidance and comments. Financial support: This study received funding from the Canadian Institutes of Health Research (grant number FRN 119577). In addition, M.I.G.D. was funded by the University of British Columbia Li Tze Fong Fellowship (grant number #4895). The funding agencies had no role in the design, analysis or writing of this article. Conflict of interest: The authors declare that they have no competing interests. Authorship: Both J.B. and M.I.G.D. contributed to study design. J.B. was the principal investigator and supervised the research. M.I.G.D. developed and led the ground-truthing protocol, sourced secondary data sets, performed data analyses and drafted the manuscript. J.B. contributed to manuscript writing and editing, and both M.I.G.D. and J.B. reviewed the manuscript and approved the final version. Ethics of human subject participation: This study did not include human subjects research.
To view supplementary material for this article, please visit https://doi.org/10.1017/S1368980017001744