A distributed geospatial approach to describe community characteristics for multisite studies

Understanding place-based contributors to health requires geographically and culturally diverse study populations, but sharing location data is a significant challenge to multisite studies. Here, we describe a standardized and reproducible method to perform geospatial analyses for multisite studies. Using census tract-level information, we created software for geocoding and geospatial data linkage that was distributed to a consortium of birth cohorts located throughout the USA. Individual sites performed geospatial linkages and returned tract-level information for 8810 children to a central site for analyses. Our generalizable approach demonstrates the feasibility of geospatial analyses across study sites to promote collaborative translational research.


Introduction
Maintaining patient privacy is a common challenge faced by researchers seeking to understand the relationship of place and health [1][2][3][4]. This issue can be especially problematic in multisite studies due to study protocols and confidentiality concerns that limit the sharing of geographic data. Existing approaches to geospatial analyses in multisite studies include the use of a central site to conduct analyses or the application of spatial techniques (e.g. changing geographic coordinates to protect confidentiality, i.e. "geomasking") to protect patient privacy [4]. However, the former method is limited by data use agreements (DUAs) and the latter may result in exposure misclassification when geomasked locations result in spatial misalignment.
Alternatively, study sites may perform geocoding and geospatial linkages independently before removing identifiable information for joint analyses. This decentralized approach, however, faces challenges in reproducibility and standardization due to geocoding methods, geographic information software (GIS), and expertise that varies across study sites. Here, we describe the application of a novel method to perform reproducible, standardized, and confidential geospatial analyses for multisite studies. Our approach extends a previously developed Decentralized Geomarker Assessment for Multi-site Studies (DeGAUSS) containerization platform to perform geocoding and extraction of polygon feature geospatial data over multiple time periods and large geographic areas [5]. As an example case, we use our approach to ascertaining US census tract-level information for participants enrolled in the Children's Respiratory and Environmental Workgroup (CREW), a network of 12 birth cohorts each studying the development of allergy and asthma in childhood [6].

Study Population
Our approach was motivated by the CREW consortium, a network of birth cohorts recruited from 1980 to 2020. Information regarding participating cohorts, including eligibility criteria, study recruitment, and other methods, are published [6] and in Supplementary Table 1. Due to its large sample size and geographic distribution, CREW provides a unique platform to examine environmental and community factors that contribute to the disproportionately higher burden of asthma-related morbidity and mortality among disadvantaged communities [7][8][9][10]. A CREW data sharing protocol and DUA were approved by the local IRB for each cohort. However, the DUA allowed only limited datasets to be shared and prohibited the distribution of identifying information, including addresses or geocodes.

Distributed Geospatial Analysis
We extended our DeGAUSS software to enable all CREW sites, including those with limited geospatial expertise, to derive spatio-temporal US census tract-level information for their participants at birth. A key advantage to DeGAUSS is the use of a software containerization platform to wrap necessary software, system dependencies, and geospatial data in a stand-alone package that will work the same regardless of its host environment [11]. Previously, we created DeGUASS and applied this tool in the Electronic Medical Records and Genomics (eMERGE) network in a proof-of-concept study [5]. Whereas our prior DeGAUSS software included only a geocoder and code to link geospatial coordinates to nearby roadways and one census tract variable, the CREW consortium required significantly expanded geographical and temporal data to be included for analyses. Therefore, a new custom DeGAUSS container containing decennial US Census data (described below), census tract polygon boundary files for the 1980,1990,2000, and 2010 census, and R code (R Foundation for Statistical Computing: Vienna, Austria; 2014) was created to merge census tract-level data to the geocoded locations of CREW birth addresses. Additional details regarding DeGAUSS and the CREW container are provided in the supplementary materials and online [12,13].
A flow diagram depicting the distributed approach to geospatial analyses for the CREW consortium is provided in Figure 1. The DeGAUSS container image was created by C.B. at a single location and distributed to each cohort. The DeGAUSS software required cohort users to provide an input.csv file containing the geocoded coordinates of their participants' birth record address. Site end users also specified the census year to assign the appropriate tract boundary file and census data based on participants' year of birth. The output data file from the DeGAUSS container contained the original input data, including geocoded locations, and appended census tract information including the census tract FIPS code in which the birth record address was located and census variables. Site end users manually removed identifying information, including the geographic coordinates and census tract FIPS code, prior to returning the de-identified dataset to a central coordinating center.

US Census/American Community Survey Data
Longitudinal US Census data and boundary files for the years 1980, 1990, 2000, and 2010 were downloaded for the entire USA from Social Explorer (www.socialexplorer.com, New York City, NY: Social Explorer 2017; accessed 12/17-1/18) by year, requested variables, and tract level of geography. For 1980, certain census variables were only available from the National Historic GIS data service (Minneapolis, MN: NHGIS; accessed 11/17-1/18) and were updated to reflect variable calculation definitions used in later censuses. As the 2010 census did not include information regarding median household income, median gross rent, or median housing values, these data were downloaded from the 2008-2012 American Community Survey (ACS, https://www.census.gov/programssurveys/acs/data.html; accessed 12/17-1/18). Additional information regarding the census and ACS variables downloaded and included in the DeGAUSS container for linkage to birth record addresses is available in the supplementary material.

Statistical Analyses
After sharing de-identified data with a central site, descriptive statistics and box-and-whisker plots for all census tract variables were calculated for the combined CREW consortium and for each cohort individually. Comparison of census tract-level to selfreported race, ethnicity, and household income was conducted by plotting the distribution of each census tract-level variable according to self-reported variable. Self-reported household income was compared to census tract median household income using each cohorts' income categories as collected by questionnaire. Additional details regarding self-reported race, ethnicity, and income information are available in the supplementary material.

Results
All cohorts (n = 12) completed the distributed analysis and returned de-identified data to the coordinating center. Collectively, 8997 participants were enrolled in CREW cohorts, and 98% (n = 8810) of these had birth record addresses geocoded with sufficient precision for linkage to a census tract.
A summary of population, race, ethnicity, and income data for the census tracts in which participants resided at birth is provided in Table 1. CREW participants resided in both low and high population density regions as reflected in the average tract population density (persons per km 2 ) that ranged from 148 for Wisconsin Infant Study Cohort (WISC) participants in rural Wisconsin to 45,772 for CCCEH participants in New York City (Table 1). Overall, CREW participants lived in tracts that were 67% White, 23% Black, 2% Asian, and 9% Other race. There was, however, variability in tract racial distribution both within and across cohorts (Table 1); eight cohorts enrolled participants from tracts where more than 75% of the population was White, while participants enrolled in URECA, WHEALS, and CCCEH resided in census tracts with populations having a greater proportion of Black or Other race. Most cohorts enrolled participants from tracts having a Hispanic population less than 10%, though the mean Hispanic population in tracts of CCCEH, IIS, TCRS, and URECA (Boston, MA and New York, NY sites) participants ranged from 21 to 57%. The overall mean of households in poverty was 15% in census tracts where CREW participants resided at birth but as shown in Figure 2, there was significant variability within and across cohorts. Additional information on census tract-level median household income, percentage (%) of households in poverty, % occupied housing, and median housing value as indicators of neighborhood income and housing is provided in the supplementary materials. The distribution of tract-level race data (%White, %Black, % Asian, %Other) according to participants self-reported race is presented in Figure 3. Overall, participants who reported being White or Asian race lived in census tracts with majority White populations, whereas participants who reported Black race resided in census tracts with greater variability in shares of White and Black populations (Figure 3). Additional comparisons of individual-level to neighborhood-level income and ethnicity are provided in the supplementary material.

Discussion
Environmental exposures and the community in which they occur are significant causes of human disease, including asthma [7,[14][15][16][17]. Disentangling the environmental and social exposures that contribute to health disparities necessitates geographically and culturally diverse studies. The methods described here make the characterization of community characteristics in a confidential yet standardized and reproducible manner more feasible for researchers and policy-makers. Importantly, our method is generalizable to additional types of geographic data, including polygon and point data, allowing other studies to customize and incorporate geocoding and geospatial analyses into their approach.
Our distributed geospatial approach offers some important advantages compared to existing methods, including reduced exposure misclassification, maintaining participant confidentiality, and reducing the need for geospatial expertise at each study site [1]. One existing and alternative method to maintain subject confidentiality is the alteration of participants' geographic coordinates. Referred to as "geomasking" or "jittering," this approach involves either a random shift in the location of subjects or a systematic transformation of the locations known only to the researchers [4]. However, this method may introduce errors or biases introduced due to the displacement of the participants' actual location, particularly for analyses requiring an exact location. For example, the amount of geomasking required to make geographic datasets de-identified according to HIPAA standards may result in incorrect census tracts being linked to individual subjects resulting in exposure misclassification. An alternative approach to incorporating geospatial information into multisite studies is to obtain IRB approval and DUAs to share subjects' identifiable information with a central site for geocoding and analysis. Challenges with this strategy include hesitation on the part of institutions to share identifiable information (e.g. addresses). Performing geospatial linkages at individual sites is another approach but may produce non-standardized and non-reproducible results due to the use of varying geocoding platforms, software, and dataset.
Our DeGAUSS method overcomes these limitations because the geospatial data and software are developed at a central site, ensuring that all individual study sites run the same software on identically constructed datasets. Importantly, our method also accounts for spatial, temporal, and informational changes in census tract boundaries and census data over the study period.   Fig. 3. Comparison of census tract race to participants' self-reported race.

Journal of Clinical and Translational Science
Participants in CREW cohorts were born over nearly four decades, and, as detailed in the accompanying supplement, linking geocoded addresses with the appropriate census tract at the time of participant birth was critical to our approach. By including census tract boundaries and accompanying data from 1980, 1990, 2000, and 2010 in our DeGAUSS software, we were able to link study participants to the appropriate census tract and data for their date of birth. Of note, the DeGAUSS platform is flexible and amenable to additional geographic data and analyses, including derivation of nearby greenspace, distance to roadways and other locations, estimating gridded air pollution exposures, area-based indices of neighborhood deprivation, and others. Specific details and example uses are available on the DeGAUSS website [12,13].
As part of the NIH Environmental influences on Child Health Outcomes (ECHO) program, the objective of CREW is to understand the etiology of childhood asthma and determine environmental and genetic contributors. As such, CREW offers a unique opportunity to examine the significantly higher rates of asthma prevalence, hospitalization, and morbidity among children residing in households and neighborhoods with lower SES, as compared to White children residing in higher socioeconomic status (SES) households and communities [18][19][20]. Multiple influences contribute to these disparities including disproportionately higher exposure to air pollution, poor housing, limited access to care, and other adverse physical and toxicant contributors, and inherited factors may increase individuals' sensitivity to these [21][22][23][24][25]. However, environmental exposures and inherited factors alone do not fully explain the observed health disparities. For that reason, hardships linked to poverty, including discrimination, stress, family turmoil, violence, instability, and others have been posited to play an important role in the social patterning of disease [26]. These social determinants of health may be separate from access to medical care and are important drivers not only in asthma morbidity, but also in a wide range of both adult and pediatric health outcomes, including mortality [27,28]. The importance of socioeconomic factors is highlighted by observations that disparities in health outcomes across strata of SES remain present within racial/ethnic groups [28].
Our methods and results should be considered, however, in the context of some limitations. Census tract boundaries alone may not accurately describe individuals' neighborhood experience and the use of decennial census data rather than ACS data resulted in reduced temporal precision but allowed us to increase our historic reach to 1980. There are also limitations to the US Census data including a lack of specificity in certain variables such as ethnicity. For example, Hispanic participants in the TCRS and IIS in Tucson, AZ, likely differ in ancestry from Hispanic participants from cohorts in New York City, NY. Finally, our approach requires expertise at individual institutions to obtain patient or electronic health records.
In conclusion, we demonstrated the use of a distributed approach to conduct geospatial analyses for the 12 CREW cohorts that is also applicable to other multisite studies. Future applications of our method will include additional gridded data including land cover, air pollution models, and meteorological information. Future research to understand the etiology of childhood asthma will incorporate longitudinal residential locations throughout childhood and multilevel analyses of individual and neighborhood environments.