The focus of the next generation of gene–environment research will add development into the equation and focus upon gene–environment–development interactions (Rose & Dick, Reference Rose and Dick2010, pp. 1854–1855).
It is clear from twin studies that the relative significance of genetic and environmental factors changes across some stages of development, notably childhood and adolescence (reviewed in Dick, Reference Dick2011). Indeed, some have suggested that the first reason for failures to replicate gene–environment findings may be the different developmental status of the samples. For example, Cole and colleagues (Reference Cole, Arevalo, Manu, Telzer, Kiang, Bower and Fuligni2011, p. 1174) have argued that one explanation for the evolutionary maintenance of genetic features that are maladaptive late in life is that they must be selectively advantageous earlier in life ‘or else they would have been eradicated from the gene pool by adverse selection’. But there are many genetically driven changes with a strong developmental component, such as puberty or menopause, or the timing and synchronization of myelination in the central nervous system (van Ijzendoorn et al., Reference van Ijzendoorn, Bakermans-Kranenburg, Belsky, Beach, Brody, Dodge and Scott2011).
Grappling with gene–environment–development in the same study is important not only empirically but also methodologically:
We have known for decades that failure to incorporate both genetic and environmental factors in a joint analysis will weaken the observed associations between a true risk factor and disease occurrence. Because the pools of susceptible and non-susceptible persons are mixed, the observed associations tend to be shifted toward the null . . . Theoretically, if we are able to measure gene–environment interactions, we should sharpen our measurements of effects in subsets of the population and even potentially increase our statistical power in measuring such effects. (Khoury & Wacholder, Reference Khoury and Wacholder2009, p. 228)
The same argument can be applied to the importance of incorporating developmental measures (e.g., Chiang et al., Reference Chiang, McMahon, de Zubicaray, Martin, Hickie, Toga and Thompson2011).
Longitudinal studies with repeated, prospective assessments using standardized measures of phenotype and envirotype offer several opportunities for improved data quality compared with the standard methods of genetic case-control studies, which tend to rely on lifetime retrospective assessments using multiple diagnosticians and uncertain control over the uniformity of diagnostic criteria. In a longitudinal study, diagnoses are likely to be more consistent and there is likely to be less recall bias in participants’ psychiatric histories, which means fewer false negatives. Both of these qualities increase statistical power in hypothesis testing (Anastasi, Reference Anastasi1950; Luan et al., Reference Luan, Wong, Day and Wareham2001; Wong et al., Reference Wong, Day, Luan, Chan and Wareham2003, Reference Wong, Day, Luan and Wareham2004). The timing of environmental events relative to the onset of a disorder is also likely to be more accurate and less vulnerable to ‘seeking after meaning’ (Spatola et al., Reference Spatola, Scaini, Pesenti-Gritti, Medland, Moruzzi, Ogliari and Battaglia2011).
The fourth reason to explore the use of longitudinal studies for genetics is that each participant provides multiple measures of phenotype and envirotype. This increases the total number of observations, even after the corrections necessary to deal with the wave-to-wave correlations within individuals (Dunlap, Reference Dunlap, Cortina, Vaslow and Burke1996).
Despite theoretical discussions on the importance of Genome-Wide Association Studies (GWAS)-based gene–environment studies (sometimes called Gene–Environment-Wide Interactions Studies (GEWIS; Khoury & Wacholder, Reference Khoury and Wacholder2009), and even some discussion on developmental GEWIS (Lenroot & Giedd, Reference Lenroot and Giedd2011; Rose & Dick, Reference Rose and Dick2010), there are few published results of GEWIS analyses in the behavioral sciences. The few studies of developmental effects on gene–environment interplay have focused on individual genes such as BDNF (Casey et al., Reference Casey, Glatt, Tottenham, Soliman, Bath, Amso and Lee2009); none has yet published uses of a genome-wide approach to psychiatric disorders (Lenroot & Giedd, Reference Lenroot and Giedd2011).
It is not hard to see why developmental GEWIS studies are so rare. The costs of initiating and maintaining large longitudinal studies are extremely high, and it is important that subjects be assessed using similar measures of both phenotype and environment over key developmental stages. Clinical studies rarely cover sufficiently long periods of time, so epidemiological samples are needed. Birth cohorts or other life-course epidemiologic studies, such as the nascent US National Children's Study, are potential sources.
Another solution to this problem is to bring together multiple data sets to conduct joint analyses or meta-analysis. This approach depends crucially on the ability to combine the data across studies. Even before genetic analyses can begin, it is necessary to develop and test methods for harmonizing data across studies (Bookman et al., Reference Bookman, McAllister, Gillanders, Wanke, Balshaw, Rutter and Birnbaum2011; Cornelis et al., Reference Cornelis, Agrawal, Cole, Hansel, Barnes and Beaty2010; Fortier et al., Reference Fortier, Doiron, Little, Ferretti, L'Heureux and Stolk2011).
The National Institute on Drug Abuse (NIDA) and the National Cancer Institute (NCI) recognized both the promise and the problems of developmental GEWIS when they wrote the following in the Request for Applications:
Over many years, NIDA, other NIH Institutes, and other organizations have funded numerous high-quality longitudinal and developmental studies that contain a wealth of data from individuals who are at risk for, or are in the course of development, progression, and desistance of substance abuse and related phenotypes. . . . The GEDI [Gene-Environment-Development Initiative] seeks to build on this substantial public investment by soliciting applications that integrate environmental and developmental variables with genotypic information in order to permit comprehensive model-building and hypothesis testing for determining genetic, environmental, and developmental contributions to substance abuse and related phenotypes. (NIH/NIDA; R01DA024413)
NIDA and NCI hoped to take already existing materials and see if they could be woven into something that, if created from scratch, would have taken 20 years and untold millions of dollars. If successful, GEDI would be a proof-of-concept that could lead perhaps to an expansion of the collaborative group of studies.
In summary, we report here on a proof-of-concept study to carry out gene–environment, gene–development, and gene–environment–development analyses (both parallel and meta-analytic) using longitudinal, population-based data sets with repeated measures over childhood, adolescence, and early adulthood, with DNA available or obtainable, with comparable measures of drug and alcohol use, abuse, and dependence, and also of key environmental exposures.
Materials and Methods
Common GEDI Study Characteristics
The data sets that make up the consortium have the following characteristics in common: (1) General population samples; (2) multiple waves of data collection across childhood, adolescence, and young adulthood; (3) detailed assessments of drug use, abuse, and dependence (substance use disorders (SUDs)) and drug abuse symptoms; (4) assessments of comorbid psychiatric disorders, diagnosed using the Diagnostic and Statistical Manual (American Psychiatric Association, 1994) and psychiatric symptom scores; (5) measures of a range of environmental exposures, including serious life events. Methods used to collect information on diagnoses, symptoms, and environmental factors are described first, followed by brief descriptions of each study. Table 1 presents a summary of similarities and differences. Further details can be found in study-specific publications cited below.
GSMS = Great Smoky Mountains Study; CCC = Caring for Children and the Community; VTSABD = Virginia Twin Study of Adolescent Behavioral Development; CHDS = Christchurch Health and Development Study.
1. Virginia Twin Study on Adolescent Behavioral Development (VTSABD; Simonoff et al., Reference Simonoff, Pickles, Meyer, Silberg, Maes, Loeber and Eaves1997).
The VTSABD is a cohort-longitudinal study of twins born between 1974 and 1983, ascertained primarily through the state school system and participating private schools in Virginia. Of 1,894 putative twin pairs, 1,412 families (75%, 2,775 children) participated and were included in the first wave of data collection. Three subsequent waves of data collection occurred at approximately 1½-year intervals, and the fifth wave when participants were in their mid-twenties (see Table 1). The study was limited to subjects of European ancestry as insufficient numbers from other ancestry groups were ascertainable. Parents completed a similar assessment on both twins. After age 18, the twins alone were interviewed individually by telephone. Over 8,500 family interview sets (parents and twins 8–17, twins 18+) have been completed. Variable numbers of subjects completed each interview wave, and 2,289 (82%) of the Wave 1 sample have completed the fifth wave.
2. Great Smoky Mountains Study (GSMS; Costello et al. Reference Costello, Angold, Burns, Stangl, Tweed, Erkanli and Worthman1996, Reference Costello, Farmer, Angold, Burns and Erkanli1997).
Three cohorts of boys and girls, aged 9, 11, and 13 years at intake in 1993, were selected from a rural population of some 20,000 children using a household equal probability design. A two-phase procedure was used for White and African-American youth to increase power by oversampling children at risk for psychiatric and SUDs. Parents (usually mothers) of the first stage random population sample completed a questionnaire about their child's behavioral problems. Of 4,195 subjects selected, 95% (N = 3,896) of parents completed the screen. All children scoring above a predetermined threshold (the top 25% of the total scores), plus a 10% random sample of the remaining 75%, were recruited for detailed interviews. Results can be back-weighted to population levels for analyses. Half of the sample consists of females, and 6% are African Americans, reflecting the population of the study area. The interviewed sample of white and Africian-American subjects was 1,070 (80% of those recruited). American Indian youth were oversampled (100%) because they are an understudied group known to be at high risk for stressful events, substance disorders, and mood disorders. Of 431 age-eligible children, 350 (81%, 49% girls) participated. Thus, the size of total GSMS sample is 1,070 + 350 = 1,420. Data collection is complete for ages 9–26, and age 30 interviews are in progress. By age 26 a total of 9,858 interviews had been completed; the average number of interviews per subject was seven, and by age 26, 97.3% completed two or more interviews.
3. The Caring for Children in the Community Study (CCC; Angold et al., Reference Angold, Erkanli, Farmer, Fairbank, Burns, Keeler and Costello2002).
This representative study of psychiatric illness and service use in African-American and White youth took place in four rural counties in the southeastern USA. The two-stage sampling design and methods are similar to those used in the GSMS. Of 4,500 youth randomly selected from the 17,117 9- to 17-year-olds in the public school's database, 3,613 (80.0%) were successfully contacted and agreed to complete the behavioral screen. Of the 1,302 selected to participate in the study, 920 (70.7%) interviews were completed. Because CCC was also the only study in GEDI to contain more than a very few African-American participants, these were omitted from the multi-site analyses.
4. Child Health and Development Study (CHDS; Fergusson & Horwood, Reference Fergusson and Horwood2001).
The CHDS is a longitudinal study of a birth cohort from New Zealand. The cohort was based on an unselected sample of 1,265 consecutive births (635 males; 630 females) occurring in the Christchurch urban region in mid-1977. The cohort has been studied at birth, 4 months and 1 year of age, annual intervals to the age of 16 years, and again at ages 18, 21, 25, and 30 years. Sample retention rates were high throughout the study and at age 30 the study was still able to assess over 80% of the surviving cohort.
Informed Consent in Each Study
Participants in all the studies gave consent for their DNA to be genotyped. However, depositing biological samples and genetic data in controlled-access biorepositories (e.g., dbGaP (Mailman et al., Reference Mailman, Feolo, Jin, Kimura, Tryka, Bagoutdinov and Sherry2007) required a different level of consent. This was obtained for GSMS and VTSABD participants in year 1 of GEDI. Further consents were not required from CCC as the study was closed. CHDS subjects gave consent for genotyping only.
Each Institutional Review Board (IRB) had slightly different requirements for consent forms, but in general study participants were given the opportunity to consent to (1) completing only the assessment instruments; (2) assessment plus DNA collection for internal use only; or (3) assessment and DNA collection, the anonymized data to be put into a repository.
Blood Samples and Genotyping
Nine milliliters of blood were collected from each of the VTSABD participants in the first year of the GEDI study, that is, when subjects were aged 25 to 34. Blood and informed consent for genotyping and storage in dbGaP were obtained from 913 participants, of whom 281 were co-twins.
GSMS and CCC
Blood from the GSMS and CCC samples was collected at each assessment: 10 finger-stick samples were collected on specially prepared paper, dried, and shipped to the study laboratory, where they were stored at –23°C until they were assayed. A pilot study showed that even after 10 years of storage adequate DNA could be extracted from these samples. Most subjects (94%) provided at least one sample; the one collected as close to age 19 as possible was used for genotyping. Because there were so few African American participants for any of the data sets except CCC, the multi-site analyses excluded them, leaving 196 CCC and 784 GSMS participants with adequate genotype data and, in the case of GSMS, consent to deposit data in dbGaP. Since CCC was a closed study, no further consents were needed.
Beginning in 2004 (at age 28), participants were asked for consent to provide saliva sample for DNA, and 918 (90% of the surviving cohort) consented. Consent for DNA collection was separate from consent for the rest of the study. In 2008–2009, participants were asked for consent for the GEDI multi-site GWAS, and 813 consented. Of these, 86% provided peripheral blood samples, 8% provided saliva, and 6% provided buccal swabs (the latter proved not to provide samples of sufficient quality for genotyping). After quality control checks, good quality data were obtained on 747 participants. The New Zealand government does not permit the data to be deposited in dbGaP.
Blood samples from the VTSABD, GSMS, and CCC samples were sent to the Rutgers University Cell and DNA Repository for DNA extraction, and to the Genotyping Shared Resource at the Mayo Clinic Cancer Center for genotyping. DNA for the CHDS sample was prepared in New Zealand and also sent to Mayo for genotyping. DNA samples were randomized to plates within studies. All samples were genotyped using Illumina Human660W-Quad v1 DNA Analysis BeadChips. Quality control was carried out in the Department of Genetics at the University of North Carolina, Chapel Hill. In each of the four samples, single nucleotide polymorphisms (SNPs) with missing rate > 0.01, minor allele frequency (MAF) < 0.05, or extreme deviation (p < 1−6) from the Hardy–Weinberg equilibrium (HWE) were removed from further analysis. Subjects with missing rate > 0.01 or unusual genome-wide homozygosity (|normalized homozygosity rate| > 5) were excluded. Sex was investigated using the no-call proportions of chrY SNPs and heterozygosity proportions for chrX SNPs. Mislabeled sex information was corrected after double-checking with the original data, and subjects with unexplainable results were deleted. In addition, pairwise identical-by-descent (IBD) estimation was evaluated to identify unexpected duplicates and relative pairs. We imputed SNP dosages in all samples using MACH (Liu et al., Reference Liu, Tozzi, Waterworth, Pillai, Muglia, Middleton and Marchini2010b). The imputation reference was HapMap3 CEU (Utah residents with Northern and Western European ancestry from the CEPH collection) for subjects of European ancestry. All subjects with pc1 < 0 were grouped as ‘white’ and were imputed using HapMap3 CEU as reference; all subjects with pc1 > 0 were grouped together as ‘other’ and imputed using HapMap3 CEU + YRI as reference. Before imputation, the studies had between 496,000 and 515,000 SNPs that passed quality control. After imputation, all studies had over 1,193,000 total SNP values for analysis. The number of SNPs used in particular studies may differ based on cut-offs used for minor allele frequency.
Unobserved population admixture due to ancestry is a well-known confound in GWAS. To protect against false-positives due to ancestry, we extracted five principal components from each sample to correspond with ancestral and cryptic population stratification. To improve the efficiency of the principal components analysis (PCA) for control of population stratification, a subset of independent SNPs was selected using the PLINK function – independent-effect with proper parameters (window size = 50, the number of SNPs to shift the window at each step = 5, and the VIF threshold = 2). PCA was applied to the selected SNPs using the smartpca module of EigenSoft (Price et al., Reference Price, Butler, Patterson, Capelli, Pascali, Scarnicci and Hirschhorn2008). Between 77,155 and 79,517 SNPs were used for each of the samples that we analyzed in the ancestry PCA. All genotyping and quality control (QC) was done blind to phenotype.
The VCU samples included in the analysis were all White, and HapMap3 CEU was used as the reference data to do the imputation for all the subjects from this data set. Most of the New Zealand samples were also White, although some were either Maori or mixed Maori and White. Since the number of Maori was small, we ignored the Maori's Asian genetic background and used HapMap3 CEU samples as the reference data to run the imputation for all subjects. The imputation quality will not be perfect for Maori, but we used Rsq (the imputation quality score from MACH) to remove badly imputed SNPs.
The Duke samples contained a range of populations, including White, African American, American Indian, Hispanic, and Asian, but most were either White or Black. For the imputation purpose, we split the samples into two groups. All the subjects with pc1 > 0 were grouped as ‘Black’ and were imputed using HapMap3 CEU + YRI as reference. All the subjects with pc1 < 0 were grouped as ‘White’ and were imputed using HapMap3 CEU as reference.
Each study collected extensive information on individual, family, and community risk for psychopathology. For the first analyses we selected a measure of exposure to potentially stressful life events (SLE) because several candidate-gene-based gene–environment studies have demonstrated significant gene–environment interplay using this measure (Karg et al., Reference Karg, Burmeister, Shedden and Sen2011). At each assessment, participants in all studies were asked to indicate whether they had experienced any of several potentially stressful life events such as losing a friend, or moving. Each study provided count variables of the total number of stressful life events recently. The period assessed varied by study from the previous 12 months to the previous 3 months. Stressful life event terms were centered to the study mean to reduce multi-collinearity with interaction terms. Although studies employed different primary periods, the parameter estimates for the association between stressful life events and substance-related outcomes were similar across studies. The study will only focus on effects that are robust to modest between-study differences in measurement or period assessed.
Data analytic methods for each substance (cannabis, alcohol, and nicotine) varied, and will be described in the empirical reports. However, there are some general principles that we discuss here.
The overall goal is to determine which, if any, of the measured or imputed SNPs contribute to the explained variance in substance involvement after controlling for ancestry, sex, and age. We expect that an SNP may contribute to substance involvement directly, via a main effect of the SNP on substance involvement, but that an SNP may also have a heterogeneous effect across individuals due to differential environmental exposures, including time. The degree to which genetic information influences substance involvement across the lifespan may vary over time as a function of time-specific life circumstances.
One of the issues that strongly influences the success of multi-site analyses is that of identifying measures of phenotypes or environmental factors that are comparable across data sets. For example, in the analyses described below, a factor score measuring alcohol involvement was estimated using Mplus 6.0 (http://www.statmodel.com/) from measures of quantity of use, frequency of use, and symptom counts common to all the data sets.
One of the unique features of GEDI is the rich developmental data inherent in each of the samples that allows us to investigate if there are genetic variants that influence substance use in a key period of development. As mentioned above, each of the samples covers a different age range (although they overlap across some ages) and each empirical paper takes a different approach to data harmonization and handling the developmental piece. As our first step toward data harmonization for GEDI, in consultation with our colleagues at Duke's Social Science Research Institute, we adopted the ‘transform and recode’ procedure most commonly used in harmonization studies (Bath et al., Reference Bath, Deeg and Poppelaars2010). First, a key member of each study team is tasked to achieve consensus regarding whether it is possible to find variables (and associated response categories) that have the same ‘face value’. Next, a new harmonized variable is created for each ‘comparable’ existing variable set by applying the transform and recode procedure to one or both of the original study measures such that existing codes for categories can be merged and relabeled in each study depending on the precise wording and ordering of the categories (Fortier et al., Reference Fortier, Doiron, Little, Ferretti, L'Heureux and Stolk2011).
For example, two primary, longitudinal measures of overtime alcohol consumption were generated to study the main effects of alcohol consumption on genetic variants: (1) a mixed model which explicitly models the developmental alcohol consumption trajectory spanning adolescence and early adulthood (ages 12–30), and (2) a simple mean of alcohol consumption (drinks per week) using repeated measures collected across adolescence (ages 12–21) for each individual. We selected these two specifications because (1) the trajectory outcome was indicated to be the best fitting longitudinal model, taking advantage of all repeated measures, and (2) the mean adolescent consumption outcome provided a simpler summary of individuals’ drinking behavior, and thus provides greater continuity to existing literature (Agrawal et al., Reference Agrawal, Grant, Littlefield, Waldron, Pergadia, Lynskey and Heath2009, Reference Agrawal, Freedman, Cheng, Lin, Shaffer and Sun2012; Grant et al., Reference Grant, Agrawal, Bucholz, Madden, Pergadia and Heath2009). The harmonization methods used focused in this first instance on measures that are relatively constant in meaning across development, such as number of drinks per week. With the help of our colleagues from the Data Harmonization team at Duke, we will then tackle measures that may change either content or meaning across development; for example, the content of self-regulation differs from one age to another, as does the developmental significance of an alcoholic drink (Sung et al., Reference Sung, AErkanli, Angold and Costello2004).
Power for Analysis of Gene–Environment Interplay
In principle, biostatistical methods for testing for genetic association and gene–environment interplay do not differ from those for testing any other association, interaction, or correlation. The problem, of course, is the vast number of SNPs and environments (van den Oord, Reference van den Oord2002) and the importance of controlling for false discoveries; that is, concluding that a marker affects an outcome when in reality it does not. We use an approach to control false discoveries based on the false discovery rate (FDR; Benjamini & Hochberg, Reference Benjamini and Hochberg1995). In comparison to controlling a family-wise error rate – for example, the Bonferroni correction – the FDR (1) provides a better balance between the competing goals of finding true effects versus controlling false discoveries, (2) results in comparable standards for declaring significance across studies because it does not directly depend on the number of tests, and (3) is relatively robust against correlated tests (Borden et al., Reference Borden, Brown, Jenkins and Clingerman1987; Fernando et al., Reference Fernando, Nettleton, Southey, Dekkers, Rothschild and Soller2004; Korn et al., Reference Korn, Troendle, McShane and Simon2004; Sabatti et al., Reference Sabatti, Service and Freimer2003; Tsai et al., Reference Tsai, Hsueh and Chen2003). The FDR is commonly used in many high-dimensional applications and has also been applied successfully in the context of GWAS (Beecham, Reference Beecham, Martin, Li, Slifer, Gilbert, Haines and Pericak-Vance2009; Lei, Reference Lei, Yang, Tan, Chen, Guo, Guo and Deng2009; Liu, Reference Liu, Zhang, Bando, Itoh, Deardorff, Clark and Krantz2009). We chose the FDR threshold of 0.1 for declaring genome-wide significance (van den Oord & Sullivan, Reference van den Oord and Sullivan2003), which means that on average 10% of the SNPs declared significant are expected to be false discoveries. Operationally (Black, Reference Black2004), the FDR is controlled using q-values that are FDRs calculated using the p-values of the markers as thresholds for declaring significance (Storey, Reference Storey2003). It is important to note that performing many GWAS analyses does not present a problem for the FDR because it controls the expected ratio of false to all discoveries. Thus, when many GWAS are performed the number of false positives will increase and so will the number of true positives. The expected ratio of false positives to all discoveries will therefore remain 0.1, our threshold for declaring genome-wide significance.
Results, Cross-Validation, and Replication
The results of the first set of papers focus on the main effect of the SNP and the interaction term between the SNP and SLE exposure. The first three papers, currently under review, focus on the problem of alcohol use, number of cigarettes per week, and any cannabis use in the past 3 months. In each case, the environmental factor used was a measure of severe life events developed to be the same for all data sets.
Gene-Environment-Development Initiative takes two approaches for testing the validity of the results: cross-validation and replication. For the former we present the analyses separately for each data set and compare size and direction of effects across studies. This provides a more powerful test than the standard replication study because it involves complete genome-wide comparisons rather than simply comparisons of a few sites selected from one data set. The disadvantage is that individual data sets are necessarily smaller than the combined GEDI data set. We are therefore working to find other data sets with which we can carry out standard replication studies: comparing results on the ‘top hits’ from GEDI. There are few other data sets with the characteristics of the studies included in GEDI (multiple measures across adolescence and early adulthood of both substance use and abuse, and relevant environmental risk factors), but we have identified three with whom we are currently working (the Minnesota Twin and Family Study; Derringer et al., Reference Derringer, Krueger, McGue and Iacono2008), Finn Twin (Pagan et al., Reference Pagan, Rose, Viken, Pulkkinen, Kaprio and Dick2006), and the Center for Education and Drug Abuse Research (CEDAR) sample (Tarter & Vanyukov, Reference Tarter and Vanyukov1994), with other collaborations under development.
The next stage in the program of data analysis is to broaden it to include gene-based analyses (Neale & Sham, Reference Neale and Sham2004), pathway analyses (Wang et al., Reference Wang, Li and Bucan2007), and polygenic risk score analyses (Purcell et al., Reference Purcell, Wray, Stone, Visscher, O'Donovan and Sullivan2009). Gene-based analyses test whether any genes harbor an excess of SNPs with small p-values. Such analyses must account for both gene length and linkage disequilibrium between SNPs (see VEGAS, Liu et al., Reference Liu, McRae, Nyholt, Medland, Wray, Brown and Macgregor2010a, for one example). Pathway analyses similarly test for an enrichment of SNPs with low p-values in genes involved in specific functional pathways (such as those in the Gene Ontology and Kyoto Encyclopedia of Genes and Genomes databases). Optimal approaches must account for varying gene size and SNP density, linkage disequilibrium within and between genes, and overlapping genes with similar annotations (see INRICH; Lee et al., Reference Lee, O'Dushlaine, Thomas and Purcell2012, for one example). Finally, the polygenic risk score analyses test a polygenic basis for the phenotype by looking at the variance accounted for by a given set of top SNPs determined by a p-value threshold (e.g., .005, .01, .10, or .25). In the first step, the sample is partitioned into discovery and replication sets. Parameter estimates, derived in the discovery sample, are used as weights to calculate scores in the replication set. Subsequently, a regression is performed on the disease state in the replication set from the polygenic score and then p-values and pseudo r 2 values are presented (see, for example., International Schizophrenia Consortium 2009). In each case we propose to analyze the individual data sets, and also to perform a meta-analysis of the entire group. In these cases we are applying analytic approaches already in use in other GWAS studies to the GEDI program, but by focusing on the model term related to the interaction between the environmental exposure and SNP status, for example, this standard approach allows us to address a novel outcome – genes that moderate the association between the exposure and the outcome.
Next Steps: Candidate Gene Selection for Next-Generation Sequencing
In addition to GWAS, we will employ targeted capture (Gnirke et al., Reference Gnirke, Melnikov, Maguire, Rogov, LeProust, Brockman and Nusbaum2009) and massively parallel next-generation sequencing (McKernan et al., Reference McKernan, Peckham, Costa, McLaughlin, Fu, Tsung and Blanchard2009) to exhaustively determine all genetic variations at selected genomic loci with evidence for involvement in SUD etiology. This approach uses a solution-based capture method (Gnirke et al., Reference Gnirke, Melnikov, Maguire, Rogov, LeProust, Brockman and Nusbaum2009), where genomic DNA from each subject is mixed with a ‘library’ of synthetic oligonucleotides, designed to be complementary to the genomic regions of interest. Molecular tags on these oligonucleotides allow them to be pulled out of solution, bringing the bound, complementary genomic DNA with them. This ‘captured’ DNA from each individual is labeled with a unique identifier and sequenced using next-generation, ultra-high throughput technology (Smith et al., Reference Smith, Heisler, Onge, Farias-Hesson, Wallace, Bodeau and Nislow2010). This approach uses similar methods to exome sequencing (Ng et al., Reference Ng, Turner, Robertson, Flygare, Bigham, Lee and Shendure2009), but here we will sequence the entire genomic region of each gene of interest, rather than just the exons, in order to capture all relevant variation.
We selected regions to include in our targeted capture library as follows:
1. GEDI GWAS findings for smoking and alcohol. For each SNP showing significant association at the genome-wide level (q-value < 0.1), we targeted the genomic region encompassing it (+/–25 kb) for sequencing, plus any genes that fall within this 50-kb window. For SNPs showing ‘potentially interesting’ associations (q-values 0.1–0.2), we chose only those that fell within 25 kb of a gene, based on the principle that potentially interesting associations are more likely to be real if they are in, or close to, a gene. These criteria led us to select 17 loci covering 3.5 Mb of genomic DNA sequence.
2. Genes identified through published GWAS alcohol (Schumann et al., Reference Schumann, Coin, Lourdusamy, Charoen, Berger, Stacey and Ellioty2011) and smoking GWAS meta-analyses (Liu et al., Reference Liu, McRae, Nyholt, Medland, Wray, Brown and Macgregor2010a; Thorgeirsson et al., Reference Thorgeirsson, Gudbjartsson, Surakka, Vink, Amin, Geller and Stefansson2010), plus gene nominations by expert colleagues (17 loci covering 2.5 Mb).
3. All human alcohol and aldehyde dehydrogenases, the key enzymes involved in alcohol metabolism (Edenberg, Reference Edenberg2007; 28 loci covering 1.3 Mb).
4. Reward system genes, including dopaminegic (Di Chiara & Imperato, Reference Di Chiara and Imperato1988), opioid (Le Merrer et al., Reference Le Merrer, Becker, Befort and Kieffer2009), and cannabinoid (Solinas et al., Reference Solinas, Yasar and Goldberg2007) receptors and related metabolic genes (16 loci covering 1.09 Mb).
5. All remaining human nicotinic acetylcholine receptors not already selected (12 loci covering 0.42 Mb).
6. Additional priority genes close to GWAS hits (4 loci covering 0.35 Mb).
7. Prioritized candidate genes. We compiled three lists of candidate genes, the first based on previous associations in the literature with SUDs, the second comprising all known human genes involved in absorption, distribution, metabolism, and excretion (ADME) of drugs (www.pharmaadme.org), and the third included all human neuroactive ligand receptors from the KEGG database (Kanehisa & Goto, Reference Kanehisa and Goto2000; Kanehisa et al., Reference Kanehisa, Goto, Sato, Furumichi and Tanabe2011). We ranked genes by the number of times they co-occurred in the literature with the search terms ‘smoking’, ‘alcohol’, or ‘cannabis’, and filled the remainder of our targeted capture library with the top-ranked genes from this list (31 loci covering 1.1 Mb).
After removing overlap and collapsing neighboring genes into single loci, our selection encompassed 121 unique loci, covering a total of 10.2 Mb. However, human genomic DNA includes repetitive elements, including RNA and DNA transposons, which constitute approximately 45% of the human genome (Lander et al., Reference Lander, Linton, Birren, Nusbaum, Zody and Baldwin2001). These contribute no useful sequence information, because they are difficult to align to unique positions. After elimination of these repetitive elements, our final library encompassed approximately 5.5 Mb. We are currently sequencing this library in 1,000 individuals selected from the VTSABD and CHDS.
As noted earlier, there is an abundance of literature recommending that genomics move in the direction of genome-wide gene—environment–development analyses, but very few data – in fact, we have found no empirical ‘developmental GEWIS’ (Khoury & Wacholder, Reference Khoury and Wacholder2009) studies so far. The few developmental gene–environment studies published have used a candidate gene approach (e.g., Adkins et al., Reference Adkins, Daw, McClay and van den Oord2012; Casey et al., Reference Casey, Glatt, Tottenham, Soliman, Bath, Amso and Lee2009; Cole et al., Reference Cole, Arevalo, Manu, Telzer, Kiang, Bower and Fuligni2011), and carry many of the limitations that have long plagued these studies (Sullivan et al., Reference Sullivan, Eaves, Kendler and Neale2001).
Thus, we undertook GEDI partly as a proof of concept of the feasibility of a developmental GEWIS. Despite the lack of empirical data, most discussions in the literature take a gloomy view of the feasibility of development GEWIS, concentrating on the large sample sizes needed and the unreliability of measures of the envirotype (Thomas, Reference Thomas2010). We acknowledge these problems unreservedly. On the other hand, there are other aspects of the situation that should be considered. First, estimates of sample sizes tend to be based on experience with early genetic studies that have collected cases and controls using what is often very unreliable data: lifetime psychiatric histories, or clinical diagnoses from hundreds or thousands of different clinicians. These methods result not only in false positives but also in considerable numbers of false negatives (cases included non-cases because subjects have forgotten past episodes of illness). Both of these types of error inflate the sample size needed. In their article on estimating the size of gene–environment interactions in the presence of measurement error (Wong et al., Reference Wong, Day, Luan and Wareham2004), and related articles (Luan et al., Reference Luan, Wong, Day and Wareham2001; Wong et al., Reference Wong, Day, Luan, Chan and Wareham2003), Wong and colleagues (Reference Wong, Day, Luan and Wareham2004) pointed out that accuracy in measuring all the related elements – genotype, phenotype, and exposure – critically affect the sample size needed for a given power. For example, ‘the difference between unreliable (correlation with true score = 0.4) and reliable (r = 0.7) measurements corresponds to a 20-fold difference in sample size’ (Moffitt et al., Reference Moffitt, Caspi and Rutter2005, p. 476). Furthermore, Wong et al. (Reference Wong, Day, Luan and Wareham2003, p. 54) noted that ‘improving the measurement can be achieved by taking repeated measurements’, thus the longitudinal studies preferred for developmental analyses will also increase their power to test hypotheses.
Advantages of the GEDI consortium are that it includes only data sets with longitudinal and repeated assessments of subjects taken across the period of adolescence and young adulthood. We can expect subjects to be less vulnerable to either false remembering or false forgetting than those in studies using lifetime retrospective data. The use of repeated assessments also means that even a relatively small number of subjects yield a large number of person-observations (e.g., the 1,420 GSMS subjects yielded 9,858 person-observations by age 26). Even after controlling for non-independence of observations, this approach substantially increases the effective sample size and therefore power. Third, the studies used reliable assessments of symptoms and diagnoses created using a single taxonomy (DSM-IV) and highly structured diagnostic algorithms. Fourth, the studies used reasonably similar measures of key environmental risk factors.
The results of our first analyses (Copeland et al., Reference Copeland, Gottfredson, Adkins, Angold, Clark, Erkanli and Costello2012) appear to provide tentative empirical evidence that the combination of prospective, longitudinal assessment and careful attention to data harmonization can, to some extent, compensate for modest sample sizes. However, regardless of these benefits, there remains an acute need to build a broader and more inclusive consortium of qualifying longitudinal data sets. Our strongest recommendation from this experiment is for the creation of an international ‘developmental dbGaP’ of such data sets to optimize power in future investigations of environmentally and developmentally contingent genetic effects on behavioral outcomes.
Work on this project was carried out with support from NIDA (R01DA024413). Dr Adkins was supported by K01MH093731, and Dr Copeland by K23MH080230. We are grateful to all the study participants who contributed to this work.