Genome wide association studies (GWAS) and family-based linkage studies have both been widely used to map genes causing variation in complex or quantitative traits. The two approaches have a similar aim, and so it is surprising that the results from the two methods have been subjected to little systematic comparison, particularly with regard to the size of estimated effects. Both the approaches use genetic markers to discover loci but differ in their experimental design. Linkage analysis relies on segregation of alleles within the family, whereas association analysis simply correlates markers with phenotypes across a population. Some studies compare the methods, but primarily aim to identify influential loci and sometimes only a selected portion of the genome is investigated (McKenzie et al., Reference McKenzie, Abecasis, Keavney, Forrester, Ratcliffe, Julier, Connell, Bennett, McFarlane-Anderson, Lathrop and Cardon2001; Daetwyler et al., Reference Daetwyler, Schenkel, Sargolzaei and Robinson2008). The equivalence between the estimated effects of loci from the two methods has rarely been explored. When comparisons of several linkage studies are made, results are inconsistent (Altmüller et al., Reference Altmüller, Palmer, Fischer, Scherb and Wjst2001); implying either false-positive results, systematic differences, such as different alleles segregating in different families, or lack of statistical power (false-negative results). This paper compares linkage and GWAS and shows that the results are in agreement, provided the differences between the methods are taken into consideration.
A key difference between linkage and association mapping is in the precision with which they map the location of quantitative trait loci (QTLs). A linkage analysis uses recombination events only within the recorded pedigree and so the confidence interval for the position of the QTL is typically large (Darvasi et al., Reference Darvasi, Weinreb, Minke, Weller and Soller1993). In contrast, GWAS rely on linkage disequilibrium (LD) between QTLs and markers to detect polymorphisms. As LD extends only for a short distance (i.e. <80 kb in humans Clark et al., Reference Clark, Nielsen, Signorovitch, Matise, Glanowski, Heil, Winn-Deen, Holden and Lai2003), the confidence interval for the position of the QTL is generally smaller for a GWAS than for a linkage analysis. Thus, although some GWAS find a QTL in the same region as linkage studies, linkage studies have found QTL on most chromosomes for extensively studied traits and regions identified with linkage tend to extend for long distances (Altmüller et al., Reference Altmüller, Palmer, Fischer, Scherb and Wjst2001).
Both GWAS and linkage studies suffer from two deficiencies when carried out using standard procedures. First, the estimated size of effect for significant QTLs are overestimated (e.g. Beavis, Reference Beavis1998; Goring et al., Reference Goring, Terwilliger and Blangero2001; Xu, Reference Xu2003b; Zöllner & Pritchard, Reference Zöllner and Pritchard2007; Goddard et al., Reference Goddard, Wray, Verbyla and Visscher2009; Sun et al., Reference Sun, Dimitromanolakis, Faye, Paterson, Waggott and Bull2011; Xiao & Boehnke, Reference Xiao and Boehnke2011). This arises because a single dataset is used for both discovery and parameter estimation, causing a correlation between the test statistic and the estimated effect size of alleles (Goring et al., Reference Goring, Terwilliger and Blangero2001). Verification of locus effects in an independent population can avoid this bias, provided that the validation results are not conditioned on statistical tests (Goring et al., Reference Goring, Terwilliger and Blangero2001). Alternatively, Goddard et al. (Reference Goddard, Wray, Verbyla and Visscher2009) argue that this bias can be overcome by fitting the effect of a single nucleotide polymorphism (SNP) or chromosome position as a random effect. If the mean of the posterior distribution of effect size for the estimate is , then the expectation of the true effect (b) has the desirable property of being the mean of the estimates, i.e. (Goddard et al., Reference Goddard, Wray, Verbyla and Visscher2009). This is not the conventional definition of unbiased, but it leads to desirable properties. For instance, if the most significant effects are re-estimated in an independent dataset, then, on average, their effects will not change.
The second problem with both GWAS and linkage analyses as usually practiced is that the effect of one position is estimated ignoring all other positions. In a GWAS, for example, each SNP is tested independently for an association with the trait. Consequently many nearby SNPs may have significant effects because they are all in LD with the same QTL. Alternatively, significant SNP may be near several possible causal polymorphisms (e.g. Barrett et al., Reference Barrett, Hansoul, Nicolae, Cho, Duerr, Rioux, Brant, Silverberg, Taylor, Barmada, Bitton, Dassopoulos, Datta, Green, Griffiths, Kistner, Murtha, Regueiro, Rotter, Schumm, Steinhart, Targan, Xavier, Libioulle, Sandor, Lathrop, Belaiche, Dewit, Gut, Heath, Laukens, Mni, Rutgeerts, Van Gossum, Zelenika, Franchimont, Hugot, de Vos, Vermeire, Louis, Cardon, Anderson, Drummond, Nimmo, Ahmad, Prescott, Onnie, Fisher, Marchini, Ghori, Bumpstead, Gwilliam, Tremelling, Deloukas, Mansfield, Jewell, Satsangi, Mathew, Parkes, Georges and Daly2008). This can cause confusion about the number, location and effect size of QTLs that have been detected. One approach to partially overcome this problem in a GWAS is to fit all positions simultaneously as random effects (Meuwissen et al., Reference Meuwissen, Hayes and Goddard2001), so that the effect of an SNP is estimated conditional on the effect of all other positions.
Multiple QTLs also cause confusion for results from linkage analyses. The simplest interpretation of a significant peak in the likelihood of a linkage analysis is that there is a single QTL near the peak. However, if more than one QTL contributes to the linkage signal (Haley & Knott, Reference Haley and Knott1992; Martínez & Curnow, Reference Martínez and Curnow1992), this can lead to a wrong conclusion being drawn and possibly a futile attempt to fine map a single locus (i.e. a so-called ‘ghost’ QTL). The effect estimated in a linkage analysis is actually the combined effect of all QTLs on the chromosome after accounting for recombination between QTLs and the position being tested. By design, there is strong linkage between adjacent positions in a linkage analysis and, if there are many QTLs, it is impossible to distinguish between adjacent loci because of inadequate recombination. If the effect of all QTLs detected in a GWAS could be combined along a chromosome, allowing for recombination between the position being tested and all other positions, then this effect should be the same as that estimated by a linkage analysis. Yang et al. (Reference Yang, Benyamin, McEvoy, Gordon, Henders, Nyholt, Madden, Heath, Martin, Montgomery, Goddard and Visscher2010) indicate that common SNP markers capture approximately half of the genetic variance for human height. This could cause a discrepancy between linkage analysis and GWAS as imperfect LD would affect the association analysis but not linkage results. Studies with domesticated species indicate that markers generally capture a higher proportion of the genetic variance (Daetwyler, Reference Daetwyler2009; Boyko et al., Reference Boyko, Quignon, Li, Schoenebeck, Degenhardt, Lohmueller, Zhao, Brisbin, Parker, von Holdt, Cargill, Auton, Reynolds, Elkahloun, Castelhano, Mosher, Sutter, Johnson, Novembre, Hubisz, Siepel, Wayne, Bustamante and Ostrander2010; Aitman et al., Reference Aitman, Boone, Churchill, Hengartner, Mackay and Stemple2011; Haile-Mariam et al., Reference Haile-Mariam, Nieuwhof, Beard, Konstatinov and Hayes2012), suggesting that this discrepancy should be minimized using a livestock population.
This study tests the hypothesis that effects estimated from a GWAS and from a linkage analysis agree, provided both are estimated appropriately as random effects and that SNPs are fitted simultaneously in both analysis. To test the hypothesis, we needed to conduct a linkage analysis and a GWAS in the same population. We used a sheep population with large half-sib families because this design maximizes power for the linkage analysis and, with appropriate methods, the impact of family structure in the GWAS can be minimized (MacLeod et al., Reference MacLeod, Hayes, Savin, Chamberlain, McPartlan and Goddard2010). Our approach first demonstrates the consequence of treating the marker effects as random and of fitting all markers simultaneously. Then we show how the effects observed in the linkage analysis can be predicted by combining the effects estimated from the GWAS and allowing for recombination along a chromosome.
2. Materials and methods
Genotypes and phenotypes were obtained for 1971 merino sheep from 12 half-sib families from the SheepGenomics project (White et al., Reference White, Allingham, Gorman, Emery, Hynd, Owens, Bell, Siddell, Hayes, Usmar, Goddard, Henshall, Dominik, Brewer, van der Werf, Nicholar, Warner, Hofmyer, Longhurst, Swan, Forage and Oddy2012). The average family size was 164 animals (range: 68–349). Genotypes consisted of 48 640 SNPs from the Illumina Ovine SNP50 BeadChip, which were quality checked and missing genotypes imputed (see Kemper et al., Reference Kemper, Emery, Bishop, Oddy, Hayes, Dominik, Henshall and Goddard2011). The trait analysed was eye muscle depth (mm) corrected for body weight, measured by ultrasound scanning at approximately 10 months of age. This trait was chosen because many records were available and the trait has an approximate normal distribution. Heritability estimates for eye muscle depth range between 0·22 (±0·04) and 0·33 (±0·03) (Safari et al., Reference Safari, Fogarty and Gilmore2005; Huisman & Brown, Reference Huisman and Brown2009; Mortimer et al., Reference Mortimer, van der Werf, Jacob, Pethick, Pearce, Warner, Geesink, Hocking Edwards, Gardner, Ponnampalam, Kitessa, Ball and Hopkins2010). Full details of the data collection and procedures can be found in White et al. (Reference White, Allingham, Gorman, Emery, Hynd, Owens, Bell, Siddell, Hayes, Usmar, Goddard, Henshall, Dominik, Brewer, van der Werf, Nicholar, Warner, Hofmyer, Longhurst, Swan, Forage and Oddy2012) . Genotypes for the 48 640 SNP were available for nine sires, while the genotypes for the remaining three sires were imputed using a rules-based approach from the progeny genotypes and ChromoPhase (Daetwyler et al., Reference Daetwyler, Wiggans, Hayes, Woolliams and Goddard2011). Calculations of LD between pairs of markers (r 2) were made using the correlation of genotypes.
(ii) Assigning inheritance of the paternal alleles
Alleles for sires and their progeny were phased into paternal and maternal haplotypes using ChromoPhase (Daetwyler et al., Reference Daetwyler, Wiggans, Hayes, Woolliams and Goddard2011). Although the sire genotypes were phased, there is no information on which haplotype is paternal or maternal, and so they are referred to below as the first and second chromosome of a sire, where the designation of first and second is arbitrary. The paternal alleles of each offspring were assigned to either the first or second chromosome of their sire based on runs of successive alleles that matched one of the two chromosomes of their sire. The algorithm allowed up to one mismatch per section to account for genotyping and map errors. Unassigned SNPs were treated as missing data. Further details of the algorithm are provided in Part A of the supplementary materials (available at http://journals.cambridge.org/grh).
(iii) Within-family linkage analysis – fixed effect model
A fixed effects model was fitted sequentially for all SNP positions. The model was
where y is a vector of phenotypes, X is a design matrix assigning progeny to fixed effects (including covariates), b is a vector of fixed effect solutions, Z is a design matrix allocating phenotypes to sires, v is a vector of sire solutions, W is an incidence matrix assigning progeny to groups according to the allele inherited from their sire, α is a vector of effects contrasting each sire's first and second chromosome and e is a vector of residuals distributed N(0, Iσ2e). Fixed effects in b were year of birth (2 levels), a regression coefficient for age (in days, mean age 304 days), birth and rearing type (three levels), sex nested within year (four levels) and four regression coefficients for the first four principal components from the genomic relationship matrix (Yang et al., Reference Yang, Benyamin, McEvoy, Gordon, Henders, Nyholt, Madden, Heath, Martin, Montgomery, Goddard and Visscher2010). Principal components were fitted as covariates to account for population structure within the maternal haplotypes as maternal pedigree was unknown (Patterson et al., Reference Patterson, Price and Reich2006). Thus, the estimate of the effect of the sire's allele () is
where and are the estimates for the fixed effects and sire solutions. The false discovery rate was calculated as (1−s)p/[s(1−p)] (Storey, Reference Storey2002; Bolormaa et al., Reference Bolormaa, Hayes, Savin, Hawken, Barendse, Arthur, Herd and Goddard2011), where s and p are the realized and expected proportion of significant SNP.
(iv) Within-family linkage analysis – random effect model
The model is similar to the fixed effect analysis (i.e. (1)) except that α is treated as a vector of random effects distributed α~N(0, Iσ2sire.snp), where I is an identity matrix and σ2sire.snp is the sire segregation variance. That is, σ2sire.snp is the variance in the trait attributed to the segregation of alleles within sire families, estimated across all families. To estimate this variance, we took the average variance component estimated using restricted maximum likelihood over all positions with ASReml (Gilmour et al., Reference Gilmour, Gogel, Cullis and Thompson2006). To avoid an upward bias, imposed by the default settings in ASReml, both positive and negative estimates of σ2sire.snp were permitted. This variance component was then fixed and used to calculate the allele effect at each position for each sire. The solutions vector, from Henderson's mixed model equations (Henderson, Reference Henderson1950; Mrode, Reference Mrode2005), was
where terms are as described in (1), and σ2error is the residual variance. This was computed with ASReml for all positions. An alternative cross-validation method to estimate the sire segregation variance, with respect to the error variance, and therefore the degree of overestimation in the fixed effect model is given in Part B of the supplementary materials (available at http://journals.cambridge.org/grh).
(v) Association analysis – fixed effect model
A regression of phenotype on allele dosage was made at each SNP position. That is, the SNP marker effect was fitted as fixed following a conventional linkage analysis. The model was
where X, Z and e were as defined in (1), v′ is a vector of random sire effects [distributed N(0, Iσ2sire)], T is a vector assigning progeny to their SNP genotype (i.e. 0, 1 or 2 copies of an SNP allele) and γ is the SNP allele effect (a scalar). The solution for was estimated using ASReml (Gilmour et al., Reference Gilmour, Gogel, Cullis and Thompson2006), where the sire variance (σ2sire) was estimated at each position.
(vi) Association analysis – simultaneous effect of all SNPs with random SNP effects
Simultaneous estimates of all SNP effects were obtained using the Bayesian approach (BayesA) of Meuwissen et al. (Reference Meuwissen, Hayes and Goddard2001) . The model is
where T, Z, v′ and e are as defined above (4), y′ is a vector of phenotypes corrected for fixed effects (i.e. , as described in (1)) and γ is a vector of marker effects assumed to be N(0, Iσ2γi), where σ2γi is the variance for the ith SNP. This method assumes that allele effects (γ) come from a t-distribution with 4·012 df following Meuwissen et al. (Reference Meuwissen, Hayes and Goddard2001) . This model, in contrast to (4), directly accounts for the LD between nearby markers, the overestimation bias in the marker effects and, by extrapolation of Kang et al. (Reference Kang, Sul, Service, Zaitlen, Kong, Freimer, Sabatti and Eskin2010) and Yang et al. (Reference Yang, Manolio, Pasquale, Boerwinkle, Caporaso, Cunningham, de Andrade, Feenstra, Feingold, Hayes, Hill, Landi, Alonso, Lettre, Lin, Ling, Lowe, Mathias, Melbye, Pugh, Cornelis, Weir, Goddard and Visscher2011), spurious results due to population stratification. Fitting all SNPs simultaneously indirectly accounts for population stratification because SNP effects are estimated conditional on all other SNPs, whereby eliminating spurious associations (e.g. associations caused by SNP in LD with QTL on different chromosomes). SNP allele effects were estimated as the posterior mean of 10 replicates of a Gibbs chain with 30 000 iterations, with 5000 iterations discarded in each replicate as burn-in.
(vii) Predicting linkage results from the association analysis
The estimates of SNP effects from (5) were used to predict the linkage effects at each position. The predicted effect at position j for sire k (ηj,k) was calculated as
where is the estimate of the SNP allele effect at positions i, pi,j is the probability of co-inheritance of positions i and j, xi,k,1 and xi,k,2 are sire k's allele at position i (i.e. 0 or 1) for the first and second chromosomes and M is the total number of SNP positions on the chromosomes. Thus, (6) is the difference between the sum of allele effects for the first and second chromosome at each position, where the sum of allele effects on each chromosome accounts for the probability of recombination events along the chromosome. The probability of co-inheritance of positions i and j was calculated as pi,j=1–2ci,j, where ci,j was the recombination fraction from Haldane's mapping function (Haldane, Reference Haldane1919), i.e. ci,j=0·5 [1−exp(−2 m)] where m is the distance (in Morgans) between i and j and assuming cM=1 Mbp (Botstein et al., Reference Botstein, White, Skolnick and Davis1980 citing Renwick, Reference Renwick1969). The regression coefficient of the observed effect on the predicted linkage effect will be one if (1) the association analysis captures all of the genetic information in the linkage analysis, (2) SNP allele effects are additive and (3) Haldane's mapping function is an accurate predictor of recombination.
(viii) Predicting linkage results from the association analysis with independent data
The data from the association analysis used to predict the linkage effects in (5) are not independent of the data used in the linkage analysis. This is because the segregating alleles from the linkage analysis in the 12 sires also contribute to the association analysis. To achieve complete independence between the association and linkage analyses, it was necessary to exclude, in turn, the offspring of each sire from the association analysis. That is, model (5) was run 12 times. SNP marker effects were then used to predict the linkage results using (6) for the sire excluded from the association analysis. For comparison, an analysis that predicts the between sire differences using marker effects estimated with data from all sires and excluding the sire to be predicted (i.e. independent data) is described in Part C of the supplementary materials (available at http://journals.cambridge.org/grh).
(i) Tracking the paternal alleles
Paternal alleles were assigned to either the 1st or 2nd chromosome of the sire at 92·1% of positions (range per sire: 81·5–95·8%), excluding uninformative positions (Supplementary Fig. S1, available at http://journals.cambridge.org/grh). There was an average of 7·2% unassigned progeny per SNP per sire.
(ii) Linkage analysis and GWAS using conventional methods
Using the conventional fixed effect linkage analysis (1), 3109 positions were identified as significant on 15 of 26 chromosomes at a false discovery rate of 14·8% (P<0·01, Fig. 1). When significant SNPs were tested using the genome-wide association analysis (4), there are 132 SNPs identified as significant with a false-discovery rate of 22·8% (P<0·01, SNP details in Supplementary Table S1, available at http://journals.cambridge.org/grh). The false-discovery rate suggests many true discoveries, although the closer inspection below creates some confusion for QTLs underlying our trait.
Doubts over the results from the conventional analysis arise because some chromosomes suggest discrete QTL, while for other chromosomes the results are inconsistent. For example, consider chromosomes 3 and 6 (Fig. 2). Chromosome 3 presents seemingly reliable answers where the 43 positions significant in both analyses appear to cluster near two likely QTLs, one at (approx) 30 Mbp and another at 50 Mbp. The effect of the SNP with the highest significance from the association analysis at about 50 Mbp is −0·39 (±0·08) mm and the estimated (absolute) effect ranges from 0·01 (±0·27) to 0·71 (±0·38) mm for the linkage analysis. However, chromosome 6 shows a strong linkage signal from 80 Mbp onwards and 21 SNP significant from both the linkage and association analysis over a wide region. It is not clear which or if all these SNPs are associated with the linkage peak. The linkage analysis suggests possibly three QTLs, while the SNP also significant in the association analysis suggests maybe four or more QTLs. Also contradictory are the several significant SNPs at about 40 Mbp, which do not have any corresponding linkage signal. It is difficult to ascertain using the two approaches in this form, which analysis is more reliable, which effects are due to experimental noise, how many QTLs exist and what is the best estimate of the position of each QTL.
(iii) Linkage analysis – impact of the random effects model
The mean maximum likelihood estimate for σ2sire.snp from all positions was 0·013, and thus the average proportion of phenotypic variance explained by the paternally inherited allele was 0·0037 (i.e. σ2sire.snp/σ2phen=0·013/3·15). Although the likelihood failed to converge at 5407 (11·1% of all) positions; a subsequent restricted (positive definite) maximum likelihood analysis at these positions showed an almost zero variance attributed to σ2sire.snp. This method overestimates the average proportion of phenotypic variance explained by markers because the sum for all markers is much greater than the genetic variance of the trait (i.e. the expected genetic variance is approximately 0·3 σ2phen but 0·0037 σ2phen/SNP 48 640 SNP>>0·3 σ2phen). The overestimation occurs because of the strong LD between makers in the linkage analysis.
Comparison of the fixed and random effects models for SNP alleles from the linkage analysis (i.e. models (2) and (3)) shows broad agreement for most sires at most positions (Fig. 3a). The regression indicates that the random effects analysis explains 91% of the variation in the fixed effect analysis but that the fixed effect model is estimating the size of the allele effect to be about ten times greater than the random effect model. Adjacent allele effects for a sire are correlated in Fig. 3a (i.e. adjacent SNP positions have correlated effects and form lines in the plot) and this correlation between positions is maintained by the random model. Notably there are several SNP positions with large effects estimated by the fixed model (>±2 mm) for which the random model estimates an effect near zero. This severe regression by the random effects model suggests that there was little support for the large effect estimated by the fixed model. These positions are probably regions where poor tracking of the paternal allele occurred, and consequently, there were few progeny who were recorded to inherit each of the sire's alleles.
(iv) Association study – impact of the random effects model
The regression of the association allele effects from the fixed and random models (i.e. (4) and (5)) show that the fixed model estimates the effect of alleles almost 100 times larger than the random model (Fig. 3b). Similar to the linkage analysis, many SNP alleles estimated with large effects (>±1 mm) from the fixed model were regressed to almost zero using the simultaneous method (Fig. 3b). This occurs because of unreliable estimates of effects from the fixed effect model. For example, of the 23 markers with large effects (>±1 mm) from fixed effect model and very small effects (<0·001 mm) in the random model, 20 (87%) were not significant (P>0·05). The remaining three markers may represent spurious results from the standard GWAS, presumably caused by LD with other SNP.
The regression of the fixed effect solutions on the random effects solutions also explains a lower amount of variation compared with the linkage analysis (i.e. R 2=0·91 vs. R 2=0·58, Fig. 3). The differences between the models and the lower proportion of variance explained by the random effect model is partially due to overestimation of the effects when they are fitted one at a time as fixed effects and partially because the BayesA analysis may spread the effect of each QTL over several adjacent SNP. For example, Fig. 4 compares the fixed and BayesA analysis over a 20 Mbp region on chromosome 6 where there appears to be a strong QTL signal at around 42 Mbp. The random effects analysis maps this effect in a location slightly further along the chromosome (41·5 Mbp) compared with the fixed effect analysis (40·8 Mbp), but it also shows the spread of QTL effects for SNP in modest LD (r 2>0·5) with this SNP in the region. Further, from the random effects model, it is clearer that there are possibility of three QTLs at 30·7, 45·0 and 50·6 Mbp for markers which are not in strong LD with the SNP at 41·5 Mbp. A further SNP at 42·1 Mbp may be associated with the same QTL tracked by the SNP at 41·5 Mbp or this association could indicate a fourth additional QTL.
(v) Predicting the linkage results from the association study
Despite the correction for bias in the linkage and association analyses the magnitude of the association effects are still in the order of 100 times smaller than those estimated from the linkage analysis (Fig. 3). A prediction of the linkage results from the association analysis needs to account for the stronger LD between adjacent positions in the linkage analysis. Using the linkage results from random model (i.e. (3)), the prediction was the contrast between sire chromosomes for the sum of the association effects accounting for recombination (i.e. models (5) and (6)). For individual sires, the expectation of the linkage effects shows good agreement with the linkage results (Fig. 5, Supplementary Fig. S2, available at http://journals.cambridge.org/grh). To compare the effects across all sires and at all positions we plotted the estimate from the linkage analysis against that predicted from the association study (Fig. 6a). The regression is almost one (slope: 0·975±1·2×10−3, intercept: 3·7×10−3±6·9×10−5) and accounts for about half of the variation in the linkage results (R 2=0·523). Considering the sampling errors in both estimates, this suggests that the association analysis is capturing the majority of within-family information. There were no data points which showed a notable deviation from the regression slope (Supplementary Fig. S3).
(vi) Predicting the linkage results with independent data
There was a high correlation between the SNP effects estimated with all animals and those estimated excluding progeny from each sire using the random effects model (average R 2=0·91, range: 0·85–0·93). However, these analyses predicted the linkage effects for the excluded sire very inaccurately (Fig. 6b, R 2=0·002). This contrasts sharply to results when the sire to be predicted is included in the analysis (Fig. 6a). Thus, the sire whose linkage analysis is to be predicted must be included in the association analysis to achieve good agreement between the two approaches. Predictive ability with independent data is slightly improved when predicting between sires differences (R 2=0·04, Part C of the supplementary material, available at http://journals.cambridge.org/grh).
This study suggests two reasons why there is often little agreement between linkage analysis and GWAS on the same complex trait. First, when the effects are estimated as fixed effects in statistical models, the most significant effects are often grossly overestimated. This is evident in our study for both the linkage and association analysis. Overestimation of fixed effects has been highlighted previously by several authors (e.g. Beavis, Reference Beavis1998) and contributes to the often smaller than expected or perhaps non-significant results for loci when replication is attempted. Naturally, this problem also occurs if one attempts to verify the results of a linkage analysis with a GWAS or vice versa. Our GWAS predicted the linkage results, provided both are estimated as random effects, SNPs are fitted simultaneously in the GWAS, and GWAS effects on a chromosome are combined to account for LD in the linkage analysis. The regression of the observed linkage effect on the effect predicted from the GWAS is close to 1·0 indicating an approximate agreement in size. The proportion of the variance in the linkage results explained by our prediction is high (R 2=0·52) considering that both sets of effects are estimated with error.
Second, this study shows that multiple linked QTLs can be the underlying cause of significant linkage results. In contrast to the simulation studies with multiple QTLs tracked by microsatellite markers (e.g. Haley & Knott, Reference Haley and Knott1992), our results in real data suggest that likelihood peaks can be caused by the sum of many QTLs along a chromosome. We do not suggest that all linkage peaks are detecting multiple small QTLs because some studies have been successful in identifying important loci (e.g. Gusella et al., Reference Gusella, Wexler, Conneally, Naylor, Anderson, Tanzi, Watkins, Ottina, Wallace, Sakaguchi, Young, Shoulson, Bonilla and Martin1983; Tsui et al., Reference Tsui, Buchwald, Barker, Braman, Knowlton, Schumm, Eiberg, Mohr, Kennedy, Plavsic, Zsiga, Markiewicz, Akots, Brown, Helms, Gravius, Parker, Rediker and Donis-Keller1985; Charlier et al., Reference Charlier, Coppieters, Farnir, Grobet, Leroy, Michaux, Mni, Schwers, vanmanshoven, Hanset and Georges1995; Coppieters et al., Reference Coppieters, Kvasz, Farnir, Arranz, Grisart, Mackinnon and Georges1998). However, successful linkage studies involve polymorphisms of large effect and these loci probably overwhelm any interference in the signal caused by multiple linked loci. The effect of the linked loci could be to increase or decrease the apparent effect size of the major loci, depending on the phase of the interacting loci. Here, we demonstrate with real data that the additive effect of multiple loci in strong LD can cause apparent linkage signals. This conclusion is consistent with simulation and theoretical studies (e.g. Dekkers & Dentine, Reference Dekkers and Dentine1991; Visscher & Haley, Reference Visscher and Haley1996) and is also supported by mice studies when single QTL fractionate into multiple smaller loci with fine mapping (Flint et al., Reference Flint, Valdar, Shifman and Mott2005).
The influence of nearby linked loci cannot be excluded when using association rather than linkage analysis. Even in a conventional GWAS analysis, fitting one SNP at a time, SNP with significant effects may be influenced by multiple nearby QTLs, some in phase and some out of phase with the tested SNP. However, LD in GWAS probably has less influence than in linkage because LD usually extends for shorter distances, i.e. <1 Mbp in Merino sheep (Kemper et al., Reference Kemper, Emery, Bishop, Oddy, Hayes, Dominik, Henshall and Goddard2011). Hence, a large number of significant SNPs most likely indicate a large number of QTLs. This conclusion is made clearer by fitting all SNPs simultaneously. Then SNPs that have no marginal effect after fitting all other SNPs, including SNP in strong LD with the causal polymorphisms, will show no association with the trait. Figure 4b shows a typical result where there are several positions along the chromosome associated with the trait of interest.
The high degree of agreement (R 2=0·52, regression coefficient ~1·0) between our observed and predicted linkage results is surprising. This consistency suggests that the association analysis is tracking the majority of the linkage information and that imperfect LD (between causal mutations and SNP) is not a strong influence on the results from our association analysis. This is because the linkage analysis has strong LD within families and imperfect LD is not limiting as it can be in GWAS. Incomplete LD between common SNP and causative mutations has been hypothesized to be responsible for ~50% of the genetic variation in human populations which is not explained by common SNP (Yang et al., Reference Yang, Benyamin, McEvoy, Gordon, Henders, Nyholt, Madden, Heath, Martin, Montgomery, Goddard and Visscher2010). Here, we suggest that the importance of incomplete LD between SNP and causative mutations is influenced strongly by genetic diversity. Our observation is supported by other studies with domestic species where common SNP capture a high proportion of the genetic variance (e.g. Daetwyler, Reference Daetwyler2009; Boyko et al., Reference Boyko, Quignon, Li, Schoenebeck, Degenhardt, Lohmueller, Zhao, Brisbin, Parker, von Holdt, Cargill, Auton, Reynolds, Elkahloun, Castelhano, Mosher, Sutter, Johnson, Novembre, Hubisz, Siepel, Wayne, Bustamante and Ostrander2010; Haile-Mariam et al., Reference Haile-Mariam, Nieuwhof, Beard, Konstatinov and Hayes2012). Thus, as the population's diversity, or effective population size (N e), increases the ability of common SNP to capture the genetic variance reduces. Incomplete LD may occur when causative SNPs are at a lower frequency than the genotyped SNPs (Yang et al., Reference Yang, Benyamin, McEvoy, Gordon, Henders, Nyholt, Madden, Heath, Martin, Montgomery, Goddard and Visscher2010), suggesting an increased importance for these mutations in, for example, human compared with livestock populations.
Extensive QTL mapping experiments in many species suggests that alleles with a large effect on quantitative traits are uncommon (e.g. Darvasi & Pisanté-Shalom, Reference Darvasi and Pisanté-Shalom2002). The results of the association analysis reported here suggest many QTLs for our trait but we found no evidence of large effect QTL in our sires. For instance, if most important genes had a variant with large effect, we might expect to see at least one sire with a large estimated effect from the linkage analysis and an inaccurate prediction of this effect from the GWAS. However, we never observed any allele from the linkage analysis which substantially differed from the effect predicted from the association analysis (Fig. 6). We sampled only 12 sires but we analysed each sire at thousands of positions. If most of the genetic variance was due to rare large effect variants then we might expect to observe at least one heterozygous sire in our dataset. The situation of segregating alleles with large effect may occur but it cannot be typical because we predicted our linkage results from an association analysis with moderate accuracy. Further, all of our estimated effects from the association analysis were also very small (<0·008 mm or <0·008/3·151/2=0·004 S.D.).
Our results show that most of the linkage information was captured in the prediction from the GWAS results. However, the two approaches are not independent because they use the same data and we also show that when the sire to be predicted is excluded from the association analysis we cannot predict the linkage results. This discrepancy could be explained by high sampling covariance between the effects estimated for SNP in very strong LD with one another. Thus, the combination of SNP alleles has been observed in the data to be predicted accurately. The between sire differences, which are the sum of all SNP effects, were estimated more precisely using independent data. Prediction of between sire differences is equivalent to genomic prediction which, given larger datasets, can reach moderate accuracies in sheep for this trait (Daetwyler et al., Reference Daetwyler, Hickey, Henshall, Dominik, Gredler, van der Werf and Hayes2010). The dependency between SNP when estimating effects of individual markers is not surprising considering that the magnitude of the largest effect was very small (0·004 s.d.) and given the relatively small size of the dataset.
These results suggest that the best analysis is the GWAS in which all SNPs are fitted simultaneously. This method gave us consistent results between linkage and association, and has greater power to discriminate linked QTLs than either the linkage analysis or the standard GWAS fitting one SNP at a time. This is clearly demonstrated in Fig. 4, where numerous GWAS results are consolidated into possibly four QTL signals at 30·7, 41·5, 45·0 and 50·6 Mbp. A potential drawback of this method is that effects may be split between closely linked markers (Xu, Reference Xu2003a). In Fig. 4, this is potentially occurring for several markers in high LD with the largest estimated effect at 41·5 Mbp. These high LD markers may also be capturing multiple different mutations at the locus. However, the effect of this disadvantage should diminish as markers in higher LD with the causal mutations for traits are included in the SNP marker set.
In summary, this study aimed to reconcile some of the differences between linkage and linkage-disequilibrium mapping. We have demonstrated, using real data, the correction for the biases in both linkage and association mapping. We show that multiple linked QTLs can combine to be the primary cause of significant linkage results. In our study, the association analysis captured 52% of the within-family information, which is high considering the sampling error of effects from both analyses. The results support the hypothesis that there are many loci of small effect underlying complex traits. We suggest an improved method for GWAS is to fit statistical models where all SNPs are analysed simultaneously. This method prevents spurious results caused by population structure and accounts for LD surrounding the analysed SNP.
We thank the SheepGenomics and CRC for Sheep Industry Innovation for the provision of the data for this work. This research was supported under Australian Research Council's Discovery Projects funding scheme (project DP1093502).
5. Declaration of interest
6. Supplementary material
The online data are available at http://journals.cambridge.org/GRH