BBC Russian
Svoboda | Graniru | BBC Russia | Golosameriki | Facebook

This website requires cookies, and the limited processing of your personal data in order to function. By using the site you are agreeing to this as outlined in our privacy notice and cookie policy.

Abstract 


Although genome-wide association studies (GWASs) have identified numerous loci associated with complex traits, imprecise modeling of the genetic relatedness within study samples may cause substantial inflation of test statistics and possibly spurious associations. Variance component approaches, such as efficient mixed-model association (EMMA), can correct for a wide range of sample structures by explicitly accounting for pairwise relatedness between individuals, using high-density markers to model the phenotype distribution; but such approaches are computationally impractical. We report here a variance component approach implemented in publicly available software, EMMA eXpedited (EMMAX), that reduces the computational time for analyzing large GWAS data sets from years to hours. We apply this method to two human GWAS data sets, performing association analysis for ten quantitative traits from the Northern Finland Birth Cohort and seven common diseases from the Wellcome Trust Case Control Consortium. We find that EMMAX outperforms both principal component analysis and genomic control in correcting for sample structure.

Free full text 


Logo of nihpaLink to Publisher's site
Nat Genet. Author manuscript; available in PMC 2011 May 11.
Published in final edited form as:
PMCID: PMC3092069
NIHMSID: NIHMS196426
PMID: 20208533

Variance component model to account for sample structure in genome-wide association studies

Abstract

Although genome-wide association studies (GWASs) have identified numerous loci associated with complex traits, imprecise modeling of the genetic relatedness within study samples may cause substantial inflation of test statistics and possibly spurious associations. Variance component approaches, such as efficient mixed-model association (EMMA), can correct for a wide range of sample structures by explicitly accounting for pairwise relatedness between individuals, using high-density markers to model the phenotype distribution; but such approaches are computationally impractical. We report here a variance component approach implemented in publicly available software, EMMA eXpedited (EMMAX), that reduces the computational time for analyzing large GWAS data sets from years to hours. We apply this method to two human GWAS data sets, performing association analysis for ten quantitative traits from the Northern Finland Birth Cohort and seven common diseases from the Wellcome Trust Case Control Consortium. We find that EMMAX outperforms both principal component analysis and genomic control in correcting for sample structure.

GWASs may utilize either case-control cohorts to test for associations with diseases or population cohorts to identify associations with quantitative traits. In both cases, it is assumed that the cohorts consist of unrelated individuals that share the same population background, although this may not hold in practice for cohorts used in many current GWASs. The presence of related individuals within a study sample results in sample structure, a term that encompasses population stratification and hidden relatedness. Population stratification refers to the inclusion of individuals from different populations within the same study sample. Hidden relatedness refers to the presence of unknown genetic relationships between individuals within the study sample1,2. The effects of sample structure present in cohorts used for genetic association studies have been well documented and identified as a cause for some spurious associations3,4.

Although limiting study samples entirely to unrelated individuals may be difficult or impossible, genotype data provides valuable information on the sample structure that can inform genetic association analysis. For example, the STRUCTURE software5 uses genotype data to partition the sample into subpopulations within which there is no sample structure and subsequently carries out association tests within the identified subpopulations. To eliminate the effects of hidden relatedness, one can estimate the proportion of genes identical by descent (IBD) between any pair of individuals in the sample and exclude from the analysis those individuals that appear closely related1,6. Population stratification and hidden relatedness, however, constitute just two extreme manifestations of sample structure, and methods are needed to correct for other forms of sample structure. In the genomic control approach7,8, which has been widely adopted, the distribution of test statistics from the single-marker analysis is used to estimate the inflation factor, λ, with which the test statistics are subsequently rescaled, constraining the risk of false positives. The EIGENSTRAT software9,10 uses principal components analysis (PCA) to detect and describe sample structure and has been widely used in GWASs. Some principal components may represent broad differences across individuals within a given data set, effectively capturing a few major axes of population structure, but it is unclear how to interpret the rest of the principal components as surrogates of sample structure11,12. Currently, association studies typically use a combination of these strategies, first identifying close relatives to remove them from analysis, then correcting for broad sample structure using principal components or spatial information and finally correcting for the residual inflation with genomic control6,13,14.

If we knew the complete genealogy of the population, we could, in principle, apply a variance component method to model the effects of the genetic relationships on the phenotypes; this approach would be similar in spirit to the classical polygenic model15 directly applied to association mapping16. The variance component would capture the complex mixture of both population stratification and hidden relatedness that directly results from the genealogy and would correct for these relationships during the mapping. Although the exact genetic relationships between individuals in the samples are unknown, we could take advantage of the high-density genotype information to empirically estimate the level of relatedness between reportedly unrelated individuals.

We report here an approach for correcting for sample structure within GWASs, based on a linear mixed model (also sometimes referred to as a mixed linear model) with an empirically estimated relatedness matrix to model the correlation between phenotypes of sample subjects. Similar variance component approaches have been used successfully in animal models17-19. However, applying even an efficient implementation of a variance component approach, such as EMMA (ref. 19), is computationally intractable for data sets consisting of thousands of individuals, owing to the heavy computational burden in the estimation of variance parameters. Capitalizing on the characteristics of complex traits in humans, we make a few simplifying assumptions that allow us to markedly increase the speed of computations, making our approach readily applicable to GWASs with tens of thousands of individuals assayed at hundreds of thousands of SNPs. For most genetic association studies in humans, because the effect of any given locus on the trait is very small20, we need to estimate the variance parameters only once for each data set, and we can globally apply them to each marker. Our computational improvements reduce the running time for the analysis of a typical GWAS data set using a variance component model from years to hours. The advantage of the variance component approach is that the empirical relatedness matrix encodes a wide range of sample structures, including both hidden relatedness and population stratification. Principal component–based methods, in contrast, by estimating major axes of the pairwise genetic similarity matrix, capture some, but not all, of the sample structure, as we show below.

We evaluate our method using two human GWAS data sets, from the 1966 Northern Finland Birth Cohort (NFBC66)13,21 and the Wellcome Trust Case Control Consortium (WTCCC)6. The NFBC66 is based on a founder population, which is expected to minimize genetic heterogeneity, increasing the chances of mapping genes underlying traits of interest22. This is an ideal sample to evaluate our method because a detailed study23 of this data set has revealed the presence of substantial population structure that could influence the results of genetic association studies. In addition, we apply our method to the case-control studies for seven common complex diseases conducted by the WTCCC6. In both data sets, our method consistently outperforms both genomic control and principal component analysis. We term our method EMMA eXpedited (EMMAX) because it builds on the previous approach EMMA (ref. 19) and markedly reduces the computational cost.

RESULTS

Revisiting principal component analysis in NFBC66

To more closely examine the extent of sample structure within the NFBC66, we used PCA of the genotype covariance matrix9 and multidimensional scaling analysis (MDS) of the identity-by-state (IBS) matrix from NFBC66 samples. The first two coordinates identified by MDS are known to correlate well with the geographical location of the linguistic groups13. The first two principal components in the current sample correlate well with latitude and longitude of parental birthplaces for the subset of individuals with known ancestry (Fig. 1). Indeed, we noted that PCA of genotypes and classical MDS of the IBS matrix lead to very similar results. There is a correlation coefficient of 0.9993 between the first components from PCA and MDS and a correlation coefficient of 0.9978 between the second components. The first five principal components separate to varying degrees the linguistic and geographic subgroups comprising northern Finland (Supplementary Fig. 1), consistent with the previous analysis using MDS13. Despite the clear correlation between geographical regions of origin and the first two principal components, clustering analyses of the IBS matrix using PLINK software or hierarchical clustering in R did not identify separate subgroups.

An external file that holds a picture, illustration, etc.
Object name is nihms196426f1.jpg

Scatter plots of the first two principal components against latitude and longitude. Only individuals of known ancestry are included in the plot. Latitude and longitude are defined as the average latitude and longitude of the parents’ birthplaces. Colors indicate linguistic or geographic subgroups.

Association analysis

Performing a simple uncorrected association test for each of the nine phenotypes originally examined in ref. 13, we made the following estimates of the genomic control parameters λ: body mass index, 1.031; C-reactive protein (CRP), 1.007; diastolic blood pressure, 1.031; glucose, 1.045; high-density lipoprotein (HDL), 1.052; insulin plasma levels, 1.029; low-density lipoprotein, 1.098; systolic blood pressure, 1.066; triglyceride, 1.023. These values are all higher than the ones obtained previously with a smaller sample size13 and are substantially higher than what one would expect in a sample with no structure. In addition, the height phenotype, which was not analyzed in the previous study13, has a λ value of 1.187. For reference, note that a conservative estimate of the 95% confidence interval of the inflation factor is between 0.992 and 1.008, assuming independence between the markers.

As hidden relatedness is a possible cause of inflated genomic control parameters, we reanalyzed the data after excluding a larger number of possibly related subjects (a genome-wide IBD estimate of >10% was used as a cutoff with PLINK software, excluding an additional 611 individuals). This resulted in a slight reduction of λ for some phenotypes (Table 1).

Table 1

Comparison of genomic control inflation factors obtained with different models

PhenotypeGenomic control inflation factor
UncorrectedIBD < 0.1ES100EMMAX
CRP1.0071.0071.0190.993
TG1.0231.0101.0191.002
INS1.0291.0221.0131.005
DBP1.0311.0191.0281.007
BMI1.0311.0241.0160.995
GLU1.0451.0331.0301.008
HDL1.0521.0561.0361.004
SBP1.0661.0561.0211.006
LDL1.0981.0891.0401.002
Height1.1871.1511.0741.003

ES100, EIGENSOFT correcting for 100 principal components; IBD < 0.1, uncorrected analysis after excluding 611 individuals whose PLINK’s IBD estimates with another individual is greater than 0.1; phenotype abbreviations are CRP, C-reactive protein; TG, triglyceride; INS, insulin plasma levels; DBP, diastolic blood pressure; BMI, body mass index; GLU, glucose; HDL, high-density lipoprotein; SBP, systolic blood pressure; LDL, low density lipoprotein.

As suggested in ref. 9, we explored the effect of including a variable number of principal components in the association tests. Although including two or five principal components are included has a considerable effect on the λ values, further augmenting the number of principal components does not substantially decrease the genomic control parameter (Fig. 2). It is often suggested that only principal components having predictive power for the phenotype should be included in the regression11. We identified principal components for each phenotype that have a t-test P < 0.005 as predictors; the results of their inclusion in the association tests are reported in Figure 2.

An external file that holds a picture, illustration, etc.
Object name is nihms196426f2.jpg

The genomic control parameters for ten traits change with the number of principal components used for adjustment. Sig PC, significant principal components, includes the principal components (PC) that have a t-test P value < 0.005 as predictors for each of the phenotypes. LDL, low density lipoprotein; SBP, systolic blood pressure; HDL, high-density lipoprotein; GLU, glucose; BMI, body mass index; DBP, diastolic blood pressure; INS, insulin plasma levels; TG, triglyceride; CRP, C-reactive protein.

Correcting for sample structure

We analyzed the ten NFBC66 phenotypes with EMMAX using a three-step procedure (see Online Methods). First, we computed a pairwise relatedness matrix from high-density markers, which we used to represent the sample structure. Second, we estimated the contribution of the sample structure to the phenotype using a variance component model, resulting in an estimated covariance matrix of phenotypes that models the effect of genetic relatedness on the phenotypes. Third, we applied a generalized least square (GLS) F-test24, or alternatively a score test25, at each marker to detect associations accounting for the sample structure using the covariance matrix.

The second step also provides us with the fraction of phenotypic variance explained by the empirically estimated relatedness matrix. We call this fraction pseudoheritability because it resembles the heritability estimated from a pedigree26, although this is not directly interchangeable with heritability of the trait because the estimated pairwise relatedness does not correspond exactly to the kinship coefficients. Nonetheless, the pseudoheritability estimates are concordant with the previous heritability estimates from a large family based study of Kosrae and Sardinian populations27,28. Different methods for estimating the pairwise relatedness provide slightly different but highly correlated estimates of pseudoherit-ability across the ten traits. (Supplementary Table 1).

Using the estimated covariance matrix, we proceeded with the GLS F-test to test the effect of each marker on the phenotype and then applied genomic control to quantify the amount of residual inflation. The genomic control λ parameters we obtained with EMMAX are much lower than those obtained using either standard association methods or regression analysis including 100 principal components (Table 1). Figure 3 and Supplementary Figure 2 illustrate the results using quantile-quantile plots of the P value distributions from these three tests. Only one of the ten phenotypes showed λ values within the 95% confidence interval of 0.992–1.008 with uncorrected or principal component analysis, whereas all of them fell in the confidence interval with EMMAX.

An external file that holds a picture, illustration, etc.
Object name is nihms196426f3.jpg

Comparison of P value distributions across different methods with NFBC66 data. (a) Quantile-quantile plot of the height phenotype, which shows the largest inflation of test statistics, before application of genomic control. The shadowed region represents a conservative 95% confidence interval (CI) computed from the beta distribution assuming independence markers. ES100 indicates EIGENSOFT correcting for 100 principal components. (b) Comparison of LDL association P values between uncorrected and EMMAX analysis after application of genomic control in a logarithmic scale.

Unlike genomic control, the EMMAX model alters the ranking of SNPs by their association statistics. This is especially important as many recent GWAS follow-up and multistage design studies take the approach of genotyping all SNPs exceeding some predefined threshold29-31. We examined the extent to which the adoption of the EMMAX model changes the SNP rankings in comparison to the uncorrected and principal component analyses. We took the top k markers from the results of EMMAX, the uncorrected method, and regression including 100 principal components (as implemented in EIGENSOFT software), for k between 10 and 5,000. For each of these sets, we calculated the number of SNPs shared between the lists and the fraction of these shared SNPs relative to the number of unique SNPs in each pair of lists. Although many of the top SNPs reported by each method overlap, a considerable number of highly ranked SNPs differ between the methods (Fig. 4 and Supplementary Table 2). In general, EMMAX results are similar to uncorrected analysis when the inflation of test statistics is small, but they become more similar to the PCA as the inflation increases. Notably, the PCA consistently shows larger departures from the uncorrected analysis than EMMAX does across all ten phenotypes. For example, when the overdispersion of test statistics was negligible, such as in the CRP phenotype, only 66% of the top 2,000 hits were concordant between the principal component and the uncorrected analysis, whereas 89% were concordant between EMMAX and the uncorrected analysis.

An external file that holds a picture, illustration, etc.
Object name is nihms196426f4.jpg

Rank concordance comparison of strongly associated SNPs between different methods. The ten NFBC66 phenotypes (abbreviated as in Fig. 2) are ordered by their genomic control inflation factors. Rank concordance is presented as CAT plots45. The proportion of SNPs shared between sets of the top k SNPs for different methods are shown for 10 ≤ k ≤ 5000. Pairs of sets being compared are indicated in key at bottom; for example, Uncorr-EMMAX, comparison of uncorrected set and EMMAX set. ES100 indicates EIGENSOFT correcting for 100 principal components.

EMMAX prevents the overdispersion of test statistics using a statistical model that explicitly takes into account sample structure, rather than correcting the overdispersed test statistics caused by not taking into account genetic relatedness in the statistical model. Consequently, EMMAX can also prevent the overcorrection that would remove true positive associations. We identified 15 genome-wide significant loci with at least one of the uncorrected, 100 principal components–corrected, or EMMAX, analyses after genomic control at the suggested P threshold32 of 7.2 × 10−8 across the ten phenotypes (Table 2). In 13 of the 15 loci, EMMAX P values become smaller than the uncorrected analysis. The two-sided binomial P value of the observed asymmetry is 9.8 × 10−4 if two methods have the same statistical power. With the 100 principal components–corrected analysis, 10 of the 15 loci show smaller P values than the uncorrected analysis (binomial P value of 0.12). Although 12 of the 15 loci are found by all methods to be genome-wide significant at P < 7.2 × 10−8, two known loci33, APOB (with triglyceride) and HNF4A (with HDL), pass the threshold only with EMMAX. In contrast, the locus NR1H3 (with HDL), which is genome-wide significant only with uncorrected analysis, turns out to be the only locus whose association has not yet been replicated by an independent study among the 15 loci.

Table 2

Fifteen peak associated SNPs with genome-wide significance

TraitrsIDChrBase positionaClosest geneP value
Uncorrected + GCES100 + GCEMMAX + GC
HDLrs37642611655550825CETP7.0 × 10−313.8 × 10−313.7 × 10−32
CRPrs27945201157945440CRP4.8 × 10−233.6 × 10−233.0 × 10−23
LDLrs6467761109620053CELSR25.4 × 10−147.7 × 10−153.8 × 10−15
CRPrs265000012119873345LEF12.1 × 10−127.0 × 10−121.9 × 10−12
HDLrs15320851556470658LIPC4.3 × 10−127.9 × 10−111.0 × 10−11
GLUrs5608872169471394G6PC21.1 × 10−114.1 × 10−123.1 × 10−12
LDLrs693221085700APOB9.6 × 10−111.5 × 10−112.8 × 10−11
TGrs1260326227584444GCKR1.9 × 10−105.9 × 10−111.8 × 10−10
HDLrs2550491666570972LCAT3.9 × 10−91.2 × 10−91.4 × 10−8
LDLrs116684771911056030LDLR1.4 × 10−83.2 × 10−84.1 × 10−9
GLUrs2971671744177862GCK1.8 × 10−81.7 × 10−91.6 × 10−8
HDLrs71201181147242866NR1H3b4.8 × 10−86.6 × 10−51.1 × 10−6
TGrs10096633819875201LPL2.0 × 10−81.1 × 10−81.9 × 10−8
TGrs673548221091049APOB8.0 × 10−81.2 × 10−76.4 × 10−8
HDLrs18009612042475778HNF4A1.5 × 10−79.5 × 10−81.8 × 10−8

These SNPs had P values below the suggested32 genome-wide significance threshold of 7.2 × 108 in the uncorrected, the 100 principal components–corrected (ES100) or the EMMAX analysis after genomic control (+GC). Traits are HDL, high-density lipoprotein; CRP, C-reactive protein; LDL, low density lipoprotein; GLU, glucose; TG, triglyceride. rsID, reference SNP ID assigned by dbSNP; Chr, chromosome; boldface indicates the strongest P values across the three methods; italics indicate P values that did not surpass the significance threshold.

aPositions are based on National Center for Biotechnology Information build 36.1.
bNR1H3 is the locus whose association with HDL that has not yet been replicated by other independent studies.

Because EMMAX estimates the variance parameters under the null hypothesis, one may suspect that it is underpowered compared to the full mixed model, which estimates the variance parameters under the alternative hypothesis. This is comparable to the difference between the score statistic and the efficient score statistic25,34,35. As most genetic variants associated to date with human complex traits are estimated to explain only a small fraction of phenotypic variance20, the difference between the two approaches will be negligible in most cases. To assess the seriousness of this concern, we ran the original EMMA, which uses a full mixed effect model, on the 15 peak SNPs and compared the resulting P values to those estimated with EMMAX using GLS. Overall, as expected, the P values from the full mixed effect model tended to be smaller than the P value from the GLS model, but the magnitude of the difference was very small (Supplementary Fig. 3a). However, the running times for EMMA were substantially longer. Because the original EMMA re-estimates the variance parameters at each marker, given the size of the NFBC data set, it took more than 10 min of CPU time per marker on an Intel Xeon 3-GHz processor, even with an efficient C implementation of EMMA. A simple extrapolation suggests that it would take more than 6 years of CPU time to analyze a single GWAS data set using EMMA, taking a full mixed model approach. The total computational time using EMMAX for this data was 6.6 h in a single CPU, and the procedure could easily be parallelized to speed it up further.

Application to Wellcome Trust Case Control Consortium data

We also applied our method to the WTCCC data set consisting of case-control studies for seven common diseases6. To analyze case-control phenotypes, we applied a linear model to the binary phenotypes, in the spirit of Armitage’s test (see Online Methods). We performed association testing over the seven disease phenotypes using EMMAX, EIGENSTRAT and uncorrected analysis. The values we observed for inflation factors λ were very similar to those in the original study, in which the test statistics were uncorrected: bipolar disease, 1.11; coronary artery disease, 1.06; Crohn’s disease, 1.10; hypertension, 1.06; rheumatoid arthritis, 1.03; type 1 diabetes, 1.04; and type 2 diabetes, 1.07. Consistent with our observations over the NFBC66 data, correcting for 100 principal components only partially reduced the inflation factors (Table 3 and Supplementary Fig. 2). When EMMAX was applied, the estimated inflation factors were below the upper bound of the confidence interval, suggesting that none of the phenotypes show significant inflation of test statistics.

Table 3

Comparison of genomic control inflation factor obtained with different models in seven WTCCC phenotypes

PhenotypeGenomic control inflation factor
UncorrectedES100EMMAX
BD1.1051.0710.998
CAD1.0631.0481.006
CD1.0981.0551.000
HT1.0551.0510.997
RA1.0281.0310.965 (0.989a)
T1D1.0431.0280.946 (0.991a)
T2D1.0651.0420.996

ES100, EIGENSOFT correcting for 100 principal components; BD, bipolar disorder; CAD, coronary artery disease; CD, Crohn’s disease; HT, hypertension; RA, rheumatoid arthritis; T1D, type 1 diabetes; T2D, type 2 diabetes.

aThe variance component parameters (σ2a and σ2e) are estimated by conditioning on the large-sized SNP effects explaining 1% or more phenotypic variance.

However, we noticed that two of the phenotypes, rheumatoid arthritis and type 1 diabetes, show significant deflation of test statistics beyond the 95% confidence interval (λ = 0.965 for rheumatoid arthritis, λ = 0.946 for type 1 diabetes). This is not unexpected, considering that a substantial fraction of the phenotypic variance in these autoimmune diseases is explained by the HLA loci, leading to inaccurate estimation of variance parameters under the null hypothesis when the HLA effect is not accounted for. In fact, the set of genome-wide significant SNPs (P < 7.2 × 10−8; ref. 32) in this region account for 47% and 60% of the phenotypic variance of rheumatoid arthritis and type 1 diabetes, respectively6. We re-estimated the variance parameters by conditioning on the 57 and 134 SNPs within the extended human MHC region36 that explain more than 1% of phenotypic variance of rheumatoid arthritis and type 1 diabetes, respectively (as described in the Supplementary Note). As a result, the genomic control λ increased to 0.989 for rheumatoid arthritis and 0.991 for type 1 diabetes. We performed this conditioning procedure only for estimating variance parameters and not in the SNP association test so that the P values would be consistent with the unconditioned analysis. Conditioning on the SNPs with such a strong effect may further improve the power to identify novel loci. A more sophisticated conditional analysis—for example, one including haplotype effects or epistatic interactions into covariates—may also better account for the strong effects in the autoimmune diseases37.

Marker-specific inflation factors

Under certain conditions, one can expect the variance of the test statistics to be inflated by a constant across the genome7,8. A formal model of hidden relatedness based on the coalescent theory1 also suggests a constant inflation across the genome when the sample structure is entirely due to hidden relatedness7. However, for a more complex genealogical relationship among individuals, it is not clear how the inflation of test statistics will behave.

Using the same variance component framework, we developed a method to estimate the marker-specific inflation of test statistics using the correlation between each marker and the empirically estimated kinship matrix (described in the Supplementary Note). These estimates are concordant with the genome-wide genomic control inflation factor on average but show substantial differences across the SNPs (Fig. 5a). In the height phenotype, for example, the estimated marker-specific inflation factors have a mean of 1.107, s.d. of 0.090 and median value of 1.093. In light of this, we explored the relationship between marker-specific inflation factors and the overdispersion of test statistics with the uncorrected analysis. The distribution of height association P values for SNPs with inflation factors <1.05 shows a less marked departure from uniform distribution than does the distribution for SNPs with inflation factors >1.20 (Fig. 5b,c). Considering that SNPs with a higher inflation factor were identified without consideration of their possible association with the phenotype, it is reasonable to conclude that this excess of small P values reflects overdispersion of test statistics.

An external file that holds a picture, illustration, etc.
Object name is nihms196426f5.jpg

Distribution of the marker-specific inflation factors from NFBC66 data sets. (a) Box plots of the marker-specific inflation factors across ten phenotypes, in addition to the genomic control inflation factor for each phenotype. Abbreviations are as in Figure 2. (b,c) Distributions of P values of the height phenotype association when the estimated per-marker inflation factors are less than 1.05 (35,988 SNPs; b) and when they are greater than 1.2 (15,874 SNPs; c).

These results underscore how correcting the test statistics using a single inflation factor may be inappropriate, possibly reducing power and not sufficiently controlling for false positives. To further demonstrate this point, we ran a simple simulation using the variance component model on which EMMAX is based. Although simulating data under this model puts our method at an advantage, and the approach is therefore less suited for comparison to other models, it does demonstrate that under some circumstances uniformly deflating P values may be inappropriate. We randomly simulated 100 sets of phenotypes solely from the sample structure with no SNP effects and examined the quantile-quantile plots across different methods. Although the inflation for most of the SNPs is corrected by genomic control as expected, we observed substantial fluctuations of the test statistics at the tail of the distribution (Supplementary Fig. 4a,b).

More than 25% of the phenotypes showed inflation or deflation beyond the 95% confidence interval. This is because the SNPs with higher per-marker inflation are not sufficiently corrected by the constant genomic control inflation factor. In contrast, EMMAX results in P values close to the expected distribution (Supplementary Fig. 4c).

The finding that marker-specific inflation factors vary substantially across the genome has notable implications for meta-analyses and multistage analyses. Such studies typically combine the test statistics after correcting for potential inflation using genomic control30,31,38. The disadvantages of using the same global correction rather than a marker-specific one can become more serious when this step is done repeatedly. To better understand these effects in the context of meta-analysis, we first compared the marker-specific inflation factors between the two WTCCC control groups, collected from essentially the same population. We observed a very strong correlation (r = 0.95; Supplementary Fig. 5a). We further compared the inflation factors across different populations and different genotyping platforms using the NFBC66 samples and WTCCC control samples. We observed a strong correlation of r = 0.70 (Supplementary Fig. 5b), suggesting that the marker-specific inflation factors can be correlated across the multiple data sets used in meta-analysis or multistage analysis owing to the shared genetic history. If this is the case, the standard approach that corrects with genomic control before merging the P values from different studies may lead to further inaccuracies: tests at some markers would be excessively, or not sufficiently, deflated multiple times, resulting in an accumulation of errors.

DISCUSSION

We report here the development of the EMMAX program, taking an expedited mixed linear model approach to correct for sample structure within human GWASs. We demonstrate its effectiveness with the analysis of two human GWAS data sets, including quantitative as well as disease traits. The proposed approach differs substantially from genomic control in that it accounts for inflation owing to population structure in a marker-specific manner, resulting in a modified ranking of association results. Accounting for marker-specific effects can reduce both false positives and false negatives. We discuss this issue in more detail in the Supplementary Note.

There are several other methods that take into account pedigree-based or empirically estimated kinship matrices in the statistical test39-42. One of the key differences between these methods and the mixed model methods, including EMMAX, is that the mixed model methods have a procedure for estimating the contribution of the kinship matrix to the phenotypes, whereas the other methods do not. Estimating the phenotypic variance contributed by the sample structure enabled us to avoid undercorrection or overcorrection of the sample structure in the NFBC66 and WTCCC data sets.

The effective application of our method depends on an appropriate estimate of the variance parameters. The IBS or Balding-Nichols matrix43 appears to be better than IBD estimates at capturing the long-distance relationships that result in variations at the population level. However, when the structure of the sample at hand is better described in terms of fairly recent hidden relatedness, methods based on the estimation of IBD may have an advantage. In principle, our approach is also suitable for association testing in a data set including individuals from a heterogeneous population with admixed background. In such cases, it is important to consider SNP ascertainment bias in estimating the degree of relatedness between individuals. Because many SNP probes in genotyping arrays are selected from European populations, the marker-based pairwise distance between two individuals may appear to be larger between unrelated European samples than between unrelated individuals from other populations. To resolve the resulting ascertainment bias, each SNP may be differently weighted when the IBS similarity matrix is computed. A general framework has been presented19 for computing the similarity matrix with a different weight for each marker. Different weighting schemes can also be used to account for heterogeneous distribution of effect size from each marker or each genomic region.

Besides the choice of the kinship matrix, the estimation of variance parameters is also a crucial part of the EMMAX approach. In our analysis of the NFBC data, we show that estimating these parameters under the null hypothesis does not lead to appreciable bias in the association P values. The example of rheumatoid arthritis and type 1 diabetes in the WTCCC data set, in contrast, reveals the difficulties encountered by EMMAX when there are SNPs explaining a large fraction of phenotypic variance. In such cases, we show that estimating variance parameters conditionally on the SNPs with stronger effects alleviates these concerns.

Finally, whereas the analysis presented here relies on decomposing the variance into two terms, a genetic relatedness component and a component representing residual effects, future studies may need to account for additional variance components to more precisely model the heterogeneous phenotypic variance. In expression quantitative trait loci mapping, for example, one may want to add additional variance components to account for technical bias44. When multiple variance components are involved, one would need to make use of algorithms such as PROC MIXED implemented in SAS, as EMMA is developed for two variance components only; this would increase the running time of the first step of our procedure. However, because the same variance components estimated from the null hypothesis would be used across the genome-wide markers, the overall computational time should still be acceptable.

METHODS

Methods and any associated references are available in the online version of the paper at http://www.nature.com/naturegenetics/.

Supplementary Material

Supplementary Data and Text

Acknowledgments

We thank the NFBC66 team for access to phenotype and genotype data used in the analyses presented here. The genotype data were generated at the Broad Institute with support from National Heart, Lung, and Blood Institute grant 6R01HL087679-03. We thank D. Clayton for reading through the manuscript and for providing important suggestions. We acknowledge the WTCCC for allowing us to use their data set. H.M.K., N.A.Z., J.H.S. and E.E. are supported by National Science Foundation grants 0513612, 0731455 and 0729049, and National Institutes of Health (NIH) grants 1K25HL080079 and U01-DA024417. N.A.Z. is supported by the Microsoft Research Fellowship. H.M.K. is supported by the Samsung Scholarship, National Human Genome Research Institute grant HG00521401, National Institute for Mental Health grant NH084698 and GlaxoSmithKline. C.S. is partially supported by NIH grants GM053275-14, HL087679-01, P30 1MH083268, 5PL1NS062410-03, 5UL1DE019580-03 and 5RL1MH083268-03. N.B.F. and S.K.S. are supported by NIH grants HL087679-03, 5PL1NS062410-03, 5UL1DE019580-03 and 5RL1MH083268-03. This research was supported in part by the University of California, Los Angeles subcontract of contract N01-ES-45530 from the National Toxicology Program and National Institute of Environmental Health Sciences to Perlegen Sciences.

Footnotes

AUTHOR CONTRIBUTIONS H.M.K., J.H.S., C.S. and E.E. designed the methods and experiments; H.M.K., J.H.S., S.K.S., S.-y.K., N.B.F., C.S. and E.E. jointly analyzed the NFBC66 data set; H.M.K., J.H.S., N.A.Z., C.S. and E.E. jointly analyzed the WTCCC data set; H.M.K., J.H.S., S.K.S., N.B.F., C.S. and E.E. wrote the manuscript; all authors contributed their critical reviews of the manuscript during its preparation.

Note: Supplementary information is available on the Nature Genetics website.

COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.

References

1. Voight BF, Pritchard JK. Confounding from cryptic relatedness in case-control association studies. PLoS Genet. 2005;1:e32. [Abstract] [Google Scholar]
2. Weir BS, Anderson AD, Hepler AB. Genetic relatedness analysis: modern data and new challenges. Nat Rev Genet. 2006;7:771–780. [Abstract] [Google Scholar]
3. Newman DL, Abney M, McPeek MS, Ober C, Cox NJ. The importance of genealogy in determining genetic associations with complex traits. Am J Hum Genet. 2001;69:1146–1148. [Europe PMC free article] [Abstract] [Google Scholar]
4. Helgason A, Yngvadttir B, Hrafnkelsson B, Gulcher J, Stefnsson K. An Icelandic example of the impact of population structure on association studies. Nat Genet. 2005;37:90–95. [Abstract] [Google Scholar]
5. Pritchard JK, Stephens M, Rosenberg NA, Donnelly P. Association mapping in structured populations. Am J Hum Genet. 2000;67:170–181. [Europe PMC free article] [Abstract] [Google Scholar]
6. Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. [Europe PMC free article] [Abstract] [Google Scholar]
7. Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55:997–1004. [Abstract] [Google Scholar]
8. Bacanu SA, Devlin B, Roeder K. Association studies for quantitative traits in structured populations. Genet Epidemiol. 2002;22:78–93. [Abstract] [Google Scholar]
9. Price AL, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–909. [Abstract] [Google Scholar]
10. Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2:e190. [Abstract] [Google Scholar]
11. Novembre J, Stephens M. Interpreting principal component analyses of spatial population genetic variation. Nat Genet. 2008;40:646–649. [Abstract] [Google Scholar]
12. Novembre J, et al. Genes mirror geography within Europe. Nature. 2008;456:98–101. [Europe PMC free article] [Abstract] [Google Scholar]
13. Sabatti C, et al. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat Genet. 2009;41:35–46. [Europe PMC free article] [Abstract] [Google Scholar]
14. Cho YS, et al. A large-scale genome-wide association study of Asian populations uncovers genetic factors influencing eight quantitative traits. Nat Genet. 2009;41:527–534. [Abstract] [Google Scholar]
15. Fisher SRA. The correlation between relatives on the supposition of Mendelian inheritance. Trans R Soc Edinb. 1918;52:399–433. [Google Scholar]
16. Ober C, Abney M, McPeek MS. The genetic dissection of complex traits in a founder population. Am J Hum Genet. 2001;69:1068–1079. [Europe PMC free article] [Abstract] [Google Scholar]
17. Yu J, et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet. 2006;38:203–208. [Abstract] [Google Scholar]
18. Zhao K, et al. An Arabidopsis example of association mapping in structured samples. PLoS Genet. 2007;3:e4. [Abstract] [Google Scholar]
19. Kang HM, et al. Efficient control of population structure in model organism association mapping. Genetics. 2008;178:1709–1723. [Europe PMC free article] [Abstract] [Google Scholar]
20. Manolio TA, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. [Europe PMC free article] [Abstract] [Google Scholar]
21. Rantakallio P. Groups at risk in low birth weight infants and perinatal mortality. Acta Paediatr Scand. 1969;193(suppl.):1–71. [Abstract] [Google Scholar]
22. Varilo T, Peltonen L. Isolates and their potential use in complex gene mapping efforts. Curr Opin Genet Dev. 2004;14:316–323. [Abstract] [Google Scholar]
23. Jakkula E, et al. The genome-wide patterns of variation expose significant substructure in a founder population. Am J Hum Genet. 2008;83:787–794. [Europe PMC free article] [Abstract] [Google Scholar]
24. Kariya T, Kurata H. Generalized Least Squares. John Wiley & Sons; 2004. [Google Scholar]
25. Chen WM, Abecasis GR. Family-based association tests for genomewide association scans. Am J Hum Genet. 2007;81:913–926. [Europe PMC free article] [Abstract] [Google Scholar]
26. Lynch M, Walsh B. Genetics and Analysis of Quantitative Traits. Sinauer; Sunderland, Massachusetts: 1998. [Google Scholar]
27. Lowe JK, et al. Genome-wide association studies in an isolated founder population from the Pacific Island of Kosrae. PLoS Genet. 2009;5:e1000365. [Europe PMC free article] [Abstract] [Google Scholar]
28. Pilia G, et al. Heritability of cardiovascular and personality traits in 6,148 Sardinians. PLoS Genet. 2006;2:e132. [Europe PMC free article] [Abstract] [Google Scholar]
29. Easton DF, et al. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature. 2007;447:1087–1093. [Europe PMC free article] [Abstract] [Google Scholar]
30. Thomas G, et al. A multistage genome-wide association study in breast cancer identifies two new risk alleles at 1p11.2 and 14q24.1 (RAD51L1) Nat Genet. 2009;41:579–584. [Europe PMC free article] [Abstract] [Google Scholar]
31. Ahmed S, et al. Newly discovered breast cancer susceptibility loci on 3p24 and 17q23.2. Nat Genet. 2009;41:585–590. [Europe PMC free article] [Abstract] [Google Scholar]
32. Dudbridge F, Gusnanto A. Estimation of significance thresholds for genomewide association scans. Genet Epidemiol. 2008;32:227–234. [Europe PMC free article] [Abstract] [Google Scholar]
33. Kathiresan S, et al. Six new loci associated with blood low-density lipoprotein cholesterol, high-density lipoprotein cholesterol or triglycerides in humans. Nat Genet. 2008;40:189–197. [Europe PMC free article] [Abstract] [Google Scholar]
34. Hinkley DV. Theoretical Statistics. CRC Press; Boca Raton: 1979. [Google Scholar]
35. Whittemore AS, Tu IP. Simple, robust linkage tests for affected sibs. Am J Hum Genet. 1998;62:1228–1242. [Europe PMC free article] [Abstract] [Google Scholar]
36. de Bakker PIW, et al. A high-resolution HLA and SNP haplotype map for disease association studies in the extended human MHC. Nat Genet. 2006;38:1166–1172. [Europe PMC free article] [Abstract] [Google Scholar]
37. Nejentsev S, et al. Localization of type 1 diabetes susceptibility to the MHC class I genes HLA-B and HLA-A. Nature. 2007;450:887–892. [Europe PMC free article] [Abstract] [Google Scholar]
38. Zeggini E, et al. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nat Genet. 2008;40:638–645. [Europe PMC free article] [Abstract] [Google Scholar]
39. Thornton T, McPeek MS. Case-control association testing with related individuals: a more powerful quasi-likelihood score test. Am J Hum Genet. 2007;81:321–337. [Europe PMC free article] [Abstract] [Google Scholar]
40. Guan W, Liang L, Boehnke M, Abecasis GR. Genotype-based matching to correct for population stratification in large-scale case-control genetic association studies. Genet Epidemiol. 2009;33:508–517. [Europe PMC free article] [Abstract] [Google Scholar]
41. Choi Y, Wijsman EM, Weir BS. Case-control association testing in the presence of unknown relationships. Genet Epidemiol. 2009;33:668–678. [Europe PMC free article] [Abstract] [Google Scholar]
42. Rakovski CS, Stram DO. A kinship-based modification of the armitage trend test to address hidden population structure and small differential genotyping errors. PLoS One. 2009;4:e5825. [Europe PMC free article] [Abstract] [Google Scholar]
43. Balding DJ, Nichols RA. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica. 1995;96:3–12. [Abstract] [Google Scholar]
44. Kang HM, Ye C, Eskin E. Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Genetics. 2008;180:1909–1925. [Europe PMC free article] [Abstract] [Google Scholar]
45. Irizarry RA, et al. Multiple-laboratory comparison of microarray platforms. Nat Methods. 2005;2:345–350. [Abstract] [Google Scholar]

Citations & impact 


Impact metrics

Jump to Citations

Citations of article over time

Alternative metrics

Altmetric item for https://www.altmetric.com/details/102084005
Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/102084005

Smart citations by scite.ai
Smart citations by scite.ai include citation statements extracted from the full text of the citing article. The number of the statements may be higher than the number of citations provided by EuropePMC if one paper cites another multiple times or lower if scite has not yet processed some of the citing articles.
Explore citation contexts and check if this article has been supported or disputed.
https://scite.ai/reports/10.1038/ng.548

Supporting
Mentioning
Contrasting
23
2675
5

Article citations


Go to all (1,518) article citations

Data 


Data behind the article

This data has been text mined from the article, or deposited into data resources.

Similar Articles 


To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.


Funding 


Funders who supported this work.

NHLBI NIH HHS (6)

NIDA NIH HHS (2)

NIDCR NIH HHS (2)

NIEHS NIH HHS (1)

NIGMS NIH HHS (2)

NIH HHS (1)

NIMH NIH HHS (3)

NINDS NIH HHS (2)