Abstract
Free full text
Comparing strategies to fine map the association of common SNPs on chromosome 9p21 to Type 2 Diabetes and Myocardial Infarction
Abstract
Non-coding variants at human chromosome 9p21 near CDKN2A and CDKN2B are associated with type 2 diabetes (T2D)1-4, myocardial infarction (MI)5-7, aneurysm8, vertical cup disc ratio9, and at least five cancers10-16. We compared approaches to more comprehensively assess genetic variation in the region. We performed targeted sequencing at high coverage in 47 individuals and compared the results to pilot data from the 1000 Genomes Project. We imputed variants into T2D and MI cohorts directly from targeted sequencing, from a genotyped reference panel derived from sequencing, and from 1000 Genomes low-coverage data. Common polymorphisms were captured similarly by all strategies. Imputation of intermediate frequency polymorphisms required a higher density of tag SNPs in disease samples than available on first generation Genome Wide Association Study (GWAS) arrays. Association analyses identified more comprehensive sets of variants demonstrating equivalent statistical association to T2D or MI, but did not identify stronger associations the original GWAS signals.
Following the identification of a disease-associated region by GWAS, comprehensive study of sequence variation in the region is required to identify the full set of variants that might explain the association signal. Since GWAS arrays incompletely capture DNA variation in each region, it has been hypothesized that causal variants partially captured by linkage disequilibrium (LD) – due to location near recombination hotspots or lower minor allele frequency – might, if directly tested, display stronger association to phenotype than the tag SNPs used in GWAS. In particular, because HapMap and GWAS arrays contain primarily variants with minor allele frequency (MAF) >5%, first generation GWAS studies failed to test polymorphisms of somewhat lower frequency that might have larger effects on disease risk. Finally, even in regions where the true association signal is well captured by LD to array SNPs, enumeration of all associated variants is a necessary prerequisite to functional experiments that will identify causal mutation(s). Thus, an important next step following GWAS is to assemble a more complete catalog of variation present in an associated region, and to test it for association to the phenotype of interest.
With the advent of next generation sequencing and the emergence of data from the 1000 Genomes (1000G) Project, investigators must choose between (or combine) multiple strategies for creating and testing a reference panel of polymorphic sites. We re-sequenced ~240kb on chromosome 9p21 (chr9:21936711-22176221, hg18) spanning the T2D and MI associations in 47 unrelated individuals of European ancestry from the HapMap CEU population17 as part of a sequencing project spanning six T2D-associated regions (Supplementary Table 1). Sequencing was performed at the Broad Institute on Illumina Genome Analyzers (Supplementary Note, all data available in the NCBI Short Read Archive). An analytical framework (Supplementary Note, Supplementary Table 2, Supplementary Figs. 1-5), since extended and incorporated in the Genome Analysis Tool Kit18,19, was developed and includes methods to empirically recalibrate Illumina base quality scores, a Bayesian framework to call SNPs, local re-alignment to identify insertions/deletions (and remove clusters of false positive SNPs), and filters to remove false positive SNP calls based on discrepancy between forward and reverse strands.
This targeted sequencing identified 635 high-confidence SNPs on chromosome 9p21 (4,463 across the six regions) (Supplementary Table 3, Supplementary Fig. 6, SNPs available in dbSNP). We evaluated sensitivity against HapMap II17 and the high coverage Pilot 2 data from the 1000 Genomes Project20 (Supplementary Note): at sites in overlapping samples with 10x or greater read coverage (70% of the region), sensitivity was 99% for HapMap variants and 97% for variants found in 1000G Pilot 2 (Supplementary Fig. 7a-c). To evaluate specificity, we genotyped 257 sites found on chromosome 9p21 but not previously genotyped in HapMap (Supplementary Fig. 7d, e). Overall, 96% of variants seen more than once in sequencing validated in the genotyping data (Supplementary Table 4).
We compared these variants to those discovered in the low-coverage Pilot 1 of the 1000 Genomes Project 20, limiting comparison to 32 CEU individuals studied in both projects. Across the six regions, both projects identified similar numbers of variants: 3,897 SNPs in the high coverage targeted sequencing as compared to 4,043 in 1000G Pilot 1. However, the variants found were in fact only partially overlapping. Of variants seen in the high coverage targeted sequencing, 22% were missed by 1000G Pilot 1 (Fig. 1), nearly all of which were rare: 72% of these sites were singletons and 12% were seen twice (Fig. 1, Table 1). (Pilot 1 successfully identified 97% of SNPs seen more than 5 times in high coverage sequencing (Table 1)). Of variants identified in Pilot 1 but not in targeted sequencing (n=998), nearly all were sites at which target capture failed to achieve high coverage: 65% of these sites had zero coverage. Thus, targeted capture and low-pass whole genome had distinct and non-overlapping failure modes.
Table 1
Number of times non-reference allele observed in this study | Number of SNPs called, this study | % Contained in 1000G Pilot 1 | % in dbSNP, build 129 | % Validated on chr9p21 |
---|---|---|---|---|
1X | 941 | 35% | 13% | 91% |
2X | 300 | 68% | 42% | 88% |
3X | 239 | 82% | 55% | 100% |
4X | 154 | 87% | 66% | 86% |
5X | 186 | 91% | 67% | 70% |
>5X | 2077 | 97% | 92% | 98% |
We evaluated methods for testing these variants for association to disease via linkage disequilibrium and haplotype-based imputation. First, we genotyped SNPs found in targeted re-sequencing on chromosome 9p21 in 168 individuals (56 parent offspring trios) from the HapMap extended CEU population21 (Supplementary Note). We used MACH22,23 to impute variants from this reference panel into 1,000 T2D patients and 1,048 controls from the Diabetes Genetics Initiative (DGI) cohort1 and 1,274 MI cases and 1,407 controls from the Myocardial Infarction Genetics (MIGen) Consortium cohort6, each previously genotyped on Affymetrix GWAS arrays (Supplementary Note).
We compared the results of imputation with this augmented reference panel (n=464 variants, Supplementary Table 5) to those obtained when imputing from HapMap II alone (n=238 variants). The addition of genotype data for a more complete collection of common variants provided imputation data for a much larger number of SNPs than was possible with HapMap II, which contains only 50-60% of common variants (Fig. 2a, b and Supplementary Fig. 8a, b). However, even with the augmented reference panel, the tag SNP density characteristic of the first generation GWAS arrays on which our disease samples were typed allowed only 80% of common (MAF > 5%) variants to be captured (either directly typed or imputed with a MACH-estimated r2 ≥ 0.8). Moreover, only a small fraction of intermediate frequency variation (MAF 2-5%) was imputed with an estimated r2 above this stringent threshold (Fig. 2c, d and Supplementary Fig. 8c, d).
To evaluate the impact of tag SNP density on imputation performance, we increased the number of tags across the region to approximately 1 SNP per 1.5kb (the previous density was ~1SNP/5kb in T2D samples and ~1SNP/3kb in MI samples) in the T2D and MI cohorts (Supplementary Note). With this increased density of tag SNPs, nearly all common variants (~98%) were captured with r2 ≥ 0.8 in disease samples. Moreover, performance for intermediate frequency variants was dramatically improved, rising from 2% to 75% with r2 ≥ 0.8 (Fig. 2e, f and Supplementary Fig. 8e, f). This result was not specific to the Affymetrix GWAS arrays, as we observed a similar improvement in imputation ability upon addition of tag SNPs using multiple other GWAS arrays (Supplementary Fig. 9).
We next compared different reference panels, imputing in each case into disease samples with the higher tag SNP density. The reference panels were: (a) the genotyped reference panel of 168 individuals above (112 unrelated individuals), (b) the targeted sequencing data (47 individuals, without genotyping and expansion into a larger sample set), and (c) 1000 Genomes Pilot 1 (55 individuals). We considered both the fraction of variants in each reference panel successfully imputed (which is related to the quality and completeness of SNP genotypes and to the size of the reference panel) and the fraction of all variation captured (which, in addition, depends on the proportion of all known SNPs represented in the reference panel).
The union of the three reference panels contained 582 variants (Fig. 3a). Each panel was partially incomplete, due to genotyping assay failure in the genotyped panel (14% of SNPs missing), sample size and low coverage in 1000 genomes (16% of SNPs missing), and sample size and gaps in coverage in the targeted sequencing (19% of SNPs missing). For common variants, there is little difference in bulk performance between the reference panels. Considering only SNPs contained in each reference panel (Fig. 3b) the genotyped panel has the highest proportion of variants imputed well. However, when all variation is considered (Fig. 3c), a lower proportion of common variation is captured by imputing from the genotyped reference panel, owing to the fact that some SNPs were missing in this panel because they failed assay design or quality control. Notably, the 1000G (freely available) and sequencing (costly) strategies performed equivalently for these common variants.
For intermediate frequency variants, there are more pronounced differences between the panels (Fig. 3b, c). These variants were best imputed from the genotyped reference panel (Fig. 3b), which was the largest and also contained trio information. This was true even when all variation was considered (Fig. 3c), suggesting that the improved imputation quality from genotype data and increased sample size offset the loss of variants in this panel due to genotyping failure. Comparing the high coverage re-sequencing and 1000G reference panels, lower frequency variants were better imputed from the high coverage re-sequencing data both when considering only the SNPs within each reference panel (Fig. 3b) and when considering the overall proportion of low frequency variants captured by imputation from each reference panel (Fig. 3c). This is consistent with the low coverage 1000G pilot 1 data being less complete and accurate for lower frequency variants20.
We tested variants for association to T2D and MI using imputation from all three reference panels to maximize the number of variants captured (Supplementary Note). Overall, we have captured 461 of the 582 polymorphic variants observed across all three reference panels in our T2D and MI samples with a MACH-estimated r2 of at least 0.8: this represents ~ 92% of all known common variants and ~52% of intermediate frequency variants (at a MACH-estimated r2 of 0.5, these figures are 98% of common variants and 83% of intermediate frequency variants). In comparison, only 176 of the 582 variants were previously captured by imputation from HapMap. Despite testing many additional SNPs in partial LD with the index GWAS hits and at allele frequencies not well captured by first generation GWAS arrays and HapMap, we found no example of a SNP with stronger association to T2D or MI than the initial GWAS signals.
However, we did identify multiple additional variants in strong LD with the GWAS hits that might underlie each association. We observed the three-tiered haplotypic association to T2D first reported by the Wellcome Trust Case Control Consortium with protective, risk, and neutral haplotypes (Table 2). The protective alleles of the GWAS SNP (rs10811661) and nine other SNPs in strong LD with this variant tag the protective haplotype (Fig. 4a, Supplementary Table 6). Interestingly, no single SNP yet identified marks the risk haplotype. Association analyses for MI identified 7 SNPs in LD with each other and with equivalent evidence for association (P < 10−4) as well as 54 additional SNPs with only slightly less evidence for association (P < 10−3) (Fig. 4b, Supplementary Table 6). Knockout of the MI-associated region in mouse alters regulation of CDKN2A and CDKN2B24, and two of the associated SNPs have recently been shown to disrupt a STAT1 binding site25. Interestingly, in addition to the SNPs disrupting the STAT1 site, there are other variants with equivalent MI association and with putative functional annotations, including variants overlapping exons of the non-coding transcript CDKN2BAS, highly conserved regions, and predicted, conserved transcription factor binding sites (Supplementary Table 6).
Table 2
Haplotypes defined by rs10757282, rs10811661 | |||
---|---|---|---|
Haplotype | Frequency | OR | P-value |
Overall Evidence | -- | -- | 4.40 × 10−5 |
CT | 0.30 | 1.29 | 3.99 × 10−4 |
TT | 0.54 | 0.96 | 5.24 × 10−1 |
CC | 0.16 | 0.72 | 2.71 × 10−4 |
This study is limited by the investigation of a single region (albeit one with at least eight different disease associations), by the early nature of the sequencing data analyzed, by the small number of samples sequenced in SNP discovery, and by the sample size of our disease cohorts. Nonetheless, the observations on the strengths and weaknesses of different methods for fine mapping GWAS signals are likely general: targeted high coverage sequencing provides high sensitivity for lower frequency variants, but has gaps in coverage; the 1000G Pilot 1 resource offers more even coverage at lower depth, currently sufficient for capture of most common variation; creating a genotyped reference panel improves accuracy and sample size, but is limited by assay conversion failures; tag SNP density characteristic of first generation GWAS is inadequate to maximally extract information with current imputation algorithms. To some extent, these limitations are transient: the growing 1000 Genomes Project resource is sequencing over 2,000 diverse samples with both low-pass whole genome and high coverage targeted exon approaches, increasing the accuracy and completeness of the public reference panel. However, our results suggest that fully exploiting this resource for imputation may require increasing tag density in GWAS disease samples and / or improved algorithms for imputation.
Finally, our study did not find evidence for stronger association at 9p21 to SNPs in moderate LD with the initial tags. While the maximum achievable association signal for lower frequency variants was limited by our sample size, we did not observe lower frequency variants with effect sizes that could individually explain the common variant associations. We do, however, identify additional common variants in LD with the GWAS hits that might underlie each association. Enumeration of all variants on 9p21 that might explain each association signal will be needed as a foundation for systematic functional studies that aim to understand how different non-coding variants in this single genomic interval can lead to such varied and clinically significant phenotypic associations.
Methods
Targeted Re-Sequencing
Six regions associated with T2D were selected for targeted re-sequencing (Supplementary Table 1). Because the goal of this study to was to identify additional SNPs that might explain the initial GWAS signal, region boundaries were selected to encompass all SNPs showing detectable linkage disequilibrium (r2 ≥ 0.2) to the T2D associated SNP with the most significant p-value. DNA was captured for sequencing by long-range PCR with 2-5kb amplicons or by hybrid selection (HS) using 170bp baits tiled across the region on an Agilent microarray26. All sequencing was performed at the Broad Institute in 2008 using Illumina Genome Analyzers. Runs from PCR-based capture generated 36bp reads and runs from HS-based capture generated 46-50bp reads. Methods for alignment, quality score adaptation and recalibration, and variant calling are described in detail in the Supplementary Note.
SNP Genotyping and Quality Control
Genotyping was performed on the Sequenom MassARRAY iPLEX platform. Quality control filters included 1) > 95% genotyping rate, 2) Hardy Weinberg equilibrium (with P > 0.001) and 3) Mendel error rate < 5%.
Phasing and Imputation
We compared several strategies and publicly available tools for phasing and imputation directly from Illumina sequencing data (Supplementary Fig. 10-11). Phased haplotypes for all reference panels were created using the PHASE software package (Version 2.1)27,28. For the genotyped reference panel, trio information was used in phasing (-P1 option). For sequencing reference panels, known phase was specified at HapMap sites (-k option). All other PHASE parameters were default values. Imputation from reference haplotypes was performed using MACH22,23 (Version 1.0.16). 100 rounds were used; all other MACH parameters were default values.
Association Analyses
Variants were tested for association using logistic regression on imputed genotype dosages and individual disease status. EIGENSTRAT29 (DGI) or PLINK30 (MIGen) was used to estimate principal components which track with the ancestry of the study samples1,6; the first ten components were used as covariates in logistic regression to account for population structure. For T2D analyses, additional covariates used were: age, gender, and body mass index. For MI analyses additional covariates used were age, gender, BMI, and smoking. Tests for haplotypic association to T2D were performed using the PLINK30 (Version 1.05) software package.
Supplementary Material
Supplementary 1
Supplementary 2
Supplementary 3
Supplementary 4
Supplementary 5
Supplementary 6
Acknowledgements
Patient collections in the DGI study were funded by grants from the Sigrid Juselius and Folkhälsan foundations as well as the Swedish Research Council (LG). The DGI GWAS study was supported by a grant from Novartis.
The MIGen study was funded by the US National Institutes of Health (NIH) and National Heart, Lung, and Blood Institute’s STAMPEED genomics research program through a grant to D.A (R01 HL087676). S.K. is supported by a Doris Duke Charitable Foundation Clinical Scientist Development Award, a charitable gift from the Fannie E. Rippel Foundation, the Donovan Family Foundation, a career development award from the NIH and the Department of Medicine and Cardiovascular Research Center at Massachusetts General Hospital. DA and JS are supported in part by a Distinguished Clinical Scholar Award from the Doris Duke Charitable Foundation (to D.A.)
Next-generation sequencing for this work was performed by the Broad Institute Sequencing Platform and genotyping was performed by the Broad Institute Genetic Analysis Platform. We acknowledge their excellence and collaboration on this study. Sequencing was supported in part by a grant from NHGRI and by the Broad Institute.
The authors thank Manny Rivas, Andrey Sivachenco, and Kiran Garimella for helpful discussions on sequencing and Benjamin Voight, Stephan Ripke, and Ron Do for helpful discussions on imputation.
Footnotes
Author Contributions Manuscript writing: J.S., V.A., A. A. P., D.A.
Clinical samples: S.K., L.G., D.A., The Myocardial Infarction Genetics Consortium
Next-generation sequencing data generation: C.G., N.P.B., S.G., Broad Institute Sequencing Platform
Sequencing analysis and variant calling: A.A.P, J.M., E.B., M.D., S.G., M.J.D., D.A.
Imputation and association analysis: J.S., V.A., M.J.D., D.A.
Genotyping and analysis: J.S., V.A, B.T., C.G., N.P.B., Broad Institute Genetic Analysis Platform
Competing Interests Statement No competing interests identified.
References
Full text links
Read article at publisher's site: https://doi.org/10.1038/ng.871
Read article for free, from open access legal sources, via Unpaywall: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4096898
Citations & impact
Impact metrics
Citations of article over time
Alternative metrics
Discover the attention surrounding your research
https://www.altmetric.com/details/102422025
Smart citations by scite.ai
Explore citation contexts and check if this article has been
supported or disputed.
https://scite.ai/reports/10.1038/ng.871
Article citations
Deletion of the Murine Ortholog of the Human 9p21.3 Locus Leads to Insulin Resistance and Obesity in Hypercholesterolemic Mice.
Cells, 13(11):983, 05 Jun 2024
Cited by: 0 articles | PMID: 38891115 | PMCID: PMC11171903
The hazards of genotype imputation when mapping disease susceptibility variants.
Genome Biol, 25(1):7, 03 Jan 2024
Cited by: 0 articles | PMID: 38172955 | PMCID: PMC10763476
Molecular insights into hereditary elliptocytosis and pyropoikilocytosis: NGS uncovers multiple potential candidate genes.
Ann Hematol, 102(9):2343-2351, 04 Jul 2023
Cited by: 0 articles | PMID: 37400730
The GEnetic Syntax Score: a genetic risk assessment implementation tool grading the complexity of coronary artery disease-rationale and design of the GESS study.
BMC Cardiovasc Disord, 21(1):284, 08 Jun 2021
Cited by: 9 articles | PMID: 34103005 | PMCID: PMC8186185
Dominant role of CDKN2B/p15INK4B of 9p21.3 tumor suppressor hub in inhibition of cell-cycle and glycolysis.
Nat Commun, 12(1):2047, 06 Apr 2021
Cited by: 22 articles | PMID: 33824349 | PMCID: PMC8024281
Go to all (61) article citations
Data
Data behind the article
This data has been text mined from the article, or deposited into data resources.
BioStudies: supplemental material and supporting data
SNPs (3)
- (4 citations) dbSNP - rs10811661
- (2 citations) dbSNP - rs10757282
- (1 citation) dbSNP - rs4977574
Similar Articles
To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.
Six sequence variants on chromosome 9p21.3 are associated with a positive family history of myocardial infarction: a multicenter registry.
BMC Cardiovasc Disord, 11:9, 07 Mar 2011
Cited by: 28 articles | PMID: 21385355 | PMCID: PMC3061953
Common genetic variants on chromosome 9p21 are associated with myocardial infarction and type 2 diabetes in an Italian population.
BMC Med Genet, 11:60, 19 Apr 2010
Cited by: 37 articles | PMID: 20403154 | PMCID: PMC2871267
A common variant on chromosome 9p21 affects the risk of myocardial infarction.
Science, 316(5830):1491-1493, 03 May 2007
Cited by: 1058 articles | PMID: 17478679
[Genetic variants, cardiovascular risk and genome-wide association studies].
Rev Esp Cardiol, 64(6):509-514, 06 May 2011
Cited by: 7 articles | PMID: 21550161
Review
Type 2 diabetes and polymorphisms on chromosome 9p21: a meta-analysis.
Nutr Metab Cardiovasc Dis, 22(8):619-625, 05 Apr 2011
Cited by: 30 articles | PMID: 21315566
Review