BBC Russian
Svoboda | Graniru | BBC Russia | Golosameriki | Facebook

This website requires cookies, and the limited processing of your personal data in order to function. By using the site you are agreeing to this as outlined in our privacy notice and cookie policy.

Abstract 


Motivation

Performing experiments with simulated data is an inexpensive approach to evaluating competing experimental designs and analysis methods in genome-wide association studies. Simulation based on resampling known haplotypes is fast and efficient and can produce samples with patterns of linkage disequilibrium (LD), which mimic those in real data. However, the inability of current methods to simulate multiple nearby disease SNPs on the same chromosome can limit their application.

Results

We introduce a new simulation algorithm based on a successful resampling method, HAPGEN, that can simulate multiple nearby disease SNPs on the same chromosome. The new method, HAPGEN2, retains many advantages of resampling methods and expands the range of disease models that current simulators offer.

Availability

HAPGEN2 is freely available from http://www.stats.ox.ac.uk/~marchini/software/gwas/gwas.html.

Contact

[email protected]

Supplementary information

Supplementary data are available at Bioinformatics online.

Free full text 


Logo of bioinfoLink to Publisher's site
Bioinformatics. 2011 Aug 15; 27(16): 2304–2305.
PMCID: PMC3150040
PMID: 21653516

HAPGEN2: simulation of multiple disease SNPs

Abstract

Motivation: Performing experiments with simulated data is an inexpensive approach to evaluating competing experimental designs and analysis methods in genome-wide association studies. Simulation based on resampling known haplotypes is fast and efficient and can produce samples with patterns of linkage disequilibrium (LD), which mimic those in real data. However, the inability of current methods to simulate multiple nearby disease SNPs on the same chromosome can limit their application.

Results: We introduce a new simulation algorithm based on a successful resampling method, HAPGEN, that can simulate multiple nearby disease SNPs on the same chromosome. The new method, HAPGEN2, retains many advantages of resampling methods and expands the range of disease models that current simulators offer.

Availability: HAPGEN2 is freely available from http://www.stats.ox.ac.uk/~marchini/software/gwas/gwas.html.

Contact: ku.ca.xo.llew@nahz

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Genome-wide association studies have become a powerful approach for uncovering the genetic variants that impact human phenotypes. Simulation studies are a popular and inexpensive approach to evaluate new methods for statistical analysis (Su et al., 2009) and to examine the power of different experimental designs (Spencer et al., 2009).

The traditional approach of simulating a population forwards (Lambert, 2008) or backwards (Hudson, 2002) in time ignore the large amount of observed genetic data that are available, can be computationally intensive and can struggle to match real LD patterns. To overcome these problems, Spencer et al. (2009) introduced a novel simulation approach, HAPGEN, which uses an alternative resampling approach. Given a reference panel of haplotypes, this method produces a sample of haplotypes with patterns of LD similar to those in the reference panel. Using the HapMap3 and 1000G haplotype data as reference panels, HAPGEN is able to simulate data for many populations. In addition, it is fast and can simulate a single disease SNP under a general disease model, allowing the user to specify the risk allele and heterozygote and homozygote relative risks. Other resampling methods also exist (Li and Li, 2008; Wright et al., 2007), but they and HAPGEN can only simulate a single disease SNP on the same haplotype. There are many complex diseases with multiple-associated loci on the same chromosome, some of them in close proximity (e.g. Strange et al., 2010), so the ability to simulate multiple disease SNPs on the same chromosome would be desirable. To address this issue, we have devised a new approach, extending HAPGEN, to simulate multiple nearby disease SNPs on the same chromosome.

2 METHODS

The HAPGEN2 simulation approach is similar to that of HAPGEN and is based on the Li and Stephens (LS) model (Li and Stephens, 2003) of LD. Briefly, given a reference panel of haplotypes, HR={h1,…,hr} as input, where each haplotype is typed at L biallelic sites, that is hi=(h(i,1),…,h(i,L)) and h(i,j)[set membership]{0,1}, the LS model models each newly simulated haplotype as an imperfect mosiac of the haplotypes in HR and the haplotypes that have already been simulated (see below for more details). Simulation of case–control data is based on a set of disease SNPs, D={dk:dk[set membership]{1,…,L}, k=1,…,K} with effect sizes and RR={(rr1k, rr2k)}, where rr1k and rr2k are the disease risks of carrying one and two copies of the 1 allele relative to carrying two copies of the 0 allele at dk, which combine multiplicatively across the K disease SNPs. The haplotypes, HP={hr+1,…,hp}, for the control individuals are simulated first, followed by the haplotypes, HQ={hp+1,…,hq}, for the case individuals.

2.1 Simulating control data

We simulate the control data as population controls (so that some of them may be cases) and simulate each additional haplotype, hi+1[set membership]HP, sequentially under the LS model. We use the copying states, z(i+1,j)[set membership]{1,…,i}, which evolve in a Markov manner, to indicate the haplotype that h(i+1,j) copies at site j. We simulate each haplotype in three stages. First, the cross-over events, which are locations where z(i+1,j)z(i+1,j−1), are simulated according to the transition probabilities

equation image
(1)

where Iz is 1 if z=z(i+1,j−1) and 0 otherwise, and ρj is genetic distance between SNPs (j−1) and j. Conceptually, the cross-over events mimicks the effect of recombination and breaks up hi+1 into independent segments, {h(i+1,s1),…,h(i+1,sn)}, where each segment is a haplotype of SNPs between two cross-over events. Second, the copying state for each segment is sampled uniformly from {1,…,i}. Finally, the allele at each SNP is simulated conditional on the copying state and a mutation parameter μi:

equation image
(2)

Spencer et al. (2009) found that An external file that holds a picture, illustration, etc.
Object name is btr341i1.jpg, where An external file that holds a picture, illustration, etc.
Object name is btr341i2.jpg, simulated amounts of novel haplotype variation similar to data simulated under the coalescent model.

2.2 Simulating case data

We simulate the case haplotypes in a similar way, but we simulate them sequentially in pairs (with each pair corresponding to a case individual) and oversample haplotypes carrying the risk alleles based on the relative risks.

Simulation of each haplotype pair, (hi+1,hi+2)[set membership]HQ, proceeds in four stages. First, the cross-over events are simulated in the same way as for the controls, according to (1). Second, the alleles at the disease SNPs are simulated. Let (h1D,h2D) be the subset of (hi+1,hi+2) that consist of the alleles at the disease SNPs, so that hjD=(h(i+j,d1),…,h(i+j,dk)) for j=1,2. The cross−over events separate h1D and h2D into segments, {h1s11,…,h1s1n1} and {h2s21,…,h2s2n2}. We simulate (h1D, h2D) from its joint distribution, which is calculated from the relative risks and the marginal frequencies of each segment in HP and HR, using Bayes Theorem:

equation image

where gdk=h1dk+h2dk is the genotype at dk, and p(hs) is the frequency of the haplotype segment hs in HR and HP. Third, the copying state for each segment, h(i+1,s), is simulated independently and is drawn uniformly from {1,…,i}, like we do for the controls, if s does not include any disease SNPs; or else it is drawn from

equation image

where Idk is 1 if h(i+1,dk)=h(z,dk) and 0 otherwise. Finally, each allele for h(i+1,s) is simulated according to (2). Copying states and alleles for hi+2 are simulated in the same way.

3 RESULTS

To demonstrate HAPGEN2, we have simulated, using HapMap2 CEU as the reference panel, 2000 cases and 2000 controls at 880 SNPs across a 700 kb region on chromosome 21, with 3 disease SNPs, at positions d1=25 356 790, d2=25 390 071 and d3=25 691 378, each under a log-additive disease model with a heterozygote relative risk of 1.3. The simulation process took <10 s on a 2.93 GHz processor laptop, and will increase linearly with the number of SNPs and individuals.

Figure 1, produced by HAPLOVIEW (Barrett et al., 2005), shows the similarity between the LD patterns of the reference panel (top) and the simulated haplotypes (bottom). The top plot in Figure 2 shows the −log10(P-values), for the log-additive test, across the region, illustrating the signal of association at the disease SNPs; subsequent plots show the P-values conditioned on the genotypes at d1, at d1 and d2 and at d1, d2 and d3, respectively, confirming that there are indeed three independent disease SNPs.

An external file that holds a picture, illustration, etc.
Object name is btr341f1.jpg

LD patterns, in terms of r2, in the HapMap reference haplotypes (top) and the simulated haplotypes (bottom).

An external file that holds a picture, illustration, etc.
Object name is btr341f2.jpg

Top plot shows the −log10(P-values) under the log-additive test at each SNP in the simulated data. The location of the disease SNPs, d1, d2, d3, are indicated (from left to right) by the vertical lines. Subsequent plots (from the top) show the P-values conditioned on the genotypes at d1, at d1 and d2 and at d1, d2 and d3.

4 DISCUSSION

We have introduced a new resampling method that can simulate multiple disease SNPs on the same haplotype, which will be particularly useful for investigating disease models involving multiple disease SNPs within close proximity. HAPGEN2 is fast, simple to use and available as a C++ package from http://www.stats.ox.ac.uk/~marchini/software/gwas/gwas.html, along with instructions and supporting resources, such as recombination rates, HapMap and 1000G reference panels.

The model described here can be easily extended to simulate interacting disease SNPs (we currently provide an R package that does this) and admixture (using reference panels from multiple populations), which we hope to implement in the future.

Funding: Wellcome Trust grants 084575/Z/08/Zand075491/Z/04/B. PD was supported in part by a Wolfson Royal Society Merit Award.J.M. was supported by United Kingdom Medical Research Council grant number G0801823.

Conflict of Interest: none declared.

Supplementary Material

Supplementary Data:

REFERENCES

  • Barrett J.C., et al. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005;21:263–265. [Abstract] [Google Scholar]
  • Hudson R.R. Generating samples under a Wright-Fisher neutral model. Bioinformatics. 2002;18:337–338. [Abstract] [Google Scholar]
  • Lambert B.W., et al. ForSim: a tool for exploring the genetic architecture of complex traits with controlled truth. Bioinformatics. 2008;24:1821–1822. [Europe PMC free article] [Abstract] [Google Scholar]
  • Li C., Li M. GWAsimulator: a rapid whole-genome simulation program. Bioinformatics. 2008;24:140–142. [Abstract] [Google Scholar]
  • Li N., Stephens M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics. 2003;165:2213–2233. [Europe PMC free article] [Abstract] [Google Scholar]
  • Spencer C.C.A., et al. Designing genome-wide association studies: sample size, power, imputation, and the choice of genotyping chip. PLoS Genet. 2009;5:e1000477. [Europe PMC free article] [Abstract] [Google Scholar]
  • Strange A., et al. A genome-wide association study identifies new psoriasis susceptibility loci and an interaction between HLA-C and ERAP1. Nat. Genet. 2010;42:985–990. [Abstract] [Google Scholar]
  • Su Z., et al. A Bayesian method for detecting and characterizing allelic heterogeneity and boosting signals in genome-wide association studies. Stat. Sci. 2009;24:430–450. [Google Scholar]
  • Wright F.A., et al. Simulating association studies: a data-based resampling method for candidate regions or whole genome scans. Bioinformatics. 2007;23:2581–2588. [Abstract] [Google Scholar]

Articles from Bioinformatics are provided here courtesy of Oxford University Press

Citations & impact 


Impact metrics

Jump to Citations

Citations of article over time

Alternative metrics

Altmetric item for https://www.altmetric.com/details/101884654
Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/101884654

Smart citations by scite.ai
Smart citations by scite.ai include citation statements extracted from the full text of the citing article. The number of the statements may be higher than the number of citations provided by EuropePMC if one paper cites another multiple times or lower if scite has not yet processed some of the citing articles.
Explore citation contexts and check if this article has been supported or disputed.
https://scite.ai/reports/10.1093/bioinformatics/btr341

Supporting
Mentioning
Contrasting
0
310
0

Article citations


Go to all (169) article citations

Data 


Data behind the article

This data has been text mined from the article, or deposited into data resources.

Similar Articles 


To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.

Funding 


Funders who supported this work.

Medical Research Council (1)

Wellcome Trust (3)