The variant call format and VCFtools.

The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API.

Availability

http://vcftools.sourceforge.net

Free full text

Bioinformatics. 2011 Aug 1; 27(15): 2156–2158.

Published online 2011 Jun 7. https://doi.org/10.1093/bioinformatics/btr330

PMCID: PMC3137218

PMID: 21653522

The variant call format and VCFtools

Petr Danecek,^1,^† Adam Auton,^2,^† Goncalo Abecasis,³ Cornelis A. Albers,¹ Eric Banks,⁴ Mark A. DePristo,⁴ Robert E. Handsaker,⁴ Gerton Lunter,² Gabor T. Marth,⁵ Stephen T. Sherry,⁶ Gilean McVean,^2,⁷ Richard Durbin,^1,^* and 1000 Genomes Project Analysis Group^‡

Petr Danecek

¹Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SA, ²Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK, ³Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, ⁴Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02141, ⁵Department of Biology, Boston College, MA 02467, ⁶National Institutes of Health National Center for Biotechnology Information, MD 20894, USA and ⁷Department of Statistics, University of Oxford, Oxford OX1 3TG, UK

Find articles by Petr Danecek

Adam Auton

Find articles by Adam Auton

Goncalo Abecasis

Find articles by Goncalo Abecasis

Cornelis A. Albers

Find articles by Cornelis A. Albers

Eric Banks

Find articles by Eric Banks

Mark A. DePristo

Find articles by Mark A. DePristo

Robert E. Handsaker

Find articles by Robert E. Handsaker

Gerton Lunter

Find articles by Gerton Lunter

Gabor T. Marth

Find articles by Gabor T. Marth

Stephen T. Sherry

Find articles by Stephen T. Sherry

Gilean McVean

Find articles by Gilean McVean

Richard Durbin

Find articles by Richard Durbin

Author information Article notes Copyright and License information Disclaimer

This article has been cited by other articles in PMC.

Abstract

Summary: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API.

Availability: http://vcftools.sourceforge.net

Contact: ku.ca.regnas@dr

1 INTRODUCTION

One of the main uses of next-generation sequencing is to discover variation among large populations of related samples. Recently, a format for storing next-generation read alignments has been standardized by the SAM/BAM file format specification (Li et al., 2009). This has significantly improved the interoperability of next-generation tools for alignment, visualization and variant calling. We propose the variant call format (VCF) as a standardized format for storing the most prevalent types of sequence variation, including SNPs, indels and larger structural variants, together with rich annotations. The format was developed with the primary intention to represent human genetic variation, but its use is not restricted to diploid genomes and can be used in different contexts as well. Its flexibility and user extensibility allows representation of a wide variety of genomic variation with respect to a single reference sequence.

Although generic feature format (GFF) has recently been extended to standardize storage of variant information in genome variant format (GVF) (Reese et al., 2010), this is not tailored for storing information across many samples. We have designed the VCF format to be scalable so as to encompass millions of sites with genotype data and annotations from thousands of samples. We have adopted a textual encoding, with complementary indexing, to allow easy generation of the files while maintaining fast data access. In this article, we present an overview of the VCF and briefly introduce the companion VCFtools software package. A detailed format specification and the complete documentation of VCFtools are available at the VCFtools web site.

2 METHODS

2.1 The VCF

2.1.1 Overview of the VCF

A VCF file (Fig. 1a) consists of a header section and a data section. The header contains an arbitrary number of meta-information lines, each starting with characters ‘##’, and a TAB delimited field definition line, starting with a single ‘#’ character. The meta-information header lines provide a standardized description of tags and annotations used in the data section. The use of meta-information allows the information stored within a VCF file to be tailored to the dataset in question. It can be also used to provide information about the means of file creation, date of creation, version of the reference sequence, software used and any other information relevant to the history of the file. The field definition line names eight mandatory columns, corresponding to data columns representing the chromosome (CHROM), a 1-based position of the start of the variant (POS), unique identifiers of the variant (ID), the reference allele (REF), a comma separated list of alternate non-reference alleles (ALT), a phred-scaled quality score (QUAL), site filtering information (FILTER) and a semicolon separated list of additional, user extensible annotation (INFO). In addition, if samples are present in the file, the mandatory header columns are followed by a FORMAT column and an arbitrary number of sample IDs that define the samples included in the VCF file. The FORMAT column is used to define the information contained within each subsequent genotype column, which consists of a colon separated list of fields. For example, the FORMAT field GT:GQ:DP in the fourth data entry of Figure 1a indicates that the subsequent entries contain information regarding the genotype, genotype quality and read depth for each sample. All data lines are TAB delimited and the number of fields in each data line must match the number of fields in the header line. It is strongly recommended that all annotation tags used are declared in the VCF header section.

An external file that holds a picture, illustration, etc.
Object name is btr330f1.jpg

Fig. 1.

(a) Example of valid VCF. The header lines ##fileformat and #CHROM are mandatory, the rest is optional but strongly recommended. Each line of the body describes variants present in the sampled population at one genomic position or region. All alternate alleles are listed in the ALT column and referenced from the genotype fields as 1-based indexes to this list; the reference haplotype is designated as 0. For multiploid data, the separator indicates whether the data are phased (|) or unphased (/). Thus, the two alleles C and G at the positions 2 and 5 in this figure occur on the same chromosome in SAMPLE1. The first data line shows an example of a deletion (present in SAMPLE1) and a replacement of two bases by another base (SAMPLE2); the second line shows a SNP and an insertion; the third a SNP; the fourth a large structural variant described by the annotation in the INFO column, the coordinate is that of the base before the variant. (b–f) Alignments and VCF representations of different sequence variants: SNP, insertion, deletion, replacement, and a large deletion. The REF columns shows the reference bases replaced by the haplotype in the ALT column. The coordinate refers to the first reference base. (g) Users are advised to use simplest representation possible and lowest coordinate in cases where the position is ambiguous.

2.1.2 Conventions and reserved keywords

The VCF specification includes several common keywords with standardized meaning. The following list gives some examples of the reserved tags.

Genotype columns:

GT, genotype, encodes alleles as numbers: 0 for the reference allele, 1 for the first allele listed in ALT column, 2 for the second allele listed in ALT and so on. The number of alleles suggests ploidy of the sample and the separator indicates whether the alleles are phased (‘|’) or unphased (‘/’) with respect to other data lines (Fig. 1).
PS, phase set, indicates that the alleles of genotypes with the same PS value are listed in the same order.
DP, read depth at this position.
GL, genotype likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields.
GQ, genotype quality, probability that the genotype call is wrong under the condition that the site is being variant. Note that the QUAL column gives an overall quality score for the assertion made in ALT that the site is variant or no variant.

INFO column:

DB, dbSNP membership;
H3, membership in HapMap3;
VALIDATED, validated by follow-up experiment;
AN, total number of alleles in called genotypes;
AC, allele count in genotypes, for each ALT allele, in the same order as listed;
SVTYPE, type of structural variant (DEL for deletion, DUP for duplication, INV for inversion, etc. as described in the specification);
END, end position of the variant;
IMPRECISE, indicates that the position of the variant is not known accurately; and
CIPOS/CIEND, confidence interval around POS and END positions for imprecise variants.

Missing values are represented with a dot. For practical reasons, the VCF specification requires that the data lines appear in their chromosomal order. The full format specification is available at the VCFtools web site.

2.1.3 Variation types

VCF is flexible and allows to express virtually any type of variation by listing both the reference haplotype (the REF column) and the alternate haplotypes (the ALT column). This permits redundancy such that the same event can be expressed in multiple ways by including different numbers of reference bases or by combining two adjacent SNPs into one haplotype (Fig. 1g). Users are advised to follow recommended practice whenever possible: one reference base for SNPs and insertions, and one alternate base for deletions. The lowest possible coordinate should be used in cases where the position is ambiguous. When comparing or merging indel variants, the variant haplotypes should be reconstructed and reconciled, such as in the Figure 1g example, although the exact nature of the reconciliation can be arbitrary. For larger, more complex, variants, quoting large sequences becomes impractical, and in these cases the annotations in the INFO column can be used to describe the variant (Fig. 1f). The full VCF specification also includes a set of recommended practices for describing complex variants.

2.1.4 Compression and indexing

Given the large number of variant sites in the human genome and the number of individuals the 1000 Genomes Project aims to sequence (Durbin et al., 2010), VCF files are usually stored in a compact binary form, compressed by bgzip, a program which utilizes the zlib-compatible BGZF library (Li et al., 2009). Files compressed by bgzip can be decompressed by the standard gunzip and zcat utilities. Fast random access can be achieved by indexing genomic position using tabix, a generic indexer for TAB-delimited files. Both programs, bgzip and tabix, are part of the samtools software package and can be downloaded from the SAMtools web site (http://samtools.sourceforge.net).

2.2 VCFtools software package

VCFtools is an open-source software package for parsing, analyzing and manipulating VCF files. The software suite is broadly split into two modules. The first module provides a general Perl API, and allows various operations to be performed on VCF files, including format validation, merging, comparing, intersecting, making complements and basic overall statistics. The second module consists of C++ executable primarily used to analyze SNP data in VCF format, allowing the user to estimate allele frequencies, levels of linkage disequilibrium and various Quality Control metrics. Further details of VCFtools can be found on the web site (http://vcftools.sourceforge.net/), where the reader can also find links to alternative tools for VCF generation and manipulation, such as the GATK toolkit (McKenna et al., 2010).

3 CONCLUSIONS

We describe a generic format for storing the most prevalent types of sequence variation. The format is highly flexible, and can be adapted to store a wide variety of information. It has already been adopted by a number of large-scale projects, and is supported by an increasing number of software tools.

Funding: Medical Research Council, UK; British Heart Foundation (grant RG/09/012/28096); Wellcome Trust (grants 090532/Z/09/Z; and 075491/Z/04); National Human Genome Research Institute (grants 54 HG003067;, R01 HG004719; and U01 HG005208); Intramural Research Program of the National Institutes of Health, the National Library of Medicine.

Conflict of Interest: none declared.

REFERENCES

Durbin R.M., et al. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. [Europe PMC free article] [Abstract] [Google Scholar]
Li H., et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–2079. [Europe PMC free article] [Abstract] [Google Scholar]
McKenna A.H., et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. [Europe PMC free article] [Abstract] [Google Scholar]
Reese M.G., et al. A standard variation file format for human genome sequences. Genome Biol. 2010;11:20796305. [Europe PMC free article] [Abstract] [Google Scholar]

Articles from Bioinformatics are provided here courtesy of Oxford University Press

Full text links

Read article at publisher's site: https://doi.org/10.1093/bioinformatics/btr330

Read article for free, from open access legal sources, via Unpaywall: https://academic.oup.com/bioinformatics/article-pdf/27/15/2156/48863483/bioinformatics_27_15_2156.pdf

Citations & impact

Impact metrics

6,467

Citations

Jump to Citations

Citations of article over time

Alternative metrics

Altmetric item for https://www.altmetric.com/details/101716635

Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/101716635

Smart citations by scite.ai
Explore citation contexts and check if this article has been supported or disputed.
https://scite.ai/reports/10.1093/bioinformatics/btr330

Supporting

Mentioning

Contrasting

10797

Article citations

Genetic resources of African mahogany in Brazil: genomic diversity and structure of forest plantations.
Faria JCT, Konzen ER, Caldeira MVW, de Oliveira Godinho T, Maluf LP, Moreira SO, da Silva Carvalho C, Leal BSS, Dos Santos Azevedo C, Momolli DR, da Costa Pinto Coelho GT, de Oliveira CMB, Soares TCB
BMC Plant Biol, 24(1):858, 13 Sep 2024
Cited by: 0 articles | PMID: 39266956
An IGHG1 variant exhibits polarized prevalence and confers enhanced IgG1 antibody responses against life-threatening organisms.
Sun W, Yang T, Sun F, Liu P, Gao J, Lan X, Xu W, Pang Y, Li T, Li C, Liang Q, Chen H, Liu X, Tan W, Zhu H, Wang F, Cheng F, Zhai W, Kim HN, [...] Liu W
Nat Immunol, 11 Sep 2024
Cited by: 0 articles | PMID: 39261722
Genomic and transcriptomic landscape to decipher the genetic basis of hyperpigmentation in Lanping black-boned sheep (Ovis aries).
Chong Y, Xiong H, Gao Z, Lu Y, Hong J, Wu J, He X, Xi D, Tu X, Deng W
BMC Genomics, 25(1):845, 09 Sep 2024
Cited by: 0 articles | PMID: 39251902 | PMCID: PMC11382470
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Development and validation of a 66K SNP array for the hard clam (Mercenaria mercenaria).
Grouzdev D, Farhat S, Guo X, Espinosa EP, Reece K, McDowell J, Yang H, Rivara G, Reitsma J, Clemetson A, Tanguy A, Allam B
BMC Genomics, 25(1):847, 09 Sep 2024
Cited by: 0 articles | PMID: 39251920
Demographic History and Inbreeding in Two Declining Sea Duck Species Inferred From Whole-Genome Sequence Data.
Cádiz MI, Tengstedt ANB, Sørensen IH, Pedersen ES, Fox AD, Hansen MM
Evol Appl, 17(9):e70008, 10 Sep 2024
Cited by: 0 articles | PMID: 39257569

Go to all (6,467) article citations

Other citations

Wikipedia

https://en.wikipedia.org/wiki/Mega2,_the_Manipulation_Environment_for_Genetic_Analysis

Funding

Funders who supported this work.

British Heart Foundation (1)

Premature cardiovascular disease and platelet quantitative traits: identification and functional characterisation of novel high-penetrance mutations and rare alleles
Professor Willem Hendrik Ouwehand, University of Cambridge
Grant ID: RG/09/012/28096
88 publications

Intramural NIH HHS

Medical Research Council (1)

Grant ID: G0801056B
960 publications

NHGRI NIH HHS (3)

Grant ID: U01 HG005208
14 publications
Grant ID: R01 HG004719
23 publications
Grant ID: 54 HG003067
3 publications

Wellcome Trust (4)

Understanding the genetic basis of common human diseases: core funding for the Wellcome Trust Centre for Human Genetics.
Professor Peter Donnelly, University of Oxford
Grant ID: 090532
1593 publications
Grant ID: RG/09/012/28096
1 publication
1000 genomes project data analysis.
Prof Gil McVean, University of Oxford
Grant ID: 086084
11 publications
Grant ID: 075491/Z/04
218 publications

Search life-sciences literature (44,835,121 articles, preprints and more)

The variant call format and VCFtools.

Collaborators (743)

Author information

Affiliations

Authors

ORCIDs linked to this article

Abstract

Summary

Availability

Free full text

The variant call format and VCFtools

Petr Danecek

Adam Auton

Goncalo Abecasis

Cornelis A. Albers

Eric Banks

Mark A. DePristo

Robert E. Handsaker

Gerton Lunter

Gabor T. Marth

Stephen T. Sherry

Gilean McVean

Richard Durbin

Abstract

1 INTRODUCTION

2 METHODS

2.1 The VCF

2.1.1 Overview of the VCF

2.1.2 Conventions and reserved keywords

2.1.3 Variation types

2.1.4 Compression and indexing

2.2 VCFtools software package

3 CONCLUSIONS

REFERENCES

Full text links

Citations & impact

Impact metrics

Citations of article over time

Alternative metrics

Article citations

Other citations

Wikipedia

Similar Articles

Funding

British Heart Foundation (1)﻿

Intramural NIH HHS

Medical Research Council (1)﻿

NHGRI NIH HHS (3)﻿

Wellcome Trust (4)﻿

Partnerships & funding

British Heart Foundation (1)

Medical Research Council (1)

NHGRI NIH HHS (3)

Wellcome Trust (4)