The Sequence Alignment/Map format and SAMtools.

Li H; Handsaker B; Wysoker A; Fennell T; Ruan J; Homer N; Marth G; Abecasis G; Durbin R; undefined

doi:10.1093/bioinformatics/btp352

Abstract

Summary

The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.

Availability

http://samtools.sourceforge.net.

Free full text

Bioinformatics. 2009 Aug 15; 25(16): 2078–2079.

Published online 2009 Jun 8. https://doi.org/10.1093/bioinformatics/btp352

PMCID: PMC2723002

PMID: 19505943

The Sequence Alignment/Map format and SAMtools

Heng Li,^1,^† Bob Handsaker,^2,^† Alec Wysoker,² Tim Fennell,² Jue Ruan,³ Nils Homer,⁴ Gabor Marth,⁵ Goncalo Abecasis,⁶ Richard Durbin,^1,^* and 1000 Genome Project Data Processing Subgroup⁷

Heng Li

¹ Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK, ²Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA, ³Beijing Institute of Genomics, Chinese Academy of Science, Beijing 100029, China, ⁴Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, ⁵Department of Biology, Boston College, Chestnut Hill, MA 02467, ⁶Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA and ⁷http://1000genomes.org

Find articles by Heng Li

Bob Handsaker

¹ Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK, ²Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA, ³Beijing Institute of Genomics, Chinese Academy of Science, Beijing 100029, China, ⁴Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, ⁵Department of Biology, Boston College, Chestnut Hill, MA 02467, ⁶Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA and ⁷http://1000genomes.org

Find articles by Bob Handsaker

Alec Wysoker

¹ Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK, ²Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA, ³Beijing Institute of Genomics, Chinese Academy of Science, Beijing 100029, China, ⁴Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, ⁵Department of Biology, Boston College, Chestnut Hill, MA 02467, ⁶Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA and ⁷http://1000genomes.org

Find articles by Alec Wysoker

Tim Fennell

¹ Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK, ²Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA, ³Beijing Institute of Genomics, Chinese Academy of Science, Beijing 100029, China, ⁴Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, ⁵Department of Biology, Boston College, Chestnut Hill, MA 02467, ⁶Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA and ⁷http://1000genomes.org

Find articles by Tim Fennell

Jue Ruan

¹ Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK, ²Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA, ³Beijing Institute of Genomics, Chinese Academy of Science, Beijing 100029, China, ⁴Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, ⁵Department of Biology, Boston College, Chestnut Hill, MA 02467, ⁶Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA and ⁷http://1000genomes.org

Find articles by Jue Ruan

Nils Homer

¹ Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK, ²Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA, ³Beijing Institute of Genomics, Chinese Academy of Science, Beijing 100029, China, ⁴Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, ⁵Department of Biology, Boston College, Chestnut Hill, MA 02467, ⁶Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA and ⁷http://1000genomes.org

Find articles by Nils Homer

Gabor Marth

¹ Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK, ²Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA, ³Beijing Institute of Genomics, Chinese Academy of Science, Beijing 100029, China, ⁴Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, ⁵Department of Biology, Boston College, Chestnut Hill, MA 02467, ⁶Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA and ⁷http://1000genomes.org

Find articles by Gabor Marth

Goncalo Abecasis

¹ Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK, ²Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA, ³Beijing Institute of Genomics, Chinese Academy of Science, Beijing 100029, China, ⁴Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, ⁵Department of Biology, Boston College, Chestnut Hill, MA 02467, ⁶Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA and ⁷http://1000genomes.org

Find articles by Goncalo Abecasis

Richard Durbin

¹ Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK, ²Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA, ³Beijing Institute of Genomics, Chinese Academy of Science, Beijing 100029, China, ⁴Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, ⁵Department of Biology, Boston College, Chestnut Hill, MA 02467, ⁶Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA and ⁷http://1000genomes.org

Find articles by Richard Durbin

1000 Genome Project Data Processing Subgroup

¹ Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK, ²Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA, ³Beijing Institute of Genomics, Chinese Academy of Science, Beijing 100029, China, ⁴Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, ⁵Department of Biology, Boston College, Chestnut Hill, MA 02467, ⁶Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA and ⁷http://1000genomes.org

Find articles by 1000 Genome Project Data Processing Subgroup

Author information Article notes Copyright and License information Disclaimer

This article has been cited by other articles in PMC.

Go to:

Abstract

Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.

Availability: http://samtools.sourceforge.net

Contact: ku.ca.regnas@dr

Go to:

1 INTRODUCTION

With the advent of novel sequencing technologies such as Illumina/Solexa, AB/SOLiD and Roche/454 (Mardis, 2008), a variety of new alignment tools (Langmead et al., 2009; Li et al., 2008) have been designed to realize efficient read mapping against large reference sequences, including the human genome. These tools generate alignments in different formats, however, complicating downstream processing. A common alignment format that supports all sequence types and aligners creates a well-defined interface between alignment and downstream analyses, including variant detection, genotyping and assembly.

The Sequence Alignment/Map (SAM) format is designed to achieve this goal. It supports single- and paired-end reads and combining reads of different types, including color space reads from AB/SOLiD. It is designed to scale to alignment sets of 10¹¹ or more base pairs, which is typical for the deep resequencing of one human individual.

In this article, we present an overview of the SAM format and briefly introduce the companion SAMtools software package. A detailed format specification and the complete documentation of SAMtools are available at the SAMtools web site.

Go to:

2 METHODS

2.1 The SAM format

2.1.1 Overview of the SAM format

The SAM format consists of one header section and one alignment section. The lines in the header section start with character ‘@’, and lines in the alignment section do not. All lines are TAB delimited. An example is shown in Figure 1b.

An external file that holds a picture, illustration, etc.
Object name is btp352f1.jpg

Fig. 1.

Example of extended CIGAR and the pileup output. (a) Alignments of one pair of reads and three single-end reads. (b) The corresponding SAM file. The ‘@SQ’ line in the header section gives the order of reference sequences. Notably, r001 is the name of a read pair. According to FLAG 163 (=1 + 2 + 32 + 128), the read mapped to position 7 is the second read in the pair (128) and regarded as properly paired (1 + 2); its mate is mapped to 37 on the reverse strand (32). Read r002 has three soft-clipped (unaligned) bases. The coordinate shown in SAM is the position of the first aligned base. The CIGAR string for this alignment contains a P (padding) operation which correctly aligns the inserted sequences. Padding operations can be absent when an aligner does not support multiple sequence alignment. The last six bases of read r003 map to position 9, and the first five to position 29 on the reverse strand. The hard clipping operation H indicates that the clipped sequence is not present in the sequence field. The NM tag gives the number of mismatches. Read r004 is aligned across an intron, indicated by the N operation. (c) Simplified pileup output by SAMtools. Each line consists of reference name, sorted coordinate, reference base, the number of reads covering the position and read bases. In the fifth field, a dot or a comma denotes a base identical to the reference; a dot or a capital letter denotes a base from a read mapped on the forward strand, while a comma or a lowercase letter on the reverse strand.

In SAM, each alignment line has 11 mandatory fields and a variable number of optional fields. The mandatory fields are briefly described in Table 1. They must be present but their value can be a ‘’ or a zero (depending on the field) if the corresponding information is unavailable. The optional fields are presented as key-value pairs in the format of TAG:TYPE:VALUE. They store extra information from the platform or aligner. For example, the ‘RG’ tag keeps the ‘read group’ information for each read. In combination with the ‘@RG’ header lines, this tag allows each read to be labeled with metadata about its origin, sequencing center and library. The SAM format specification gives a detailed description of each field and the predefined TAGs.

Table 1.

Mandatory fields in the SAM format

No.	Name	Description
1	`QNAME`	Query NAME of the read or the read pair
2	`FLAG`	Bitwise FLAG (pairing, strand, mate strand, etc.)
3	`RNAME`	Reference sequence NAME
4	`POS`	1-Based leftmost POSition of clipped alignment
5	`MAPQ`	MAPping Quality (Phred-scaled)
6	`CIGAR`	Extended CIGAR string (operations: `MIDNSHP`)
7	`MRNM`	Mate Reference NaMe (‘=’ if same as `RNAME`)
8	`MPOS`	1-Based leftmost Mate POSition
9	`ISIZE`	Inferred Insert SIZE
10	`SEQ`	Query SEQuence on the same strand as the reference
11	`QUAL`	Query QUALity (ASCII-33=Phred base quality)

2.1.2 Extended CIGAR

The standard CIGAR description of pairwise alignment defines three operations: ‘M’ for match/mismatch, ‘I’ for insertion compared with the reference and ‘D’ for deletion. The extended CIGAR proposed in SAM added four more operations: ‘N’ for skipped bases on the reference, ‘S’ for soft clipping, ‘H’ for hard clipping and ‘P’ for padding. These support splicing, clipping, multi-part and padded alignments. Figure 1 shows examples of CIGAR strings for different types of alignments.

2.1.3 Binary Alignment/Map format

To improve the performance, we designed a companion format Binary Alignment/Map (BAM), which is the binary representation of SAM and keeps exactly the same information as SAM. BAM is compressed by the BGZF library, a generic library developed by us to achieve fast random access in a zlib-compatible compressed file. An example alignment of 112 Gbp of Illumina GA data requires 116 GB of disk space (1.0 byte per input base), including sequences, base qualities and all the meta information generated by MAQ. Most of this space is used to store the base qualities.

2.1.4 Sorting and indexing

A SAM/BAM file can be unsorted, but sorting by coordinate is used to streamline data processing and to avoid loading extra alignments into memory. A position-sorted BAM file can be indexed. We combine the UCSC binning scheme (Kent et al., 2002) and simple linear indexing to achieve fast random retrieval of alignments overlapping a specified chromosomal region. In most cases, only one seek call is needed to retrieve alignments in a region.

2.2 SAMtools software package

SAMtools is a library and software package for parsing and manipulating alignments in the SAM/BAM format. It is able to convert from other alignment formats, sort and merge alignments, remove PCR duplicates, generate per-position information in the pileup format (Fig. 1c), call SNPs and short indel variants, and show alignments in a text-based viewer. For the example alignment of 112 Gbp Illumina GA data, SAMtools took about 10 h to convert from the MAQ format and 40 min to index with <30 MB memory. Conversion is slower mainly because compression with zlib is slower than decompression. External sorting writes temporary BAM files and would typically be twice as slow as conversion.

SAMtools has two separate implementations, one in C and the other in Java, with slightly different functionality.

Go to:

3 CONCLUSIONS

We designed and implemented a generic alignment format, SAM, which is simple to work with and flexible enough to keep most information from various sequencing platforms and read aligners. The equivalent binary representation, BAM, is compact in size and supports fast retrieval of alignments in specified regions. Using positional sorting and indexing, applications can perform stream-based processing on specific genomic regions without loading the entire file into memory. The SAM/BAM format, together with SAMtools, separates the alignment step from downstream analyses, enabling a generic and modular approach to the analysis of genomic sequencing data.

Go to:

ACKNOWLEDGEMENTS

We are grateful to James Bonfield for the comments on indexing and to SAMtools users for testing the software as it has matured.

Funding: Wellcome Trust/077192/Z/05/Z; NIH Hapmap/1000 Genomes Project grant (U54HG002750 to B.H.).

Conflict of Interest: none declared.

Go to:

REFERENCES

Kent WJ, et al. The human genome browser at UCSC. Genome Res. 2002;12:996–1006. [Europe PMC free article] [Abstract] [Google Scholar]
Langmead B, et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25. [Europe PMC free article] [Abstract] [Google Scholar]
Li H, et al. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008;18:1851–1858. [Europe PMC free article] [Abstract] [Google Scholar]
Mardis ER. Next-generation DNA sequencing methods. Annu. Rev. Genomics Hum. Genet. 2008;9:387–402. [Abstract] [Google Scholar]

Articles from Bioinformatics are provided here courtesy of Oxford University Press

Full text links

Read article at publisher's site: https://doi.org/10.1093/bioinformatics/btp352

Read article for free, from open access legal sources, via Unpaywall: https://academic.oup.com/bioinformatics/article-pdf/25/16/2078/48994296/bioinformatics_25_16_2078.pdf

Free after 12 months at bioinformatics.oxfordjournals.org
http://bioinformatics.oxfordjournals.org/cgi/content/full/25/16/2078

Free to read at bioinformatics.oxfordjournals.org
http://bioinformatics.oxfordjournals.org/cgi/content/abstract/25/16/2078

Free after 12 months at bioinformatics.oxfordjournals.org
http://bioinformatics.oxfordjournals.org/cgi/reprint/25/16/2078

Citations & impact

Impact metrics

31,435

Citations

Jump to Citations

1

Data citation

Jump to Data

Citations of article over time

Alternative metrics

Altmetric item for https://www.altmetric.com/details/223216

Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/223216

Smart citations by scite.ai
Explore citation contexts and check if this article has been supported or disputed.
https://scite.ai/reports/10.1093/bioinformatics/btp352

Supporting

Mentioning

Contrasting

23

41213

3

Article citations

Altered microRNA composition in the uterine lumen fluid in cattle (Bos taurus) pregnancies initiated by artificial insemination or transfer of an in vitro produced embryo.
Biase FH, Moorey SE, Schnuelle JG, Rodning S, Ortega MS, Spencer TE
J Anim Sci Biotechnol, 15(1):130, 13 Sep 2024
Cited by: 0 articles | PMID: 39267128
Recovery of 1559 metagenome-assembled genomes from the East China Sea's low-oxygen region.
Liu S, Chen Q, Hou C, Dong C, Qiu X, Tang K
Sci Data, 11(1):994, 12 Sep 2024
Cited by: 0 articles | PMID: 39266528
Genome-wide characterization of dynamic DNA 5-hydroxymethylcytosine and TET2-related DNA demethylation during breast tumorigenesis.
Wu SL, Yang L, Huang C, Li Q, Ma C, Yuan F, Zhou Y, Wang X, Tong WM, Niu Y, Jin F
Clin Epigenetics, 16(1):125, 11 Sep 2024
Cited by: 0 articles | PMID: 39261937
Revising pathogenesis of AP1S1-related MEDNIK syndrome: a missense variant in the AP1S1 gene as a causal genetic lesion.
Rackova M, Mattera R, Svaton M, Fencl F, Kanderova V, Spicakova K, Park SY, Fabian O, Koblizek M, Fronkova E, Bonifacino JS, Skvarova Kramarzova K
J Mol Med (Berl), 13 Sep 2024
Cited by: 0 articles | PMID: 39269494
A reference genome for the Harpy Eagle reveals steady demographic decline and chromosomal rearrangements in the origin of Accipitriformes.
Canesin LEC, Vilaça ST, Oliveira RRM, Al-Ajli F, Tracey A, Sims Y, Formenti G, Fedrigo O, Banhos A, Sanaiotti TM, Farias IP, Jarvis ED, Oliveira G, Hrbek T, Solferini V, Aleixo A
Sci Rep, 14(1):19925, 12 Sep 2024
Cited by: 0 articles | PMID: 39261501

Go to all (31,435) article citations

Other citations

Wikipedia (5)

Data

Data that cites the article

This data has been provided by curated databases and other sources that have cited the article.

ENCODE: Encyclopedia of DNA Elements

http://encodeproject.org/publications/c984b730-e1ab-4504-995e-420795cfbc2c/

Funding

Funders who supported this work.

NHGRI NIH HHS (8)

Grant ID: R01 HG004719-04
7 publications
Grant ID: R01 HG004719-03
7 publications
Grant ID: R01 HG004719
23 publications
Grant ID: R01 HG004719-02S1
7 publications
Grant ID: U54HG002750
1 publication
Grant ID: R01 HG004719-01
7 publications
Grant ID: R01 HG004719-02
7 publications
Grant ID: U54 HG002750
3 publications

Wellcome Trust (1)

Grant ID: 077192/Z/05/Z
5 publications

Search life-sciences literature (44,811,094 articles, preprints and more)

The Sequence Alignment/Map format and SAMtools.

Author information

Affiliations

Authors

ORCIDs linked to this article

Abstract

Summary

Availability

Free full text

The Sequence Alignment/Map format and SAMtools

Heng Li

Bob Handsaker

Alec Wysoker

Tim Fennell

Jue Ruan

Nils Homer

Gabor Marth

Goncalo Abecasis

Richard Durbin

1000 Genome Project Data Processing Subgroup

Abstract

1 INTRODUCTION

2 METHODS

2.1 The SAM format

2.1.1 Overview of the SAM format

Table 1.

2.1.2 Extended CIGAR

2.1.3 Binary Alignment/Map format

2.1.4 Sorting and indexing

2.2 SAMtools software package

3 CONCLUSIONS

ACKNOWLEDGEMENTS

REFERENCES

Full text links

Citations & impact

Impact metrics

Citations of article over time

Alternative metrics

Article citations

Other citations

Wikipedia (5)

Data

Data that cites the article

ENCODE: Encyclopedia of DNA Elements

Similar Articles

Funding

NHGRI NIH HHS (8)﻿

Wellcome Trust (1)﻿

Partnerships & funding

NHGRI NIH HHS (8)

Wellcome Trust (1)