Samtools Manual Page
Samtools Manual Page
SYNOPSIS
samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
DESCRIPTION
Samtools is a set of utilities that manipulate alignments in the BAM format. It imports from and exports to the
SAM (Sequence Alignment/Map) format, does sorting, merging and indexing, and allows to retrieve reads in
any regions swiftly.
Samtools is designed to work on a stream. It regards an input file `-' as the standard input (stdin) and an output
file `-' as the standard output (stdout). Several commands can thus be combined with Unix pipes. Samtools
always output warning and error messages to the standard error output (stderr).
Samtools is also able to open a BAM (not SAM) file on a remote FTP or HTTP server if the BAM file name
starts with `ftp://' or `http://'. Samtools checks the current working directory for the index file and will download
the index upon absence. Samtools does not retrieve the entire alignment file unless it is asked to do so.
With no options or regions specified, prints all alignments in the specified input alignment file
(in SAM, BAM, or CRAM format) to standard output in SAM format (with no header).
You may specify one or more space-separated region specifications after the input filename
to restrict output to only those alignments which overlap the specified region(s). Use of
region specifications requires a coordinate-sorted and indexed input file (in BAM or CRAM
format).
The -b, -C, -1, -u, -h, -H, and -c options change the output format from the default of
headerless SAM, and the -o and -U options set the output file name(s).
The -t and -T options provide additional reference data. One of these two options is required
when SAM input does not contain @SQ headers, and the -T option is required whenever
writing CRAM output.
The -L, -M, -r, -R, -s, -q, -l, -m, -f, -F, and -G options filter the alignments that will be
included in the output to only those alignments that match certain criteria.
The -x and -B options modify the data which is contained in each alignment.
Finally, the -@ option can be used to allocate additional threads to be used for compression,
and the -? option requests a long help message.
REGIONS: Regions can be specified as: RNAME[:STARTPOS[-ENDPOS]] and all position coordinates
are 1-based.
Important note: when multiple regions are given, some alignments may be output multiple
times if they overlap more than one of the specified regions.
chr1 Output all alignments mapped to the reference sequence named `chr1'
(i.e. @SQ SN:chr1).
chr2:1000000 The region on chr2 beginning at base position 1,000,000 and ending at the
end of the chromosome.
chr3:1000-2000 The 1001bp region on chr3 beginning at base position 1,000 and ending at
base position 2,000 (including both end positions).
'*' Output the unmapped reads at the end of the file. (This does not include
any unmapped reads placed on a reference sequence alongside their
mapped mates.)
OPTIONS:
-b Output in the BAM format.
-c Instead of printing the alignments, only count them and print the total
number. All filter options, such as -f, -F, and -q, are taken into account.
-U FILE Write alignments that are not selected by the various filter options to FILE.
When this option is used, all alignments (or all alignments intersecting the
regions specified) are written to either the output file or this file, but never
both.
-t FILE A tab-delimited FILE. Each line must contain the reference name in the
first column and the length of the reference in the second column, with one
line for each distinct reference. Any additional fields beyond the second
column are ignored. This file also defines the order of the reference
sequences in sorting. If you run: `samtools faidx <ref.fa>', the resulting
index file <ref.fa>.fai can be used as this FILE.
-L FILE Only output alignments overlapping the input BED FILE [null].
-M Use the multi-region iterator on the union of the BED file and command-
line region arguments. This avoids re-reading the same regions of files so
can sometimes be much faster. Note this also removes duplicate
sequences. Without this a sequence that overlaps multiple regions
specified on the command line will be reported multiple times.
-r STR Output alignments in read group STR [null]. Note that records with no RG
tag will also be output when using this option. This behaviour may change
in a future release.
-R FILE Output alignments in read groups listed in FILE [null]. Note that records
with no RG tag will also be output when using this option. This behaviour
may change in a future release.
-m INT Only output alignments with number of CIGAR bases consuming query
sequence ≥ INT [0]
-f INT
Only output alignments with all bits set in INT present in the FLAG field.
INT can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or in
octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
-F INT Do not output alignments with any bits set in INT present in the FLAG
field. INT can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-
F]+/) or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
-G INT Do not output alignments with all bits set in INT present in the FLAG field.
This is the opposite of -f such that -f12 -G12 is the same as no filtering at
all. INT can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/)
or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
-s FLOAT Output only a proportion of the input alignments. This subsampling acts in
the same way on all of the alignment records in the same template or read
pair, so it never keeps a read but not its mate.
The integer and fractional parts of the -s INT.FRAC option are used
separately: the part after the decimal point sets the fraction of
templates/pairs to be kept, while the integer part is used as a seed that
influences which subset of reads is kept.
-@ INT Number of BAM compression threads to use in addition to main thread [0].
sort samtools sort [-l level] [-m maxMem] [-o out.bam] [-O format] [-n] [-t tag] [-T tmpprefix] [-@
threads] [in.sam|in.bam|in.cram]
The sorted output is written to standard output by default, or to the specified file (out.bam)
when -o is used. This command will also create temporary files tmpprefix.%d.bam as
needed when the entire alignment data cannot fit into memory (as controlled via the -m
option).
Options:
-l INT Set the desired compression level for the final output file, ranging from 0
(uncompressed) or 1 (fastest but minimal compression) to 9 (best
compression but slowest to write), similarly to gzip(1)'s compression level
setting.
-m INT Approximately the maximum required memory per thread, specified either
in bytes or with a K, M, or G suffix. [768 MiB]
To prevent sort from creating a huge number of temporary files, it enforces
a minimum value of 1M for this setting.
-n Sort by read names (i.e., the QNAME field) rather than by chromosomal
coordinates.
-t TAG Sort first by the value in the alignment tag TAG, then by position or name
(if also using -n). -o FILE Write the final sorted output to FILE, rather than
to standard output.
By default, any temporary files are written alongside the output file, as
out.bam.tmp.nnnn.bam, or if output is to standard output, in the current
directory as samtools.mmm.mmm.tmp.nnnn.bam.
Ordering Rules
If option -t is in use, records are first sorted by the value of the given alignment tag, and then
by position or name (if using -n). For example, “-t RG” will make read group the primary sort
key. The rules for ordering by tag are:
Records that do not have the tag are sorted before ones that do.
If the types of the tags are different, they will be sorted so that single character tags
(type A) come before array tags (type B), then string tags (types H and Z), then
numeric tags (types f and i).
Numeric tags (types f and i) are compared by value. Note that comparisons of floating-
point values are subject to issues of rounding and precision.
String tags (types H and Z) are compared based on the binary contents of the tag
using the C strcmp(3) function.
Character tags (type A) are compared by binary character value.
No attempt is made to compare tags of other types — notably type B array values will
not be compared.
When the -n option is present, records are sorted by name. Names are compared so as to
give a “natural” ordering — i.e. sections consisting of digits are compared numerically while
all other sections are compared based on their binary representation. This means “a1” will
come before “b1” and “a9” will come before “a10”. Records with the same name will be
ordered according to the values of the READ1 and READ2 flags (see flags).
When the -n option is not present, reads are sorted by reference (according to the order of
the @SQ header records), then by position in the reference, and then by the REVERSE flag.
Note
Historically samtools sort also accepted a less flexible way of specifying the final and
temporary output filenames:
This has now been removed. The previous out.prefix argument (and -f option, if any) should
be changed to an appropriate combination of -T PREFIX and -o FILE. The previous -o
option should be removed, as output defaults to standard output.
Index a coordinate-sorted BAM or CRAM file for fast random access. (Note that this does not
work with SAM files even if they are bgzip compressed — to index such files, use tabix(1)
instead.)
This index is needed when region arguments are used to limit samtools view and similar
commands to particular regions of interest.
If an output filename is given, the index file will be written to out.index. Otherwise, for a
CRAM file aln.cram, index file aln.cram.crai will be created; for a BAM file aln.bam, either
aln.bam.bai or aln.bam.csi will be created, depending on the index format selected.
Options:
-b Create a BAI index. This is currently the default when no format options
are used.
-c Create a CSI index. By default, the minimum interval size for the index is
2^14, which is the same as the fixed value used by the BAI format.
Retrieve and print stats in the index file corresponding to the input file. Before calling
idxstats, the input BAM file should be indexed by samtools index.
If run on a SAM or CRAM file or an unindexed BAM file, this command will still produce the
same summary statistics, but does so by reading through the entire file. This is far slower
than using the BAM indices.
The output is TAB-delimited with each line consisting of reference sequence name,
sequence length, # mapped reads and # unmapped reads. It is written to stdout.
Does a full pass through the input file to calculate and print statistics to stdout.
Provides counts for each of 13 categories based primarily on bit flags in the FLAG field.
Each category in the output is broken down into QC pass and QC fail, which is presented as
"#PASS + #FAIL" followed by a description of the category.
The first row of output gives the total number of reads that are QC pass and fail (according
to flag bit 0x200). For example:
Which would indicate that there are a total of 150 reads in the input file, 122 of which are
marked as QC pass and 28 of which are marked as "not passing quality controls"
Following this, additional categories are given for reads which are:
secondary 0x100 bit set
paired in sequencing
0x1 bit set
properly paired both 0x1 and 0x2 bits set and 0x4 bit not set
singletons both 0x1 and 0x8 bits set and bit 0x4 not set
And finally, two rows are given that additionally filter on the reference name (RNAME), mate
reference name (MRNM), and mapping quality (MAPQ) fields:
samtools stats collects statistics from BAM files and outputs in a text format. The output can
be visualized graphically using plot-bamstats.
Options:
-d, --remove-dups
Exclude from statistics reads marked as duplicates
--GC-depth FLOAT
the size of GC-depth bins (decreasing bin size increases memory
requirement) [2e4]
-I, --id STR Include only listed read group or sample name []
-l, --read-length INT
Include in the statistics only reads with the given read length []
-r, --ref-seq FILE Reference sequence (required for GC-depth and mismatches-per-cycle
calculation). []
-S, --split TAG In addition to the complete statistics, also output categorised statistics
based on the tagged field TAG (e.g., use --split RG to split into read
groups).
Reports the total read base count (i.e. the sum of per base read depths) for each genomic
region specified in the supplied BED file. The regions are output as they appear in the BED
file and are 0-based. Counts for each alignment file supplied are reported in separate
columns.
Options:
-Q INT Only count reads with mapping quality greater than INT
-j Do not include deletions (D) and ref skips (N) in bedcov computation.
Options:
-a -a, -aa Output absolutely all positions, including unused reference sequences.
Note that when used in conjunction with a BED file the -a option may
sometimes operate as if -aa was specified if the reference sequence has
coverage outside of the region specified in the BED file.
-f FILE Use the BAM files specified in the FILE (a file of filenames, one file per
line) []
-l INT Ignore reads shorter than INT
-m, -d INT Truncate reported depth at a maximum of INT reads. [8000]. If 0, depth is
set to the maximum integer value, effectively removing any depth limit.
-q INT Only count reads with base quality greater than INT
-Q INT Only count reads with mapping quality greater than INT
-r CHR:FROM-TO
Only report depth in specified region.
merge samtools merge [-nur1f] [-h inh.sam] [-R reg] [-b <list>] <out.bam> <in1.bam> [<in2.bam>
<in3.bam> ... <inN.bam>]
Merge multiple sorted alignment files, producing a single sorted output file that contains all
the input records and maintains the existing sort order.
If -h is specified the @SQ headers of input files will be merged into the specified header,
otherwise they will be merged into a composite header created from the input headers. If in
the process of merging @SQ lines for coordinate sorted input files, a conflict arises as to the
order (for example input1.bam has @SQ for a,b,c and input2.bam has b,a,c) then the
resulting output file will need to be re-sorted back into coordinate order.
Unless the -c or -p flags are specified then when merging @RG and @PG records into the
output header then any IDs found to be duplicates of existing IDs in the output header will
have a suffix appended to them to differentiate them from similar header records from other
files and the read records will be updated to reflect this.
The ordering of the records in the input files must match the usage of the -n and -t
command-line options. If they do not, the output order will be undefined. See sort for
information about record ordering.
OPTIONS:
-h FILE Use the lines of FILE as `@' headers to be copied to out.bam, replacing
any header lines that would otherwise be copied from in1.bam. (FILE is
actually in SAM format, though any alignment records it may contain are
ignored.)
-t TAG The input alignments have been sorted by the value of TAG, then by either
position or name (if -n is given).
-r Attach an RG tag to each alignment. The tag value is inferred from file
names.
-c
When several input files contain @RG headers with the same ID, emit
only one of them (namely, the header line from the first file we find that ID
in) to the merged output file. Combining these similar headers is usually
the right thing to do when the files being merged originated from the same
file.
Without -c, all @RG headers appear in the output file, with random
suffixes added to their IDs where necessary to differentiate them.
-p Similarly, for each @PG ID in the set of files to merge, use the @PG line
of the first file we find that ID in rather than adding a suffix to differentiate
similar IDs.
Index reference sequence in the FASTA format or extract subsequence from indexed
reference sequence. If no region is specified, faidx will index the file and create
<ref.fasta>.fai on the disk. If regions are specified, the subsequences will be retrieved and
printed to stdout in the FASTA format.
The sequences in the input file should all have different names. If they do not, indexing will
emit a warning about duplicate sequences and retrieval will only produce subsequences
from the first sequence with the duplicated name.
FASTQ files can be read and indexed by this command. Without using --fastq any extracted
subsequence will be in FASTA format.
Options
-f, --fastq Read FASTQ files and output extracted sequences in FASTQ format.
Same as using samtools fqidx.
-i, --reverse-complement
Output the sequence as the reverse complement. When this option is
used, “/rc” will be appended to the sequence names. To turn this off or
change the string appended, use the --mark-strand option.
--mark-strand TYPE
Append strand indicator to sequence name. TYPE can be one of:
custom,<pos>,<neg>
Append string <pos> to names when writing the forward
strand and <neg> when writing the reverse strand.
Spaces are preserved, so it is possible to move the
indicator into the comment part of the description line by
including a leading space in the strings <pos> and
<neg>.
Index reference sequence in the FASTQ format or extract subsequence from indexed
reference sequence. If no region is specified, fqidx will index the file and create
<ref.fastq>.fai on the disk. If regions are specified, the subsequences will be retrieved and
printed to stdout in the FASTQ format.
The sequences in the input file should all have different names. If they do not, indexing will
emit a warning about duplicate sequences and retrieval will only produce subsequences
from the first sequence with the duplicated name.
samtools fqidx should only be used on fastq files with a small number of entries. Trying to
use it on a file containing millions of short sequencing reads will produce an index that is
almost as big as the original file, and searches using the index will be very slow and use a lot
of memory.
Options
-i, --reverse-complement
Output the sequence as the reverse complement. When this option is
used, “/rc” will be appended to the sequence names. To turn this off or
change the string appended, use the --mark-strand option.
--mark-strand TYPE
Append strand indicator to sequence name. TYPE can be one of:
custom,<pos>,<neg>
Append string <pos> to names when writing the forward
strand and <neg> when writing the reverse strand.
Spaces are preserved, so it is possible to move the
indicator into the comment part of the description line by
including a leading space in the strings <pos> and
<neg>.
tview samtools tview [-p chr:pos] [-s STR] [-d display] <in.sorted.bam> [ref.fasta]
Text alignment viewer (based on the ncurses library). In the viewer, press `?' for help and
press `g' to check the alignment start from a region in the format like `chr10:10,000,000' or
`=10,000,000' when viewing the same reference sequence.
Options:
Options:
-v Verbose output
%% %
%* basename
%# @RG index
%! @RG ID
%. output format filename extension
Quickly check that input files appear to be intact. Checks that beginning of the file contains a
valid header (all formats) containing at least one target sequence and then seeks to the end
of the file and checks that an end-of-file (EOF) is present and intact (BAM only).
Data in the middle of the file is not read since that would be much more time consuming, so
please note that this command will not detect internal corruption, but is useful for testing that
files are not truncated before performing more intensive tasks on them.
This command will exit with a non-zero exit code if any input files don't have a valid header
or are missing an EOF block. Otherwise it will exit successfully (with a zero exit code).
Options:
-v Verbose output: will additionally print the names of all input files that don't
pass the check to stdout. Multiple -v options will cause additional
messages regarding check results to be printed to stderr.
-q
Quiet mode: disables warning messages on stderr about files that fail. If
both -q and -v options are used then the appropriate level of -v takes
precedence.
OPTIONS:
-u, --uri STR Specify the URI for the UR tag. Defaults to the absolute path of ref.fasta
unless reading from stdin.
Fill in mate coordinates, ISIZE and mate related flags from a name-sorted alignment.
OPTIONS:
-m Add ms (mate score) tags. These are used by markdup to select the best
reads to keep.
mpileup samtools mpileup [-EB] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-Q minBaseQ] [-q minMapQ]
in.bam [in2.bam [...]]
Generate pileup for one or multiple BAM files. Alignment records are grouped by sample
(SM) identifiers in @RG header lines. If sample identifiers are absent, each input file is
regarded as one sample.
Samtools mpileup can still produce VCF and BCF output, but this feature is deprecated and
will be removed in a future release. Please use bcftools mpileup for this instead.
(Documentation on the deprecated options has been removed from this manual page, but
older versions are available online at <http://www.htslib.org/doc/
(http://www.htslib.org/doc/)>.)
In the pileup format (without -u or -g), each line represents a genomic position, consisting of
chromosome name, 1-based coordinate, reference base, the number of reads covering the
site, read bases, base qualities and alignment mapping qualities. Information on match,
mismatch, indel, strand, mapping quality and start and end of a read are all encoded at the
read base column. At this column, a dot stands for a match to the reference base on the
forward strand, a comma for a match on the reverse strand, a '>' or '<' for a reference skip,
`ACGTN' for a mismatch on the forward strand and `acgtn' for a mismatch on the reverse
strand. A pattern `\\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this
reference position and the next reference position. The length of the insertion is given by the
integer in the pattern, followed by the inserted sequence. Similarly, a pattern `-[0-9]+
[ACGTNacgtn]+' represents a deletion from the reference. The deleted bases will be
presented as `*' in the following lines. Also at the read base column, a symbol `^' marks the
start of a read. The ASCII of the character following `^' minus 33 gives the mapping quality. A
symbol `$' marks the end of a read segment.
Note that there are two orthogonal ways to specify locations in the input file; via -r region and
-l file. The former uses (and requires) an index to do random access while the latter streams
through the file contents filtering out the specified regions, requiring no index. The two may
be used in conjunction. For example a BED file containing locations of genes in
chromosome 20 could be specified using -r 20 -l chr20.bed, meaning that the index is used
to find chromosome 20 and then it is filtered for the regions listed in the bed file.
Input Options:
-A, --count-orphans
Do not skip anomalous read pairs in variant calling.
-B, --no-BAQ Disable base alignment quality (BAQ) computation. See BAQ below.
Note that up to release 1.8, samtools would enforce a minimum value for
this option. This no longer happens and the limit is set exactly as
specified.
-E, --redo-BAQ Recalculate BAQ on the fly, ignore existing BQ tags. See BAQ below.
Supplying a reference file will enable base alignment quality calculation for
all reads aligned to a reference in the file. See BAQ below.
-q, -min-MQ INT Minimum mapping quality for an alignment to be used [0]
-Q, --min-BQ INT Minimum base quality for a base to be considered [13]
-r, --region STR Only generate pileup in region. Requires the BAM files to be indexed. If
used in conjunction with -l then considers the intersection of the two
requests. STR [all sites]
-R, --ignore-RG Ignore RG tags. Treat all reads in one BAM as one sample.
-x, --ignore-overlaps
Disable read-pair overlap detection.
Output Options:
-o, --output FILE Write pileup output to FILE, rather than the default of standard output.
(The same short option is used for both the deprecated --open-prob
option and --output . If -o's argument contains any non-digit characters
other than a leading + or - sign, it is interpreted as --output. Usually the
filename extension will take care of this, but to write to an entirely numeric
filename use -o ./123 or --output 123.)
-a -a, -aa Output absolutely all positions, including unused reference sequences.
Note that when used in conjunction with a BED file the -a option may
sometimes operate as if -aa was specified if the reference sequence has
coverage outside of the region specified in the BED file.
BAQ is the Phred-scaled probability of a read base being misaligned. It greatly helps to
reduce false SNPs caused by misalignments. BAQ is calculated using the probabilistic
realignment method described in the paper “Improving SNP discovery by base alignment
quality”, Heng Li, Bioinformatics, Volume 27, Issue 8 <https://doi.org/10
(https://doi.org/10).1093/bioinformatics/btr076>
BAQ is turned on when a reference file is supplied using the -f option. To disable it, use the -
B option.
It is possible to store pre-calculated BAQ values in a SAM BQ:Z tag. Samtools mpileup will
use the precalculated values if it finds them. The -E option can be used to make it ignore the
contents of the BQ:Z tag and force it to recalculate the BAQ scores by making a new
alignment.
FLAGS:
Converts a BAM or CRAM into either FASTQ or FASTA format depending on the command
invoked. The files will be automatically compressed if the file names have a .gz or .bgzf
extension.
The input to this program must be collated by name. Use samtools collate or samtools
sort -n to ensure this.
For each different QNAME, the input records are categorised according to the state of the
READ1 and READ2 flag bits. The three categories used are:
The exact meaning of these categories depends on the sequencing technology used. It is
expected that ordinary single and paired-end sequencing reads will be in categories 1 and 2
(in the case of paired-end reads, one read of the pair will be in category 1, the other in
category 2). Category 0 is essentially a “catch-all” for reads that do not fit into a simple
paired-end sequencing model.
For each category only one sequence will be written for a given QNAME. If more than one
record is available for a given QNAME and category, the first in input file order that has
quality values will be used. If none of the candidate records has quality values, then the first
in input file order will be used instead.
Sequences will be written to standard output unless one of the -1,-2, or -0 options is used, in
which case sequences for that category will be written to the specified file.
If a singleton file is specified using the -s option then only paired sequences will be output for
categories 1 and 2; paired meaning that for a given QNAME there are sequences for both
category 1 and 2. If there is a sequence for only one of categories 1 or 2 then it will be
diverted into the specified singletons file. This can be used to prepare fastq files for
programs that cannot handle a mixture of paired and singleton reads.
The -s option only affects category 1 and 2 records. The output for category 0 will be the
same irrespective of the use of this option.
OPTIONS:
-n By default, either '/1' or '/2' is added to the end of read names where the
corresponding READ1 or READ2 FLAG bit is set. Using -n causes read
names to be left as they are.
-N Always add either '/1' or '/2' to the end of read names even when put into
different files.
-t Copy RG, BC and QT tags to the FASTQ header line, if they exist.
-T TAGLIST Specify a comma-separated list of tags to copy to the FASTQ header line,
if they exist.
-1 FILE Write reads with the READ1 FLAG set (and READ2 not set) to FILE
instead of outputting them. If the -s option is used, only paired reads will
be written to this file.
-2 FILE Write reads with the READ2 FLAG set (and READ1 not set) to FILE
instead of outputting them. If the -s option is used, only paired reads will
be written to this file.
-0 FILE Write reads where the READ1 and READ2 FLAG bits set are either both
set or both unset to FILE instead of outputting them.
-f INT Only output alignments with all bits set in INT present in the FLAG field.
INT can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or in
octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
-F INT Do not output alignments with any bits set in INT present in the FLAG
field. INT can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-
F]+/) or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
-G INT Only EXCLUDE reads with all of the bits set in INT present in the FLAG
field. INT can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-
F]+/) or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
--quality-tag TAG
aux tag to find index quality in [default: QT]
--index-format STR
string to describe how to parse the barcode and quality tags. For example:
n*i* ignore the left part of the tag until the separator, then
use the second part
EXAMPLES
BUGS
The default value for the -F option should really be 0x900 so that
secondary and supplementary reads are automatically excluded.
The existing default of 0 is retained for reasons of compatibility.
collate samtools collate [options] in.sam|in.bam|in.cram [<prefix>]
The output from this command should be suitable for any operation that
requires all reads from the same template to be grouped together.
If present, <prefix> is used to name the temporary files that collate uses
when sorting the data. If neither the '-O' nor '-o' options are used, <prefix>
must be present and collate will use it to make an output file name by
appending a suffix depending on the format written (.bam by default).
Using -f for fast mode will output only primary alignments that have either
the READ1 or READ2 flags set (but not both). Any other alignment
records will be filtered out. The collation will only work correctly if there are
no more than two reads for any given QNAME after filtering.
Fast mode keeps a buffer of alignments in memory so that it can write out
most pairs as soon as they are found instead of storing them in temporary
files. This allows collate to avoid some work and so finish more quickly
compared to the standard mode. The number of alignments held can be
changed using -r, storing more alignments uses more memory but
increases the number of pairs that can be written early.
While collate normally randomises the ordering of read pairs, fast mode
does not. Position-dependent biases that would normally be broken up
can remain in the fast collate output. It is therefore not a good idea to use
fast mode when preparing data for programs that expect randomly ordered
paired reads. For example using fast collate instead of the standard mode
may lead to significantly different results from aligners that estimate library
insert sizes on batches of reads.
Options:
-o FILE Write output to FILE. This option cannot be used with '-
O'.
By default this command outputs the BAM or CRAM file to standard output
(stdout), but for CRAM format files it has the option to perform an in-place
edit, both reading and writing to the same file. No validity checking is
performed on the header, nor that it is suitable to use with the sequence
data itself.
OPTIONS:
-i, --in-place Perform the header edit in-place, if possible. This only
works on CRAM files and only if there is sufficient room
to store the new header. The amount of space available
will differ for each CRAM file.
cat samtools cat [-b list] [-h header.sam] [-o out.bam] <in1.bam> <in2.bam> [
... ]
OPTIONS:
-b FOFN Read the list of input BAM or CRAM files from FOFN.
These are concatenated prior to any files specified on
the command line. Multiple -b FOFN options may be
specified to concatenate multiple lists of BAM/CRAM
files.
-h FILE Uses the SAM header from FILE. By default the header
is taken from the first file to be concatenated.
OPTIONS:
OPTIONS:
Generate the MD tag. If the MD tag is already present, this command will
give a warning if the MD tag generated is different from the existing tag.
Output SAM by default.
Calmd can also read and write CRAM files although in most cases it is
pointless as CRAM recalculates MD and NM tags on the fly. The one
exception to this case is where both input and output CRAM files have
been / are being created with the no_ref option.
OPTIONS:
targetcut
samtools targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0] [-1 em1] [-2 em2]
[-f ref] <in.bam>
phase samtools phase [-AF] [-k len] [-b prefix] [-q minLOD] [-Q minBaseQ]
<in.bam>
OPTIONS:
OPTIONS:
-1
Enable fastest compression level. Only works for BAM
or CRAM output.
markdup samtools markdup [-l length] [-r] [-s] [-T] [-S] in.algsort.bam out.bam
Mark duplicate alignments from a coordinate sorted file that has been run
through fixmate with the -m option. This program relies on the MC and ms
tags that fixmate provides.
EXAMPLE
help, --help Display a brief usage message listing the samtools commands available. If
the name of a command is also given, e.g., samtools help view, the
detailed usage message for that particular command is displayed.
--version Display the version numbers and copyright information for samtools and
the important libraries used by samtools.
GLOBAL OPTIONS
Several long-options are shared between multiple samtools subcommands: --input-fmt, --
input-fmt-option, --output-fmt, --output-fmt-option, and --reference. The input format is
typically auto-detected so specifying the format is usually unnecessary and the option is
included for completeness. Note that not all subcommands have all options. Consult the
subcommand help for more details.
Format strings recognised are "sam", "bam" and "cram". They may be followed by a comma
separated list of options as key or key=value. See below for examples.
The fmt-option arguments accept either a single option or option=value. Note that some
options only work on some file formats and only on read or write streams. If value is
unspecified for a boolean option, the value is assumed to be 1. The valid options are as
follows.
nthreads=INT Specifies the number of threads to use during encoding and/or decoding.
For BAM this will be encoding only. In CRAM the threads are dynamically
shared between encoder and decoder.
reference=fasta_file
Specifies a FASTA reference file for use in CRAM encoding or decoding. It
usually is not required for decoding except in the situation of the MD5 not
being obtainable via the REF_PATH or REF_CACHE environment
variables.
decode_md=0|1 CRAM input only; defaults to 1 (on). CRAM does not typically store MD
and NM tags, preferring to generate them on the fly. This option controls
this behaviour. It can be particularly useful when combined with a file
encoded using store_md=1 and store_nm=1.
store_md=0|1 CRAM output only; defaults to 0 (off). CRAM normally only stores MD tags
when no reference is unknown and lets the decoder generate these values
on-the-fly (see decode_md).
store_nm=0|1 CRAM output only; defaults to 0 (off). CRAM normally only stores NM tags
when no reference is unknown and lets the decoder generate these values
on-the-fly (see decode_md).
ignore_md5=0|1 CRAM input only; defaults to 0 (off). When enabled, md5 checksum errors
on the reference sequence and block checksum errors within CRAM are
ignored. Use of this option is strongly discouraged.
required_fields=bit-field
CRAM input only; specifies which SAM columns need to be populated. By
default all fields are used. Limiting the decode to specific columns can
have significant performance gains. The bit-field is a numerical value
constructed from the following table.
0x1 SAM_QNAME
0x2 SAM_FLAG
0x4 SAM_RNAME
0x8 SAM_POS
0x10 SAM_MAPQ
0x20 SAM_CIGAR
0x40 SAM_RNEXT
0x80 SAM_PNEXT
0x100 SAM_TLEN
0x200 SAM_SEQ
0x400 SAM_QUAL
0x800 SAM_AUX
0x1000 SAM_RGAUX
name_prefix=string
CRAM input only; defaults to output filename. Any sequences with auto-
generated read names will use string as the name prefix.
multi_seq_per_slice=0|1
CRAM output only; defaults to 0 (off). By default CRAM generates one
container per reference sequence, except in the case of many small
references (such as a fragmented assembly).
version=major.minor
CRAM output only. Specifies the CRAM version number. Acceptable
values are "2.1" and "3.0".
seqs_per_slice=INT
CRAM output only; defaults to 10000.
slices_per_container=INT
CRAM output only; defaults to 1. The effect of having multiple slices per
container is to share the compression header block between multiple
slices. This is unlikely to have any significant impact unless the number of
sequences per slice is reduced. (Together these two options control the
granularity of random access.)
embed_ref=0|1 CRAM output only; defaults to 0 (off). If 1, this will store portions of the
reference sequence in each slice, permitting decode without having
requiring an external copy of the reference sequence.
use_bzip2=0|1 CRAM output only; defaults to 0 (off). Permits use of bzip2 in CRAM block
compression.
use_lzma=0|1 CRAM output only; defaults to 0 (off). Permits use of lzma in CRAM block
compression.
lossy_names=0|1
CRAM output only; defaults to 0 (off). If 1, templates with all members
within the same CRAM slice will have their read names removed. New
names will be automatically generated during decoding. Also see the
name_prefix option.
For example:
When reading a CRAM the @SQ headers are interrogated to identify the reference
sequence MD5sum (M5: tag) and the local reference sequence filename (UR: tag). Note
that http:// and ftp:// based URLs in the UR: field are not used, but local fasta filenames (with
or without file://) can be used.
To create a CRAM the @SQ headers will also be read to identify the reference sequences,
but M5: and UR: tags may not be present. In this case the -T and -t options of samtools view
may be used to specify the fasta or fasta.fai filenames respectively (provided the .fasta.fai
file is also backed up by a .fasta file).
Use any local file specified by the command line options (eg -T).
ENVIRONMENT VARIABLES
HTS_PATH A colon-separated list of directories in which to search for HTSlib plugins.
If $HTS_PATH starts or ends with a colon or contains a double colon (::),
the built-in list of directories is searched at that point in the search.
EXAMPLES
Import SAM to BAM when @SQ lines are present in the header:
The value in a RG tag is determined by the file name the read is coming from. In this
example, in the merged.bam, reads from ga.bam will be attached RG:Z:ga, while
reads from 454.bam will be attached RG:Z:454.
Convert a BAM file to a CRAM with NM and MD tags stored verbatim rather than
calculating on the fly during CRAM decode, so that mixed data sets with MD/NM only
on some records, or NM calculated using different definitions of mismatch, can be
decoded without change. The second command demonstrates how to decode such a
file. The request to not decode MD here is turning off auto-generation of both MD and
NM; it will still emit the MD/NM tags on records that had these stored verbatim.
An alternative way of achieving the above is listing multiple options after the --output-
fmt or -O option. The commands below are equivalent to the two above.
The bcftools filter command marks low quality sites and sites with the read depth
exceeding a limit, which should be adjusted to about twice the average read depth
(bigger read depths usually indicate problematic regions which are often enriched for
artefacts). One may consider to add -C50 to mpileup if mapping quality is
overestimated for reads containing excessive mismatches. Applying this option
usually helps BWA-short but may not other mappers.
Individuals are identified from the SM tags in the @RG header lines. Individuals can
be pooled in one alignment file; one individual can also be separated into multiple
files. The -P option specifies that indel candidates should be collected only from read
groups with the @RG-PL tag set to ILLUMINA. Collecting indel candidates from reads
sequenced by an indel-prone technology may affect the performance of indel calling.
It adds and corrects the NM and MD tags at the same time. The calmd command also
comes with the -C option, the same as the one in pileup and mpileup. Apply if it
helps.
LIMITATIONS
Unaligned words used in bam_import.c, bam_endian.h, bam.c and bam_aux.c.
Samtools paired-end rmdup does not work for unpaired reads (e.g. orphan reads or
ends mapped to different chromosomes). If this is a concern, please use Picard's
MarkDuplicates which correctly handles these cases, although a little slower.
AUTHOR
Heng Li from the Sanger Institute wrote the original C version of samtools. Bob Handsaker
from the Broad Institute implemented the BGZF library. James Bonfield from the Sanger
Institute developed the CRAM implementation. John Marshall and Petr Danecek contribute
to the source code and various people from the 1000 Genomes Project have contributed to
the SAM format specification.
SEE ALSO
bcftools(1), sam(5), tabix(1)
Samtools website: <http://www.htslib.org/ (http://www.htslib.org/)>
File format specification of SAM/BAM,CRAM,VCF/BCF: <http://samtools.github.io/hts-specs
(http://samtools.github.io/hts-specs)>
Samtools latest source: <https://github.com/samtools/samtools
(https://github.com/samtools/samtools)>
HTSlib latest source: <https://github.com/samtools/htslib
(https://github.com/samtools/htslib)>
Bcftools website: <http://samtools.github.io/bcftools (http://samtools.github.io/bcftools)>
Copyright © 2018 Genome Research Limited (reg no. 2742969) is a charity registered in England with number
1021457. Terms and conditions (/terms).