Pysam
Pysam
Release 0.19.1
1 Contents 3
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Working with BAM/CRAM/SAM-formatted files . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.4 Using samtools commands within python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.5 Working with tabix-indexed files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1.6 Working with VCF/BCF formatted files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1.7 Extending pysam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.8 Installing pysam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
1.9 FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.10 Developer’s guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.11 Release notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.12 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
1.13 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3 References 73
Bibliography 75
Index 77
i
ii
pysam documentation, Release 0.19.1
Contents 1
pysam documentation, Release 0.19.1
2 Contents
CHAPTER 1
Contents
1.1 Introduction
Pysam is a python module that makes it easy to read and manipulate mapped short read sequence data stored in
SAM/BAM files. It is a lightweight wrapper of the htslib C-API.
This page provides a quick introduction in using pysam followed by the API. See Working with BAM/CRAM/SAM-
formatted files for more detailed usage instructions.
To use the module to read a file in BAM format, create a AlignmentFile object:
import pysam
samfile = pysam.AlignmentFile("ex1.bam", "rb")
Once a file is opened you can iterate over all of the read mapping to a specified region using fetch(). Each iteration
returns a AlignedSegment object which represents a single read along with its fields and optional tags:
samfile.close()
To give:
EAS56_57:6:190:289:82 0 99 <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<;
˓→69 CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA 0 192 1
EAS56_57:6:190:289:82 0 99 <<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2;
˓→137 AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC 73 64 1
EAS51_64:3:190:727:308 0 102 <<<<<<<<<<<<<<<<<<<<<<<<<<<::<<<844
˓→99 GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG 99 18 1
...
3
pysam documentation, Release 0.19.1
import pysam
samfile = pysam.AlignmentFile("ex1.bam", "rb")
pairedreads = pysam.AlignmentFile("allpaired.bam", "wb", template=samfile)
for read in samfile.fetch():
if read.is_paired:
pairedreads.write(read)
pairedreads.close()
samfile.close()
An alternative way of accessing the data in a SAM file is by iterating over each base of a specified region using the
pileup() method. Each iteration returns a PileupColumn which represents all the reads in the SAM file that
map to a single base in the reference sequence. The list of reads are represented as PileupRead objects in the
PileupColumn.pileups property:
import pysam
samfile = pysam.AlignmentFile("ex1.bam", "rb" )
for pileupcolumn in samfile.pileup("chr1", 100, 120):
print("\ncoverage at base %s = %s" % (pileupcolumn.pos, pileupcolumn.n))
for pileupread in pileupcolumn.pileups:
if not pileupread.is_del and not pileupread.is_refskip:
# query position is None if is_del or is_refskip is set.
print('\tbase in read %s = %s' %
(pileupread.alignment.query_name,
pileupread.alignment.query_sequence[pileupread.query_position]))
samfile.close()
Commands available in samtools are available as simple function calls. For example:
pysam.sort("-o", "output.bam", "ex1.bam")
Analogous to AlignmentFile, a TabixFile allows fast random access to compressed and tabix indexed tab-
separated file formats with genomic data:
import pysam
tabixfile = pysam.TabixFile("example.gtf.gz")
(continues on next page)
4 Chapter 1. Contents
pysam documentation, Release 0.19.1
TabixFile implements lazy parsing in order to iterate over large tables efficiently.
More detailed usage instructions is at Working with BAM/CRAM/SAM-formatted files.
Note: Coordinates in pysam are always 0-based (following the python convention). SAM text files use 1-based
coordinates.
Note:
The above examples can be run in the tests directory of the installation directory. Type ‘make’ before running
them.
See also:
https://github.com/pysam-developers/pysam
The pysam code repository, containing source code and download instructions
http://pysam.readthedocs.org/en/latest/
The pysam website containing documentation
1.2 API
1.2. API 5
pysam documentation, Release 0.19.1
f = pysam.AlignmentFile('ex1.bam','rb')
If mode is not specified, the method will try to auto-detect in the order ‘rb’, ‘r’, thus both
the following should work:
f1 = pysam.AlignmentFile('ex1.bam')
f2 = pysam.AlignmentFile('ex1.sam')
6 Chapter 1. Contents
pysam documentation, Release 0.19.1
• index_filename (string) – Explicit path to the index file. Only needed if the index
is not named in the standard manner, not located in the same directory as the BAM/CRAM
file, or is remote. An IOError is raised if the index cannot be found or is invalid.
• filepath_index (string) – Alias for index_filename.
• require_index (bool) – When reading, require that an index file is present and is valid
or raise an IOError. (default=False)
• filename (string) – Alternative to filepath_or_object. Filename of the file to be
opened.
• duplicate_filehandle (bool) – By default, file handles passed either directly or
through File-like objects will be duplicated before passing them to htslib. The duplication
prevents issues where the same stream will be closed by htslib and through destruction of
the high-level python object. Set to False to turn off duplication.
• ignore_truncation (bool) – Issue a warning, instead of raising an error if the current
file appears to be truncated due to a missing EOF marker. Only applies to bgzipped formats.
(Default=False)
• format_options (list) – A list of key=value strings, as accepted by –input-fmt-option
and –output-fmt-option in samtools.
• threads (integer) – Number of threads to use for compressing/decompressing
BAM/CRAM files. Setting threads to > 1 cannot be combined with ignore_truncation.
(Default=1)
check_index(self )
return True if index is present.
Raises
• AttributeError – if htsfile is SAM formatted and thus has no index.
• ValueError – if htsfile is closed or index could not be opened.
close(self )
closes the pysam.AlignmentFile.
count(self, contig=None, start=None, stop=None, region=None, until_eof=False,
read_callback=’nofilter’, reference=None, end=None)
count the number of reads in region
The region is specified by contig, start and stop. reference and end are also accepted for backward com-
patibility as synonyms for contig and stop, respectively. Alternatively, a samtools region string can be
supplied.
A SAM file does not allow random access and if region or contig are given, an exception is raised.
Parameters
• contig (string) – reference_name of the genomic region (chromosome)
• start (int) – start of the genomic region (0-based inclusive)
• stop (int) – end of the genomic region (0-based exclusive)
• region (string) – a region string in samtools format.
• until_eof (bool) – count until the end of the file, possibly including unmapped reads
as well.
• read_callback (string or function) – select a call-back to ignore reads when
counting. It can be either a string with the following values:
1.2. API 7
pysam documentation, Release 0.19.1
all skip reads in which any of the following flags are set: BAM_FUNMAP,
BAM_FSECONDARY, BAM_FQCFAIL, BAM_FDUP
nofilter uses every single read
Alternatively, read_callback can be a function check_read(read) that should return
True only for those reads that shall be included in the counting.
• reference (string) – backward compatible synonym for contig
• end (int) – backward compatible synonym for stop
Raises ValueError – if the genomic coordinates are out of range or invalid.
count_coverage(self, contig, start=None, stop=None, region=None, quality_threshold=15,
read_callback=’all’, reference=None, end=None)
count the coverage of genomic positions by reads in region.
The region is specified by contig, start and stop. reference and end are also accepted for backward com-
patibility as synonyms for contig and stop, respectively. Alternatively, a samtools region string can be
supplied. The coverage is computed per-base [ACGT].
Parameters
• contig (string) – reference_name of the genomic region (chromosome)
• start (int) – start of the genomic region (0-based inclusive). If not given, count from
the start of the chromosome.
• stop (int) – end of the genomic region (0-based exclusive). If not given, count to the
end of the chromosome.
• region (string) – a region string.
• quality_threshold (int) – quality_threshold is the minimum quality score (in
phred) a base has to reach to be counted.
• read_callback (string or function) – select a call-back to ignore reads when
counting. It can be either a string with the following values:
all skip reads in which any of the following flags are set: BAM_FUNMAP,
BAM_FSECONDARY, BAM_FQCFAIL, BAM_FDUP
nofilter uses every single read
Alternatively, read_callback can be a function check_read(read) that should return
True only for those reads that shall be included in the counting.
• reference (string) – backward compatible synonym for contig
• end (int) – backward compatible synonym for stop
Raises ValueError – if the genomic coordinates are out of range or invalid.
Returns four array.arrays of the same length in order A C G T
Return type tuple
fetch(self, contig=None, start=None, stop=None, region=None, tid=None, until_eof=False, multi-
ple_iterators=False, reference=None, end=None)
fetch reads aligned in a region.
See parse_region() for more information on how genomic regions can be specified. reference and
end are also accepted for backward compatibility as synonyms for contig and stop, respectively.
8 Chapter 1. Contents
pysam documentation, Release 0.19.1
Without a contig or region all mapped reads in the file will be fetched. The reads will be returned ordered
by reference sequence, which will not necessarily be the order within the file. This mode of iteration still
requires an index. If there is no index, use until_eof=True.
If only contig is set, all reads aligned to contig will be fetched.
A SAM file does not allow random access. If region or contig are given, an exception is raised.
Parameters
• until_eof (bool) – If until_eof is True, all reads from the current file position will be
returned in order as they are within the file. Using this option will also fetch unmapped
reads.
• multiple_iterators (bool) – If multiple_iterators is True, multiple iterators on
the same file can be used at the same time. The iterator returned will receive its own copy
of a filehandle to the file effectively re-opening the file. Re-opening a file creates some
overhead, so beware.
Returns An iterator over a collection of reads.
Return type IteratorRow
Raises ValueError – if the genomic coordinates are out of range or invalid or the file does
not permit random access to genomic coordinates.
find_introns(self, read_iterator)
Return a dictionary {(start, stop): count} Listing the intronic sites in the reads (identified by ‘N’ in the
cigar strings), and their support ( = number of reads ).
read_iterator can be the result of a .fetch(. . . ) call. Or it can be a generator filtering such reads. Example
samfile.find_introns((read for read in samfile.fetch(. . . ) if read.is_reverse)
find_introns_slow(self, read_iterator)
Return a dictionary {(start, stop): count} Listing the intronic sites in the reads (identified by ‘N’ in the
cigar strings), and their support ( = number of reads ).
read_iterator can be the result of a .fetch(. . . ) call. Or it can be a generator filtering such reads. Example
samfile.find_introns((read for read in samfile.fetch(. . . ) if read.is_reverse)
get_index_statistics(self )
return statistics about mapped/unmapped reads per chromosome as they are stored in the index, similarly
to the statistics printed by the samtools idxstats command.
CRAI indexes do not record these statistics, so for a CRAM file with a .crai index the returned statistics
will all be 0.
Returns a list of records for each chromosome. Each record has the attributes ‘contig’,
‘mapped’, ‘unmapped’ and ‘total’.
Return type list
get_reference_length(self, reference)
return reference length corresponding to numerical tid
get_reference_name(self, tid)
return reference name corresponding to numerical tid
get_tid(self, reference)
return the numerical tid corresponding to reference
returns -1 if reference is not known.
1.2. API 9
pysam documentation, Release 0.19.1
getrname(self, tid)
deprecated, use get_reference_name() instead
gettid(self, reference)
deprecated, use get_tid() instead
has_index(self )
return true if htsfile has an existing (and opened) index.
head(self, n, multiple_iterators=True)
return an iterator over the first n alignments.
This iterator is is useful for inspecting the bam-file.
Parameters multiple_iterators (bool) – is set to True by default in order to avoid
changing the current file position.
Returns an iterator over a collection of reads
Return type IteratorRowHead
is_valid_tid(self, int tid)
return True if the numerical tid is valid; False otherwise.
Note that the unmapped tid code (-1) counts as an invalid.
lengths
tuple of the lengths of the reference sequences. This is a read-only attribute. The lengths are in the same
order as pysam.AlignmentFile.references
mapped
int with total number of mapped alignments according to the statistics recorded in the index. This is a
read-only attribute. (This will be 0 for a CRAM file indexed by a .crai index, as that index format does not
record these statistics.)
mate(self, AlignedSegment read)
return the mate of pysam.AlignedSegment read.
Note: Calling this method will change the file position. This might interfere with any iterators that have
not re-opened the file.
Note: This method is too slow for high-throughput processing. If a read needs to be processed with its
mate, work from a read name sorted file or, better, cache reads.
next
nocoordinate
int with total number of reads without coordinates according to the statistics recorded in the index, i.e.,
the statistic printed for “*” by the samtools idxstats command. This is a read-only attribute. (This
will be 0 for a CRAM file indexed by a .crai index, as that index format does not record these statistics.)
nreferences
int with the number of reference sequences in the file. This is a read-only attribute.
10 Chapter 1. Contents
pysam documentation, Release 0.19.1
Note: ‘all’ reads which overlap the region are returned. The first base returned will be the first base of the
first read ‘not’ necessarily the first base of the region used in the query.
Parameters
• truncate (bool) – By default, the samtools pileup engine outputs all reads overlap-
ping a region. If truncate is True and a region is given, only columns in the exact region
specified are returned.
• max_depth (int) – Maximum read depth permitted. The default limit is ‘8000’.
• stepper (string) – The stepper controls how the iterator advances. Possible options
for the stepper are
all skip reads in which any of the following flags are set: BAM_FUNMAP,
BAM_FSECONDARY, BAM_FQCFAIL, BAM_FDUP
nofilter uses every single read turning off any filtering.
samtools same filter and read processing as in samtools pileup. For full compatibility,
this requires a ‘fastafile’ to be given. The following options all pertain to filtering of the
samtools stepper.
• fastafile (FastaFile object.) – This is required for some of the steppers.
• ignore_overlaps (bool) – If set to True, detect if read pairs overlap and only take
the higher quality base. This is the default.
• flag_filter (int) – ignore reads where any of the bits in the flag are set. The default
is BAM_FUNMAP | BAM_FSECONDARY | BAM_FQCFAIL | BAM_FDUP.
• flag_require (int) – only use reads where certain flags are set. The default is 0.
• ignore_orphans (bool) – ignore orphans (paired reads that are not in a proper pair).
The default is to ignore orphans.
• min_base_quality (int) – Minimum base quality. Bases below the minimum qual-
ity will not be output. The default is 13.
• adjust_capq_threshold (int) – adjust mapping quality. The default is 0 for no
adjustment. The recommended value for adjustment is 50.
• min_mapping_quality (int) – only use reads above a minimum mapping quality.
The default is 0.
• compute_baq (bool) – re-alignment computing per-Base Alignment Qualities (BAQ).
The default is to do re-alignment. Realignment requires a reference sequence. If none is
present, no realignment will be performed.
1.2. API 11
pysam documentation, Release 0.19.1
• redo_baq (bool) – recompute per-Base Alignment Quality on the fly ignoring existing
base qualities. The default is False (use existing base qualities).
Returns an iterator over genomic positions.
Return type IteratorColumn
references
tuple with the names of reference sequences. This is a read-only attribute
text
deprecated, use references and lengths instead
unmapped
int with total number of unmapped reads according to the statistics recorded in the index. This number of
reads includes the number of reads without coordinates. This is a read-only attribute. (This will be 0 for a
CRAM file indexed by a .crai index, as that index format does not record these statistics.)
write(self, AlignedSegment read) → int
write a single pysam.AlignedSegment to disk.
Raises ValueError – if the writing failed
Returns the number of bytes written. If the file is closed, this will be 0.
Return type int
class pysam.AlignmentHeader
header information for a AlignmentFile object
Parameters
• header_dict (dict) – build header from a multi-level dictionary. The first level are the
four types (‘HD’, ‘SQ’, . . . ). The second level are a list of lines, with each line being a list
of tag-value pairs. The header is constructed first from all the defined fields, followed by
user tags in alphabetical order. Alternatively, an AlignmentHeader object can be passed
directly.
• text (string) – use the string provided as the header
• reference_names (list) – see reference_lengths
• reference_lengths (list) – build header from list of chromosome names and
lengths. By default, ‘SQ’ and ‘LN’ tags will be added to the header text. This option
can be changed by unsetting the flag add_sq_text.
• add_sq_text (bool) – do not add ‘SQ’ and ‘LN’ tags to header. This option permits
construction SAM formatted files without a header.
as_dict(self )
deprecated, use to_dict() instead
copy(self )
from_dict(type cls, header_dict)
from_references(type cls, reference_names, reference_lengths, text=None, add_sq_text=True)
from_text(type cls, text)
get(self, *args)
get_reference_length(self, reference)
get_reference_name(self, tid)
12 Chapter 1. Contents
pysam documentation, Release 0.19.1
get_tid(self, reference)
return the numerical tid corresponding to reference
returns -1 if reference is not known.
is_valid_tid(self, int tid)
return True if the numerical tid is valid; False otherwise.
Note that the unmapped tid code (-1) counts as an invalid.
items(self )
iteritems(self )
keys(self )
lengths
tuple of the lengths of the reference sequences. This is a read-only attribute. The lengths are in the same
order as pysam.AlignmentFile.references
nreferences
int with the number of reference sequences in the file.
This is a read-only attribute.
references
tuple with the names of reference sequences. This is a read-only attribute
to_dict(self )
return two-level dictionary with header information from the file.
The first level contains the record (HD, SQ, etc) and the second level contains the fields (VN, LN, etc).
The parser is validating and will raise an AssertionError if if encounters any record or field tags that are not
part of the SAM specification. Use the pysam.AlignmentFile.text attribute to get the unparsed
header.
The parsing follows the SAM format specification with the exception of the CL field. This option will
consume the rest of a header line irrespective of any additional fields. This behaviour has been added to
accommodate command line options that contain characters that are not valid field separators.
If no @SQ entries are within the text section of the header, this will be automatically added from the
reference names and lengths stored in the binary part of the header.
values(self )
An AlignedSegment represents an aligned segment within a SAM/BAM file.
class pysam.AlignedSegment(AlignmentHeader header=None)
Class representing an aligned segment.
This class stores a handle to the samtools C-structure representing an aligned read. Member read access is
forwarded to the C-structure and converted into python objects. This implementation should be fast, as only the
data needed is converted.
For write access, the C-structure is updated in-place. This is not the most efficient way to build BAM entries,
as the variable length data is concatenated and thus needs to be resized if a field is updated. Furthermore, the
BAM entry might be in an inconsistent state.
One issue to look out for is that the sequence should always be set before the quality scores. Setting the sequence
will also erase any quality scores that were set previously.
Parameters header – AlignmentHeader object to map numerical identifiers to chromosome
names. If not given, an empty header is created.
1.2. API 13
pysam documentation, Release 0.19.1
aend
deprecated, use reference_end instead.
alen
deprecated, use reference_length instead.
aligned_pairs
deprecated, use get_aligned_pairs() instead.
bin
properties bin
blocks
deprecated, use get_blocks() instead.
cigar
deprecated, use cigarstring or cigartuples instead.
cigarstring
the cigar alignment as a string.
The cigar string is a string of alternating integers and characters denoting the length and the type of an
operation.
Note: The order length,operation is specified in the SAM format. It is different from the order of the
cigar property.
M BAM_CMATCH 0
I BAM_CINS 1
D BAM_CDEL 2
N BAM_CREF_SKIP 3
S BAM_CSOFT_CLIP 4
H BAM_CHARD_CLIP 5
P BAM_CPAD 6
= BAM_CEQUAL 7
X BAM_CDIFF 8
B BAM_CBACK 9
Note: The output is a list of (operation, length) tuples, such as [(0, 30)]. This is different from the
SAM specification and the cigarstring property, which uses a (length, operation) order, for example:
30M.
14 Chapter 1. Contents
pysam documentation, Release 0.19.1
flag
properties flag
from_dict(type cls, sam_dict, AlignmentHeader header)
parses a dictionary representation of the aligned segment.
Parameters sam_dict – dictionary of alignment values, keys corresponding to output from
todict().
fromstring(type cls, sam, AlignmentHeader header)
parses a string representation of the aligned segment.
The input format should be valid SAM format.
Parameters sam – SAM formatted string
get_aligned_pairs(self, matches_only=False, with_seq=False)
a list of aligned read (query) and reference positions.
For inserts, deletions, skipping either query or reference position may be None.
For padding in the reference, the reference position will always be None.
Parameters
• matches_only (bool) – If True, only matched bases are returned - no None on either
side.
• with_seq (bool) – If True, return a third element in the tuple containing the reference
sequence. For CIGAR ‘P’ (padding in the reference) operations, the third tuple element
will be None. Substitutions are lower-case. This option requires an MD tag to be present.
Returns aligned_pairs
Return type list of tuples
get_blocks(self )
a list of start and end positions of aligned gapless blocks.
The start and end positions are in genomic coordinates.
Blocks are not normalized, i.e. two blocks might be directly adjacent. This happens if the two blocks are
separated by an insertion in the read.
get_cigar_stats(self )
summary of operations in cigar string.
The output order in the array is “MIDNSHP=X” followed by a field for the NM tag. If the NM tag is not
present, this field will always be 0.
M BAM_CMATCH 0
I BAM_CINS 1
D BAM_CDEL 2
N BAM_CREF_SKIP 3
S BAM_CSOFT_CLIP 4
H BAM_CHARD_CLIP 5
P BAM_CPAD 6
= BAM_CEQUAL 7
X BAM_CDIFF 8
B BAM_CBACK 9
NM NM tag 10
1.2. API 15
pysam documentation, Release 0.19.1
16 Chapter 1. Contents
pysam documentation, Release 0.19.1
If with_value_type is set, the value type as encode in the AlignedSegment record will be returned as well:
[(NM, 2, “i”), (RG, “GJP00TM04”, “Z”)]
This method will convert all values in the optional alignment section. When getting only one or few tags,
please see get_tag() for a quicker way to achieve this.
has_tag(self, tag)
returns true if the optional alignment section contains a given tag.
infer_query_length(self, always=False)
infer query length from CIGAR alignment.
This method deduces the query length from the CIGAR alignment but does not include hard-clipped bases.
Returns None if CIGAR alignment is not present.
If always is set to True, infer_read_length is used instead. This is deprecated and only present for backward
compatibility.
infer_read_length(self )
infer read length from CIGAR alignment.
This method deduces the read length from the CIGAR alignment including hard-clipped bases.
Returns None if CIGAR alignment is not present.
inferred_length
deprecated, use infer_query_length() instead.
is_duplicate
true if optical or PCR duplicate
is_forward
true if read is mapped to forward strand (implemented in terms of is_reverse)
is_mapped
true if read itself is mapped (implemented in terms of is_unmapped)
is_paired
true if read is paired in sequencing
is_proper_pair
true if read is mapped in a proper pair
is_qcfail
true if QC failure
is_read1
true if this is read1
is_read2
true if this is read2
is_reverse
true if read is mapped to reverse strand
is_secondary
true if not primary alignment
is_supplementary
true if this is a supplementary alignment
is_unmapped
true if read itself is unmapped
1.2. API 17
pysam documentation, Release 0.19.1
isize
deprecated, use template_length instead.
mapping_quality
mapping quality
mapq
deprecated, use mapping_quality instead.
mate_is_forward
true if the mate is mapped to forward strand (implemented in terms of mate_is_reverse)
mate_is_mapped
true if the mate is mapped (implemented in terms of mate_is_unmapped)
mate_is_reverse
true if the mate is mapped to reverse strand
mate_is_unmapped
true if the mate is unmapped
modified_bases
Modified bases annotations from Ml/Mm tags. The output is Dict[(canonical base, strand, modification)]
-> [ (pos,qual), . . . ] with qual being (256*probability), or -1 if unknown. Strand==0 for forward and 1 for
reverse strand modification
modified_bases_forward
Modified bases annotations from Ml/Mm tags. The output is Dict[(canonical base, strand, modifica-
tion)] -> [ (pos,qual), . . . ] with qual being (256*probability), or -1 if unknown. Strand==0 for for-
ward and 1 for reverse strand modification. The positions are with respect to the original sequence from
get_forward_sequence()
mpos
deprecated, use next_reference_start instead.
mrnm
deprecated, use next_reference_id instead.
next_reference_id
the reference id of the mate/next read.
next_reference_name
reference name of the mate/next read (None if no AlignmentFile is associated)
next_reference_start
the position of the mate/next read.
opt(self, tag)
deprecated, use get_tag() instead.
overlap(self )
deprecated, use get_overlap() instead.
pnext
deprecated, use next_reference_start instead.
pos
deprecated, use reference_start instead.
positions
deprecated, use get_reference_positions() instead.
18 Chapter 1. Contents
pysam documentation, Release 0.19.1
qend
deprecated, use query_alignment_end instead.
qlen
deprecated, use query_alignment_length instead.
qname
deprecated, use query_name instead.
qqual
deprecated, use query_alignment_qualities instead.
qstart
deprecated, use query_alignment_start instead.
qual
deprecated, use query_qualities instead.
query
deprecated, use query_alignment_sequence instead.
query_alignment_end
end index of the aligned query portion of the sequence (0-based, exclusive)
This the index just past the last base in query_sequence that is not soft-clipped.
query_alignment_length
length of the aligned query sequence.
This is equal to query_alignment_end - query_alignment_start
query_alignment_qualities
aligned query sequence quality values (None if not present). These are the quality values that correspond
to query_alignment_sequence, that is, they exclude qualities of soft clipped bases. This is equal
to query_qualities[query_alignment_start:query_alignment_end].
Quality scores are returned as a python array of unsigned chars. Note that this is not the ASCII-encoded
value typically seen in FASTQ or SAM formatted files. Thus, no offset of 33 needs to be subtracted.
This property is read-only.
query_alignment_sequence
aligned portion of the read.
This is a substring of query_sequence that excludes flanking
bases that were soft clipped (None if not present). It is equal to
query_sequence[query_alignment_start:query_alignment_end].
SAM/BAM files may include extra flanking bases that are not part of the alignment. These bases may be
the result of the Smith-Waterman or other algorithms, which may not require alignments that begin at the
first residue or end at the last. In addition, extra sequencing adapters, multiplex identifiers, and low-quality
bases that were not considered for alignment may have been retained.
query_alignment_start
start index of the aligned query portion of the sequence (0-based, inclusive).
This the index of the first base in query_sequence that is not soft-clipped.
query_length
the length of the query/read.
This value corresponds to the length of the sequence supplied in the BAM/SAM file. The length of a query
is 0 if there is no sequence in the BAM/SAM file. In those cases, the read length can be inferred from the
CIGAR alignment, see pysam.AlignedSegment.infer_query_length().
1.2. API 19
pysam documentation, Release 0.19.1
q = read.query_qualities
read.query_sequence = read.query_sequence[5:10]
read.query_qualities = q[5:10]
The sequence is returned as it is stored in the BAM file. (This will be the reverse complement of the
original read sequence if the mapper has aligned the read to the reverse strand.)
reference_end
aligned reference position of the read on the reference genome.
reference_end points to one past the last aligned residue. Returns None if not available (read is unmapped
or no cigar alignment present).
reference_id
reference ID
Note: This field contains the index of the reference sequence in the sequence dictionary. To obtain the
name of the reference sequence, use get_reference_name()
reference_length
aligned length of the read on the reference genome.
This is equal to reference_end - reference_start. Returns None if not available.
reference_name
reference name
reference_start
0-based leftmost coordinate
rlen
deprecated, use query_length instead.
rname
deprecated, use reference_id instead.
20 Chapter 1. Contents
pysam documentation, Release 0.19.1
rnext
deprecated, use next_reference_id instead.
seq
deprecated, use query_sequence instead.
setTag(self, tag, value, value_type=None, replace=True)
deprecated, use set_tag() instead.
set_tag(self, tag, value, value_type=None, replace=True)
sets a particular field tag to value in the optional alignment section.
value_type describes the type of value that is to entered into the alignment record. It can be set explicitly
to one of the valid one-letter type codes. If unset, an appropriate type will be chosen automatically based
on the python type of value.
An existing value of the same tag will be overwritten unless replace is set to False. This is usually not
recommended as a tag may only appear once in the optional alignment section.
If value is None, the tag will be deleted.
This method accepts valid SAM specification value types, which are:
A: printable char
i: signed int
f: float
Z: printable string
H: Byte array in hex format
B: Integer or numeric array
i: python int
f: python float
Z: python str or bytes
B: python array.array, list or tuple
Note that a single character string will be output as ‘Z’ and not ‘A’ as the former is the more general type.
set_tags(self, tags)
sets the fields in the optional alignment section with a list of (tag, value) tuples.
The value type of the values is determined from the python type. Optionally, a type may be given explicitly
as a third value in the tuple, For example:
x.set_tags([(NM, 2, “i”), (RG, “GJP00TM04”, “Z”)]
This method will not enforce the rule that the same tag may appear only once in the optional alignment
section.
tags
deprecated, use get_tags() instead.
template_length
the observed query template length
tid
deprecated, use reference_id instead.
1.2. API 21
pysam documentation, Release 0.19.1
tlen
deprecated, use template_length instead.
to_dict(self )
returns a json representation of the aligned segment.
Field names are abbreviated versions of the class attributes.
to_string(self )
returns a string representation of the aligned segment.
The output format is valid SAM format if a header is associated with the AlignedSegment.
tostring(self, htsfile=None)
deprecated, use to_string() instead.
Parameters htsfile – (deprecated) AlignmentFile object to map numerical identifiers to
chromosome names. This parameter is present for backwards compatibility and ignored.
class pysam.PileupColumn
A pileup of reads at a particular reference sequence position (column). A pileup column contains all the reads
that map to a certain target base.
This class is a proxy for results returned by the samtools pileup engine. If the underlying engine iterator ad-
vances, the results of this column will change.
get_mapping_qualities(self )
query mapping quality scores at pileup column position.
Returns a list of quality scores
Return type list
get_num_aligned(self )
return number of aligned bases at pileup column position.
This method applies a base quality filter and the number is equal to the size of
get_query_sequences(), get_mapping_qualities(), etc.
get_query_names(self )
query/read names aligned at pileup column position.
Returns a list of query names at pileup column position.
Return type list
get_query_positions(self )
positions in read at pileup column position.
Returns a list of read positions
Return type list
get_query_qualities(self )
query base quality scores at pileup column position.
Returns a list of quality scores
Return type list
get_query_sequences(self, bool mark_matches=False, bool mark_ends=False, bool
add_indels=False)
query bases/sequences at pileup column position.
Optionally, the bases/sequences can be annotated according to the samtools mpileup format. This is the
format description from the samtools mpileup tool:
22 Chapter 1. Contents
pysam documentation, Release 0.19.1
To reproduce samtools mpileup format, set all of mark_matches, mark_ends and add_indels to True.
Parameters
• mark_matches (bool) – If True, output bases matching the reference as “.” or “,” for
forward and reverse strand, respectively. This mark requires the reference sequence. If no
reference is present, this option is ignored.
• mark_ends (bool) – If True, add markers “^” and “$” for read start and end, respec-
tively.
• add_indels (bool) – If True, add bases for bases inserted into or skipped from the
reference. The latter requires a reference sequence file to have been given, e.g. via
pileup(fastafile = . . . ). If no reference sequence is available, skipped bases are represented
as ‘N’s.
Returns a list of bases/sequences per read at pileup column position.
Return type list
n
deprecated, use nsegments instead.
nsegments
number of reads mapping to this column.
Note that this number ignores the base quality filter.
pileups
list of reads (pysam.PileupRead) aligned to this column
pos
deprecated, use reference_pos instead.
reference_id
the reference sequence number as defined in the header
reference_name
reference name (None if no AlignmentFile is associated)
reference_pos
the position in the reference sequence (0-based).
1.2. API 23
pysam documentation, Release 0.19.1
set_min_base_quality(self, min_base_quality)
set the minimum base quality for this pileup column.
tid
deprecated, use reference_id instead.
class pysam.PileupRead
Representation of a read aligned to a particular position in the reference sequence.
alignment
a pysam.AlignedSegment object of the aligned read
indel
indel length for the position following the current pileup site.
This quantity peeks ahead to the next cigar operation in this alignment. If the next operation is an insertion,
indel will be positive. If the next operation is a deletion, it will be negation. 0 if the next operation is not
an indel.
is_del
1 iff the base on the padded read is a deletion
is_head
1 iff the base on the padded read is the left-most base.
is_refskip
1 iff the base on the padded read is part of CIGAR N op.
is_tail
1 iff the base on the padded read is the right-most base.
level
the level of the read in the “viewer” mode. Note that this value is currently not computed.
query_position
position of the read base at the pileup site, 0-based. None if is_del or is_refskip is set.
query_position_or_next
position of the read base at the pileup site, 0-based.
If the current position is a deletion, returns the next aligned base.
class pysam.IndexedReads(AlignmentFile samfile, int multiple_iterators=True)
Index a Sam/BAM-file by query name while keeping the original sort order intact.
The index is kept in memory and can be substantial.
By default, the file is re-opened to avoid conflicts if multiple operators work on the same file. Set multi-
ple_iterators = False to not re-open samfile.
Parameters
• samfile (AlignmentFile) – File to be indexed.
• multiple_iterators (bool) – Flag indicating whether the file should be reopened.
Reopening prevents existing iterators being affected by the indexing.
build(self )
build the index.
find(self, query_name)
find query_name in index.
Returns Returns an iterator over all reads with query_name.
24 Chapter 1. Contents
pysam documentation, Release 0.19.1
TabixFile opens tabular files that have been indexed with tabix.
class pysam.TabixFile
Random access to bgzf formatted files that have been indexed by tabix.
The file is automatically opened. The index file of file <filename> is expected to be called <filename>.
tbi by default (see parameter index).
Parameters
• filename (string) – Filename of bgzf file to be opened.
• index (string) – The filename of the index. If not set, the default is to assume that the
index is called filename.tbi
• mode (char) – The file opening mode. Currently, only r is permitted.
• parser (pysam.Parser) – sets the default parser for this tabix file. If parser is None,
the results are returned as an unparsed string. Otherwise, parser is assumed to be a functor
that will return parsed data (see for example asTuple and asGTF).
• encoding (string) – The encoding passed to the parser
• threads (integer) – Number of threads to use for decompressing Tabix files. (De-
fault=1)
Raises
• ValueError – if index file is missing.
• IOError – if file could not be opened
close(self )
closes the pysam.TabixFile.
contigs
list of chromosome names
fetch(self, reference=None, start=None, end=None, region=None, parser=None, multi-
ple_iterators=False)
fetch one or more rows in a region using 0-based indexing. The region is specified by reference, start and
end. Alternatively, a samtools region string can be supplied.
Without reference or region all entries will be fetched.
If only reference is set, all reads matching on reference will be fetched.
If parser is None, the default parser will be used for parsing.
Set multiple_iterators to true if you will be using multiple iterators on the same file at the same time.
The iterator returned will receive its own copy of a filehandle to the file effectively re-opening the file.
Re-opening a file creates some overhead, so beware.
header
the file header.
The file header consists of the lines at the beginning of a file that are prefixed by the comment character #.
1.2. API 25
pysam documentation, Release 0.19.1
Note: The header is returned as an iterator presenting lines without the newline character.
26 Chapter 1. Contents
pysam documentation, Release 0.19.1
contig = vcf.contig
first_sample_genotype = vcf[0]
second_sample_genotype = vcf[1]
class pysam.asBed
converts a tabix row into a bed record with the following fields:
Only the first three fields are required. Additional fields are optional, but if one is defined, all the preceding need
to be defined as well.
class pysam.asGTF
converts a tabix row into a GTF record with the following fields:
GTF formatted entries also define the following fields that are derived from the attributes field:
1.2. API 27
pysam documentation, Release 0.19.1
Name Content
gene_id the gene identifier
transcript_id the transcript identifier
class pysam.FastaFile
Random access to fasta formatted files that have been indexed by faidx.
The file is automatically opened. The index file of file <filename> is expected to be called <filename>.
fai.
Parameters
• filename (string) – Filename of fasta file to be opened.
• filepath_index (string) – Optional, filename of the index. By default this is the
filename + “.fai”.
• filepath_index_compressed (string) – Optional, filename of the index if fasta
file is. By default this is the filename + “.gzi”.
Raises
• ValueError – if index file is missing
• IOError – if file could not be opened
close(self )
close the file.
closed
bool indicating the current state of the file object. This is a read-only attribute; the close() method changes
the value.
fetch(self, reference=None, start=None, end=None, region=None)
fetch sequences in a region.
A region can either be specified by reference, start and end. start and end denote 0-based, half-open
intervals.
Alternatively, a samtools region string can be supplied.
If any of the coordinates are missing they will be replaced by the minimum (start) or maximum (end)
coordinate.
Note that region strings are 1-based, while start and end denote an interval in python coordinates. The
region is specified by reference, start and end.
Returns string
Return type a string with the sequence specified by the region.
Raises
• IndexError – if the coordinates are out of range
• ValueError – if the region is invalid
filename
filename associated with this object. This is a read-only attribute.
28 Chapter 1. Contents
pysam documentation, Release 0.19.1
get_reference_length(self, reference)
return the length of reference.
is_open(self )
return true if samfile has been opened.
lengths
tuple with the lengths of reference sequences.
nreferences
int with the number of reference sequences in the file. This is a read-only attribute.
references
tuple with the names of reference sequences.
class pysam.FastxFile
Stream access to fasta or fastq formatted files.
The file is automatically opened.
Entries in the file can be both fastq or fasta formatted or even a mixture of the two.
This file object permits iterating over all entries in the file. Random access is not implemented. The iteration
returns objects of type FastqProxy
Parameters
• filename (string) – Filename of fasta/fastq file to be opened.
• persist (bool) – If True (default) make a copy of the entry in the file during iteration. If
set to False, no copy will be made. This will permit much faster iteration, but an entry will
not persist when the iteration continues and an entry is read-only.
Notes
Examples
close(self )
close the file.
1.2. API 29
pysam documentation, Release 0.19.1
closed
bool indicating the current state of the file object. This is a read-only attribute; the close() method changes
the value.
filename
string with the filename associated with this object.
is_open(self )
return true if samfile has been opened.
next
class pysam.FastqProxy
A single entry in a fastq file.
get_quality_array(self, int offset=33) → array
return quality values as integer array after subtracting offset.
name
The name of each entry in the fastq file.
quality
The quality score of each entry in the fastq file, represented as a string.
sequence
The sequence of each entry in the fastq file.
f = pysam.VariantFile('ex1.bcf','r')
If mode is not specified, we will try to auto-detect the file type. All of the following should
work:
f1 = pysam.VariantFile('ex1.bcf')
f2 = pysam.VariantFile('ex1.vcf')
f3 = pysam.VariantFile('ex1.vcf.gz')
30 Chapter 1. Contents
pysam documentation, Release 0.19.1
1.2. API 31
pysam documentation, Release 0.19.1
reset(self )
reset file position to beginning of file just after the header.
subset_samples(self, include_samples)
Read only a subset of samples to reduce processing time and memory. Must be called prior to retrieving
records.
write(self, VariantRecord record) → int
write a single pysam.VariantRecord to disk.
returns the number of bytes written.
class pysam.VariantHeader
header information for a VariantFile object
add_line(self, line)
Add a metadata line to this header
add_meta(self, key, value=None, items=None)
Add metadata to this header
add_record(self, VariantHeaderRecord record)
Add an existing VariantHeaderRecord to this header
add_sample(self, name)
Add a new sample to this header
add_samples(self, *args)
Add several new samples to this header. This function takes multiple arguments, each of which may be
either a sample name or an iterable returning sample names (e.g., a list of sample names).
alts
alt metadata (dict ID->record).
The data returned just a snapshot of alt records, is created every time the property is requested, and modi-
fications will not be reflected in the header metadata and vice versa.
i.e. it is just a dict that reflects the state of alt records at the time it is created.
contigs
contig information (VariantHeaderContigs)
copy(self )
filters
filter metadata (VariantHeaderMetadata)
formats
format metadata (VariantHeaderMetadata)
info
info metadata (VariantHeaderMetadata)
merge(self, VariantHeader header)
new_record(self, contig=None, start=0, stop=0, alleles=None, id=None, qual=None, filter=None,
info=None, samples=None, **kwargs)
Create a new empty VariantRecord.
Arguments are currently experimental. Use with caution and expect changes in upcoming releases.
records
header records (VariantHeaderRecords)
samples
32 Chapter 1. Contents
pysam documentation, Release 0.19.1
version
VCF version
class pysam.VariantRecord(*args, **kwargs)
Variant record
alleles
tuple of reference allele followed by alt alleles
alts
tuple of alt alleles
chrom
chromosome/contig name
contig
chromosome/contig name
copy(self )
return a copy of this VariantRecord object
filter
filter information (see VariantRecordFilter)
format
sample format metadata (see VariantRecordFormat)
id
record identifier or None if not available
info
info data (see VariantRecordInfo)
pos
record start position on chrom/contig (1-based inclusive)
qual
phred scaled quality score or None if not available
ref
reference allele
rid
internal reference id number
rlen
record length on chrom/contig (aka rec.stop - rec.start)
samples
sample data (see VariantRecordSamples)
start
record start position on chrom/contig (0-based inclusive)
stop
record stop position on chrom/contig (0-based exclusive)
translate(self, VariantHeader dst_header)
class pysam.VariantHeaderRecord(*args, **kwargs)
header record from a VariantHeader object
attrs
sequence of additional header attributes
1.2. API 33
pysam documentation, Release 0.19.1
1.2.6 HTSFile
34 Chapter 1. Contents
pysam documentation, Release 0.19.1
compression
File compression.
One of NONE, GZIP, BGZF, CUSTOM.
description
Vaguely human readable description of the file format
format
File format.
One of UNKNOWN, BINARY_FORMAT, TEXT_FORMAT, SAM, BAM, BAI, CRAM, CRAI, VCF,
BCF, CSI, GZI, TBI, BED.
get_reference_name(self, tid)
return contig name corresponding to numerical tid
get_tid(self, contig)
return the numerical tid corresponding to contig
returns -1 if contig is not known.
is_bam
return True if HTSFile is reading or writing a BAM alignment file
is_bcf
return True if HTSFile is reading or writing a BCF variant file
is_closed
return True if HTSFile is closed.
is_cram
return True if HTSFile is reading or writing a BAM alignment file
is_open
return True if HTSFile is open and in a valid state.
is_read
return True if HTSFile is open for reading
is_sam
return True if HTSFile is reading or writing a SAM alignment file
is_valid_reference_name(self, contig)
return True if the contig name contig is valid; False otherwise.
is_valid_tid(self, tid)
return True if the numerical tid is valid; False otherwise.
returns -1 if contig is not known.
is_vcf
return True if HTSFile is reading or writing a VCF variant file
is_write
return True if HTSFile is open for writing
parse_region(self, contig=None, start=None, stop=None, region=None, tid=None, reference=None,
end=None)
parse alternative ways to specify a genomic region. A region can either be specified by contig, start and
stop. start and stop denote 0-based, half-open intervals. reference and end are also accepted for backward
compatibility as synonyms for contig and stop, respectively.
Alternatively, a samtools region string can be supplied.
1.2. API 35
pysam documentation, Release 0.19.1
If any of the coordinates are missing they will be replaced by the minimum (start) or maximum (stop)
coordinate.
Note that region strings are 1-based inclusive, while start and stop denote an interval in 0-based, half-open
coordinates (like BED files and Python slices).
If contig or region or are *, unmapped reads at the end of a BAM file will be returned. Setting either to .
will iterate from the beginning of the file.
Returns a tuple of flag, tid, start and stop. The flag indicates whether no coordinates were
supplied and the genomic region is the complete genomic space.
Return type tuple
Raises ValueError – for invalid or out of bounds regions.
reset(self )
reset file position to beginning of file just after the header.
Returns The file position after moving the file pointer.
Return type pointer
seek(self, uint64_t offset)
move file pointer to position offset, see pysam.HTSFile.tell().
tell(self )
return current file position, see pysam.HTSFile.seek().
version
Tuple of file format version numbers (major, minor)
import pysam
samfile = pysam.AlignmentFile("ex1.bam", "rb")
The above command opens the file ex1.bam for reading. The b qualifier indicates that this is a BAM file. To open a
SAM file, type:
import pysam
samfile = pysam.AlignmentFile("ex1.sam", "r")
import pysam
samfile = pysam.AlignmentFile("ex1.cram", "rc")
Reads are obtained through a call to the pysam.AlignmentFile.fetch() method which returns an iterator.
Each call to the iterator will returns a pysam.AlignedSegment object:
36 Chapter 1. Contents
pysam documentation, Release 0.19.1
pysam.AlignmentFile.fetch() returns all reads overlapping a region sorted by the first aligned base in the
reference sequence. Note that it will also return reads that are only partially overlapping with the region. Thus the
reads returned might span a region that is larger than the one queried.
In contrast to fetching, the pileup engine returns for each base in the reference sequence the reads that map to that
particular position. In the typical view of reads stacking vertically on top of the reference sequence similar to a
multiple alignment, fetching iterates over the rows of this implied multiple alignment while a pileup iterates over the
columns.
Calling pileup() will return an iterator over each column (reference base) of a specified region. Each call to the
iterator returns an object of the type pysam.PileupColumn that provides access to all the reads aligned to that
particular reference position as well as some additional information:
The following example shows how a new BAM file is constructed from scratch. The important part here is that the
pysam.AlignmentFile class needs to receive the sequence identifiers. These can be given either as a dictionary
in a header structure, as lists of names and sizes, or from a template file. Here, we use a header dictionary:
Pysam does not support reading and writing from true python file objects, but it does support reading and writing from
stdin and stdout. The following example reads from stdin and writes to stdout:
It will also work with BAM files. The following script converts a BAM formatted file on stdin to a SAM formatted file
on stdout:
Note that the file open mode needs to changed from r to rb.
Commands available in samtools are available as simple function calls. Command line options are provided as argu-
ments. For example:
Or for example:
print(pysam.sort.usage())
pysam.sort()
Messages from samtools on stderr are captured and are available using the getMessages() method:
pysam.sort.getMessage()
Note that only the output from the last invocation of a command is stored.
38 Chapter 1. Contents
pysam documentation, Release 0.19.1
In order for pysam to make the output of samtools commands accessible the stdout stream needs to be redirected. This
is the default behaviour, but can cause problems in environments such as the ipython notebook. A solution is to pass
the catch_stdout keyword argument:
pysam.sort(catch_stdout=False)
Note that this means that output from commands which produce output on stdout will not be available. The only
solution is to run samtools commands through subprocess.
To open a tabular file that has been indexed with tabix, use TabixFile:
import pysam
tbx = pysam.TabixFile("example.bed.gz")
This will return a tuple-like data structure in which columns can be retrieved by numeric index:
By providing a parser to fetch or TabixFile, the data will we presented in parsed form:
Pre-built parsers are available for bed (asBed) formatted files and gtf (asGTF) formatted files. Thus, additional
fields become available through named access, for example:
_pysam.VariantFile.fetch() iterates over VariantRecord objects which provides access to simple vari-
ant attributes such as contig, pos, ref:
but also to complex attributes such as the contents to the info, format and genotype columns. These complex
attributes are views on the underlying htslib data structures and provide dictionary-like access to the data:
The header attribute (VariantHeader) provides access information stored in the vcf header. The complete header
can be printed:
>>> print(bcf_in.header)
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples
With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build
129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
##contig=<ID=M>
##contig=<ID=17>
##contig=<ID=20>
##bcftools_viewVersion=1.3+htslib-1.3
##bcftools_viewCommand=view -O b -o example_vcf42.bcf
example_vcf42.vcf.gz
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001
˓→NA00002 NA0000
Individual contents such as contigs, info fields, samples, formats can be retrieved as attributes from header:
>>> print(bcf_in.header.contigs)
<pysam.cbcf.VariantHeaderContigs object at 0xf250f8>
To convert these views to native python types, iterate through the views:
>>> print(list((bcf_in.header.contigs)))
['M', '17', '20']
>>> print(list((bcf_in.header.filters)))
['PASS', 'q10', 's50']
>>> print(list((bcf_in.header.info)))
['NS', 'DP', 'AF', 'AA', 'DB', 'H2']
(continues on next page)
40 Chapter 1. Contents
pysam documentation, Release 0.19.1
Alternatively, it is possible to iterate through all records in the header returning objects of type
VariantHeaderRecord::
Using pyximport, it is (relatively) straight-forward to access pysam internals and the underlying samtools library. An
example is provided in the tests directory. The example emulates the samtools flagstat command and consists of
three files:
1. The main script pysam_flagstat.py. The important lines in this script are:
import pyximport
pyximport.install()
import _pysam_flagstat
...
flag_counts = _pysam_flagstat.count(pysam_in)
The first part imports, sets up pyximport and imports the cython module _pysam_flagstat. The second
part calls the count method in _pysam_flagstat.
2. The cython implementation _pysam_flagstat.pyx. This script imports the pysam API via:
This statement imports, amongst others, AlignedSegment into the namespace. Speed can be gained from
declaring variables. For example, to efficiently iterate over a file, an AlignedSegment object is declared:
3. A pyxbld providing pyximport with build information. Required are the locations of the samtools and pysam
header libraries of a source installation of pysam plus the csamtools.so shared library. For example:
If the script pysam_flagstat.py is called the first time, pyximport will compile the cython extension
_pysam_flagstat.pyx and make it available to the script. Compilation requires a working compiler and cython
installation. Each time _pysam_flagstat.pyx is modified, a new compilation will take place.
pyximport comes with cython.
Pysam can be installed through conda, pypi and from the repository. The recommended way to install pysam is through
conda/bioconda.
This will install pysam from the bioconda channel and automatically makes sure that dependencies are installed. Also,
compilation flags will be set automatically, which will potentially save a lot of trouble on OS X.
Pysam provides a python interface to the functionality contained within the htslib C library. There are two ways that
these two can be combined, builtin and external.
Builtin
42 Chapter 1. Contents
pysam documentation, Release 0.19.1
This will compile the builtin htslib source code within pysam.
htslib can be configured at compilation to turn on additional features such support using encrypted configurations,
enable plugins, and more. See the htslib project for more information on these.
Pysam will attempt to configure htslib to turn on some advanced features. If these fail, for example due to missing
library dependencies (libcurl, libcrypto), it will fall back to conservative defaults.
Options can be passed to the configure script explicitly by setting the environment variable HT-
SLIB_CONFIGURE_OPTIONS. For example:
export HTSLIB_CONFIGURE_OPTIONS=--enable-plugins
pip install pysam
External
pysam can be combined with an externally installed htslib library. This is a good way to avoid duplication of li-
braries. To link against an externally installed library, set the environment variables HTSLIB_LIBRARY_DIR and
HTSLIB_INCLUDE_DIR before installing:
export HTSLIB_LIBRARY_DIR=/usr/local/lib
export HTSLIB_INCLUDE_DIR=/usr/local/include
pip install pysam
Note that the location of the file libhts.so needs to be known to the linker once you run pysam, for example by
setting the environment-varirable LD_LIBRARY_PATH.
Note that generally the pysam and htslib version need to be compatible. See the release notes for more information.
pysam depends on cython to provide the connectivity to the htslib C library. The installation of the source tarball
(.tar.gz) contains pre-built C-files and cython needs not be present during installation. However, when installing
from the repository, cython needs to be installed beforehand.
To install from repository, type:
1.8.4 Requirements
1.9 FAQ
Pysam has not been published in print. When referring to pysam, please use the github URL: https://github.com/
pysam-developers/pysam. As pysam is a wrapper around htslib and the samtools package, I suggest citing [Li.2009],
[Bonfield.2021], and/or [Danecek.2021], as appropriate.
Pysam is a mix of python and C code. Instructions within python are generally made thread-safe through python’s
global interpreter lock (GIL). This ensures that python data structures will always be in a consistent state.
If an external function outside python is called, the programmer has a choice to keep the GIL in place or to release it.
Keeping the GIL in place will make sure that all python threads wait until the external function has completed. This is
a safe option and ensures thread-safety.
Alternatively, the GIL can be released while the external function is called. This will allow other threads to run
concurrently. This can be beneficial if the external function is expected to halt, for example when waiting for data to
read or write. However, to achieve thread-safety, the external function needs to be implemented with thread-safety in
mind. This means that there can be no shared state between threads, or if there is shared, it needs to be controlled to
prevent any access conflicts.
Pysam generally uses the latter option and aims to release the GIL for I/O intensive tasks. This is generally fine, but
thread-safety of all parts have not been fully tested.
A related issue is when different threads read from the same file object - or the same thread uses two iterators over
a file. There is only a single file-position for each opened file. To prevent this from happening, use the option
multiple_iterators=True when calling a fetch() method. This will return an iterator on a newly opened file.
pysam uses 0-based coordinates and the half-open notation for ranges as does python. Coordinates and intervals
reported from pysam always follow that convention.
Confusion might arise as different file formats might have different conventions. For example, the SAM format is
1-based while the BAM format is 0-based. It is important to remember that pysam will always conform to the python
convention and translate to/from the file format automatically.
The only exception is the region string in the fetch() and pileup() methods. This string follows the convention
of the samtools command line utilities. The same is true for any coordinates passed to the samtools command utilities
directly, such as pysam.mpileup().
iter1 = samfile.fetch("chr1")
print(next(iter1).reference_id)
iter2 = samfile.fetch("chr2")
print(next(iter2).reference_id)
print(next(iter1).reference_id)
44 Chapter 1. Contents
pysam documentation, Release 0.19.1
Note how the second iterator stops as the file pointer has moved to chr2. The correct way to work with multiple
iterators is:
samfile = pysam.AlignmentFile("pysam_ex1.bam", "rb")
The reason for this behaviour is that every iterator needs to keep track of its current position in the file. Within pysam,
each opened file can only keep track of one file position and hence there can only be one iterator per file. Using
the option multiple_iterators=True will return an iterator within a newly opened file. This iterator will not
interfere with existing iterators as it has its own file handle associated with it.
Note that re-opening files incurs a performance penalty which can become severe when calling fetch() often. Thus,
multiple_iterators is set to False by default.
fetch() will only iterate over alignments in the SAM/BAM file. The following thus always works:
bf = pysam.AlignmentFile(fname, "rb")
for r in bf.fetch():
assert not r.is_unmapped
If the SAM/BAM file contains unaligned reads, they can be included in the iteration by adding the until_eof=True
flag:
bf = pysam.AlignmentFile(fname, "rb")
for r in bf.fetch(until_eof=True):
if r.is_unmapped:
print("read is unmapped")
fetch() requires an index when iterating over a SAM/BAM file. To iterate over a file without an index, use
until_eof=True:
1.9. FAQ 45
pysam documentation, Release 0.19.1
bf = pysam.AlignmentFile(fname, "rb")
for r in bf.fetch(until_eof=True):
print(r)
1.9.7 BAM files with a large number of reference sequences are slow
If you have many reference sequences in a BAM file, the following might be slow:
The reason is that track.fetch() will iterate through the BAM file for each reference sequence in the order as it is defined
in the header. This might require a lot of jumping around in the file. To avoid this, use:
Spliced reads are reported within samfile.pileup. To ignore these in your analysis, test the flags is_del == True
and indel == 0 in the PileupRead object.
Editing reads in-place generally works, though there is one quirk to be aware of. Assigning to AlignedSeg-
ment.query_sequence will invalidate any quality scores in AlignedSegment.query_qualities. The reason is that sam-
tools manages the memory of the sequence and quality scores together and thus requires them to always be of the same
length or 0.
Thus, to in-place edit the sequence and quality scores, copies of the quality scores need to be taken. Consider trimming
for example:
quals = read.query_qualities
read.query_sequence = read.query_sequence[5:10]
read.query_qualities = quals[5:10]
SNP calling is highly complex and heavily parameterized. There was a danger that the pysam implementations might
show different behaviour from the samtools implementation, which would have caused a lot of confusion.
The best way to use samtools SNP calling from python is to use the pysam.mpileup() command and parse the
output directly.
46 Chapter 1. Contents
pysam documentation, Release 0.19.1
Pysam works by providing proxy objects to objects defined within the C-samtools package. Thus, some attention must
be paid to the lifetime of objects. The following to code snippets will cause an error:
s = AlignmentFile('ex1.bam')
for p in s.pileup('chr1', 1000,1010):
pass
for pp in p.pileups:
print(pp)
The iteration has finished, thus the contents of p are invalid. Another variation of this:
Again, the iteration finishes as the temporary iterator created by pileup goes out of scope. The solution is to keep a
handle to the iterator that remains alive:
Compiling pysam can be tricky as there are numerous variables that differ between build environments such as OS,
version, python version, and compiler. It is difficult to build software that builds cleanly on all systems and the process
might fail. Please see the pysam user group for common issues.
If there is a build issue, read the generated output carefully - generally the cause of the problem is among the first
errors to be reported. For example, you will need to have the development version of python installed that includes
the header files such as Python.h. If that file is missing, the compiler will report this at the very top of its error
messages but will follow it with any unknown function or variable definition it encounters later on.
General advice is to always use the latest version on python and cython when building pysam. There are some known
incompatibilities:
• Python 3.4 requires cython 0.20.2 or later (see here)
In version 0.10.0 and onwards, all pysam extension modules contain a lib-prefix. This facilates linking against pysam
extension modules with compilers that require to start with lib. As a consequence, all code using pysam extension
modules directly will need to be adapted. For example,:
cimport pysam.csamtools
will become:
cimport pysam.libcsamtools
1.9. FAQ 47
pysam documentation, Release 0.19.1
See instructions in import.py to import the latest versions of htslib, samtools and bcftools.
Unit tests are in the tests directory. To run all unit tests, run:
pytest tests
Most tests use test data from the tests/*_data directories. Some of these test data files are generated from other
files in these directories, which is done by running make in each directory:
make -C tests/pysam_data
# etc
Alternatively if any tests/*_data/all.stamp file is not already present, running the unit tests should generate
that directory’s data files automatically.
1.10.4 Benchmarking
To run the benchmarking suite, make sure that pytest-benchmark is installed. To run all benchmarks, type:
pytest tests/*_bench.py
1.10.5 Contributors
48 Chapter 1. Contents
pysam documentation, Release 0.19.1
This release wraps htslib/samtools/bcftools version 1.13. Corresponding to new samtools commands, pysam.samtools
now has additional functions ampliconclip, ampliconstats, fqimport, and version.
Bugs fixed:
• [#447] The maximum QNAME length is fully restored to 254
• [#506, #958, #1000] Don’t crash the Python interpreter on pysam.bcftools.*() errors
• [#603] count_coverage: ignore reads that have no SEQ field
• [#928] Fix pysam.bcftools.mpileup() segmentation fault
• [#983] Add win32/*.[ch] to MANIFEST.in
• [#994] Raise exception in get_tid() if header could not be parsed
• [#995] Choose TBI/CSI in tabix_index() via both min_shift and csi
• [#996] AlignmentFile.fetch() now works with large chromosomes longer than 229 bases
This release wraps htslib/bcftools version 1.10.2 and samtools version 1.10. The following bugs reported against
pysam are fixed due to this:
• [#447] Writing out QNAME longer than 251 characters corrupts BAM
• [#640, #734, #843] Setting VariantRecord pos or stop raises error
• [#738, #919] FastxFile truncates concatenated plain gzip compressed files
Additional bugfixes:
• [#840] Pileup doesn’t work on python3 when index_filename is used
• [#886] FastqProxy raises ValueError when instantiated from python
• [#904] VariantFile.fetch() throws ValueError on files with no records
• [#909] Fix incorrect quoting in VariantFile contig records
• [#915, #916] Implement pileup() for unindexed files and/or SAM files
Backwards incompatible changes:
• The samtools import command was removed in samtools 1.10, so pysam no longer exports a samimport function.
Use pysam.view() instead.
50 Chapter 1. Contents
pysam documentation, Release 0.19.1
Bugfix release. Principal reason for release is to update cython version in order to fix pip install pysam with python
3.8.
• [#879] Fix add_meta function in libcbcf.pyx, so meta-information lines in header added with this function have
double-quoting rules in accordance to rules specified in VCF4.2 and VCF4.3 specifications
• [#863] Force arg to bytes to support non-ASCII encoding
• [#875] Bump minimum Cython version
• [#868] Prevent segfault on Python 2.7 AlignedSegment.compare(other=None)
• [#867] Fix wheel building on TravisCI
• [#863] Force arg to bytes to support non-ASCII encoding
• [#799] disambiguate interpretation of bcf_read return code
• [#841] Fix silent truncation of FASTQ with bad q strings
• [#846] Prevent segmentation fault on ID, when handling malformed records
• [#829] Run configure with the correct CC/CFLAGS/LDFLAGS env vars
Bugfix release.
• [#824] allow reading of UTF-8 encoded text in VCF/BCF files.
• [#780] close all filehandles before opening new ones in pysam_dispatch
• [#773] do not cache VariantRecord.id to avoid memory leak
• [#781] default of multiple_iterators=True is changed to False for CRAM files.
• [#825] fix collections.abc import
• [#825] use bcf_hdr_format instead of bcf_hdr_fmt_text, fix memcpy bug when setting FORMAT fields.
• [#804] Use HTSlib’s kstring_t, which reallocates and enlarges its memory as needed, rather than a fixed-size
char buffer.
• [#814] Build wheels and upload them to PyPI
• [#755] Allow passing flags and arguments to index methods
• [#763] Strip 0 in header check
• [#761] Test Tabix index contents, not the compression
Bugfix release.
• [#746] catch pileup itorator out-of-scope segfaults
• [#747] fix faixd fetch with region
• [#748] increase max_pos to (1<<31)-1
• [#645] Add missing macOS stub files in MANIFEST.in, @SoapZA
Bugfix release.
• [#716] raise ValueError if tid is out of range when writing
• [#697] release version using cython 0.28.5 for python 3.7 compatibility
This is mostly a bugfix release, though bcftools has now also been upgraded to 1.7.0.
• [#621] Add a warning to count_coverage when an alignment has an empty QUAL field
• [#635] Speed-up of AlignedSegment.find_intro()
• treat border case of all bases in pileup column below quality score
• [#634] Fix access to pileup reference_sequence
52 Chapter 1. Contents
pysam documentation, Release 0.19.1
This release wraps htslib/samtools/bcftools versions 1.6.0 and contains a series of bugfixes.
• [#544] reading header from remote TabixFiles now works.
• [#531] add missing tag types H and A. A python float will now be added as ‘f’ type instead of ‘d’ type.
• [#543] use FastaFile instead of Fastafile in pileup.
• [#546] set is_modified flag in setAttribute so updated attributes are output.
• [#537] allow tabix index files to be created in a custom location.
• [#530] add get_index_statistics() method
This release wraps htslib/samtools/bcftools versions 1.5.0 and contains a series of bugfixes.
• [#473] A new FastxRecord class that can be instantiated from class and modified in-place. Replaces Persistent-
FastqProxy.
• [#521] In AligmentFile, Simplify file detection logic and allow remote index files
– Removed attempts to guess data and index file names; this is magic left to htslib.
– Removed file existence check prior to opening files with htslib
– Better error checking after opening files that raise the appropriate error (IOError for when errno is set,
ValueError otherwise for backward compatibility).
– Report IO errors when loading an index by name.
– Allow remote indices (tested using S3 signed URLs).
header = pysam.AlignmentHeader(
reference_names=["chr1", "chr2"],
reference_lengths=[1000, 1000])
read = pysam.AlignedSegment(header)
This will affect all code that instantiates AlignedSegment objects directly. We have not yet merged to allow users to
provide feed-back. The pull-request is here: https://github.com/pysam-developers/pysam/pull/518 Please comment on
github.
This release wraps htslib/samtools/bcfools versions 1.4.1 in response to a security fix in these libraries. Additionally
the following issues have been fixed:
• [#452] add GFF3 support for tabix parsers
• [#461] Multiple fixes related to VariantRecordInfo and handling of INFO/END
54 Chapter 1. Contents
pysam documentation, Release 0.19.1
• [#447] limit query name to 251 characters (only partially addresses issue)
VariantFile and related object fixes
• Restore VariantFile.__dealloc__
• Correct handling of bcf_str_missing in bcf_array_to_object and bcf_object_to_array
• Added update() and pop() methods to some dict-like proxy objects
• scalar INFO entries could not be set again after being deleted
• VariantRecordInfo.__delitem__ now allows unset flags to be deleted without raising a KeyError
• Multiple other fixes for VariantRecordInfo methods
• INFO/END is now accessible only via VariantRecord.stop and VariantRecord.rlen. Even if present behind the
scenes, it is no longer accessible via VariantRecordInfo.
• Add argument to issue a warning instead of an exception if input appears to be truncated
Other features and fixes:
• Make AlignmentFile __dealloc__ and close more stringent
• Add argument AlignmentFile to issue a warning instead of an exception if input appears to be truncated
Bugfix release
• [#440] add deprecated ‘always’ option to infer_query_length for backwards compatibility.
This release wraps the latest versions of htslib/samtools/bcftools and implements a few bugfixes.
• [#413] Wrap HTSlib/Samtools/BCFtools 1.4
• [#422] Fix missing pysam.sort.usage() message
• [#411] Fix BGZfile initialization bug
• [#412] Add seek support for BGZFile
• [#395] Make BGZfile iterable
• [#433] Correct getQueryEnd
• [#419] Export SAM enums such as pysam.CMATCH
• [#415] Fix access by tid in AlignmentFile.fetch()
• [#405] Writing SAM now outputs a header by default.
• [#332] split infer_query_length(always) into infer_query_length and infer_read_length
This release implements further functionality in the VariantFile API and includes several bugfixes:
• treat special case -c option in samtools view outputs to stdout even if -o given, fixes #315
• permit reading BAM files with CSI index, closes #370
cimport pysam.csamtools
will become:
cimport pysam.libcsamtools
This is a bugfix release addressing some installation problems in pysam 0.9.0, in particular:
• patch included htslib to work with older libcurl versions, fixes #262.
• do not require cython for python 3 install, fixes #260
• FastaFile does not accept filepath_index any more, see #270
• add AlignedSegment.get_cigar_stats method.
• py3 bugfix in VariantFile.subset_samples, fixes #272
• add missing sysconfig import, fixes #278
• do not redirect stdout, but instead write to a separately created file. This should resolve issues when pysam is
used in notebooks or other environments that redirect stdout.
• wrap htslib-1.3.1, samtools-1.3.1 and bcftools-1.3.1
• use bgzf throughout instead of gzip
• allow specifying a fasta reference for CRAM file when opening for both read and write, fixes #280
56 Chapter 1. Contents
pysam documentation, Release 0.19.1
Overview
The 0.9.0 release upgrades htslib to htslib 1.3 and numerous other enhancements and bugfixes. See below for a detailed
list.
Htslib 1.3 comes with additional capabilities for remote file access which depend on the presence of optional system
libraries. As a consequence, the installation script setup.py has become more complex. For an overview, see
Installing pysam. We have tested installation on linux and OS X, but could not capture all variations. It is possible that
a 0.9.1 release might follow soon addressing installation issues.
The VariantFile class provides access to vcf and bcf formatted files. The class is certainly usable and interface is
reaching completion, but the API and the functionality is subject to change.
pysam.samtools.view(self.filename).splitlines(True)
This release contains numerous bugfixes and a first implementation of a pythonic interface to VCF/BCF files. Note
that this code is still incomplete and preliminary, but does offer a nearly complete immutable Pythonic interface to
VCF/BCF metadata and data with reading and writing capability.
Potential isses when upgrading from v0.8.3:
• binary tags are now returned as python arrays
• renamed several methods for pep8 compatibility, old names still retained for backwards compatibility, but should
be considered deprecated.
– gettid() is now get_tid()
– getrname() is now get_reference_name()
– parseRegion() is now parse_region()
• some methods have changed for pep8 compatibility without the old names being present:
– fromQualityString() is now qualitystring_to_array()
– toQualityString() is now qualities_to_qualitystring()
• faidx now returns strings and not binary strings in py3.
• The cython components have been broken up into smaller files with more specific content. This will affect users
using the cython interfaces.
Edited list of commit log changes:
• fixes AlignmentFile.check_index to return True
• add RG/PM header tag - closes #179
• add with_seq option to get_aligned_pairs
• use char * inside reconsituteReferenceSequence
• add soft clipping for get_reference_sequence
• add get_reference_sequence
• queryEnd now computes length from cigar string if no sequence present, closes #176
• tolerate missing space at end of gtf files, closes #162
• do not raise Error when receiving output on stderr
• add docu about fetching without index, closes #170
• FastaFile and FastxFile now return strings in python3, closes #173
• py3 compat: relative -> absolute imports.
• add reference_name and next_reference_name attributes to AlignedSegment
• add function signatures to cvcf cython. Added note about other VCF code.
• add context manager functions to FastaFile
• add reference_name and next_reference_name attributes to AlignedSegment
• PileupColumn also gets a reference_name attribute.
• add context manager functions to FastaFile
• TabixFile.header for remote files raises AttributeError, fixes #157
58 Chapter 1. Contents
pysam documentation, Release 0.19.1
60 Chapter 1. Contents
pysam documentation, Release 0.19.1
• Disabled features
– IteratorColumn.setMask() disabled as htslib does not implement this functionality?
• Not implemented yet:
– reading SAM files without header
Tabix files between version 0.7.8 and 0.8.0 are not compatible and need to be re-indexed.
While version 0.7.8 and 0.8.0 should be mostly compatible, there are some notable exceptions:
• tabix iterators will fail if there are comments in the middle or the end of a file.
• tabix raises always ValueError for invalid intervals. Previously, different types of errors were raised (KeyError,
IndexError, ValueError) depending on the type of invalid intervals (missing chromosome, out-of-range, malfor-
matted interval).
62 Chapter 1. Contents
pysam documentation, Release 0.19.1
64 Chapter 1. Contents
pysam documentation, Release 0.19.1
1.12 Benchmarking
˓→------------------------------------------------------------------------------------
˓→-------
--------------------------------------------------------------------------------------
˓→------------------------------------------------------------------------------------
˓→------------------------------------------------------------------------------------
˓→-------
˓→------------------------------------------------------------------------------------
˓→-------
1.13 Glossary
BAM Binary SAM format. BAM files are binary formatted, indexed and allow random access.
BCF Binary VCF.
BED Browser Extensible Data format. A text file format used to store genomic regions as coordinates and associated
notations.
bgzip Utility in the htslib package to block compress genomic data files.
cigar Stands for Compact Idiosyncratic Gapped Alignment Report and represents a compressed (run-length encoded)
pairwise alignment format. It was first defined by the Exonerate Aligner, but was alter adapted and adopted as
part of the SAM standard and many other aligners. In the Python API, the cigar alignment is presented as a
list of tuples (operation,length). For example, the tuple [ (0,3), (1,5), (0,2) ] refers to an
alignment with 3 matches, 5 insertions and another 2 matches.
column
The portion of reads aligned to a single base in the reference sequence.
contig The sequence that a tid refers to. For example chr1, contig123.
CRAM CRAM is a binary format representing the same sequence alignment information as SAM and BAM, but
offering significantly better lossless compression than BAM.
faidx Utility in the samtools package to index fasta formatted files.
FASTA Simple text format containing sequence data, with only the bare minimum of metadata. Typically used for
reference sequence data.
FASTQ Simple text format containing sequence data and associated base qualities.
fetching Retrieving all mapped reads mapped to a region.
genotype
An individual’s collection of genes. It can also refer to the two alleles inherited for a particular gene.
GTF
The Gene Transfer Format is a file format used to hold information about gene structure.
hard clipping
hard clipped In hard clipped reads, part of the sequence has been removed prior to alignment. That only a subse-
quence is aligend might be recorded in the cigar alignment, but the removed sequence will not be part of the
alignment record, in contrast to soft clipped reads.
pileup Pileup
reference Synonym for contig.
1.13. Glossary 69
pysam documentation, Release 0.19.1
region A genomic region, stated relative to a reference sequence. A region consists of reference name (‘chr1’), start
(15000), and end (20000). Start and end can be omitted for regions spanning a whole chromosome. If end is
missing, the region will span from start to the end of the chromosome. Within pysam, coordinates are 0-
based half-open intervals, i.e., the first base of the reference sequence is numbered zero; and the base at position
start is part of the interval, but the base at end is not.
When a region is written as a single string using samtools-compatible notation, e.g., ‘chr1:15001-20000’, the
string’s coordinates instead represent a 1-based closed interval, i.e., both (1-based) positions 15,001 and 20,000
are part of the interval. (This example denotes the same 5,000-base region as the example in the previous
paragraph.)
SAM A textual format for storing genomic alignment information.
sam file A file containing aligned reads. The sam file can either be a BAM file or a TAM file.
soft clipping
soft clipped In alignments with soft clipping part of the query sequence are not aligned. The unaligned query se-
quence is still part of the alignment record. This is in difference to hard clipped reads.
tabix Utility in the htslib package to index bgzip compressed files.
tabix file A sorted, compressed and indexed tab-separated file created by the command line tool tabix or the com-
mands tabix_compress() and tabix_index(). The file is indexed by chromosomal coordinates.
tabix row A row in a tabix file. Fields within a row are tab-separated.
TAM Text SAM file. TAM files are human readable files of tab-separated fields. TAM files do not allow random
access.
target The sequence that a read has been aligned to. Target sequences have bot a numerical identifier (tid) and an
alphanumeric name (Reference).
tid The target id. The target id is 0 or a positive integer mapping to entries within the sequence dictionary in the
header section of a TAM file or BAM file.
VCF Variant Call Format.
70 Chapter 1. Contents
CHAPTER 2
Contents:
• genindex
• modindex
• search
71
pysam documentation, Release 0.19.1
References
See also:
Information about htslib http://www.htslib.org
The samtools homepage http://samtools.sourceforge.net
The cython C-extensions for python https://cython.org/
The python language https://www.python.org
73
pysam documentation, Release 0.19.1
74 Chapter 3. References
Bibliography
[Li.2009] The Sequence Alignment/Map format and SAMtools. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J,
Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. Bioinfor-
matics. 2009 Aug 15;25(16):2078-9. Epub 2009 Jun 8 btp352. PMID: 19505943.
[Bonfield.2021] HTSlib: C library for reading/writing high-throughput sequencing data. Bonfield JK, Marshall J,
Danecek P, Li H, Ohan V, Whitwham A, Keane T, Davies RM. GigaScience (2021) 10(2) giab007. PMID:
33594436.
[Danecek.2021] Twelve years of SAMtools and BCFtools. Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V,
Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H. GigaScience (2021) 10(2) giab008.
PMID: 33590861.
75
pysam documentation, Release 0.19.1
76 Bibliography
Index
77
pysam documentation, Release 0.19.1
78 Index
pysam documentation, Release 0.19.1
K O
key (pysam.VariantHeaderRecord attribute), 34 open() (pysam.VariantFile method), 31
keys() (pysam.AlignmentHeader method), 13 opt() (pysam.AlignedSegment method), 18
keys() (pysam.VariantHeaderRecord method), 34 overlap() (pysam.AlignedSegment method), 18
L P
lengths (pysam.AlignmentFile attribute), 10 parse_region() (pysam.HTSFile method), 35
lengths (pysam.AlignmentHeader attribute), 13 pileup, 69
lengths (pysam.FastaFile attribute), 29 pileup() (pysam.AlignmentFile method), 10
level (pysam.PileupRead attribute), 24 PileupColumn (class in pysam), 22
PileupRead (class in pysam), 24
M pileups (pysam.PileupColumn attribute), 23
mapped (pysam.AlignmentFile attribute), 10 pnext (pysam.AlignedSegment attribute), 18
pop() (pysam.VariantHeaderRecord method), 34
Index 79
pysam documentation, Release 0.19.1
80 Index
pysam documentation, Release 0.19.1
U
unmapped (pysam.AlignmentFile attribute), 12
update() (pysam.VariantHeaderRecord method), 34
V
value (pysam.VariantHeaderRecord attribute), 34
values() (pysam.AlignmentHeader method), 13
values() (pysam.VariantHeaderRecord method), 34
VariantFile (class in pysam), 30
VariantHeader (class in pysam), 32
VariantHeaderRecord (class in pysam), 33
VariantRecord (class in pysam), 33
VCF, 70
version (pysam.HTSFile attribute), 36
version (pysam.VariantHeader attribute), 32
W
write() (pysam.AlignmentFile method), 12
write() (pysam.VariantFile method), 32
Index 81