The XAM
package offers high-performance tools for SAM and BAM file formats, which are the most popular file formats.
If you have questions about the SAM and BAM formats or any of the terminology used when discussing these formats, see the published specification, which is maintained by the samtools group.
A very very simple SAM file looks like the following:
@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
@@ -51,7 +51,7 @@ julia> find(header(reader), "SQ")
Bio.Align.SAM.MetaInfo:
tag: SQ
value: SN=mitochondria LN=366924
-
In the above we can see there were 7 sequences in the reference: 5 chromosomes, one chloroplast sequence, and one mitochondrial sequence.
The XAM
package supports the following accessors for SAM.Record
types.
flag(record::Record)::UInt16
Get the bitwise flag of record
.
sourceismapped(record::Record)::Bool
Test if record
is mapped.
sourceisprimary(record::Record)::Bool
Test if record
is a primary line of the read.
This is equivalent to flag(record) & 0x900 == 0
.
sourcerefname(record::Record)::String
Get the reference sequence name of record
.
sourceposition(record::Record)::Int
Get the 1-based leftmost mapping position of record
.
sourcerightposition(record::Record)::Int
Get the 1-based rightmost mapping position of record
.
sourceisnextmapped(record::Record)::Bool
Test if the mate/next read of record
is mapped.
sourcenextrefname(record::Record)::String
Get the reference name of the mate/next read of record
.
sourcenextposition(record::Record)::Int
Get the position of the mate/next read of record
.
sourcemappingquality(record::Record)::UInt8
Get the mapping quality of record
.
sourcecigar(record::Record)::String
Get the CIGAR string of record
.
sourcealignment(record::Record)::BioAlignments.Alignment
Get the alignment of record
.
sourcealignlength(record::Record)::Int
Get the alignment length of record
.
sourcetempname(record::Record)::String
Get the query template name of record
.
sourcetemplength(record::Record)::Int
Get the template length of record
.
sourcesequence(record::Record)::BioSequences.LongDNASeq
Get the segment sequence of record
.
sourcesequence(::Type{String}, record::Record)::String
Get the segment sequence of record
as String
.
sourceseqlength(record::Record)::Int
Get the sequence length of record
.
sourcequality(record::Record)::Vector{UInt8}
Get the Phred-scaled base quality of record
.
sourcequality(::Type{String}, record::Record)::String
Get the ASCII-encoded base quality of record
.
sourceauxdata(record::Record)::Dict{String,Any}
Get the auxiliary data (optional fields) of record
.
sourceThe XAM
package supports the following accessors for BAM.Record
types.
flag(record::Record)::UInt16
Get the bitwise flag of record
.
sourceismapped(record::Record)::Bool
Test if record
is mapped.
sourceisprimary(record::Record)::Bool
Test if record
is a primary line of the read.
This is equivalent to flag(record) & 0x900 == 0
.
sourcerefid(record::Record)::Int
Get the reference sequence ID of record
.
The ID is 1-based (i.e. the first sequence is 1) and is 0 for a record without a mapping position.
See also: BAM.rname
sourcerefname(record::Record)::String
Get the reference sequence name of record
.
See also: BAM.refid
sourcereflen(record::Record)::Int
Get the length of the reference sequence this record applies to.
sourceposition(record::Record)::Int
Get the 1-based leftmost mapping position of record
.
sourcerightposition(record::Record)::Int
Get the 1-based rightmost mapping position of record
.
sourceisnextmapped(record::Record)::Bool
Test if the mate/next read of record
is mapped.
sourcenextrefid(record::Record)::Int
Get the next/mate reference sequence ID of record
.
sourcenextrefname(record::Record)::String
Get the reference name of the mate/next read of record
.
sourcenextposition(record::Record)::Int
Get the 1-based leftmost mapping position of the next/mate read of record
.
sourcemappingquality(record::Record)::UInt8
Get the mapping quality of record
.
sourcecigar(record::Record)::String
Get the CIGAR string of record
.
Note that in the BAM specification, the field called cigar
typically stores the cigar string of the record. However, this is not always true, sometimes the true cigar is very long, and due to some constraints of the BAM format, the actual cigar string is stored in an extra tag: CG:B,I
, and the cigar
field stores a pseudo-cigar string.
Calling this method with checkCG
set to true
(default) this method will always yield the true cigar string, because this is probably what you want the vast majority of the time.
If you have a record that stores the true cigar in a CG:B,I
tag, but you still want to access the pseudo-cigar that is stored in the cigar
field of the BAM record, then you can set checkCG to false
.
See also BAM.cigar_rle
.
sourcealignment(record::Record)::BioAlignments.Alignment
Get the alignment of record
.
sourcealignlength(record::Record)::Int
Get the alignment length of record
.
sourcetempname(record::Record)::String
Get the query template name of record
.
sourcetemplength(record::Record)::Int
Get the template length of record
.
sourcesequence(record::Record)::BioSequences.LongDNASeq
Get the segment sequence of record
.
sourceseqlength(record::Record)::Int
Get the sequence length of record
.
sourcequality(record::Record)::Vector{UInt8}
Get the base quality of record
.
sourceauxdata(record::Record)::BAM.AuxData
Get the auxiliary data of record
.
sourceSAM and BAM records support the storing of optional data fields associated with tags.
Tagged auxiliary data follows a format of TAG:TYPE:VALUE
. TAG
is a two-letter string, and each tag can only appear once per record. TYPE
is a single case-sensetive letter which defined the format of VALUE
.
Type | Description |
---|
'A' | Printable character |
'i' | Signed integer |
'f' | Single-precision floating number |
'Z' | Printable string, including space |
'H' | Byte array in Hex format |
'B' | Integer of numeric array |
For more information about these tags and their types we refer you to the SAM/BAM specification and the additional optional fields specification document.
There are some tags that are reserved, predefined standard tags, for specific uses.
To access optional fields stored in tags, you use getindex
indexing syntax on the record object. Note that accessing optional tag fields will result in type instability in Julia. This is because the type of the optional data is not known until run-time, as the tag is being read. This can have a significant impact on performance. To limit this, if the user knows the type of a value in advance, specifying it as a type annotation will alleviate the problem:
Below is an example of looping over records in a bam file and using indexing syntax to get the data stored in the "NM" tag. Note the UInt8
type assertion to alleviate type instability.
for record in open(BAM.Reader, "data.bam")
+
In the above we can see there were 7 sequences in the reference: 5 chromosomes, one chloroplast sequence, and one mitochondrial sequence.
The XAM
package supports the following accessors for SAM.Record
types.
flag(record::Record)::UInt16
Get the bitwise flag of record
.
sourceismapped(record::Record)::Bool
Test if record
is mapped.
sourceisprimary(record::Record)::Bool
Test if record
is a primary line of the read.
This is equivalent to flag(record) & 0x900 == 0
.
sourcerefname(record::Record)::String
Get the reference sequence name of record
.
sourceposition(record::Record)::Int
Get the 1-based leftmost mapping position of record
.
sourcerightposition(record::Record)::Int
Get the 1-based rightmost mapping position of record
.
sourceisnextmapped(record::Record)::Bool
Test if the mate/next read of record
is mapped.
sourcenextrefname(record::Record)::String
Get the reference name of the mate/next read of record
.
sourcenextposition(record::Record)::Int
Get the position of the mate/next read of record
.
sourcemappingquality(record::Record)::UInt8
Get the mapping quality of record
.
sourcecigar(record::Record)::String
Get the CIGAR string of record
.
sourcealignment(record::Record)::BioAlignments.Alignment
Get the alignment of record
.
sourcealignlength(record::Record)::Int
Get the alignment length of record
.
sourcetempname(record::Record)::String
Get the query template name of record
.
sourcetemplength(record::Record)::Int
Get the template length of record
.
sourcesequence(record::Record)::BioSequences.LongDNASeq
Get the segment sequence of record
.
sourcesequence(::Type{String}, record::Record)::String
Get the segment sequence of record
as String
.
sourceseqlength(record::Record)::Int
Get the sequence length of record
.
sourcequality(record::Record)::Vector{UInt8}
Get the Phred-scaled base quality of record
.
sourcequality(::Type{String}, record::Record)::String
Get the ASCII-encoded base quality of record
.
sourceauxdata(record::Record)::Dict{String,Any}
Get the auxiliary data (optional fields) of record
.
sourceThe XAM
package supports the following accessors for BAM.Record
types.
flag(record::Record)::UInt16
Get the bitwise flag of record
.
sourceismapped(record::Record)::Bool
Test if record
is mapped.
sourceisprimary(record::Record)::Bool
Test if record
is a primary line of the read.
This is equivalent to flag(record) & 0x900 == 0
.
sourcerefid(record::Record)::Int
Get the reference sequence ID of record
.
The ID is 1-based (i.e. the first sequence is 1) and is 0 for a record without a mapping position.
See also: BAM.rname
sourcerefname(record::Record)::String
Get the reference sequence name of record
.
See also: BAM.refid
sourcereflen(record::Record)::Int
Get the length of the reference sequence this record applies to.
sourceposition(record::Record)::Int
Get the 1-based leftmost mapping position of record
.
sourcerightposition(record::Record)::Int
Get the 1-based rightmost mapping position of record
.
sourceisnextmapped(record::Record)::Bool
Test if the mate/next read of record
is mapped.
sourcenextrefid(record::Record)::Int
Get the next/mate reference sequence ID of record
.
sourcenextrefname(record::Record)::String
Get the reference name of the mate/next read of record
.
sourcenextposition(record::Record)::Int
Get the 1-based leftmost mapping position of the next/mate read of record
.
sourcemappingquality(record::Record)::UInt8
Get the mapping quality of record
.
sourcecigar(record::Record)::String
Get the CIGAR string of record
.
Note that in the BAM specification, the field called cigar
typically stores the cigar string of the record. However, this is not always true, sometimes the true cigar is very long, and due to some constraints of the BAM format, the actual cigar string is stored in an extra tag: CG:B,I
, and the cigar
field stores a pseudo-cigar string.
Calling this method with checkCG
set to true
(default) this method will always yield the true cigar string, because this is probably what you want the vast majority of the time.
If you have a record that stores the true cigar in a CG:B,I
tag, but you still want to access the pseudo-cigar that is stored in the cigar
field of the BAM record, then you can set checkCG to false
.
See also BAM.cigar_rle
.
sourcealignment(record::Record)::BioAlignments.Alignment
Get the alignment of record
.
sourcealignlength(record::Record)::Int
Get the alignment length of record
.
sourcetempname(record::Record)::String
Get the query template name of record
.
sourcetemplength(record::Record)::Int
Get the template length of record
.
sourcesequence(record::Record)::BioSequences.LongDNASeq
Get the segment sequence of record
.
sourceseqlength(record::Record)::Int
Get the sequence length of record
.
sourcequality(record::Record)::Vector{UInt8}
Get the base quality of record
.
sourceauxdata(record::Record)::BAM.AuxData
Get the auxiliary data of record
.
sourceSAM and BAM records support the storing of optional data fields associated with tags.
Tagged auxiliary data follows a format of TAG:TYPE:VALUE
. TAG
is a two-letter string, and each tag can only appear once per record. TYPE
is a single case-sensetive letter which defined the format of VALUE
.
Type | Description |
---|
'A' | Printable character |
'i' | Signed integer |
'f' | Single-precision floating number |
'Z' | Printable string, including space |
'H' | Byte array in Hex format |
'B' | Integer of numeric array |
For more information about these tags and their types we refer you to the SAM/BAM specification and the additional optional fields specification document.
There are some tags that are reserved, predefined standard tags, for specific uses.
To access optional fields stored in tags, you use getindex
indexing syntax on the record object. Note that accessing optional tag fields will result in type instability in Julia. This is because the type of the optional data is not known until run-time, as the tag is being read. This can have a significant impact on performance. To limit this, if the user knows the type of a value in advance, specifying it as a type annotation will alleviate the problem:
Below is an example of looping over records in a bam file and using indexing syntax to get the data stored in the "NM" tag. Note the UInt8
type assertion to alleviate type instability.
for record in open(BAM.Reader, "data.bam")
nm = record["NM"]::UInt8
# do something
end
The XAM
package supports the BAI index to fetch records in a specific range from a BAM file. Samtools provides index
subcommand to create an index file (.bai) from a sorted BAM file.
$ samtools index -b SRR1238088.sort.bam
@@ -103,4 +103,4 @@ SAM.Writer(IOStream(<file my-data.sam>))
julia> bamw = BAM.Writer(BGZFStream(open("my-data.bam", "w"), "w"))
BAM.Writer(BGZFStreams.BGZFStream{IOStream}(<mode=write>))
Once you have a BAM or SAM writer, you can use the write
method to write BAM.Record
s or SAM.Record
s to file:
julia> write(bamw, rec) # Here rec is a `BAM.Record`
-330780