mirror of
https://github.com/MillironX/XAM.jl.git
synced 2024-12-23 21:28:18 +00:00
Documentation roundup
This commit is contained in:
parent
bf835d9ff0
commit
30a9911e58
1 changed files with 313 additions and 0 deletions
313
docs/src/man/hts-files.md
Normal file
313
docs/src/man/hts-files.md
Normal file
|
@ -0,0 +1,313 @@
|
|||
# SAM and BAM
|
||||
|
||||
|
||||
## Introduction
|
||||
|
||||
High-throughput sequencing (HTS) technologies generate a large amount of data in the form of a large number of nucleotide sequencing reads.
|
||||
One of the most common tasks in bioinformatics is to align these reads against known reference genomes, chromosomes, or contigs.
|
||||
BioAlignments provides several data formats commonly used for this kind of task.
|
||||
|
||||
BioAlignments offers high-performance tools for SAM and BAM file formats, which are the most popular file formats.
|
||||
|
||||
If you have questions about the SAM and BAM formats or any of the terminology used when discussing these formats, see the published [specification][samtools-spec], which is maintained by the [samtools group][samtools].
|
||||
|
||||
A very very simple SAM file looks like the following:
|
||||
|
||||
```
|
||||
@HD VN:1.6 SO:coordinate
|
||||
@SQ SN:ref LN:45
|
||||
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
|
||||
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
|
||||
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
|
||||
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
|
||||
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
|
||||
r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1
|
||||
```
|
||||
|
||||
Where the first two lines are part of the "header", and the following lines are "records".
|
||||
Each record describes how a read aligns to some reference sequence.
|
||||
Sometimes one record describes one read, but there are other cases like chimeric reads and split alignments, where multiple records apply to one read.
|
||||
In the example above, `r003` is a chimeric read, and `r004` is a split alignment, and `r001` are mate pair reads.
|
||||
Again, we refer you to the official [specification][samtools-spec] for more details.
|
||||
|
||||
A BAM file stores this same information but in a binary and compressible format that does not make for pretty printing here!
|
||||
|
||||
## Reading SAM and BAM files
|
||||
|
||||
A typical script iterating over all records in a file looks like below:
|
||||
|
||||
```julia
|
||||
using BioAlignments
|
||||
|
||||
# Open a BAM file.
|
||||
reader = open(BAM.Reader, "data.bam")
|
||||
|
||||
# Iterate over BAM records.
|
||||
for record in reader
|
||||
# `record` is a BAM.Record object.
|
||||
if BAM.ismapped(record)
|
||||
# Print the mapped position.
|
||||
println(BAM.refname(record), ':', BAM.position(record))
|
||||
end
|
||||
end
|
||||
|
||||
# Close the BAM file.
|
||||
close(reader)
|
||||
```
|
||||
|
||||
The size of a BAM file is often extremely large.
|
||||
The iterator interface demonstrated above allocates an object for each record and that may be a bottleneck of reading data from a BAM file.
|
||||
In-place reading reuses a pre-allocated object for every record and less memory allocation happens in reading:
|
||||
|
||||
```julia
|
||||
reader = open(BAM.Reader, "data.bam")
|
||||
record = BAM.Record()
|
||||
while !eof(reader)
|
||||
read!(reader, record)
|
||||
# do something
|
||||
end
|
||||
```
|
||||
|
||||
## SAM and BAM Headers
|
||||
|
||||
Both `SAM.Reader` and `BAM.Reader` implement the `header` function, which returns a `SAM.Header` object.
|
||||
To extract certain information out of the headers, you can use the `find` method on the header to extract information according to SAM/BAM tag.
|
||||
Again we refer you to the [specification][samtools-spec] for full details of all the different tags that can occur in headers, and what they mean.
|
||||
|
||||
Below is an example of extracting all the info about the reference sequences from the BAM header.
|
||||
In SAM/BAM, any description of a reference sequence is stored in the header, under a tag denoted `SQ` (think `reference SeQuence`!).
|
||||
|
||||
```jlcon
|
||||
julia> reader = open(SAM.Reader, "data.sam");
|
||||
|
||||
julia> find(header(reader), "SQ")
|
||||
7-element Array{Bio.Align.SAM.MetaInfo,1}:
|
||||
Bio.Align.SAM.MetaInfo:
|
||||
tag: SQ
|
||||
value: SN=Chr1 LN=30427671
|
||||
Bio.Align.SAM.MetaInfo:
|
||||
tag: SQ
|
||||
value: SN=Chr2 LN=19698289
|
||||
Bio.Align.SAM.MetaInfo:
|
||||
tag: SQ
|
||||
value: SN=Chr3 LN=23459830
|
||||
Bio.Align.SAM.MetaInfo:
|
||||
tag: SQ
|
||||
value: SN=Chr4 LN=18585056
|
||||
Bio.Align.SAM.MetaInfo:
|
||||
tag: SQ
|
||||
value: SN=Chr5 LN=26975502
|
||||
Bio.Align.SAM.MetaInfo:
|
||||
tag: SQ
|
||||
value: SN=chloroplast LN=154478
|
||||
Bio.Align.SAM.MetaInfo:
|
||||
tag: SQ
|
||||
value: SN=mitochondria LN=366924
|
||||
|
||||
```
|
||||
|
||||
In the above we can see there were 7 sequences in the reference: 5 chromosomes, one chloroplast sequence, and one mitochondrial sequence.
|
||||
|
||||
## SAM and BAM Records
|
||||
|
||||
BioAlignments supports the following accessors for `SAM.Record` types.
|
||||
|
||||
```@docs
|
||||
XAM.SAM.flag
|
||||
XAM.SAM.ismapped
|
||||
XAM.SAM.isprimary
|
||||
XAM.SAM.refname
|
||||
XAM.SAM.position
|
||||
XAM.SAM.rightposition
|
||||
XAM.SAM.isnextmapped
|
||||
XAM.SAM.nextrefname
|
||||
XAM.SAM.nextposition
|
||||
XAM.SAM.mappingquality
|
||||
XAM.SAM.cigar
|
||||
XAM.SAM.alignment
|
||||
XAM.SAM.alignlength
|
||||
XAM.SAM.tempname
|
||||
XAM.SAM.templength
|
||||
XAM.SAM.sequence
|
||||
XAM.SAM.seqlength
|
||||
XAM.SAM.quality
|
||||
XAM.SAM.auxdata
|
||||
```
|
||||
|
||||
BioAlignments supports the following accessors for `BAM.Record` types.
|
||||
|
||||
```@docs
|
||||
XAM.BAM.flag
|
||||
XAM.BAM.ismapped
|
||||
XAM.BAM.isprimary
|
||||
XAM.BAM.refid
|
||||
XAM.BAM.refname
|
||||
XAM.BAM.reflen
|
||||
XAM.BAM.position
|
||||
XAM.BAM.rightposition
|
||||
XAM.BAM.isnextmapped
|
||||
XAM.BAM.nextrefid
|
||||
XAM.BAM.nextrefname
|
||||
XAM.BAM.nextposition
|
||||
XAM.BAM.mappingquality
|
||||
XAM.BAM.cigar
|
||||
XAM.BAM.alignment
|
||||
XAM.BAM.alignlength
|
||||
XAM.BAM.tempname
|
||||
XAM.BAM.templength
|
||||
XAM.BAM.sequence
|
||||
XAM.BAM.seqlength
|
||||
XAM.BAM.quality
|
||||
XAM.BAM.auxdata
|
||||
```
|
||||
|
||||
## Accessing auxiliary data
|
||||
|
||||
SAM and BAM records support the storing of optional data fields associated with tags.
|
||||
|
||||
Tagged auxiliary data follows a format of `TAG:TYPE:VALUE`.
|
||||
`TAG` is a two-letter string, and each tag can only appear once per record.
|
||||
`TYPE` is a single case-sensetive letter which defined the format of `VALUE`.
|
||||
|
||||
| Type | Description |
|
||||
|------|-----------------------------------|
|
||||
| 'A' | Printable character |
|
||||
| 'i' | Signed integer |
|
||||
| 'f' | Single-precision floating number |
|
||||
| 'Z' | Printable string, including space |
|
||||
| 'H' | Byte array in Hex format |
|
||||
| 'B' | Integer of numeric array |
|
||||
|
||||
For more information about these tags and their types we refer you to the [SAM/BAM specification][samtools-spec] and the additional [optional fields specification][samtags] document.
|
||||
|
||||
There are some tags that are reserved, predefined standard tags, for specific uses.
|
||||
|
||||
To access optional fields stored in tags, you use `getindex` indexing syntax on the record object.
|
||||
Note that accessing optional tag fields will result in type instability in Julia.
|
||||
This is because the type of the optional data is not known until run-time, as the tag is being read.
|
||||
This can have a significant impact on performance.
|
||||
To limit this, if the user knows the type of a value in advance, specifying it as a type annotation will alleviate the problem:
|
||||
|
||||
Below is an example of looping over records in a bam file and using indexing syntax to get the data stored in the "NM" tag.
|
||||
Note the `UInt8` type assertion to alleviate type instability.
|
||||
|
||||
```julia
|
||||
for record in open(BAM.Reader, "data.bam")
|
||||
nm = record["NM"]::UInt8
|
||||
# do something
|
||||
end
|
||||
```
|
||||
|
||||
## Getting records in a range
|
||||
|
||||
BioAlignments supports the BAI index to fetch records in a specific range from a BAM file.
|
||||
[Samtools][samtools] provides `index` subcommand to create an index file (.bai) from a sorted BAM file.
|
||||
|
||||
```console
|
||||
$ samtools index -b SRR1238088.sort.bam
|
||||
$ ls SRR1238088.sort.bam*
|
||||
SRR1238088.sort.bam SRR1238088.sort.bam.bai
|
||||
```
|
||||
|
||||
`eachoverlap(reader, chrom, range)` returns an iterator of BAM records overlapping the query interval:
|
||||
|
||||
```julia
|
||||
reader = open(BAM.Reader, "SRR1238088.sort.bam", index="SRR1238088.sort.bam.bai")
|
||||
for record in eachoverlap(reader, "Chr2", 10000:11000)
|
||||
# `record` is a BAM.Record object
|
||||
# ...
|
||||
end
|
||||
close(reader)
|
||||
```
|
||||
|
||||
## Getting records overlapping genomic features
|
||||
|
||||
`eachoverlap` also accepts the `Interval` type defined in [GenomicFeatures.jl][genomicfeatures].
|
||||
|
||||
This allows you to do things like first read in the genomic features from a GFF3 file, and then for each feature, iterate over all the BAM records that overlap with that feature.
|
||||
|
||||
```julia
|
||||
# Load GFF3 module.
|
||||
using GenomicFeatures
|
||||
using BioAlignments
|
||||
|
||||
# Load genomic features from a GFF3 file.
|
||||
features = open(collect, GFF3.Reader, "TAIR10_GFF3_genes.gff")
|
||||
|
||||
# Keep mRNA features.
|
||||
filter!(x -> GFF3.featuretype(x) == "mRNA", features)
|
||||
|
||||
# Open a BAM file and iterate over records overlapping mRNA transcripts.
|
||||
reader = open(BAM.Reader, "SRR1238088.sort.bam", index = "SRR1238088.sort.bam.bai")
|
||||
for feature in features
|
||||
for record in eachoverlap(reader, feature)
|
||||
# `record` overlaps `feature`.
|
||||
# ...
|
||||
end
|
||||
end
|
||||
close(reader)
|
||||
```
|
||||
|
||||
## Writing files
|
||||
|
||||
In order to write a BAM or SAM file, you must first create a `SAM.Header`.
|
||||
|
||||
A `SAM.Header` is constructed from a vector of `SAM.MetaInfo` objects.
|
||||
|
||||
For example, to create the following simple header:
|
||||
|
||||
```
|
||||
@HD VN:1.6 SO:coordinate
|
||||
@SQ SN:ref LN:45
|
||||
```
|
||||
|
||||
```julia
|
||||
julia> a = SAM.MetaInfo("HD", ["VN" => 1.6, "SO" => "coordinate"])
|
||||
SAM.MetaInfo:
|
||||
tag: HD
|
||||
value: VN=1.6 SO=coordinate
|
||||
|
||||
julia> b = SAM.MetaInfo("SQ", ["SN" => "ref", "LN" => 45])
|
||||
SAM.MetaInfo:
|
||||
tag: SQ
|
||||
value: SN=ref LN=45
|
||||
|
||||
julia> h = SAM.Header([a, b])
|
||||
SAM.Header(SAM.MetaInfo[SAM.MetaInfo:
|
||||
tag: HD
|
||||
value: VN=1.6 SO=coordinate, SAM.MetaInfo:
|
||||
tag: SQ
|
||||
value: SN=ref LN=45])
|
||||
|
||||
```
|
||||
|
||||
Then to create the writer for a SAM file, construct a `SAM.Writer` using the header and an `IO` type:
|
||||
|
||||
```julia
|
||||
julia> samw = SAM.Writer(open("my-data.sam", "w"), h)
|
||||
SAM.Writer(IOStream(<file my-data.sam>))
|
||||
|
||||
```
|
||||
|
||||
To make a BAM Writer is slightly different, as you need to use a specific stream type from the [BGZFStreams][bgzfstreams] package:
|
||||
|
||||
```julia
|
||||
julia> using BGZFStreams
|
||||
|
||||
julia> bamw = BAM.Writer(BGZFStream(open("my-data.bam", "w"), "w"))
|
||||
BAM.Writer(BGZFStreams.BGZFStream{IOStream}(<mode=write>))
|
||||
|
||||
```
|
||||
|
||||
Once you have a BAM or SAM writer, you can use the `write` method to write `BAM.Record`s or `SAM.Record`s to file:
|
||||
|
||||
```julia
|
||||
julia> write(bamw, rec) # Here rec is a `BAM.Record`
|
||||
330780
|
||||
```
|
||||
|
||||
[samtools]: https://samtools.github.io/
|
||||
[samtools-spec]: https://samtools.github.io/hts-specs/SAMv1.pdf
|
||||
[samtags]: https://samtools.github.io/hts-specs/SAMtags.pdf
|
||||
[bgzfstreams]: https://github.com/BioJulia/BGZFStreams.jl
|
||||
[genomicfeatures]: https://github.com/BioJulia/GenomicFeatures.jl
|
Loading…
Reference in a new issue