Documentation roundup

2024-11-23 10:19:56 +00:00 · 2020-02-02 20:55:31 +11:00 · 2020-02-02 20:55:31 +11:00 · 30a9911e58
commit 30a9911e58
parent bf835d9ff0
1 changed files with 313 additions and 0 deletions
--- a/docs/src/man/hts-files.md
+++ b/docs/src/man/hts-files.md
@ -0,0 +1,313 @@
+# SAM and BAM
+
+
+## Introduction
+
+High-throughput sequencing (HTS) technologies generate a large amount of data in the form of a large number of nucleotide sequencing reads.
+One of the most common tasks in bioinformatics is to align these reads against known reference genomes, chromosomes, or contigs.
+BioAlignments provides several data formats commonly used for this kind of task.
+
+BioAlignments offers high-performance tools for SAM and BAM file formats, which are the most popular file formats.
+
+If you have questions about the SAM and BAM formats or any of the terminology used when discussing these formats, see the published [specification][samtools-spec], which is maintained by the [samtools group][samtools].
+
+A very very simple SAM file looks like the following:
+
+```
+@HD VN:1.6 SO:coordinate
+@SQ SN:ref LN:45
+r001   99 ref  7 30 8M2I4M1D3M = 37  39 TTAGATAAAGGATACTG *
+r002    0 ref  9 30 3S6M1P1I4M *  0   0 AAAAGATAAGGATA    *
+r003    0 ref  9 30 5S6M       *  0   0 GCCTAAGCTAA       * SA:Z:ref,29,-,6H5M,17,0;
+r004    0 ref 16 30 6M14N5M    *  0   0 ATAGCTTCAGC       *
+r003 2064 ref 29 17 6H5M       *  0   0 TAGGC             * SA:Z:ref,9,+,5S6M,30,1;
+r001  147 ref 37 30 9M         =  7 -39 CAGCGGCAT         * NM:i:1
+```
+
+Where the first two lines are part of the "header", and the following lines are "records".
+Each record describes how a read aligns to some reference sequence.
+Sometimes one record describes one read, but there are other cases like chimeric reads and split alignments, where multiple records apply to one read.
+In the example above, `r003` is a chimeric read, and `r004` is a split alignment, and `r001` are mate pair reads.
+Again, we refer you to the official [specification][samtools-spec] for more details.
+
+A BAM file stores this same information but in a binary and compressible format that does not make for pretty printing here!
+
+## Reading SAM and BAM files
+
+A typical script iterating over all records in a file looks like below:
+
+```julia
+using BioAlignments
+
+# Open a BAM file.
+reader = open(BAM.Reader, "data.bam")
+
+# Iterate over BAM records.
+for record in reader
+    # `record` is a BAM.Record object.
+    if BAM.ismapped(record)
+        # Print the mapped position.
+        println(BAM.refname(record), ':', BAM.position(record))
+    end
+end
+
+# Close the BAM file.
+close(reader)
+```
+
+The size of a BAM file is often extremely large.
+The iterator interface demonstrated above allocates an object for each record and that may be a bottleneck of reading data from a BAM file.
+In-place reading reuses a pre-allocated object for every record and less memory allocation happens in reading:
+
+```julia
+reader = open(BAM.Reader, "data.bam")
+record = BAM.Record()
+while !eof(reader)
+    read!(reader, record)
+    # do something
+end
+```
+
+## SAM and BAM Headers
+
+Both `SAM.Reader` and `BAM.Reader` implement the `header` function, which returns a `SAM.Header` object.
+To extract certain information out of the headers, you can use the `find` method on the header to extract information according to SAM/BAM tag.
+Again we refer you to the [specification][samtools-spec] for full details of all the different tags that can occur in headers, and what they mean.
+
+Below is an example of extracting all the info about the reference sequences from the BAM header.
+In SAM/BAM, any description of a reference sequence is stored in the header, under a tag denoted `SQ` (think `reference SeQuence`!).
+
+```jlcon
+julia> reader = open(SAM.Reader, "data.sam");
+
+julia> find(header(reader), "SQ")
+7-element Array{Bio.Align.SAM.MetaInfo,1}:
+ Bio.Align.SAM.MetaInfo:
+    tag: SQ
+  value: SN=Chr1 LN=30427671
+ Bio.Align.SAM.MetaInfo:
+    tag: SQ
+  value: SN=Chr2 LN=19698289
+ Bio.Align.SAM.MetaInfo:
+    tag: SQ
+  value: SN=Chr3 LN=23459830
+ Bio.Align.SAM.MetaInfo:
+    tag: SQ
+  value: SN=Chr4 LN=18585056
+ Bio.Align.SAM.MetaInfo:
+    tag: SQ
+  value: SN=Chr5 LN=26975502
+ Bio.Align.SAM.MetaInfo:
+    tag: SQ
+  value: SN=chloroplast LN=154478
+ Bio.Align.SAM.MetaInfo:
+    tag: SQ
+  value: SN=mitochondria LN=366924
+
+```
+
+In the above we can see there were 7 sequences in the reference: 5 chromosomes, one chloroplast sequence, and one mitochondrial sequence.
+
+## SAM and BAM Records
+
+BioAlignments supports the following accessors for `SAM.Record` types.
+
+```@docs
+XAM.SAM.flag
+XAM.SAM.ismapped
+XAM.SAM.isprimary
+XAM.SAM.refname
+XAM.SAM.position
+XAM.SAM.rightposition
+XAM.SAM.isnextmapped
+XAM.SAM.nextrefname
+XAM.SAM.nextposition
+XAM.SAM.mappingquality
+XAM.SAM.cigar
+XAM.SAM.alignment
+XAM.SAM.alignlength
+XAM.SAM.tempname
+XAM.SAM.templength
+XAM.SAM.sequence
+XAM.SAM.seqlength
+XAM.SAM.quality
+XAM.SAM.auxdata
+```
+
+BioAlignments supports the following accessors for `BAM.Record` types.
+
+```@docs
+XAM.BAM.flag
+XAM.BAM.ismapped
+XAM.BAM.isprimary
+XAM.BAM.refid
+XAM.BAM.refname
+XAM.BAM.reflen
+XAM.BAM.position
+XAM.BAM.rightposition
+XAM.BAM.isnextmapped
+XAM.BAM.nextrefid
+XAM.BAM.nextrefname
+XAM.BAM.nextposition
+XAM.BAM.mappingquality
+XAM.BAM.cigar
+XAM.BAM.alignment
+XAM.BAM.alignlength
+XAM.BAM.tempname
+XAM.BAM.templength
+XAM.BAM.sequence
+XAM.BAM.seqlength
+XAM.BAM.quality
+XAM.BAM.auxdata
+```
+
+## Accessing auxiliary data
+
+SAM and BAM records support the storing of optional data fields associated with tags.
+
+Tagged auxiliary data follows a format of `TAG:TYPE:VALUE`.
+`TAG` is a two-letter string, and each tag can only appear once per record.
+`TYPE` is a single case-sensetive letter which defined the format of `VALUE`.
+
+| Type | Description                       |
+|------|-----------------------------------|
+| 'A'  | Printable character               |
+| 'i'  | Signed integer                    |
+| 'f'  | Single-precision floating number  |
+| 'Z'  | Printable string, including space |
+| 'H'  | Byte array in Hex format          |
+| 'B'  | Integer of numeric array          |
+
+For more information about these tags and their types we refer you to the [SAM/BAM specification][samtools-spec] and the additional [optional fields specification][samtags] document.
+
+There are some tags that are reserved, predefined standard tags, for specific uses.
+
+To access optional fields stored in tags, you use `getindex` indexing syntax on the record object.
+Note that accessing optional tag fields will result in type instability in Julia.
+This is because the type of the optional data is not known until run-time, as the tag is being read.
+This can have a significant impact on performance.
+To limit this, if the user knows the type of a value in advance, specifying it as a type annotation will alleviate the problem:
+
+Below is an example of looping over records in a bam file and using indexing syntax to get the data stored in the "NM" tag.
+Note the `UInt8` type assertion to alleviate type instability.
+
+```julia
+for record in open(BAM.Reader, "data.bam")
+    nm = record["NM"]::UInt8
+    # do something
+end
+```
+
+## Getting records in a range
+
+BioAlignments supports the BAI index to fetch records in a specific range from a BAM file.
+[Samtools][samtools] provides `index` subcommand to create an index file (.bai) from a sorted BAM file.
+
+```console
+$ samtools index -b SRR1238088.sort.bam
+$ ls SRR1238088.sort.bam*
+SRR1238088.sort.bam     SRR1238088.sort.bam.bai
+```
+
+`eachoverlap(reader, chrom, range)` returns an iterator of BAM records overlapping the query interval:
+
+```julia
+reader = open(BAM.Reader, "SRR1238088.sort.bam", index="SRR1238088.sort.bam.bai")
+for record in eachoverlap(reader, "Chr2", 10000:11000)
+    # `record` is a BAM.Record object
+    # ...
+end
+close(reader)
+```
+
+## Getting records overlapping genomic features
+
+`eachoverlap` also accepts the `Interval` type defined in [GenomicFeatures.jl][genomicfeatures].
+
+This allows you to do things like first read in the genomic features from a GFF3 file, and then for each feature, iterate over all the BAM records that overlap with that feature.
+
+```julia
+# Load GFF3 module.
+using GenomicFeatures
+using BioAlignments
+
+# Load genomic features from a GFF3 file.
+features = open(collect, GFF3.Reader, "TAIR10_GFF3_genes.gff")
+
+# Keep mRNA features.
+filter!(x -> GFF3.featuretype(x) == "mRNA", features)
+
+# Open a BAM file and iterate over records overlapping mRNA transcripts.
+reader = open(BAM.Reader, "SRR1238088.sort.bam", index = "SRR1238088.sort.bam.bai")
+for feature in features
+    for record in eachoverlap(reader, feature)
+        # `record` overlaps `feature`.
+        # ...
+    end
+end
+close(reader)
+```
+
+## Writing files
+
+In order to write a BAM or SAM file, you must first create a `SAM.Header`.
+
+A `SAM.Header` is constructed from a vector of `SAM.MetaInfo` objects.
+
+For example, to create the following simple header:
+
+```
+@HD VN:1.6 SO:coordinate
+@SQ SN:ref LN:45
+```
+
+```julia
+julia> a = SAM.MetaInfo("HD", ["VN" => 1.6, "SO" => "coordinate"])
+SAM.MetaInfo:
+    tag: HD
+  value: VN=1.6 SO=coordinate
+
+julia> b = SAM.MetaInfo("SQ", ["SN" => "ref", "LN" => 45])
+SAM.MetaInfo:
+    tag: SQ
+  value: SN=ref LN=45
+
+julia> h = SAM.Header([a, b])
+SAM.Header(SAM.MetaInfo[SAM.MetaInfo:
+    tag: HD
+  value: VN=1.6 SO=coordinate, SAM.MetaInfo:
+    tag: SQ
+  value: SN=ref LN=45])
+
+```
+
+Then to create the writer for a SAM file, construct a `SAM.Writer` using the header and an `IO` type:
+
+```julia
+julia> samw = SAM.Writer(open("my-data.sam", "w"), h)
+SAM.Writer(IOStream(<file my-data.sam>))
+
+```
+
+To make a BAM Writer is slightly different, as you need to use a specific stream type from the [BGZFStreams][bgzfstreams] package:
+
+```julia
+julia> using BGZFStreams
+
+julia> bamw = BAM.Writer(BGZFStream(open("my-data.bam", "w"), "w"))
+BAM.Writer(BGZFStreams.BGZFStream{IOStream}(<mode=write>))
+
+```
+
+Once you have a BAM or SAM writer, you can use the `write` method to write `BAM.Record`s or `SAM.Record`s to file:
+
+```julia
+julia> write(bamw, rec) # Here rec is a `BAM.Record`
+330780
+```
+
+[samtools]:      https://samtools.github.io/
+[samtools-spec]: https://samtools.github.io/hts-specs/SAMv1.pdf
+[samtags]:       https://samtools.github.io/hts-specs/SAMtags.pdf
+[bgzfstreams]:   https://github.com/BioJulia/BGZFStreams.jl
+[genomicfeatures]:   https://github.com/BioJulia/GenomicFeatures.jl