1
0
Fork 0
mirror of https://github.com/MillironX/XAM.jl.git synced 2024-12-25 22:28:18 +00:00
XAM.jl/dev/man/hts-files/index.html

108 lines
42 KiB
HTML
Raw Normal View History

2020-02-22 19:53:02 +00:00
<!DOCTYPE html>
2020-02-28 03:32:10 +00:00
<html lang="en"><head><meta charset="UTF-8"/><meta name="viewport" content="width=device-width, initial-scale=1.0"/><title>SAM and BAM · XAM.jl</title><link href="https://fonts.googleapis.com/css?family=Lato|Roboto+Mono" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.11.2/css/fontawesome.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.11.2/css/solid.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.11.2/css/brands.min.css" rel="stylesheet" type="text/css"/><link href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.11.1/katex.min.css" rel="stylesheet" type="text/css"/><script>documenterBaseURL="../.."</script><script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.6/require.min.js" data-main="../../assets/documenter.js"></script><script src="../../siteinfo.js"></script><script src="../../../versions.js"></script><link class="docs-theme-link" rel="stylesheet" type="text/css" href="../../assets/themes/documenter-dark.css" data-theme-name="documenter-dark"/><link class="docs-theme-link" rel="stylesheet" type="text/css" href="../../assets/themes/documenter-light.css" data-theme-name="documenter-light" data-theme-primary/><script src="../../assets/themeswap.js"></script></head><body><div id="documenter"><nav class="docs-sidebar"><a class="docs-logo" href="../../"><img src="../../assets/logo.svg" alt="XAM.jl logo"/></a><div class="docs-package-name"><span class="docs-autofit">XAM.jl</span></div><form class="docs-search" action="../../search/"><input class="docs-search-query" id="documenter-search-query" name="q" type="text" placeholder="Search docs"/></form><ul class="docs-menu"><li><a class="tocitem" href="../../">Home</a></li><li class="is-active"><a class="tocitem" href>SAM and BAM</a><ul class="internal"><li><a class="tocitem" href="#Introduction-1"><span>Introduction</span></a></li><li><a class="tocitem" href="#Reading-SAM-and-BAM-files-1"><span>Reading SAM and BAM files</span></a></li><li><a class="tocitem" href="#SAM-and-BAM-Headers-1"><span>SAM and BAM Headers</span></a></li><li><a class="tocitem" href="#SAM-and-BAM-Records-1"><span>SAM and BAM Records</span></a></li><li><a class="tocitem" href="#Accessing-auxiliary-data-1"><span>Accessing auxiliary data</span></a></li><li><a class="tocitem" href="#Getting-records-in-a-range-1"><span>Getting records in a range</span></a></li><li><a class="tocitem" href="#Getting-records-overlapping-genomic-features-1"><span>Getting records overlapping genomic features</span></a></li><li><a class="tocitem" href="#Writing-files-1"><span>Writing files</span></a></li></ul></li><li><a class="tocitem" href="../api/">API Reference</a></li></ul><div class="docs-version-selector field has-addons"><div class="control"><span class="docs-label button is-static is-size-7">Version</span></div><div class="docs-selector control is-expanded"><div class="select is-fullwidth is-size-7"><select id="documenter-version-selector"></select></div></div></div></nav><div class="docs-main"><header class="docs-navbar"><nav class="breadcrumb"><ul class="is-hidden-mobile"><li class="is-active"><a href>SAM and BAM</a></li></ul><ul class="is-hidden-tablet"><li class="is-active"><a href>SAM and BAM</a></li></ul></nav><div class="docs-right"><a class="docs-edit-link" href="https://github.com/BioJulia/XAM.jl/blob/develop/docs/src/man/hts-files.md" title="Edit on GitHub"><span class="docs-icon fab"></span><span class="docs-label is-hidden-touch">Edit on GitHub</span></a><a class="docs-settings-button fas fa-cog" id="documenter-settings-button" href="#" title="Settings"></a><a class="docs-sidebar-button fa fa-bars is-hidden-desktop" id="documenter-sidebar-button" href="#"></a></div></header><article class="content" id="documenter-page"><h1 id="SAM-and-BAM-1"><a class="docs-heading-anchor" href="#SAM-and-BAM-1">SAM and BAM</a><a class="docs-heading-anchor-permalink" href="#SAM-and-BAM-1" title="Permalink"></a></h1><h2 id="Introduction-1"><
2020-02-22 19:53:02 +00:00
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1</code></pre><p>Where the first two lines are part of the &quot;header&quot;, and the following lines are &quot;records&quot;. Each record describes how a read aligns to some reference sequence. Sometimes one record describes one read, but there are other cases like chimeric reads and split alignments, where multiple records apply to one read. In the example above, <code>r003</code> is a chimeric read, and <code>r004</code> is a split alignment, and <code>r001</code> are mate pair reads. Again, we refer you to the official <a href="https://samtools.github.io/hts-specs/SAMv1.pdf">specification</a> for more details.</p><p>A BAM file stores this same information but in a binary and compressible format that does not make for pretty printing here!</p><h2 id="Reading-SAM-and-BAM-files-1"><a class="docs-heading-anchor" href="#Reading-SAM-and-BAM-files-1">Reading SAM and BAM files</a><a class="docs-heading-anchor-permalink" href="#Reading-SAM-and-BAM-files-1" title="Permalink"></a></h2><p>A typical script iterating over all records in a file looks like below:</p><pre><code class="language-julia">using XAM
# Open a BAM file.
reader = open(BAM.Reader, &quot;data.bam&quot;)
# Iterate over BAM records.
for record in reader
# `record` is a BAM.Record object.
if BAM.ismapped(record)
# Print the mapped position.
println(BAM.refname(record), &#39;:&#39;, BAM.position(record))
end
end
# Close the BAM file.
close(reader)</code></pre><p>The size of a BAM file is often extremely large. The iterator interface demonstrated above allocates an object for each record and that may be a bottleneck of reading data from a BAM file. In-place reading reuses a pre-allocated object for every record and less memory allocation happens in reading:</p><pre><code class="language-julia">reader = open(BAM.Reader, &quot;data.bam&quot;)
record = BAM.Record()
while !eof(reader)
2020-04-17 14:29:35 +00:00
empty!(record)
2020-02-22 19:53:02 +00:00
read!(reader, record)
# do something
end</code></pre><h2 id="SAM-and-BAM-Headers-1"><a class="docs-heading-anchor" href="#SAM-and-BAM-Headers-1">SAM and BAM Headers</a><a class="docs-heading-anchor-permalink" href="#SAM-and-BAM-Headers-1" title="Permalink"></a></h2><p>Both <code>SAM.Reader</code> and <code>BAM.Reader</code> implement the <code>header</code> function, which returns a <code>SAM.Header</code> object. To extract certain information out of the headers, you can use the <code>find</code> method on the header to extract information according to SAM/BAM tag. Again we refer you to the <a href="https://samtools.github.io/hts-specs/SAMv1.pdf">specification</a> for full details of all the different tags that can occur in headers, and what they mean.</p><p>Below is an example of extracting all the info about the reference sequences from the BAM header. In SAM/BAM, any description of a reference sequence is stored in the header, under a tag denoted <code>SQ</code> (think <code>reference SeQuence</code>!).</p><pre><code class="language-jlcon">julia&gt; reader = open(SAM.Reader, &quot;data.sam&quot;);
julia&gt; find(header(reader), &quot;SQ&quot;)
7-element Array{Bio.Align.SAM.MetaInfo,1}:
Bio.Align.SAM.MetaInfo:
tag: SQ
value: SN=Chr1 LN=30427671
Bio.Align.SAM.MetaInfo:
tag: SQ
value: SN=Chr2 LN=19698289
Bio.Align.SAM.MetaInfo:
tag: SQ
value: SN=Chr3 LN=23459830
Bio.Align.SAM.MetaInfo:
tag: SQ
value: SN=Chr4 LN=18585056
Bio.Align.SAM.MetaInfo:
tag: SQ
value: SN=Chr5 LN=26975502
Bio.Align.SAM.MetaInfo:
tag: SQ
value: SN=chloroplast LN=154478
Bio.Align.SAM.MetaInfo:
tag: SQ
value: SN=mitochondria LN=366924
2020-04-25 20:31:21 +00:00
</code></pre><p>In the above we can see there were 7 sequences in the reference: 5 chromosomes, one chloroplast sequence, and one mitochondrial sequence.</p><h2 id="SAM-and-BAM-Records-1"><a class="docs-heading-anchor" href="#SAM-and-BAM-Records-1">SAM and BAM Records</a><a class="docs-heading-anchor-permalink" href="#SAM-and-BAM-Records-1" title="Permalink"></a></h2><h3 id="SAM.Record-1"><a class="docs-heading-anchor" href="#SAM.Record-1">SAM.Record</a><a class="docs-heading-anchor-permalink" href="#SAM.Record-1" title="Permalink"></a></h3><p>The <code>XAM</code> package supports the following accessors for <code>SAM.Record</code> types.</p><article class="docstring"><header><a class="docstring-binding" id="XAM.SAM.flag" href="#XAM.SAM.flag"><code>XAM.SAM.flag</code></a><span class="docstring-category">Function</span></header><section><div><pre><code class="language-julia">flag(record::Record)::UInt16</code></pre><p>Get the bitwise flag of <code>record</code>.</p></div><a class="docs-sourcelink" target="_blank" href="https://github.com/BioJulia/XAM.jl/blob/447e6b46c7a6a328b150ac981c7bc785eec2bef9/src/sam/record.jl#L150-L154">source</a></section></article><article class="docstring"><header><a class="docstring-binding" id="XAM.SAM.ismapped" href="#XAM.SAM.ismapped"><code>XAM.SAM.ismapped</code></a><span class="docstring-category">Function</span></header><section><div><pre><code class="language-julia">ismapped(record::Record)::Bool</code></pre><p>Test if <code>record</code> is mapped.</p></div><a class="docs-sourcelink" target="_blank" href="https://github.com/BioJulia/XAM.jl/blob/447e6b46c7a6a328b150ac981c7bc785eec2bef9/src/sam/record.jl#L164-L168">source</a></section></article><article class="docstring"><header><a class="docstring-binding" id="XAM.SAM.isprimary" href="#XAM.SAM.isprimary"><code>XAM.SAM.isprimary</code></a><span class="docstring-category">Function</span></header><section><div><pre><code class="language-julia">isprimary(record::Record)::Bool</code></pre><p>Test if <code>record</code> is a primary line of the read.</p><p>This is equivalent to <code>flag(record) &amp; 0x900 == 0</code>.</p></div><a class="docs-sourcelink" target="_blank" href="https://github.com/BioJulia/XAM.jl/blob/447e6b46c7a6a328b150ac981c7bc785eec2bef9/src/sam/record.jl#L173-L179">source</a></section></article><article class="docstring"><header><a class="docstring-binding" id="XAM.SAM.refname" href="#XAM.SAM.refname"><code>XAM.SAM.refname</code></a><span class="docstring-category">Function</span></header><section><div><pre><code class="language-julia">refname(record::Record)::String</code></pre><p>Get the reference sequence name of <code>record</code>.</p></div><a class="docs-sourcelink" target="_blank" href="https://github.com/BioJulia/XAM.jl/blob/447e6b46c7a6a328b150ac981c7bc785eec2bef9/src/sam/record.jl#L184-L188">source</a></section></article><article class="docstring"><header><a class="docstring-binding" id="XAM.SAM.position" href="#XAM.SAM.position"><code>XAM.SAM.position</code></a><span class="docstring-category">Function</span></header><section><div><pre><code class="language-julia">position(record::Record)::Int</code></pre><p>Get the 1-based leftmost mapping position of <code>record</code>.</p></div><a class="docs-sourcelink" target="_blank" href="https://github.com/BioJulia/XAM.jl/blob/447e6b46c7a6a328b150ac981c7bc785eec2bef9/src/sam/record.jl#L201-L205">source</a></section></article><article class="docstring"><header><a class="docstring-binding" id="XAM.SAM.rightposition" href="#XAM.SAM.rightposition"><code>XAM.SAM.rightposition</code></a><span class="docstring-category">Function</span></header><section><div><pre><code class="language-julia">rightposition(record::Record)::Int</code></pre><p>Get the 1-based rightmost mapping position of <code>record</code>.</p></div><a class="docs-sourcelink" target="_blank" href="https://github.com/BioJulia/XAM.jl/blob/447e6b46c7a6a328b150ac981c7bc785eec2bef9/src/sam/record.jl#L219-L223">source</a></section></article><article class="docstring"><header><a class="do
2020-02-22 19:53:02 +00:00
nm = record[&quot;NM&quot;]::UInt8
# do something
end</code></pre><h2 id="Getting-records-in-a-range-1"><a class="docs-heading-anchor" href="#Getting-records-in-a-range-1">Getting records in a range</a><a class="docs-heading-anchor-permalink" href="#Getting-records-in-a-range-1" title="Permalink"></a></h2><p>The <code>XAM</code> package supports the BAI index to fetch records in a specific range from a BAM file. <a href="https://samtools.github.io/">Samtools</a> provides <code>index</code> subcommand to create an index file (.bai) from a sorted BAM file.</p><pre><code class="language-console">$ samtools index -b SRR1238088.sort.bam
$ ls SRR1238088.sort.bam*
SRR1238088.sort.bam SRR1238088.sort.bam.bai</code></pre><p>The method <code>eachoverlap(reader, chrom, range)</code> returns an iterator of BAM records overlapping the query interval:</p><pre><code class="language-julia">reader = open(BAM.Reader, &quot;SRR1238088.sort.bam&quot;, index=&quot;SRR1238088.sort.bam.bai&quot;)
for record in eachoverlap(reader, &quot;Chr2&quot;, 10000:11000)
# `record` is a BAM.Record object
# ...
end
2020-02-28 03:20:23 +00:00
close(reader)</code></pre><h2 id="Getting-records-overlapping-genomic-features-1"><a class="docs-heading-anchor" href="#Getting-records-overlapping-genomic-features-1">Getting records overlapping genomic features</a><a class="docs-heading-anchor-permalink" href="#Getting-records-overlapping-genomic-features-1" title="Permalink"></a></h2><p>The <code>eachoverlap</code> method also accepts the <code>Interval</code> type defined in <a href="https://github.com/BioJulia/GenomicFeatures.jl">GenomicFeatures.jl</a>.</p><p>This allows you to do things like first read in the genomic features from a GFF3 file, and then for each feature, iterate over all the BAM records that overlap with that feature.</p><pre><code class="language-julia">using GenomicFeatures
2020-02-22 19:53:02 +00:00
using GFF3
using XAM
# Load genomic features from a GFF3 file.
features = open(collect, GFF3.Reader, &quot;TAIR10_GFF3_genes.gff&quot;)
# Keep mRNA features.
filter!(x -&gt; GFF3.featuretype(x) == &quot;mRNA&quot;, features)
# Open a BAM file and iterate over records overlapping mRNA transcripts.
reader = open(BAM.Reader, &quot;SRR1238088.sort.bam&quot;, index = &quot;SRR1238088.sort.bam.bai&quot;)
for feature in features
for record in eachoverlap(reader, feature)
# `record` overlaps `feature`.
# ...
end
end
close(reader)</code></pre><h2 id="Writing-files-1"><a class="docs-heading-anchor" href="#Writing-files-1">Writing files</a><a class="docs-heading-anchor-permalink" href="#Writing-files-1" title="Permalink"></a></h2><p>In order to write a BAM or SAM file, you must first create a <code>SAM.Header</code>.</p><p>A <code>SAM.Header</code> is constructed from a vector of <code>SAM.MetaInfo</code> objects.</p><p>For example, to create the following simple header:</p><pre><code class="language-none">@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45</code></pre><pre><code class="language-julia">julia&gt; a = SAM.MetaInfo(&quot;HD&quot;, [&quot;VN&quot; =&gt; 1.6, &quot;SO&quot; =&gt; &quot;coordinate&quot;])
SAM.MetaInfo:
tag: HD
value: VN=1.6 SO=coordinate
julia&gt; b = SAM.MetaInfo(&quot;SQ&quot;, [&quot;SN&quot; =&gt; &quot;ref&quot;, &quot;LN&quot; =&gt; 45])
SAM.MetaInfo:
tag: SQ
value: SN=ref LN=45
julia&gt; h = SAM.Header([a, b])
SAM.Header(SAM.MetaInfo[SAM.MetaInfo:
tag: HD
value: VN=1.6 SO=coordinate, SAM.MetaInfo:
tag: SQ
value: SN=ref LN=45])
</code></pre><p>Then to create the writer for a SAM file, construct a <code>SAM.Writer</code> using the header and an <code>IO</code> type:</p><pre><code class="language-julia">julia&gt; samw = SAM.Writer(open(&quot;my-data.sam&quot;, &quot;w&quot;), h)
SAM.Writer(IOStream(&lt;file my-data.sam&gt;))
2020-03-20 22:43:09 +00:00
</code></pre><p>To make a BAM Writer is slightly different, as you need to use a specific stream type from the <a href="https://github.com/BioJulia/BGZFStreams.jl">https://github.com/BioJulia/BGZFStreams.jl</a> package:</p><pre><code class="language-julia">julia&gt; using BGZFStreams
2020-02-22 19:53:02 +00:00
julia&gt; bamw = BAM.Writer(BGZFStream(open(&quot;my-data.bam&quot;, &quot;w&quot;), &quot;w&quot;))
BAM.Writer(BGZFStreams.BGZFStream{IOStream}(&lt;mode=write&gt;))
</code></pre><p>Once you have a BAM or SAM writer, you can use the <code>write</code> method to write <code>BAM.Record</code>s or <code>SAM.Record</code>s to file:</p><pre><code class="language-julia">julia&gt; write(bamw, rec) # Here rec is a `BAM.Record`
2020-04-25 20:31:21 +00:00
330780</code></pre></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../../">« Home</a><a class="docs-footer-nextpage" href="../api/">API Reference »</a></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> on <span class="colophon-date" title="Saturday 25 April 2020 20:31">Saturday 25 April 2020</span>. Using Julia version 1.4.1.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>