1
0
Fork 0
mirror of https://github.com/MillironX/taxprofiler.git synced 2024-09-21 06:52:04 +00:00

Merge pull request #230 from genomic-medicine-sweden/documentation

Add documentation
This commit is contained in:
James A. Fellows Yates 2023-02-03 16:34:09 +01:00 committed by GitHub
commit de5bdc36c5
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
4 changed files with 47 additions and 18 deletions

View file

@ -18,7 +18,7 @@ process {
publishDir = [
path: { "${params.outdir}/fastqc/raw" },
mode: params.publish_dir_mode,
pattern: '*.html'
pattern: '*.{html,zip}'
]
}
@ -28,7 +28,7 @@ process {
publishDir = [
path: { "${params.outdir}/fastqc/processed" },
mode: params.publish_dir_mode,
pattern: '*.html'
pattern: '*.{html,zip}'
]
}
@ -37,7 +37,7 @@ process {
publishDir = [
path: { "${params.outdir}/falco/raw" },
mode: params.publish_dir_mode,
pattern: '*.{html,txt}'
pattern: '*.{html,txt,zip}'
]
}
@ -46,7 +46,7 @@ process {
publishDir = [
path: { "${params.outdir}/falco/processed" },
mode: params.publish_dir_mode,
pattern: '*.{html,txt}'
pattern: '*.{html,txt,zip}'
]
}
@ -354,7 +354,7 @@ process {
publishDir = [
path: { "${params.outdir}/kraken2/${meta.db_name}/" },
mode: params.publish_dir_mode,
pattern: '*.{txt,report,fastq.gz}'
pattern: '*.{txt,fastq.gz}'
]
}
@ -393,7 +393,7 @@ process {
publishDir = [
path: { "${params.outdir}/krakenuniq/${meta.db_name}/" },
mode: params.publish_dir_mode,
pattern: '*.{txt,report,fastq.gz}'
pattern: '*.{txt,fastq.gz}'
]
}

View file

@ -13,7 +13,7 @@ The directories listed below will be created in the results directory after the
The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:
- [FastQC](#fastqc) - Raw read QC
- [falco](#falco) - Alternative to FastQC for raw read QC
- [falco](#fastqc) - Alternative to FastQC for raw read QC
- [fastp](#fastp) - Adapter trimming for Illumina data
- [AdapterRemoval](#adapterremoval) - Adapter trimming for Illumina data
- [Porechop](#porechop) - Adapter removal for Oxford Nanopore data
@ -22,7 +22,8 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
- [Filtlong](#filtlong) - Quality trimming and filtering for Nanopore data
- [Bowtie2](#bowtie2) - Host removal for Illumina reads
- [minimap2](#minimap2) - Host removal for Nanopore reads
- [samtoolsstats](#samtoolsstats) - Statistics from host removal
- [SAMtools stats](#samtoolsstats) - Statistics from host removal
- [SAMtools fastq](#samtoolsfastq) - Converts the alignment file in fastq format
- [Bracken](#bracken) - Taxonomic classifier using k-mers and abundance estimations
- [Kraken2](#kraken2) - Taxonomic classifier using exact k-mer matches
- [KrakenUniq](#krakenuniq) - Taxonomic classifier that combines the k-mer-based classification and the number of unique k-mers found in each species
@ -35,7 +36,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
### FastQC
### FastQC or falco
<details markdown="1">
<summary>Output files</summary>
@ -48,6 +49,8 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).
> Falco produces identical output to FastQC but in the `falco/` directory.
![MultiQC - FastQC sequence counts plot](images/mqc_fastqc_counts.png)
![MultiQC - FastQC mean quality scores plot](images/mqc_fastqc_quality.png)
@ -68,6 +71,7 @@ It is used in nf-core/taxprofiler for adapter trimming of short-reads.
- `fastp`
- `<sample_id>.fastp.fastq.gz`: File with the trimmed unmerged fastq reads.
- `<sample_id>.merged.fastq.gz`: File with the reads that were successfully merged.
- `<sample_id>.*{log,html,json}`: Log files in different formats.
</details>
@ -189,7 +193,7 @@ It is used with nf-core/taxprofiler to allow removal of 'host' (e.g. human) and/
</details>
By default nf-core/taxprofiler will only provide the `.log` file if host removal is turned on. You will only see the mapped (host) reads `.bam` file or the off-target reads in `.fastq` format in your results directory if you provide `--save_hostremoval_mapped` and ` --save_hostremoval_unmapped` respectively.
By default nf-core/taxprofiler will only provide the `.log` file if host removal is turned on. You will only see the mapped (host) and unmapped reads in `.bam` format or the off-target reads in `.fastq` format in your results directory if you provide `--save_hostremoval_mapped` and ` --save_hostremoval_unmapped` respectively.
> ⚠️ The resulting `.fastq` files may _not_ always be the 'final' reads that go into taxprofiling, if you also run other steps such as run merging etc..
@ -209,19 +213,33 @@ It is used with nf-core/taxprofiler to allow removal of 'host' (e.g. human) or o
</details>
By default, nf-core taxprofiler will only provide the `.bam` file if host removal for long reads is turned on (i.e., `--save_hostremoval_mapped` and ` --save_hostremoval_unmapped`).
By default, nf-core taxprofiler will only provide the `.bam` file containing mapped and unmapped if host removal for long reads is turned on (i.e., `--save_hostremoval_mapped` and ` --save_hostremoval_unmapped`).
> minimap2 is not yet supported as a module in MultiQC and therefore there is no dedicated section in the MultiQC HTML. Rather, alignment statistics to host genome is reported via samtools stats module in MultiQC report.
### Samtools stats
### SAMtools fastq
[Samtools stats](http://www.htslib.org/doc/samtools-stats.html) collects statistics from a `.sam`, `.bam`, or `.cram` alignment file and outputs in a text format.
[SAMtools fastq](http://www.htslib.org/doc/1.1/samtools.html) converts a `.sam`, `.bam`, or `.cram` alignment file to FASTQ format
<details markdown="1">
<summary>Output files</summary>
- `samtoolsstats`
- `<sample_id>.stats`: File containing samtools stats output
- `<sample_id>.fq.gz`: Alignment file in FASTQ gzip format.
</details>
This directory will be present and contain the unmapped reads from the `.fastq` format from long-read minimap2 host removal (for short-read unmapped reads, see [bowtie2](#bowtie2)), if `--save_hostremoval_unmapped` is supplied.
### SAMtools stats
[SAMtools stats](http://www.htslib.org/doc/samtools-stats.html) collects statistics from a `.sam`, `.bam`, or `.cram` alignment file and outputs in a text format.
<details markdown="1">
<summary>Output files</summary>
- `samtoolsstats`
- `<sample_id>.stats`: File containing samtools stats output.
</details>
@ -330,7 +348,7 @@ The most summary file is the `*combined_reports.txt` file which summarises resul
- `diamond`
- `<sample_id>.log`: A log file containing stdout information
- `<sample_id>.sam`: A file in SAM format that contains the aligned reads
- `<sample_id>*.{blast,xml,txt,daa,sam,tsv,paf}`: A file containing alignment information in various formats, or taxonomic information in a text-based format. Exact output depends on user choice.
</details>

View file

@ -610,7 +610,18 @@ A detailed description can be found [here](https://github.com/bbuchfink/diamond/
#### Kaiju custom database
To build a kaiju database, you need two components: a FASTA file with the protein sequences (the headers are the numeric NCBI taxon identifiers of the protein sequences), and you need to define the uppercase characters of the standard 20 amino acids you wish to include.
To build a kaiju database, you need three components: a FASTA file with the protein sequences ,the NCBI taxonomy dump files, and you need to define the uppercase characters of the standard 20 amino acids you wish to include.
> ⚠️ The headers of the protein fasta file must be numeric NCBI taxon identifiers of the protein sequences.
To download the NCBI taxonomy files, please run the following commands:
```bash
wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.zip
unzip new_taxdump.zip
```
To build the database, run the following command (the contents of taxdump must be in the same location where you run the command):
```bash
kaiju-mkbwt -a ACDEFGHIKLMNPQRSTVWY -o proteins proteins.faa

View file

@ -296,8 +296,8 @@
"save_hostremoval_mapped": {
"type": "boolean",
"fa_icon": "fas fa-save",
"description": "Save mapped reads in BAM format from host removal",
"help_text": "Save the reads mapped to the reference genome in BAM format as output by the respective hostremoval alignment tool.\n\nThis can be useful if you wish to perform other analyses on the host organism (such as host-microbe interaction), however, you should consider whether the default mapping parameters of Bowtie2 (short-read) or minimap2 (long-read) are optimised to your context. "
"description": "Saved mapped and unmapped reads in BAM format from host removal",
"help_text": "Save the reads mapped to the reference genome and off-target reads in BAM format as output by the respective hostremoval alignment tool.\n\nThis can be useful if you wish to perform other analyses on the host organism (such as host-microbe interaction), however, you should consider whether the default mapping parameters of Bowtie2 (short-read) or minimap2 (long-read) are optimised to your context."
},
"save_hostremoval_unmapped": {
"type": "boolean",