1
0
Fork 0
mirror of https://github.com/MillironX/taxprofiler.git synced 2024-11-13 07:13:10 +00:00

Merge branch 'dev' into final-reads-saving

This commit is contained in:
James A. Fellows Yates 2023-04-20 14:18:10 +02:00 committed by GitHub
commit 0245c4880b
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
12 changed files with 141 additions and 133 deletions

View file

@ -11,6 +11,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### `Fixed`
- [#271](https://github.com/nf-core/taxprofiler/pull/271/files) Improved standardised table generation documentation nd mOTUs manual database download tutorial (♥ to @prototaxites for reporting, fix by @jfy133)
- [#269](https://github.com/nf-core/taxprofiler/pull/269/files) Reduced output files in AWS full test output due to very large files
- [#270](https://github.com/nf-core/taxprofiler/pull/270/files) Fixed warning for host removal index parameter, and improved index checks (♥ to @prototaxites for reporting, fix by @jfy133)
- [#274](https://github.com/nf-core/taxprofiler/pull/274/files) Substituted the samtools/bam2fq module with samtools/fastq module (fix by @sofstam)
### `Dependencies`
### `Deprecated`

View file

@ -374,13 +374,13 @@ process {
ext.prefix = { "${meta.id}_${meta.run_accession}.unmapped" }
}
withName: SAMTOOLS_BAM2FQ {
withName: SAMTOOLS_FASTQ {
ext.prefix = { "${meta.id}_${meta.run_accession}.unmapped" }
publishDir = [
[
path: { "${params.outdir}/samtools/bam2fq" },
path: { "${params.outdir}/samtools/fastq" },
mode: params.publish_dir_mode,
pattern: '*.fq.gz',
pattern: '*.fastq.gz',
enabled: params.save_hostremoval_unmapped
],
[

View file

@ -21,7 +21,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
- [Bowtie2](#bowtie2) - Host removal for Illumina reads
- [minimap2](#minimap2) - Host removal for Nanopore reads
- [SAMtools stats](#samtools-stats) - Statistics from host removal
- [SAMtools bam2fq](#samtools-bam2fq) - Converts unmapped BAM file to fastq format (minimap2 only)
- [SAMtools fastq](#samtools-fastq) - Converts unmapped BAM file to fastq format (minimap2 only)
- [Analysis Ready Reads](#analysis-read-reads) - Optional results directory containing the final processed reads used as input for classification/profiling.
- [Bracken](#bracken) - Taxonomic classifier using k-mers and abundance estimations
- [Kraken2](#kraken2) - Taxonomic classifier using exact k-mer matches
@ -202,7 +202,7 @@ It is used with nf-core/taxprofiler to allow removal of 'host' (e.g. human) and/
By default nf-core/taxprofiler will only provide the `.log` file if host removal is turned on. You will only have a `.bam` file if you specify `--save_hostremoval_bam`. This will contain _both_ mapped and unmapped reads. You will only get FASTQ files if you specify to save `--save_hostremoval_unmapped` - these contain only unmapped reads. Alternatively, if you wish only to have the 'final' reads that go into classification/profiling (i.e., that may have additional processing), do not specify this flag but rather specify `--save_analysis_ready_reads`, in which case the reads will be in the folder `analysis_ready_reads`.
> Unmapped reads in FASTQ are only found in this directory for short-reads, for long-reads see [`samtools/bam2fq/`](#samtools-bam2fq)
> Unmapped reads in FASTQ are only found in this directory for short-reads, for long-reads see [`samtools/fastq/`](#samtools-fastq)
> ⚠️ The resulting `.fastq` files may _not_ always be the 'final' reads that go into taxprofiling, if you also run other steps such as run merging etc..
@ -229,11 +229,11 @@ By default, nf-core/taxprofiler will only provide the `.bam` file containing map
> minimap2 is not yet supported as a module in MultiQC and therefore there is no dedicated section in the MultiQC HTML. Rather, alignment statistics to host genome is reported via samtools stats module in MultiQC report.
> Unlike Bowtie2, minimap2 does not produce an unmapped FASTQ file by itself. See [`samtools/bam2fq`](#samtools-bam2fq)
> Unlike Bowtie2, minimap2 does not produce an unmapped FASTQ file by itself. See [`samtools/fastq`](#samtools-fastq)
### SAMtools bam2fq
### SAMtools fastq
[SAMtools bam2fq](http://www.htslib.org/doc/1.1/samtools.html) converts a `.sam`, `.bam`, or `.cram` alignment file to FASTQ format
[SAMtools fastq](http://www.htslib.org/doc/1.1/samtools.html) converts a `.sam`, `.bam`, or `.cram` alignment file to FASTQ format
<details markdown="1">
<summary>Output files</summary>

View file

@ -296,7 +296,9 @@ nf-core/taxprofiler supports generation of Krona interactive pie chart plots for
##### Multi-Table Generation
In addition to per-sample profiles, the pipeline also supports generation of 'native' multi-sample taxonomic profiles (i.e., those generated by the taxonomic profiling tools themselves or additional utility scripts provided by the tool authors).
The main multiple-sample table from nf-core/taxprofiler is from a dedicated standalone tool originally developed for the pipeline - [Taxpasta](https://taxpasta.readthedocs.io/en/latest/). When providing `--run_profile_standardisation`, every classifier/profiler and database combination will get a standardised and multi-sample taxon table in the [`taxpasta/`](https://nf-co.re/taxprofiler/output) directory. These tables are structured in the same way, to facilitate comparison between the the results of the classifier/profiler
In addition to per-sample profiles and standardised Taxpasta output, the pipeline also supports generation of 'native' multi-sample taxonomic profiles (i.e., those generated by the taxonomic profiling tools themselves or additional utility scripts provided by the tool authors), when providing `--run_profile_standardisation` to your pipeline.
These are executed on a per-database level. I.e., you will get a multi-sample taxon table for each database you provide for each tool and will be placed in the same directory as the directories containing the per-sample profiles.
@ -309,7 +311,7 @@ The following tools will produce multi-sample taxon tables:
- **MetaPhlAn3** (via MetaPhlAn's `merge_metaphlan_tables.py` script)
- **mOTUs** (via the `motus merge` command)
Note that the multi-sample tables from these folders are not inter-operable with each other as they can have different formats.
Note that the multi-sample tables from the 'native' tools in each folders are [not inter-operable](https://taxpasta.readthedocs.io/en/latest/tutorials/getting-started/) with each other as they can have different formats and can contain additional and different data. In this case we refer you to use the standardised and merged output from Taxpasta, as described above.
### Updating the pipeline
@ -794,6 +796,8 @@ More information on the MetaPhlAn3 database can be found [here](https://github.c
mOTUs does not provide the ability to construct custom databases. Therefore we recommend to use the the prebuilt database of marker genes provided by the developers.
> ⚠️ **Do not change the directory name of the resulting database if moving to a central location** The database name of `db_mOTU/` is hardcoded in the mOTUs tool
To do this you need to have `mOTUs` installed on your machine.
```bash

View file

@ -187,9 +187,9 @@
"git_sha": "c8e35eb2055c099720a75538d1b8adb3fb5a464c",
"installed_by": ["modules"]
},
"samtools/bam2fq": {
"samtools/fastq": {
"branch": "master",
"git_sha": "c8e35eb2055c099720a75538d1b8adb3fb5a464c",
"git_sha": "0f8a77ff00e65eaeebc509b8156eaa983192474b",
"installed_by": ["modules"]
},
"samtools/index": {

View file

@ -1,56 +0,0 @@
process SAMTOOLS_BAM2FQ {
tag "$meta.id"
label 'process_low'
conda "bioconda::samtools=1.16.1"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/samtools:1.16.1--h6899075_1' :
'quay.io/biocontainers/samtools:1.16.1--h6899075_1' }"
input:
tuple val(meta), path(inputbam)
val split
output:
tuple val(meta), path("*.fq.gz"), emit: reads
path "versions.yml" , emit: versions
when:
task.ext.when == null || task.ext.when
script:
def args = task.ext.args ?: ''
def prefix = task.ext.prefix ?: "${meta.id}"
if (split){
"""
samtools \\
bam2fq \\
$args \\
-@ $task.cpus \\
-1 ${prefix}_1.fq.gz \\
-2 ${prefix}_2.fq.gz \\
-0 ${prefix}_other.fq.gz \\
-s ${prefix}_singleton.fq.gz \\
$inputbam
cat <<-END_VERSIONS > versions.yml
"${task.process}":
samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
END_VERSIONS
"""
} else {
"""
samtools \\
bam2fq \\
$args \\
-@ $task.cpus \\
$inputbam | gzip --no-name > ${prefix}_interleaved.fq.gz
cat <<-END_VERSIONS > versions.yml
"${task.process}":
samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
END_VERSIONS
"""
}
}

View file

@ -1,55 +0,0 @@
name: samtools_bam2fq
description: |
The module uses bam2fq method from samtools to
convert a SAM, BAM or CRAM file to FASTQ format
keywords:
- bam2fq
- samtools
- fastq
tools:
- samtools:
description: Tools for dealing with SAM, BAM and CRAM files
homepage: None
documentation: http://www.htslib.org/doc/1.1/samtools.html
tool_dev_url: None
doi: ""
licence: ["MIT"]
input:
- meta:
type: map
description: |
Groovy Map containing sample information
e.g. [ id:'test', single_end:false ]
- inputbam:
type: file
description: BAM/CRAM/SAM file
pattern: "*.{bam,cram,sam}"
- split:
type: boolean
description: |
TRUE/FALSE value to indicate if reads should be separated into
/1, /2 and if present other, or singleton.
Note: choosing TRUE will generate 4 different files.
Choosing FALSE will produce a single file, which will be interleaved in case
the input contains paired reads.
output:
- meta:
type: map
description: |
Groovy Map containing sample information
e.g. [ id:'test', single_end:false ]
- versions:
type: file
description: File containing software versions
pattern: "versions.yml"
- reads:
type: file
description: |
FASTQ files, which will be either a group of 4 files (read_1, read_2, other and singleton)
or a single interleaved .fq.gz file if the user chooses not to split the reads.
pattern: "*.fq.gz"
authors:
- "@lescai"

44
modules/nf-core/samtools/fastq/main.nf generated Normal file
View file

@ -0,0 +1,44 @@
process SAMTOOLS_FASTQ {
tag "$meta.id"
label 'process_low'
conda "bioconda::samtools=1.16.1"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/samtools:1.16.1--h6899075_1' :
'quay.io/biocontainers/samtools:1.16.1--h6899075_1' }"
input:
tuple val(meta), path(input)
val(interleave)
output:
tuple val(meta), path("*_{1,2}.fastq.gz") , optional:true, emit: fastq
tuple val(meta), path("*_interleaved.fastq.gz"), optional:true, emit: interleaved
tuple val(meta), path("*_singleton.fastq.gz") , optional:true, emit: singleton
tuple val(meta), path("*_other.fastq.gz") , optional:true, emit: other
path "versions.yml" , emit: versions
when:
task.ext.when == null || task.ext.when
script:
def args = task.ext.args ?: ''
def prefix = task.ext.prefix ?: "${meta.id}"
def output = ( interleave && ! meta.single_end ) ? "> ${prefix}_interleaved.fastq.gz" :
meta.single_end ? "-1 ${prefix}_1.fastq.gz -s ${prefix}_singleton.fastq.gz" :
"-1 ${prefix}_1.fastq.gz -2 ${prefix}_2.fastq.gz -s ${prefix}_singleton.fastq.gz"
"""
samtools \\
fastq \\
$args \\
--threads ${task.cpus-1} \\
-0 ${prefix}_other.fastq.gz \\
$input \\
$output
cat <<-END_VERSIONS > versions.yml
"${task.process}":
samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
END_VERSIONS
"""
}

62
modules/nf-core/samtools/fastq/meta.yml generated Normal file
View file

@ -0,0 +1,62 @@
name: samtools_fastq
description: Converts a SAM/BAM/CRAM file to FASTQ
keywords:
- bam
- sam
- cram
- fastq
tools:
- samtools:
description: |
SAMtools is a set of utilities for interacting with and post-processing
short DNA sequence read alignments in the SAM, BAM and CRAM formats, written by Heng Li.
These files are generated as output by short read aligners like BWA.
homepage: http://www.htslib.org/
documentation: http://www.htslib.org/doc/samtools.html
doi: 10.1093/bioinformatics/btp352
licence: ["MIT"]
input:
- meta:
type: map
description: |
Groovy Map containing sample information
e.g. [ id:'test', single_end:false ]
- input:
type: file
description: BAM/CRAM/SAM file
pattern: "*.{bam,cram,sam}"
- interleave:
type: boolean
description: Set true for interleaved fastq file
output:
- meta:
type: map
description: |
Groovy Map containing sample information
e.g. [ id:'test', single_end:false ]
- versions:
type: file
description: File containing software versions
pattern: "versions.yml"
- fastq:
type: file
description: Compressed FASTQ file(s) with reads with either the READ1 or READ2 flag set in separate files.
pattern: "*_{1,2}.fastq.gz"
- interleaved:
type: file
description: Compressed FASTQ file with reads with either the READ1 or READ2 flag set in a combined file. Needs collated input file.
pattern: "*_interleaved.fastq.gz"
- singleton:
type: file
description: Compressed FASTQ file with singleton reads
pattern: "*_singleton.fastq.gz"
- other:
type: file
description: Compressed FASTQ file with reads with either both READ1 and READ2 flags set or unset
pattern: "*_other.fastq.gz"
authors:
- "@priyanka-surana"
- "@suzannejin"

View file

@ -310,7 +310,7 @@
"type": "boolean",
"fa_icon": "fas fa-save",
"description": "Save reads from samples that went through the host-removal step",
"help_text": "Save only the reads NOT mapped to the reference genome in FASTQ format (as exported from `samtools view` and `bam2fq`).\n\nThis can be useful if you wish to perform other analyses on the off-target reads from the host mapping, such as manual profiling or _de novo_ assembly."
"help_text": "Save only the reads NOT mapped to the reference genome in FASTQ format (as exported from `samtools view` and `fastq`).\n\nThis can be useful if you wish to perform other analyses on the off-target reads from the host mapping, such as manual profiling or _de novo_ assembly."
}
},
"fa_icon": "fas fa-user-times"
@ -473,15 +473,18 @@
},
"motus_use_relative_abundance": {
"type": "boolean",
"description": "Turn on printing relative abundance instead of counts."
"description": "Turn on printing relative abundance instead of counts.",
"fa_icon": "fas fa-percent"
},
"motus_save_mgc_read_counts": {
"type": "boolean",
"description": "Turn on saving the mgc reads count."
"description": "Turn on saving the mgc reads count.",
"fa_icon": "fas fa-save"
},
"motus_remove_ncbi_ids": {
"type": "boolean",
"description": "Turn on removing NCBI taxonomic IDs."
"description": "Turn on removing NCBI taxonomic IDs.",
"fa_icon": "fas fa-address-card"
}
},
"fa_icon": "fas fa-align-center"
@ -496,7 +499,7 @@
"type": "boolean",
"fa_icon": "fas fa-toggle-on",
"description": "Turn on standardisation of taxon tables across profilers",
"help_text": "Turns on standardisation of output OTU tables across all tools; each into a TSV format following the following scheme:\n\n|TAXON | SAMPLE_A | SAMPLE_B |\n|-------------|----------------|-----------------|\n| taxon_a | 32 | 123 |\n| taxon_b | 1 | 5 |\n\nThis currently only is generated for mOTUs."
"help_text": "Turns on standardisation of output OTU tables across all tools.\n\nThis happens in two forms, firstly - if available - by a given classifiers/profilers 'native' profile merger and standardisation (for Bracken, Kaiju, Kraken, Centrifuge, MetaPhlAn3, mOTUs), and secondly for _all_ classifier/profilers in the pipeline using [`taxpasta`](https://taxpasta.readthedocs.io).\n\nIn the latter case, taxpasta generates a standardised output as follows:\n\n|TAXON | SAMPLE_A | SAMPLE_B |\n|-------------|----------------|-----------------|\n| taxon_a | 32 | 123 |\n| taxon_b | 1 | 5 |\n\nwhereas all other 'native' tools have varying format outputs. See pipeline [output](https://nf-co.re/taxprofiler) documentation for more information."
},
"standardisation_motus_generatebiom": {
"type": "boolean",

View file

@ -5,7 +5,7 @@
include { MINIMAP2_INDEX } from '../../modules/nf-core/minimap2/index/main'
include { MINIMAP2_ALIGN } from '../../modules/nf-core/minimap2/align/main'
include { SAMTOOLS_VIEW } from '../../modules/nf-core/samtools/view/main'
include { SAMTOOLS_BAM2FQ } from '../../modules/nf-core/samtools/bam2fq/main'
include { SAMTOOLS_FASTQ } from '../../modules/nf-core/samtools/fastq/main'
include { SAMTOOLS_INDEX } from '../../modules/nf-core/samtools/index/main'
include { SAMTOOLS_STATS } from '../../modules/nf-core/samtools/stats/main'
@ -38,8 +38,8 @@ workflow LONGREAD_HOSTREMOVAL {
SAMTOOLS_VIEW ( ch_minimap2_mapped , [], [] )
ch_versions = ch_versions.mix( SAMTOOLS_VIEW.out.versions.first() )
SAMTOOLS_BAM2FQ ( SAMTOOLS_VIEW.out.bam, false )
ch_versions = ch_versions.mix( SAMTOOLS_BAM2FQ.out.versions.first() )
SAMTOOLS_FASTQ ( SAMTOOLS_VIEW.out.bam, false )
ch_versions = ch_versions.mix( SAMTOOLS_FASTQ.out.versions.first() )
// Indexing whole BAM for host removal statistics
SAMTOOLS_INDEX ( MINIMAP2_ALIGN.out.bam )
@ -54,7 +54,7 @@ workflow LONGREAD_HOSTREMOVAL {
emit:
stats = SAMTOOLS_STATS.out.stats //channel: [val(meta), [reads ] ]
reads = SAMTOOLS_BAM2FQ.out.reads // channel: [ val(meta), [ reads ] ]
reads = SAMTOOLS_FASTQ.out.fastq.mix( SAMTOOLS_FASTQ.out.other) // channel: [ val(meta), [ reads ] ]
versions = ch_versions // channel: [ versions.yml ]
mqc = ch_multiqc_files
}

View file

@ -35,7 +35,8 @@ if (params.shortread_qc_includeunmerged && !params.shortread_qc_mergepairs) exit
if (params.shortread_complexityfilter_tool == 'fastp' && ( params.perform_shortread_qc == false || params.shortread_qc_tool != 'fastp' )) exit 1, "ERROR: [nf-core/taxprofiler] cannot use fastp complexity filtering if preprocessing not turned on and/or tool is not fastp. Please specify --perform_shortread_qc and/or --shortread_qc_tool 'fastp'"
if (params.perform_shortread_hostremoval && !params.hostremoval_reference) { exit 1, "ERROR: [nf-core/taxprofiler] --shortread_hostremoval requested but no --hostremoval_reference FASTA supplied. Check input." }
if (!params.hostremoval_reference && params.hostremoval_reference_index) { exit 1, "ERROR: [nf-core/taxprofiler] --shortread_hostremoval_index provided but no --hostremoval_reference FASTA supplied. Check input." }
if (params.perform_shortread_hostremoval && !params.hostremoval_reference && params.shortread_hostremoval_index) { exit 1, "ERROR: [nf-core/taxprofiler] --shortread_hostremoval_index provided but no --hostremoval_reference FASTA supplied. Check input." }
if (params.perform_longread_hostremoval && !params.hostremoval_reference && params.longread_hostremoval_index) { exit 1, "ERROR: [nf-core/taxprofiler] --longread_hostremoval_index provided but no --hostremoval_reference FASTA supplied. Check input." }
if (params.hostremoval_reference ) { ch_reference = file(params.hostremoval_reference) }
if (params.shortread_hostremoval_index ) { ch_shortread_reference_index = Channel.fromPath(params.shortread_hostremoval_index).map{[[], it]} } else { ch_shortread_reference_index = [] }